# Preprocess the data
I decided to preprocess the data before splitting it into training/evaluation sets to ensure that there are no scaling discrepancies.

## What is this data??

By researching where the data might come from, I have been able to find it is the [Breast Cancer Wisconsin (Diagnostic) Dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)

The dataset has 32 columns:
* **id** (ID number)
* **diagnosis** (M for malicious, B for benign)
* **radius_mean** (mean of distances from center to points on perimeter)
* **texture_mean** (standard deviation of gray-scale values)
* **perimeter_mean** (mean size of the core tumor)
* **area_mean**
* **smoothness_mean** (mean of local variation in radius lengths)
* **compactness_mean** (mean of perimeter^2 / area - 1.0)
* **concavity_mean** (mean of severity of concave portions of the contour)
* **concave points_mean** (mean for number of concave portions of the contour)
* **symmetry_mean**
* **fractal_dimension_mean** (mean for "coastline approximation" - 1)
* **radius_se** (standard error for the mean of distances from center to points on the perimeter)
* **texture_se** (standard error for standard deviation of gray-scale values)
* **perimeter_se** 
* **area_se**
* **smoothness_se** (standard error for local variation in radius lengths)
* **compactness_se** (standard error for perimeter^2 / area - 1.0)
* **concavity_se** (standard error for severity of concave portions of the contour)
* **concave points_se** (standard error for number of concave portions of the contour)
* **symmetry_se**
* **fractal_dimension_se** (standard error for "coastline approximation" - 1)
* **radius_worst** ("worst" or largest mean value for mean of distances from center to points on the perimeter)
* **texture_worst** ("worst" or largest mean value for standard deviation of gray-scale values)
* **perimeter_worst**
* **area_worst**
* **smoothness_worst** ("worst" or largest mean value for local variation in radius lengths)
* **compactness_worst** ("worst" or largest mean value for perimeter^2 / area - 1.0)
* **concavity_worst** ("worst" or largest mean value for severity of concave portions of the contour)
* **concave points_worst** ("worst" or largest mean value for number of concave portions of the contour)
* **symmetry_worst**
* **fractal_dimension_worst** ("worst" or largest mean value for "coastline approximation" - 1)

I will add these columns to the DataFrame in order to simplify the analysis.

In [5]:
import pandas as pd

df = pd.read_csv("data.csv", header=None) # header=None prevents pandas from interpreting the first data row as the column names.
columns = [
    "id",
    "diagnosis",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "concave points_mean",
    "symmetry_mean",
    "fractal_dimension_mean",
    "radius_se",
    "texture_se",
    "perimeter_se",
    "area_se",
    "smoothness_se",
    "compactness_se",
    "concavity_se",
    "concave points_se",
    "symmetry_se",
    "fractal_dimension_se",
    "radius_worst",
    "texture_worst",
    "perimeter_worst",
    "area_worst",
    "smoothness_worst",
    "compactness_worst",
    "concavity_worst",
    "concave points_worst",
    "symmetry_worst",
    "fractal_dimension_worst",
    ]

df.columns = columns
print(df.head())

         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   ...  radius_worst  texture_worst  perimeter_worst  area_wor

## Convert categorical values into numerical values
Ensure all values can be easily evaluated during training.

In [6]:
# Convert categorical values into numerical.
df["diagnosis"] = df["diagnosis"].map({"M": 1, "B": 0})

## Fill missing values
This data happens to be preprocessed since it comes from a website which provides pre-made Machine Learning datasets.
I will still add this step, since it is crucial to ensure missing values have no impact on the dataset.


In [7]:
# Fill missing values with the mean for their column, ensuring they have no impact on the data.
df = df.fillna(df.mean()) 

## Feature Scaling
Algorithms which use gradient descent for optimization (like linear regression, logistic regression and in this case, neural networks), converge significantly faster when features are on a similar scale.

In [9]:
for column in df.columns:
    if column != "diagnosis":
        df[column] = (df[column] - df[column] / df[column].std())

print(df["diagnosis"])

0      1
1      1
2      1
3      1
4      1
      ..
564    1
565    1
566    1
567    1
568    0
Name: diagnosis, Length: 569, dtype: int64
