# Mammogram Prediction
### Predict whether a mammogram throws a benign or malign result
Using the "mammographic masses" public dataset from UCI (https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)
This data contains 961 instances of masses detected in mammograms, and contains the following attributes:
BI-RADS assessment: 1 to 5 (ordinal)
Age: patient's age in years (integer)
Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
Severity: benign=0 or malignant=1 (binominal)
* *BI-RADS is an assessment on the severity classification*
*   I need to build a multi layer perceptron and train it to classify masses as benign or malignant based on its features. Data needs to be cleaned, lots of rows contain missing data, and there are many erroneous data which are outliers as well.
* Remember that I need to normalize the data and experiment with different topologies, optimizers and hyperparameters

# Preparing the Data

In [1]:
# Importing the data and parse it into a Panda's Dataframe. Data is `mammographic_masses.data.txt`
import pandas as pd

masses_data = pd.read_csv("mammographic_masses.data")
masses_data.head()

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0


In [2]:
# Improving the data and its visualization
masses_data = pd.read_csv("mammographic_masses.data", na_values=["?"],
                          names=['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
masses_data.head()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [3]:
# Now I need to evaluate if the data needs to have cleaning. Visualize the data using `describe()`
masses_data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


*Note*: I can see that there are missing values and also is a difference in the magnitud of the values on each column, so this is something that I need to take care of

In [8]:
# Figuring out the places where the data is missing
masses_data.loc[
    masses_data["age"].isnull() | masses_data["shape"].isnull() | masses_data["margin"].isnull() | masses_data[
        "density"].isnull()]

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


In [9]:
# Dropping the rows with missing data
masses_data.dropna(inplace=True)
masses_data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [21]:
# Now I need to convert the pandas DataFrame into NumpyArrays so that Scikit-Learn can use the information (they are preferred)
all_features = masses_data[["age", "shape", "margin", "density"]].values
all_classes = masses_data["severity"].values
feature_names = ['age', 'shape', 'margin', 'density']

In [None]:
# Remember the very important step to normalize the data
from sklearn import preprocessing