# Diagnose Mammograms Using Multilayer Perceptron  

## Introduction
The Mammographic Mass dataset for this exercise was acquired from the UCI repository (source: http://archive.ics.uci.edu/ml/datasets/mammographic+mass).

Mammography is an effective method for breast cancer screening; however, mammogram interpretations from breast biopsies have low positive predictive values (PPV). A low PPV means that the probability that breast biopsies with a malignant screening test truly have cancer is very low, resulting in approximately 70% unnecessary biopsies with benign outcomes. The goal of this exercise is to reduce the false positives and prevent unnecessary treatments from being performed on patients.

This dataset has six attributes associated with each mammographic mass lesion, consisting of 961 instances (516 benign and 445 malignant). Below is the list of six attributes: 

 1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!)
 2. Age: patient's age in years (integer)
 3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
 4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
 5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
 6. Severity: benign=0 or malignant=1 (binominal, goal field!)

The BI-RADS assessment is not a predictive attribute and thus will be removed from the data. Age, shape, margin, and density are the attributes we will be using to train the Multilayer Perceptron (MLP) and severiy will serve as the classification we will attempt to predict.

## Preparing and Cleaning the Data

In [1]:
import pandas as pd

data = pd.read_csv('mammographic_masses.data.txt')
data.head()

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0


Quick look at the dataset shows that the columns do not have appropriate names and missing data is portrayed as '?'. This can be easily fixed by using optional parameters in read_csv in the following lines of code. Then we can use describe() on the pandas dataframe to get an overview of the missing data. 



In [2]:
data = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
data.head()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [3]:
data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


The easiest way to deal with missing data is to simply to remove the instances with the missing data; however, this can sometimes introduce bias to our data. Let's take a quick look at all the instances with the missing data to see if we can spot any type of correlation to what sort of data was missing.

In [4]:
data.loc[(data['age'].isnull()) | (data['shape'].isnull()) | (data['margin'].isnull()) | (data['density'].isnull())]

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


There doesn't seem to be an obvious pattern in the missing data and appears to be randomly distributed, therefore, we'll go ahead and drop the instances with the missing data using dropna(). And a quick comparison of the mean and std of the data before and after dropping the missing data shows that dropping the missing data doesn't a have big impact. 

In [5]:
data.dropna(inplace = True)
data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Next we'll have to normalize the data because from our previous use of head() showed that the values in the 'age' column are much larger than the other attributes. This will introduce a bias into our model because a patient's age will influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. Therefore, bringing all the values in all the attributes into a common range will prevent this bias.

To perform normalization we will use the StandardScaler() from the scikit_learn preprocessing library. 

In [6]:
features = data[['age', 'shape', 'margin', 'density']].values
classes = data['severity'].values
feature_names = ['age', 'shape', 'margin', 'density']
features

array([[67.,  3.,  5.,  3.],
       [58.,  4.,  5.,  3.],
       [28.,  1.,  1.,  3.],
       ...,
       [64.,  4.,  5.,  3.],
       [66.,  4.,  5.,  3.],
       [62.,  3.,  3.,  3.]])

In [7]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
features_scaled = scaler.fit_transform(features)
features_scaled

array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

Now we're ready to train our MLP model using Keras.

## Training & Evaluating the Model

We will be using cross_val_score() to perform 10-fold cross validation on the Keras model we fit. 

In [8]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

def create_model():
    model = Sequential()
    #4 feature inputs going into a 6-unit layer (more does not seem to make much difference)
    model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))
    #Deep Learning isn't necessary for this dataset, the additional layer in the next line don't help much either
    #model.add(Dense(1, kernel_initializer='normal', activation='relu'))
    #Output layer with a binary classification (benign or malignant)
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    #Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [9]:
from sklearn.model_selection import cross_val_score
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

#Wrap our Keras model in an estimator compatible with scikit_learn
estimator = KerasClassifier(build_fn=create_model, epochs = 100, verbose=0)
#Now we can use scikit_learn's cross_val_score to evaluate the model
cv_scores = cross_val_score(estimator, features_scaled, classes, cv=10)
cv_scores.mean()



0.8

Which mean accuracy of the Keras model is 80%. Which can definitely be improved upon but will be sufficient for this exercise.