# Breast cancer prediction with Multi-layer Perceptron classifier

### Author
Piotr Tynecki  
Last edition: May 16, 2018

#### Alternative study with Logistic Regression

If you're interested about my previous study for breast cancer prediction using **Logistic Regression** feel free to go to that [kaggle link](https://www.kaggle.com/ptynecki/breast-cancer-prediction-with-lr-99).

### About the Breast Cancer Wisconsin Diagnostic Dataset
Breast Cancer Wisconsin Diagnostic Dataset (WDBC) consists of features which were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Those features describe the characteristics of the cell nuclei found in the image.

![Diagnosing Breast Cancer from Image](https://kaggle2.blob.core.windows.net/datasets-images/180/384/3da2510581f9d3b902307ff8d06fe327/dataset-cover.jpg)

This dataset has 569 instances: 212 - Malignant and 357 - Benign. It consists of 31 attributes including the class attribute. The attributes description is ten real-valued features which are computed for each cell nucleus. These features include: Texture, Radius, Perimeter, Smoothness, Area, Concavity, Compactness, Symmetry, Concave points and Fractal dimension.

In this document I demonstrate an automated methodology to predict if a sample is benign or malignant.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score, classification_report
from sklearn.preprocessing import Normalizer, MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer, LabelEncoder
from sklearn.pipeline import Pipeline

### Step 1: Exploratory Data Analysis (EDA)
EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. It let us to summarize data main characteristics.

In [2]:
breast_cancer = pd.read_csv('../input/data.csv')
breast_cancer.head()

In [3]:
breast_cancer.info()

In [4]:
breast_cancer.shape

In [5]:
breast_cancer.describe()

In [6]:
breast_cancer.groupby('diagnosis').size()

#### Data quality checks

In [7]:
breast_cancer.isnull().sum()

In [8]:
for field in breast_cancer.columns:
    amount = np.count_nonzero(breast_cancer[field] == 0)
    
    if amount > 0:
        print('Number of 0-entries for "{field_name}" feature: {amount}'.format(
            field_name=field,
            amount=amount
        ))

### Step 2: Feature Engineering

In [9]:
# Features "id" and "Unnamed: 32" are not useful 
feature_names = breast_cancer.columns[2:-1]
X = breast_cancer[feature_names]
# "diagnosis" feature is our class which I wanna predict
y = breast_cancer.diagnosis

#### Transforming the prediction target

In [10]:
class_le = LabelEncoder()
# M -> 1 and B -> 0
y = class_le.fit_transform(breast_cancer.diagnosis.values)

#### Correlation Matrix
A matrix of correlations provides useful insight into relationships between pairs of variables.

In [11]:
sns.heatmap(
    data=X.corr(),
    annot=True,
    fmt='.2f',
    cmap='RdYlGn'
)

fig = plt.gcf()
fig.set_size_inches(20, 16)

plt.show()

### Step 3: Multi-layer Perceptron classifier evaluation after Pipeline and GridSearchCV usage

For this case study I decided to use [Multi-layer Perceptron classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) classifier.

#### Model Parameter Tuning
[GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) returns the set of parameters which have an imperceptible impact on model evaluation. Model parameter tuning with other steps like data preprocessing and cross-validation splitting strategy can be automated by [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class.

#### Data standardization
[Preprocessing data](http://scikit-learn.org/stable/modules/preprocessing.html) provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.

Let's start with defining the Pipeline instance. In this case I used three different approach `Normalizer`, `MinMaxScaler`, `StandardScaler`, `RobustScaler`, `QuantileTransformer` for data preprocesing and `MLPClassifier` for classification.

In [12]:
pipe = Pipeline(steps=[
    ('preprocess', StandardScaler()),
    ('classification', MLPClassifier())
])

Next, I needed to prepare attributes with values for above steps which wanna to check by the model parameter tuning process: `activation`, `solver`, `max_iter` and `alpha`.

In [13]:
random_state = 42
mlp_activation = ['identity', 'logistic', 'tanh', 'relu']
mlp_solver = ['lbfgs', 'sgd', 'adam']
mlp_max_iter = range(1000, 10000, 1000)
mlp_alpha = [1e-4, 1e-3, 0.01, 0.1, 1]
preprocess = [Normalizer(), MinMaxScaler(), StandardScaler(), RobustScaler(), QuantileTransformer()]

Next, I needed to prepare supported combinations for classifier parameters including above attributes. In Multi-layer Perceptron classifier case I decided to opt out of the PCA or any other feature selection techniques.

In [14]:
mlp_param_grid = [
    {
        'preprocess': preprocess,
        'classification__activation': mlp_activation,
        'classification__solver': mlp_solver,
        'classification__random_state': [random_state],
        'classification__max_iter': mlp_max_iter,
        'classification__alpha': mlp_alpha
    }
]

Next, I needed to prepare cross-validation splitting strategy object with `StratifiedKFold` and passed it with others to `GridSearchCV`. In that case for evaluation I used `f1 score` metric.

In [15]:
# strat_k_fold = StratifiedKFold(
#     n_splits=10,
#     random_state=42
# )

# mlp_grid = GridSearchCV(
#     pipe,
#     param_grid=mlp_param_grid,
#     cv=strat_k_fold,
#     scoring='f1',
#     n_jobs=-1,
#     verbose=2
# )

# mlp_grid.fit(X, y)

# # Best MLPClassifier parameters
# print(mlp_grid.best_params_)
# # Best score for MLPClassifier with best parameters
# print('\nBest F1 score for MLP: {:.2f}%'.format(mlp_grid.best_score_ * 100))

# best_params = mlp_grid.best_params_

#### Model evaluation

Finally, after a few hours of computation, I established the best parameters values which I passed to new feature selection and classifier instances. `best_params` returned `StandardScaler` for data preprocessing and `1000`, `0.1`, `'logistic'` and `'adam'` values for `max_iter`, `alpha`, `activation` and `solver` classifier attributes.

What else, I discovered that `train_test_split` function gave the best F1 score with split around 32% of data for training and 68% for testing.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    random_state=42,
    test_size=0.32
)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [17]:
scaler = StandardScaler()

print('\nData preprocessing with {scaler}\n'.format(scaler=scaler))

X_train_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.transform(X_test)

mlp = MLPClassifier(
    max_iter=1000,
    alpha=0.1,
    activation='logistic',
    solver='adam',
    random_state=42
)
mlp.fit(X_train_scaler, y_train)

mlp_predict = mlp.predict(X_test_scaler)
mlp_predict_proba = mlp.predict_proba(X_test_scaler)[:, 1]

print('MLP Accuracy: {:.2f}%'.format(accuracy_score(y_test, mlp_predict) * 100))
print('MLP AUC: {:.2f}%'.format(roc_auc_score(y_test, mlp_predict_proba) * 100))
print('MLP Classification report:\n\n', classification_report(y_test, mlp_predict))
print('MLP Training set score: {:.2f}%'.format(mlp.score(X_train_scaler, y_train) * 100))
print('MLP Testing set score: {:.2f}%'.format(mlp.score(X_test_scaler, y_test) * 100))

#### Confusion Matrix

Also known as an Error Matrix, is a specific table layout that allows visualization of the performance of an algorithm. The table have two rows and two columns that reports the number of False Positives (FP), False Negatives (FN), True Positives (TP) and True Negatives (TN). This allows more detailed analysis than accuracy.

In [18]:
outcome_labels = sorted(breast_cancer.diagnosis.unique())

# Confusion Matrix for MLPClassifier
sns.heatmap(
    confusion_matrix(y_test, mlp_predict),
    annot=True,
    fmt="d",
    xticklabels=outcome_labels,
    yticklabels=outcome_labels
)

#### Receiver Operating Characteristic (ROC)

[ROC curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

In [19]:
# ROC for MLPClassifier
fpr, tpr, thresholds = roc_curve(y_test, mlp_predict_proba)

plt.plot([0,1],[0,1],'k--')
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for MLPClassifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

#### F1-score after 10-fold cross-validation

In [20]:
strat_k_fold = StratifiedKFold(
    n_splits=10,
    random_state=42
)

scaler = StandardScaler()

X_std = scaler.fit_transform(X)

fe_score = cross_val_score(
    mlp,
    X_std,
    y,
    cv=strat_k_fold,
    scoring='f1'
)

print("MLP: F1 after 10-fold cross-validation: {:.2f}% (+/- {:.2f}%)".format(
    fe_score.mean() * 100,
    fe_score.std() * 2
))

### Final step: Conclusions

After the application of data standardization and tuning the classifier parameters I achieved the following results:

* Accuracy: ~99.5%
* F1-score: 99%
* Precision: 99%
* Recall: 99%

F1 score after 10-fold cross-validation is a little lower (-0.01%) than in my previous study for [breast cancer prediction using Logistic Regression](https://www.kaggle.com/ptynecki/breast-cancer-prediction-with-lr-99).

I would love to knows your comments and other tuning proposals for that study case.