### Yaser Marey
##### October 15th, 2020

****
#### Hello, in this Notebook, I am applying different Machine Learning Algorithms to Breast Cancer Wisconsin Diagnostic data set to classify tumor to either malignant or benign based on data attributes of the patient record.

#### The objective is to demonstrate how to use Scikit-Learn tools to compare different algorithms quickly, I am using different tools from `sklearn.model_selection` package such as `StratifiedShuffleSplit` and `learning_curve` and I am applying and also analyzing the behavior of: `AdaBoostClassifier`, `KNeighborsClassifier`,`MLPClassifier`,`SVC` and `DecisionTreeClassifier`

#### My treatment for this task will follow three steps:

Step 0 Basic Exploratory Analysis

Step 1 Preprocess Data

Step 2 Apply ML Algorithms and Plot Learning Curves
    

_one comment about terminology: I am using Validation and Testing here interchangeably from code to narrative description, I hope you don't find that confusing, I will fix this in a following update_

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# from matplotlib import cm as cm
import seaborn as sns

# Step 0: Basic Exploratory Data Analysis
We perform basic exploratory data analysis or EDA. 
Basic pandas commands such as head, tail, describe and info are really all that we need here:

In [None]:
# Read Data from File
data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.shape

In [None]:
data.columns

In [None]:
# Checking if there is any attributes that has None or numpy.NaN values
data.isna().any()

In [None]:
# How many rows of each attribute that has NaN value
data.apply(lambda x: x.isna().values.ravel().sum())

#### From those commands output and by cross-checking with the dataset description from the data set provider we can confirm the following:

1. We have a total of 569 records or samples.
2. We have 33 attributes, one of them is particularly, not useful and has 0 values which is "Unnamed : 32" 
3. We have two other special features those are:
    * id which is the id of the patient
    * Diagnosis is whether (M for malignant, or B for benign )

#### Then, we have ten different measurements applied to the cells from each patient:

    1. radius
    2. texture               
    3. perimeter             
    4. area                  
    5. smoothness            
    6. compactness           
    7. concavity             
    8. concave points        
    9. symmetry              
    10. fractal_dimension     

4. The mean (of the three largest values), worst measurement, and standard deviation are computed for each of these measurements resulting in 30 computed attributes for each patient all are real numbers.

Now we are ready to preprocess the data.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Step 1: Preprocess Data

We start by removing 'id', 'Unamed: 32' attributes, then checking the distribution of the data, following that we will standardize the attributes.

In [None]:
data.drop('id',axis = 1, inplace=True)
data.drop('Unnamed: 32',axis = 1, inplace=True)
data.shape

In [None]:
print("Number of Malignant Records: {0} accounts for: {1:.2f}% of the diagnosis class\n"
      "Number of Benign Records: {2} accounts for: {3:.2f}% of the diagnosis class".format(
    data.loc[data.diagnosis == 'M'].shape[0],
    100 * data.loc[data.diagnosis == 'M'].shape[0] / data.shape[0],
    data.loc[data.diagnosis == 'B'].shape[0],
    100 * data.loc[data.diagnosis == 'B'].shape[0] / data.shape[0]))

In [None]:
# Check data set imbalance
unique, counts = np.unique(data.diagnosis, return_counts=True)
plt.bar(unique, counts, 1, color=['lightgreen', 'lightgray'])
plt.title('Class Frequency')
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

Dataset is slightly imbalanced. Some researchers debate that this shouldn't be a problem and the model should generalize will still generalize adequately however I usually find that imbalanced data leads to accuracy degradation where learners performance reflects the underlying class distribution, this problem is especially clear in the case of learners prone to overfitting such as Decision Tree and Neural Network, therefore while applying Cross-Validation I will make use of a special object from Scikit-Learn that performs Stratified resampling: `sklearn.model_selection.StratifiedShuffleSplit` and I will explain this further below.

In addition to being balanced, most of the learning algorithms assume normal distribution of the data set attributes, let's look into that next:

In [None]:
# Plot distribution
data.plot.density(subplots=True, layout=(5,10), sharex=False, legend=False, fontsize=1, figsize=(12,12))
plt.show()

We see that all distributions are Gaussians, how convenient!
Now we look at correlation among attributes strong correlation between attributes may suggest that we can remove one of them without affecting learning.

In [None]:
plt.figure(figsize=(12, 12))

corr = data.corr()

ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)


From this we see strong corelation between perimeter, area, and radius, we should be able to remove two of them, I will keep radius.

In [None]:
data.drop(['perimeter_mean','perimeter_se', 'perimeter_worst', 'area_mean', 'area_se', 'area_worst'],axis = 1, inplace=True)
data.shape

Now I use `sclearn.preprocessing.LabelEncoder` to encode categorical labels data ('M','B') to numeric values so that ML methods can handle them:

In [None]:
# Encode target column since it is categorical and most learning algorithms require 
# numeric inputs (M)alignant = 1, (B)enign = 0
labelencoder_diagnosis = LabelEncoder()
data.diagnosis = labelencoder_diagnosis.fit_transform(data.diagnosis)

In [None]:
data.diagnosis

The final preprocessing step is to Standardizing Features. I am using `sklearn.preprocessing.StandardScaler` to standardize features, which is to update their values to have a mean of zero and a standard deviation of 1.
This is important for machine learning algorithms that use Euclidian distance between two points in their computations like K-Nearest Neighbor, also useful for algorithms that use gradient descent such as Neural Networks.

In [None]:
sc = StandardScaler()
for name in data.columns:
    if name != 'diagnosis':
        data[[name]] = sc.fit_transform(data[[name]])


In [None]:
data.head

For performance comparison, I produced the learning curves for each of the algorithms. Learning curves show performance on both training and testing data.  Performance is measured using prediction Accuracy. Accuracy is calculated as a function of a progressing number of training samples up to 100% of the training batch size. 

Training batch is 80% of the full number of data set samples while testing is 20%. Also, the measurement is averaged over several iterations to obtain a smooth curve. This number of iterations varies from one algorithm taking into consideration the time complexity of each algorithm.

To achieve this task as with the least amount of code, I am using a set of scikit-learn tools:

    from sklearn.model_selection import ShuffleSplit
    from sklearn.model_selection import learning_curve

First let's divide the data dataframe to two, X set of training samples, and y the corresponding labels.

In [None]:
X = data.iloc[:, 1:]
y = data.iloc[:, 0]

In [None]:
X

In [None]:
y

The `sklearn.model_selection learning_curve` is a convenient method if you want to compare several models quickly.

But first, I am using another tool from Scikit-learn to shuffle and split the training data for Cross-Validation, that is: StratifiedShuffleSplit [[source](http://https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)]

    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)

StratifiedShuffleSplit shuffles each time before splitting, and it splits n_splits time while preserving the percentage of samples of each class (label) as the original data in each of the training and test parts.    

After setting the Cross-Validation splits I am using learning_curve from sklearn.model_selection. [[source](http://https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html)]

learning_curve provides a simple interface to the entire process of training, cross_validation, and visualization that we need to go through for each of the learning algorithms we want to assess.

An example of how to use this method is as the following:

    train_sizes, train_scores, test_scores = learning_curve(
                                                            # the estimator, which is an object that object type that implements the “fit” and “predict” methods 
                                                            RandomForestClassifier(), 
                                                            # training set
                                                            X, 
                                                            # training labels
                                                            y,
                                                            # ShuffleSplit instance
                                                            cv=cv,
                                                            # Evaluation metric
                                                            scoring='accuracy',
                                                            # 5 different sizes of the training set
                                                            train_sizes=np.linspace(0.01, 1.0, 5) 
                                                            )

This method will call the clone the estimator object each time passing in a specific size of data according to train_sizes list, training is then split into training and test parts after being shuffled and according to the percentage specified in StratifiedShuffleSplit method. 

Since we have a n_splits = K Folds, we notice that the `learning_curve` method will return K readings for training_scores and for test_scores, so `test_scores.shape` will yield `(len(train_sizes), K)` therefore we take the mean of each row to calculate the mean of different scores from different splits for the same training sample size and plot that.

Also, I use `fill_between` to plot the area between train_scores_mean and the same +/- training scores standard deviation:

`plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")` 

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit 
from sklearn.model_selection import learning_curve

from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, 
                        cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Number of training examples")
    plt.ylabel("Accuracy Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
#     print(f"\nshape of train_score is : {train_scores.shape}\n")
    train_scores_mean = np.mean(train_scores, axis=1)
    print("Training accuracy scores for different training sizes are:\n{0}".format(train_scores_mean))
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    print("Testing accuracy scores for different training sizes are:\n{0}".format(test_scores_mean))
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Testing score")
    plt.legend(loc="best")
    plt.show()
    return plt

In [None]:
# 1 - Decision Tree
title = "Learning curves of Decision Tree\n" \
        "Cross Validation of 10 splits\n" \
        "Training-Test: 80-20%"
cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = DecisionTreeClassifier()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1)

The graph shows a perfect accuracy of 1.0 on training while much lower best performance of  ~0.938 on testing, this implies high variance and overfitting problem. This problem is expected because Decision Tree learning algorithm is known to be expressive, and therefore can overfit quickly. It is clear also that adding more data is likely to enhance testing accuracy. Also simplifying the model by applying pruning technique such as limiting maximum depth or minimum number of samples per leaf is expected to improve accuracy on testing. 

In [None]:
# 2 - Multilayer Preceptron Network
title = "Learning Curve for Multi Layer Preceptron (MLP)\n" \
        "Cross Validation of 10 splits\n" \
        "Training-Test: 80-20%"
cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = MLPClassifier(random_state=42)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

Nice performance indeed!, no clear symptmbs for over fitting or high variance. Accuracy on Testing is better compared to Decision Tree algorithm.
Next I am trying AdaBoostClassifier which is an ensemble classifier that is configured out of a group of 50 weak learners.

In [None]:
# 3 - Adaboost
title = "Learning Curve for Adaboost\n" \
        "Cross Validation of 10 splits,\n" \
        "Training-Test: 80-20%, n_estimators: 50, learning_rate:1)"
cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = AdaBoostClassifier(n_estimators=50, learning_rate=1, random_state=0)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

Good performance, more interestingly the gab between training and testing scores hints that we may be able to obtain better accuracy by regualrizing the model.
Next is K-Nearest Neighbor, I select K = 5

In [None]:
# 4 - K-Nearest Neighbor
title = "Learning Curves for K-Nearest Neighbor (KNN)\n" \
        "Cross Validation of 10 splits\n" \
        "Training-Test:80-20%, n_neighbors:5"
cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

The graph shows good accuracy on both training and testing compared to the results obtained from the best learning algorithm we have so far which is the Multi-layer Perceptron Neural Network. The gap between the two curves is narrow. All this implies a good fit to data, low variance and low bias.

Now, we will look into SVM with two kernels `rbf` which assumes that data is not linearly separable, and `linear` which assumes linear separability of the data:

In [None]:
# 5 - rbf Kernel - Support Vector Machines
title = "Learning Curves for Support Vector Machines (SVM)\n" \
        "Cross Validation on 10 splits\n" \
        "Training-Test:80-20%, kernel: rbf, gamma:0.001"
cv = StratifiedShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
estimator = SVC(kernel='rbf', gamma=0.001, random_state=0)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

In [None]:
# 6 - linear Kernel - Support Vector Machines
title = "Learning Curves for Support Vector Machines (SVM)\n" \
        "Cross Validation of 10 splits\n" \
        "Training-Test:80-20%, kernel: linear, gamma:0.001"
cv = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(kernel='linear', gamma=0.001, random_state=0)
plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)

Clearly the dataset is linearly separable and therefore liner kernel achieves better results. The small gab and the close high accuracy on both testing and training indicate a good fit to the data, low variance, and low bias. The accuracy is on par with the best results we have so far from the Multi-Layer Perceptron Neural Network Algorithm.