# Comparison of ML Models at Predicting Breast Cancer

In [None]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

from scipy.stats import randint as sp_randint

%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("../input/data.csv",header = 0)
df.head()

# Prepare Data

In [None]:
# Remove unnecessary columns
df.drop('id',axis=1,inplace=True)
df.drop('Unnamed: 32',axis=1,inplace=True)

In [None]:
# Encode diagnosis as numerical values(B=0, M=1)
le = preprocessing.LabelEncoder()
le.fit(['M', 'B'])

df['diagnosis'] = le.transform(df['diagnosis'])

In [None]:
df.describe()

# Visualize Data

One of the main goals of visualizing the data here is to observe which features are most helpful in predicting malignant or benign cancer. The other is to see general trends that may aid us in model selection and hyper parameter selection.

## Principal Component Analysis

The purpose for doing principal component analysis on the labeled data here is to observe the variance explained by each of the components and the associated weights assigned to each feature. The resulting output will aid in deciding on which features to drop.

In [None]:
from sklearn.decomposition import PCA

# observables = df.loc[:,observe]
observables = df.iloc[:,1:]
pca = PCA(n_components=3)
pca.fit(observables)

# Dimension indexing
dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

# Individual PCA Components
components = pd.DataFrame(np.round(pca.components_, 4), columns = observables.keys())
components.index = dimensions

# Explained variance in PCA
ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
variance_ratios.index = dimensions

print(pd.concat([variance_ratios, components], axis = 1))

## Observations

It can be observed that **98.20%** of the variance is explained in dimension 1. This means that nearly all of the variance in the data can be described by one dimension. The remaining two dimensions describe a much smaller amount of variance. 

In dimension 1, most of the feature weight is associated with the **area_mean** and **area_worst** dimension. This was a surprise. My assumption was that the mean values would describe most of the variance in the data. Due to this observation, in the next step, I will visualize how well each of the **mean** features as well as **area_worst** and **perimeter_worst** explain the resulting diagnosis.

While I will not be using PCA in the actual machine learning phase, this describes the data well and helps understand which features should be further investigated for their importance in the final prediction.

# Feature Selection
Along with my initial hypothesis that the *mean value* features were important in predicting cancer type, **area_worst** and **perimeter_worst** will also be investigated due to their weighted importance in the previous PCA observation step.

In [None]:
# Observe correlation to the diagnosis
tst = df.corr()['diagnosis'].copy()
tst = tst.drop('diagnosis')
tst.sort_values(inplace=True)
tst.plot(kind='bar', alpha=0.6)

In [None]:
# Separate out malignant and benign data for graphing
malignant = df[df['diagnosis'] ==1]
benign = df[df['diagnosis'] ==0]

In [None]:
# Column names to observe in following graphs - mean values only
observe = list(df.columns[1:11]) + ['area_worst'] + ['perimeter_worst']
observables = df.loc[:,observe]

In [None]:
plt.rcParams.update({'font.size': 8})
plot, graphs = plt.subplots(nrows=6, ncols=2, figsize=(8,10))
graphs = graphs.flatten()
for idx, graph in enumerate(graphs):
    graph.figure
    
    binwidth= (max(df[observe[idx]]) - min(df[observe[idx]]))/50
    bins = np.arange(min(df[observe[idx]]), max(df[observe[idx]]) + binwidth, binwidth)
    graph.hist([malignant[observe[idx]],benign[observe[idx]]], bins=bins, alpha=0.6, normed=True, label=['Malignant','Benign'], color=['red','blue'])
    graph.legend(loc='upper right')
    graph.set_title(observe[idx])
plt.tight_layout()

## Observations
From the graphs, we can see that **radius_mean, perimeter_mean, area_mean, concavity_mean** and **concave_points_mean** are useful in predicting cancer type due to the distinct grouping between malignant and benign cancer types in these features. We can also see that **area_worst** and **perimeter_worst** are also quite useful.


In [None]:
color_wheel = {0: "blue", 1: "red"}
colors = df["diagnosis"].map(lambda x: color_wheel.get(x))
pd.scatter_matrix(observables, c=colors, alpha = 0.5, figsize = (15, 15), diagonal = 'kde');

## Observations

The scatter matrix clarifies a few more points. The **perimeter_mean, area_mean** and **radius mean** have a strong, positive, linear correlation. Most other data also has a more rough linear correlation to other features with the exception of **fractal_dimension_mean, symmetry_mean** and **smoothness_mean**.

Within these three features we can see quite a bit of mixing between malignant and benign cancer in the scatter matrix. This suggests that our assumption above, that they do not aid in predicting cancer type, is likely correct. There is less correlation and separability between the two diagnoses.

Due to the lack of clear separability and lack of variance explained I feel comfortable dropping them.

### Trimming Data
From observing the graphs and PCA data above: fractal_dimension_mean, smoothness_mean and symmetry_mean are not very useful in predicting the type of cancer. To aid in the learning process and remove noise, these columns will be dropped.

In [None]:
# Drop columns that do not aid in predicting type of cancer
observables.drop(['fractal_dimension_mean', 'smoothness_mean', 'symmetry_mean'],axis=1,inplace=True)

# Classification

Here a comparison will be made between the different types of learning algorithms. At the end a breakdown of the data and explanation of the algorithm's performance will be made.

In [None]:
# Split data appropriately
X = observables
y = df['diagnosis']

## Naive Bayes

In [None]:
gnb = GaussianNB()
gnb_scores = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')
print(gnb_scores.mean())

### Gaussian Naive Bayes Findings
Gaussian Naive Bayes had an accuracy score of **0.92**. While this is not ideal, it is not a terrible score to attain using an algorithm as simple as Naive Bayes. NB performed well because, as seen above, much of the data is linearly separable. 

## KNN

In [None]:
# Decide what k should be for KNN
knn = KNeighborsClassifier()

k_range = list(range(1, 30))
leaf_size = list(range(1,30))
weight_options = ['uniform', 'distance']
algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute']
param_grid = {'n_neighbors': k_range, 'leaf_size': leaf_size, 'weights': weight_options, 'algorithm': algorithm}

In [None]:
rand_knn = RandomizedSearchCV(knn, param_grid, cv=10, scoring="accuracy", n_iter=100, random_state=42)
rand_knn.fit(X,y)

It looks as though any value for K past 20 would work well but the simpler the better.

In [None]:
print(rand_knn.best_score_)
print(rand_knn.best_params_)
print(rand_knn.best_estimator_)

### KNN Findings

Utilizing Randomized hyper parameter search along with cross validation resulted in a KNN model with an accuracy score of **0.93**. The model that was chosen by *RandomizedSearchCV* is as follows: {'weights': 'uniform', 'n_neighbors': 14, 'leaf_size': 22, 'algorithm': 'ball_tree'}.

I do not believe KNN is optimal for this problem so a more involved comparison of results will be made after several more tests.

## Decision Tree Classifier

In [None]:
dt_clf = DecisionTreeClassifier(random_state=42)

param_grid = {'max_features': ['auto', 'sqrt', 'log2'],
              'min_samples_split': sp_randint(2, 11), 
              'min_samples_leaf': sp_randint(1, 11)}

In [None]:
rand_dt = RandomizedSearchCV(dt_clf, param_grid, cv=10, scoring="accuracy", n_iter=100, random_state=42)
rand_dt.fit(X,y)

In [None]:
print(rand_dt.best_score_)
print(rand_dt.best_params_)
print(rand_dt.best_estimator_)

### Decision Tree Findings

Utilizing Randomized hyper parameter search along with cross validation resulted in a Decision Tree Classification model with an accuracy score of **0.95**. The model that was chosen by *RandomizedSearchCV* is as follows: {'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 3}.

## Support Vector Machine Classifier

In [None]:
sv_clf = SVC(random_state=42)

param_grid = [
              {'C': [1, 10, 100, 1000], 
               'kernel': ['linear']
              },
              {'C': [1, 10, 100, 1000], 
               'gamma': [0.001, 0.0001], 
               'kernel': ['rbf']
              },
 ]

In [None]:
grid_sv = GridSearchCV(sv_clf, param_grid, cv=10, scoring="accuracy")
grid_sv.fit(X,y)

In [None]:
print(grid_sv.best_score_)
print(grid_sv.best_params_)
print(grid_sv.best_estimator_)

### SVM Findings

Utilizing Randomized hyper parameter search along with cross validation resulted in a SVM model with an accuracy score of **0.96** which is quite good. The model that was chosen by *RandomizedSearchCV* is as follows: {'C': 100, 'kernel': 'linear'}.

I found it interesting how the linear kernel performed significantly better than the RBF. It is worth taking more time to look into the exactly how RBF and linear kernels behave. An interesting side note was the execution time on linear kernels which was much longer than RBF.

## Random Forest Classification

In [None]:
rf_clf = RandomForestClassifier(random_state=42)

param_grid = {"max_depth": [3, None],
              "max_features":  sp_randint(1, 8),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

In [None]:
rand_rf = RandomizedSearchCV(rf_clf, param_distributions=param_grid, n_iter=100, random_state=42)
rand_rf.fit(X,y)

In [None]:
print(rand_rf.best_score_)
print(rand_rf.best_params_)
print(rand_rf.best_estimator_)

### Random Forest Findings

Here we continue to see relatively accurate predictions. The classification accuracy is **0.95** which is pretty good given the size of the dataset.

## AdaBoost Classifier

In [None]:
# Using decision stumps due to size of sample.
# Attempting to prevent over-fitting
stump_clf =  DecisionTreeClassifier(random_state=42, max_depth=1)

param_grid = {
              "base_estimator__max_features": ['auto', 'sqrt', 'log2'],
              "n_estimators": list(range(1,500)),
              "learning_rate": np.linspace(0.01, 1, num=20),
             }

In [None]:
ada_clf = AdaBoostClassifier(base_estimator = stump_clf)

rand_ada = RandomizedSearchCV(ada_clf, param_grid, scoring = 'accuracy', n_iter=100, random_state=42)
rand_ada.fit(X,y)

In [None]:
print(rand_ada.best_score_)
print(rand_ada.best_params_)
print(rand_ada.best_estimator_)

### AdaBoost Findings

As I expected, AdaBoost performed quite well. It has an accuracy of **0.97**. I decided to use a decision stump as the base estimator for a few reasons. Due to the size of the dataset I wanted to reduce the possibility of overfitting by using a very simple model. I could have achieved better results by swapping out a randomforest for the decision stump but I feel that there is a lack of generalization in that case. I'm much more confident in the generalization of this model.

# Conclusion 

The best model used to diagnose breast cancer from my comparative analysis is **AdaBoost** using a **Decision Stump** as its base estimator. Adaboost had a prediction accuracy of **0.97** with Support Vector Machine (**0.96**) and Random Forest (**0.95**) in close second and third.

I believe that with further analysis of data, especially misclassified data, I could improve these scores further. There are also many more algorithms that I could attempt as well, but with such a small dataset I wanted to keep it relatively simple.

The largest change in performance that I found was in the feature selection phase. There are huge tradeoffs for selecting to keep certain features and it is not always very obvious. Visualizing the data the ways I did above, as well as analyzing principal component analysis aided in selecting the most useful features. 


## Notes

It is important to note that each model is chosen using K-Fold (10-Fold) cross validation with hyperparameter optimization using RandomSearchCV. I chose random search as opposed to the exhaustive solution of grid search simply for saving time.