This project attempts to predict if a breast cancer mass is Malignant or Benign based on 30 features of the cell nuclei as gathered using a fine needle aspirate method. We have a 569 record dataset that is used for the training, test, and cross validation steps. 

We include all available predictor versions in our model <b>(<i>Mean, Standard Error, and Worst</i>)</b> for the 10 core predictors (<b>Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave Points, Symmetry, and Fractal Dimension</b>).

<b>Findings</b>: Including all available predictor versions in our model (Mean, Standard Error, and Worst) for the 10 core predictors (Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave Points, Symmetry, and Fractal Dimension) we achieve a <b>~99.42% accuracy with Multilayer Perceptron Classification with a cross-validation score of ~95%</b> (the same cv score as our Random Forest, but a much higher testing accuracy).  Our Random Forest Classifier <b>accuracy of ~98%</b> and a <b>cross-validation score of ~95%</b>.   SVM achieves the lowest accuracy of ~96%. 

acknowledgments: Thanks to Buddhini W. for a well-articulated treatment of this dataset-  [1st Place Kaggle Submission - author: Buddhini W.](https://www.kaggle.com/buddhiniw/d/uciml/breast-cancer-wisconsin-data/breast-cancer-prediction)

<h6> Dataset Description </h6>
<p> source: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29<br>
source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data<br>
1st place submission: https://www.kaggle.com/buddhiniw/d/uciml/breast-cancer-wisconsin-data/breast-cancer-prediction<br>

Features are computed from a digitized image of a fine needle aspirate (FNA) 
of a breast mass. They describe characteristics of the cell nuclei present 
in the image. n the 3-dimensional space is that described in: 
[K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of 
Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: 
    ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: 
    https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 

3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)<br> 
b) texture (standard deviation of gray-scale values)<br> 
c) perimeter<br> 
d) area<br> 
e) smoothness (local variation in radius lengths)<br> 
f) compactness (perimeter^2 / area - 1.0)<br> 
g) concavity (severity of concave portions of the contour)<br> 
h) concave points (number of concave portions of the contour)<br> 
i) symmetry<br> 
j) fractal dimension ("coastline approximation" - 1)<br>

The mean, standard error and "worst" or largest (mean of the three largest values)
of these features were computed for each image, resulting in 30 features.
For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.


In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as s
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn import svm
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt

In [None]:
#%% Get and Clean Data

#Read data as pandas dataframe
d = pd.read_csv('../input/data.csv')

df = d.drop('Unnamed: 32', axis=1)

#if using diagnosis as categorical
df.diagnosis = df.diagnosis.astype('category')

#Create references to subset predictor and outcome variables
x = list(df.drop('diagnosis',axis=1).drop('id',axis=1))
y ='diagnosis'

# -- Feature Normalization / Scaling -----------------------------------------
#  Normalize features for SVM and MLPClassifier
#-----------------------------------------------------------------------------
df2 = df[x]
df_norm = (df2 - df2.mean()) / (df2.max() - df2.min())
df_norm = pd.concat([df_norm, df[y]], axis=1)
#-----------------------------------------------------------------------------

#show first 10 rows
df.head(10)

In [None]:
#Explore correlations
plt.rcParams['figure.figsize']=(12,8)
s.set(font_scale=1.4)
s.heatmap(df.drop('diagnosis', axis=1).drop('id',axis=1).corr(), cmap='coolwarm')

**<h4> Mean versions of the 10 Core Predictors </h4><br>**
The below boxplots are of the "mean" value for the 10 core features in the dataset.  These are ranked as the most important features in the model we fit (see Feature Importances below) in terms of classifying the breast cancer mass as Malignant (M) or Benign (B). 

The charts reveal a tendency for the average value of a feature to be generally higher for malignant diagnoses vs. the benign class. This is true for every feature except for <b> Fractal Dimension Mean</b> which shows a flat difference between M and B diagnoses for the mean value of the feature.  <b>Radius Mean</b> on the other hand shows a more distinct distribution for M vs. B diagnoses, as is subsequently found to be the most important feature according to our fitted Random Forest model further below (see cell [18]: below).  

In [None]:
plt.rcParams['figure.figsize']=(10,5)
f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5)
s.boxplot('diagnosis',y='radius_mean',data=df, ax=ax1)
s.boxplot('diagnosis',y='texture_mean',data=df, ax=ax2)
s.boxplot('diagnosis',y='perimeter_mean',data=df, ax=ax3)
s.boxplot('diagnosis',y='area_mean',data=df, ax=ax4)
s.boxplot('diagnosis',y='smoothness_mean',data=df, ax=ax5)
f.tight_layout()

f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5)
s.boxplot('diagnosis',y='compactness_mean',data=df, ax=ax2)
s.boxplot('diagnosis',y='concavity_mean',data=df, ax=ax1)
s.boxplot('diagnosis',y='concave points_mean',data=df, ax=ax3)
s.boxplot('diagnosis',y='symmetry_mean',data=df, ax=ax4)
s.boxplot('diagnosis',y='fractal_dimension_mean',data=df, ax=ax5)    
f.tight_layout()

**<h4> Standard Error versions of the 10 Core Predictors </h4><br>**
Visualizing the <i>Standard Error</i> feature columns below, we see much larger spreads between the max values and the average values of the vectors than observed in the <i> Mean </i> above.  

We also see some similarity in average values for M vs. B in some of these standard error derived features that we did not observe in the mean derived features. For example, below <b>Texture SE</b> shows a similar flatness across the mean value for M vs. B. 

<b>Smoothness SE</b> also has a much smaller difference in mean value for M vs. B.  In the case of <b>Symmetry SE</b> the average for M is actually smaller than that for B, which is the opposite dynamic of the <b>Symmetry Mean</b> feature as seen above.  

In [None]:
#%%
plt.rcParams['figure.figsize']=(10,5)
f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5)
s.boxplot('diagnosis',y='radius_se',data=df, ax=ax1, palette='cubehelix')
s.boxplot('diagnosis',y='texture_se',data=df, ax=ax2, palette='cubehelix')
s.boxplot('diagnosis',y='perimeter_se',data=df, ax=ax3, palette='cubehelix')
s.boxplot('diagnosis',y='area_se',data=df, ax=ax4, palette='cubehelix')
s.boxplot('diagnosis',y='smoothness_se',data=df, ax=ax5, palette='cubehelix')
f.tight_layout()

f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5)
s.boxplot('diagnosis',y='compactness_se',data=df, ax=ax2, palette='cubehelix')
s.boxplot('diagnosis',y='concavity_se',data=df, ax=ax1, palette='cubehelix')
s.boxplot('diagnosis',y='concave points_se',data=df, ax=ax3, palette='cubehelix')
s.boxplot('diagnosis',y='symmetry_se',data=df, ax=ax4, palette='cubehelix')
s.boxplot('diagnosis',y='fractal_dimension_se',data=df, ax=ax5, palette='cubehelix')    
f.tight_layout()

**<h4> Worst versions of the 10 Core Predictors </h4><br>**
Finally we look at the <i> Worst</i> set of features for the 10 core metrics.  Interestingly, these features show more similar vector distribution to the <i>Mean</i> columns than do the <i>Standard Error</i> columns; however, they are ranked lower in feature importance than the <i>Standard Error</i> predictors.  

A visual inspection shows that the average values of the vectors shows a similar tendency for higher average values for diagnosis == M vs. diagnosis == B.  Given this similarity in distribution to the most important <i>Mean</i> features, and taking into account the low importance ranking even compared to <i>SE</i> features, we decide to include all available predictors to achieve an improved accuracy that may capture dynamics previously unaccounted for by the classifier's ranking of the predictors' importances (see Feature Importances below).

<i><b>Of note:</b></i> Note that the <b>Fractal Dimension</b> core metric only shows a similar dynamic of higher avg. value for M vs. B when looking at the <b>Fractal Dimension Worst </b> feature values.  

In [None]:
plt.rcParams['figure.figsize']=(10,5)
f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5)
s.boxplot('diagnosis',y='radius_worst',data=df, ax=ax1, palette='coolwarm')
s.boxplot('diagnosis',y='texture_worst',data=df, ax=ax2, palette='coolwarm')
s.boxplot('diagnosis',y='perimeter_worst',data=df, ax=ax3, palette='coolwarm')
s.boxplot('diagnosis',y='area_worst',data=df, ax=ax4, palette='coolwarm')
s.boxplot('diagnosis',y='smoothness_worst',data=df, ax=ax5, palette='coolwarm')
f.tight_layout()

f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1,5)
s.boxplot('diagnosis',y='compactness_worst',data=df, ax=ax2, palette='coolwarm')
s.boxplot('diagnosis',y='concavity_worst',data=df, ax=ax1, palette='coolwarm')
s.boxplot('diagnosis',y='concave points_worst',data=df, ax=ax3, palette='coolwarm')
s.boxplot('diagnosis',y='symmetry_worst',data=df, ax=ax4, palette='coolwarm')
s.boxplot('diagnosis',y='fractal_dimension_worst',data=df, ax=ax5, palette='coolwarm')    
f.tight_layout()

<hr>
<center><h3><i> Fitting a Random Forest Classifier </i></h3></center>

Below, we fit a RandomForestClassifier() with 1000 trees (n_estimators=1000) and then cross-validate using sklearn's native `cross_val_score()` function.

Here we are looking at all predictors, that is 3 versions of the 10 core predictors totaling 30 features. 

In [None]:
#--------------------------------------------------------------------------------------#
# Train Random Forest
np.random.seed(10)

traindf, testdf = train_test_split(df, test_size = 0.3)

x_train = traindf[x]
y_train = traindf[y]

x_test = testdf[x]
y_test = testdf[y]

forest = RandomForestClassifier(n_estimators=1000)
fit = forest.fit(x_train, y_train)
accuracy = fit.score(x_test, y_test)
predict = fit.predict(x_test)
cmatrix = confusion_matrix(y_test, predict)

#--------------------------------------------------------------------------------------#
# Perform k fold cross-validation


print ('Accuracy of Random Forest: %s' % "{0:.2%}".format(accuracy))

# Cross_Validation
v = cross_val_score(fit, x_train, y_train, cv=10)
for i in range(10):
    print('Cross Validation Score: %s'%'{0:.2%}'.format(v[i,]))

A visualization of the confusion matrix below reveals the <b>1.75% error</b> is the result of our model misclassifying 3 cases as Malignant when they were actually Benign.  

We see; however, that our classification accuracy for a mass being Benign when it is actually Benign is a perfect 100%.  However, the cross-validation performed above reveals a higher expected out-of-sample error (~4.2%) so we expect that further tests of this model on fresh data will likely be accompanied by either some misclassifications of Benign cases as well as Malignant, or greater increase in false positives where Benign cases are classified as Malignant.  

In [None]:
plt.rcParams['figure.figsize']=(14,8)
ax = plt.axes()
s.heatmap(cmatrix, annot=True, fmt='d', ax=ax, cmap='BrBG', annot_kws={"size": 30})
ax.set_title('Random Forest Confusion Matrix')

In [None]:
#%%Feature importances
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for f in range(traindf[x].shape[1]):
    print("feature %s (%f)" % (list(traindf[x])[f], importances[indices[f]]))

Gini Importances of our predictors sorted descending as derived from our Random Forest model.  

Despite the high importance of ‘Mean’ features, we forego eliminating features based on low importance (according to RF classifier) and/or potential multi-collinearity.  Feature optimization is covered in prior research, and our goal is to test accuracy using all features.

In [None]:
feat_imp = pd.DataFrame({'Feature':list(traindf[x]),
                        'Gini importance':importances[indices]})
plt.rcParams['figure.figsize']=(8,12)
s.set_style('whitegrid')
ax = s.barplot(x='Gini importance', y='Feature', data=feat_imp)
ax.set(xlabel='Gini Importance')
plt.show()

**<hr>
<center><h3><i> Fitting a Support Vector Machine </i></h3></center>**

In [None]:
#---------------------------------------------------------------------------------------#
# Train Support Vector Machine ---------------------------------------------------------#
#---------------------------------------------------------------------------------------#

np.random.seed(10)

traindf, testdf = train_test_split(df_norm, test_size = 0.3)

x_train = traindf[x]
y_train = traindf[y]

x_test = testdf[x]
y_test = testdf[y]

svmf = svm.SVC()
svm_fit = svmf.fit(x_train, y_train)
accuracy = svm_fit.score(x_test, y_test)
predict = svm_fit.predict(x_test)
svm_cm = confusion_matrix(y_test, predict)

#--------------------------------------------------------------------------------------#
# Perform k fold cross-validation
print ('Accuracy of Support Vector Machine: %s' % "{0:.2%}".format(accuracy))

# Cross_Validation
v = cross_val_score(svm_fit, x_train, y_train, cv=10)
for i in range(10):
    print('Cross Validation Score: %s'%'{0:.2%}'.format(v[i,]))

In [None]:
#   Visualize SVM Confusion Matrix
plt.rcParams['figure.figsize']=(14,8)
ax = plt.axes()
s.heatmap(svm_cm, annot=True, fmt='d', ax=ax, cmap="YlGnBu", annot_kws={"size": 30})
ax.set_title('Support Vector Machine Confusion Matrix')

**<hr>
<center><h3><i> Fitting an MLP Classifier </i></h3></center>**

In [None]:

#---------------------------------------------------------------------------------------#
# Train MLPClassifier ------------------------------------------------------------------#
#---------------------------------------------------------------------------------------#
np.random.seed(10)

traindf, testdf = train_test_split(df_norm, test_size = 0.3)

x_train = traindf[x]
y_train = traindf[y]

x_test = testdf[x]
y_test = testdf[y]

clf = MLPClassifier(solver='lbfgs', alpha=5, hidden_layer_sizes=(500,), random_state=10)
mlp_fit = clf.fit(x_train, y_train)
accuracy = mlp_fit.score(x_test, y_test)
predict = mlp_fit.predict(x_test)
mlp_cm = confusion_matrix(y_test, predict)

#--------------------------------------------------------------------------------------#
# Perform k fold cross-validation
print ('Accuracy of Multilayer Perceptron: %s' % "{0:.2%}".format(accuracy))

# Cross_Validation
v = cross_val_score(mlp_fit, x_train, y_train, cv=10)
for i in range(10):
    print('Cross Validation Score: %s'%'{0:.2%}'.format(v[i,]))

In [None]:
#   Visualize MLP Confusion Matrix
plt.rcParams['figure.figsize']=(14,8)
ax = plt.axes()
s.heatmap(mlp_cm, annot=True, fmt='d', ax=ax, annot_kws={"size": 30})
ax.set_title('Multilayer Perceptron Confusion Matrix')

**<h4><i>Conclusions and Remarks for Future Research:</i></h4>**

In spite of the higher importance of <i>Mean</i> category features, we find a high accuracy and cross-validation score by including all available predictor versions in our model <b>(<i>Mean, Standard Error, and Worst</i>)</b> for the 10 core predictors (<b>Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave Points, Symmetry, and Fractal Dimension</b>).  

While our <b>Multilayer Perceptron Classifier</b> has the highest accuracy of <b>99.42%</b>, our study, uses a very small dataset of only 569 observations that need to be used for both the training and test dataframes. Further adjustments may likely be necessary with increased data size. In addition, we may be able to improve accuracy incrementally by further paramater optimization (i.e. testing many different levels of Alpha penalty in the MLPClassifier).

It would also be of use to understand how our SVM classifier achieves 100% accuracy for predicting Malignant data records while having a higher error for Benign classifications alone than our Random Forest and Multilayer Perceptron classifiers which have a 100% accuracy on predicting Benign records, but show error in predicting Malignant.

Furthermore, detail on the additional measurements that can be acquired by the FNA (Fine Needle Aspirate) technique on breast cancer mass nuclei should be explored.

----------
**Appendix**

In [None]:
diagnosis = df['diagnosis']
mean_cols = [col for col in df.columns if 'mean' in col]
meandf = pd.concat([diagnosis,df[mean_cols]], axis=1)

plt.rcParams['figure.figsize']=(12,12)
g = s.PairGrid(meandf, hue="diagnosis")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.add_legend();

plt.tight_layout()