# INFO 4604 Final Project - Predicting if cancer is benign or not

## Amogh Jahagirdar and Ryan Rouleau

Dataset: [https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/home](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/home)

## Precursor Analysis/General Data Cleansing

In [2]:
import pandas as pd

data = pd.read_csv('./data/data.csv')

In [3]:
#Basic summary statistics
data.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


In [4]:
# Drop last column ("Unnamed:32" column of NaNs being read, when the CSV is opened up in Excel that column doesn't exist)
data_cleaned = data.iloc[:, :-1]

# Drop ID (just a bookeeping column part of the original data)
data_cleaned = data_cleaned.drop("id", 1)

In [93]:
# Analyze the balance of the data.
nrows = data_cleaned.shape[0]
print("Percentages of benign and maligant data is \n {}".format(100 * data_cleaned["diagnosis"].value_counts()/nrows))

Percentages of benign and maligant data is 
 B    62.741652
M    37.258348
Name: diagnosis, dtype: float64


As one can see, there are significantly more benign cases than malignant in the given dataset which could effect our results.

In [94]:
from sklearn.model_selection import train_test_split

# Split into training and testing sets
X = data_cleaned[data_cleaned.columns.difference(["diagnosis"])]
y = data_cleaned['diagnosis']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.25,random_state=123)

## Baseline Model

Next, we will create a simple baseline classifier with no feature extraction. We can use scikit learn's DummyClassifier class with "the most frequent" strategy. This is not used for actual classification purposes, it is mereley a benchmark for what a theoretical classifier would predict if it didn't actually learn from the features in the data (a minimum accuracy for our actual models). All of our models should perform much better than the DummyClasifier.

In [7]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

baseline = DummyClassifier(strategy='most_frequent', random_state=1234)
baseline.fit(X_train, Y_train)

print("Baseline Training accuracy: %0.6f" % accuracy_score(Y_train, baseline.predict(X_train)))
print("Baseline Testing accuracy: %0.6f" % accuracy_score(Y_test, baseline.predict(X_test)))

Baseline Training accuracy: 0.629108
Baseline Testing accuracy: 0.622378


Our models should be able to perform significantly above 60% accuracy.

##  Baseline Classification Algorithms (No feature selection, preprocessing, or hyperparameter tuning)
###  Decision Tree Classifier

In [8]:
from sklearn.tree import DecisionTreeClassifier

decisionTree = DecisionTreeClassifier(random_state=1234)
decisionTree.fit(X_train, Y_train)

print("Baseline Decision Tree Training accuracy: %0.6f" % accuracy_score(Y_train, decisionTree.predict(X_train)))
print("Baseline Decision Tree Testing accuracy: %0.6f" % accuracy_score(Y_test, decisionTree.predict(X_test)))

Baseline Decision Tree Training accuracy: 1.000000
Baseline Decision Tree Testing accuracy: 0.958042


### Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression

logisticRegression = LogisticRegression(random_state=1234)
logisticRegression.fit(X_train, Y_train)

print("Baseline Logistic Regression Training accuracy: %0.6f" % accuracy_score(Y_train, logisticRegression.predict(X_train)))
print("Baseline Logistic Regression Testing accuracy: %0.6f" % accuracy_score(Y_test, logisticRegression.predict(X_test)))

Baseline Logistic Regression Training accuracy: 0.948357
Baseline Logistic Regression Testing accuracy: 0.986014


### Support Vector Machine

In [10]:
from sklearn.svm import SVC 

svm = SVC(random_state=1234)
svm.fit(X_train, Y_train)

print("Baseline Support Vector Machine Training accuracy: %0.6f" % accuracy_score(Y_train, svm.predict(X_train)))
print("Baseline Support Vector Machine Testing accuracy: %0.6f" % accuracy_score(Y_test, svm.predict(X_test)))

Baseline Support Vector Machine Training accuracy: 1.000000
Baseline Support Vector Machine Testing accuracy: 0.622378


Using a support vector machine without any hyperparameter modifications also severly overfits with a training accuracy of 100%.  The test accuracy is concerning as it is exactly the same as `most frequent` baseline classifier.

### Neural Net

In [95]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(random_state=1234,max_iter=1000)
mlp.fit(X_train, Y_train)

print("Baseline Neural Net Training accuracy: %0.6f" % accuracy_score(Y_train, mlp.predict(X_train)))
print("Baseline Neural Net Testing accuracy: %0.6f" % accuracy_score(Y_test, mlp.predict(X_test)))

Baseline Neural Net Training accuracy: 0.537559
Baseline Neural Net Testing accuracy: 0.531469


Our baseline MLP accuracy is not good.  It is 10% below the baseline accuracy of 63%.  It'll be interesting to see how much we can improve this or if there is simply not enough data.

## Feature preprocessing via Standard Scaling

In [67]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()

X_std = std_scaler.fit_transform(X)
X_train_std = std_scaler.fit_transform(X_train)
X_test_std = std_scaler.transform(X_test)

# A little sanity check to see how models perform after scaling
models = [logisticRegression, decisionTree, mlp, svm]
for model in models:
    # Warm start by default is off so by calling fit it "retrains from scratch" which is what we want
    model.fit(X_train_std, Y_train)
    model_name = model.__class__.__name__
    train_accuracy = accuracy_score(Y_train, model.predict(X_train_std))
    test_accuracy = accuracy_score(Y_test, model.predict(X_test_std))
    print("Train accuracy for model {} after standardizing features: {}".format(model_name, train_accuracy))
    print("Test accuracy for model {} after standardizing features: {}".format(model_name, test_accuracy))

Train accuracy for model LogisticRegression after standardizing features: 0.9859154929577465
Test accuracy for model LogisticRegression after standardizing features: 0.993006993006993
Train accuracy for model DecisionTreeClassifier after standardizing features: 1.0
Test accuracy for model DecisionTreeClassifier after standardizing features: 0.958041958041958
Train accuracy for model MLPClassifier after standardizing features: 0.9929577464788732
Test accuracy for model MLPClassifier after standardizing features: 0.986013986013986
Train accuracy for model SVC after standardizing features: 0.9835680751173709
Test accuracy for model SVC after standardizing features: 0.986013986013986


As we can see, standardizing the features results in a significant increase in both training and test accuracies. 

## Feature Extraction and Cross Validation

Here we are testing different percentiles (`[1, 2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]`) of the best features selected by ANOVA F-value.  For each percentile we run cross validation so we get the best parameters for that subset of features.  We do this for all four of our models with the end result giving us the best overall classifier with the best number of best features to inlcude and the respective optimal parameters.



In [34]:
'''
WARNING: THIS MAY TAKE A WHILE TO RUN
'''

from sklearn.model_selection import GridSearchCV, cross_val_predict
from sklearn.feature_selection import SelectPercentile, f_classif



model_to_possible_params = {}

#Populate model_to_params:
#Key is model_name, value is a dictionary mapping from parameter to a list of potential values
for model in models:
    model_name = model.__class__.__name__
    tunable_params = None
    if model_name == 'LogisticRegression':
        tunable_params = [{'C': [0.001, 0.01,0.1,1,10,100]}]
    elif model_name == 'DecisionTreeClassifier':
        tunable_params = [{'max_depth':[1,2,4,8]}, {'min_samples_leaf': [1,2,3,5,8]}]
    elif model_name == 'SVC':
        tunable_params = [{'kernel': ['rbf', 'linear', 'poly', 'sigmoid']}, {'C': [0.001, 0.01, 0.1, 1, 10, 100]}]
    else:
        tunable_params = [{'activation':['identity', 'logistic', 'tanh', 'relu']}, 
                          {'alpha': [1e-04, 1e-03, 1e-02, 0.05, 1, 10]}, 
                          {'hidden_layer_sizes': [(3), (3,5), (3,5,3), (3,5,5)]}]
    
    model_to_possible_params[model_name] = tunable_params
    
percentiles = [1, 2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

#mapping from tuple (percentil, model) to cross validation score

accuracy_results = {}

for model in models:
    
    relevant_params = model_to_possible_params[model.__class__.__name__]
    
    # (best percentile, accuracy, model)
    best_so_far = (None, -1, None)
    
    for percentile in percentiles:
        selection = SelectPercentile(percentile=percentile, score_func=f_classif)
        X_train_selected = selection.fit_transform(X_train_std, Y_train)
        gs_classifier = GridSearchCV(model, relevant_params, cv=5, n_jobs=4)
        gs_classifier.fit(X_train_selected, Y_train)
        accuracy_results[(percentile, gs_classifier)] = gs_classifier.best_score_
        
        if gs_classifier.best_score_ > best_so_far[1]:
            best_so_far = (percentile, gs_classifier.best_score_, gs_classifier) 
        
        
        print("Percentile: {}, Classifier: {}, Best Params: {}, Best Accuracy: {:.3f}".format(
            percentile,
            model.__class__.__name__,
            gs_classifier.best_params_,
            gs_classifier.best_score_
        ))
        
    print("---")
    print("Best Precentile: {}, Classifier: {}, Best Params: {}, Best Accuracy: {:.3f}".format(
        best_so_far[0],
        model.__class__.__name__,
        best_so_far[2].best_params_,
        best_so_far[1]
    ))
    print("")
    print("")


Percentile: 1, Classifier: LogisticRegression, Best Params: {'C': 0.001}, Accuracy: 0.908
Percentile: 2, Classifier: LogisticRegression, Best Params: {'C': 0.001}, Accuracy: 0.908
Percentile: 5, Classifier: LogisticRegression, Best Params: {'C': 0.01}, Accuracy: 0.930
Percentile: 10, Classifier: LogisticRegression, Best Params: {'C': 0.01}, Accuracy: 0.946
Percentile: 20, Classifier: LogisticRegression, Best Params: {'C': 100}, Accuracy: 0.953
Percentile: 30, Classifier: LogisticRegression, Best Params: {'C': 10}, Accuracy: 0.955
Percentile: 40, Classifier: LogisticRegression, Best Params: {'C': 1}, Accuracy: 0.944
Percentile: 50, Classifier: LogisticRegression, Best Params: {'C': 100}, Accuracy: 0.941
Percentile: 60, Classifier: LogisticRegression, Best Params: {'C': 0.1}, Accuracy: 0.969
Percentile: 70, Classifier: LogisticRegression, Best Params: {'C': 0.1}, Accuracy: 0.977
Percentile: 80, Classifier: LogisticRegression, Best Params: {'C': 0.1}, Accuracy: 0.977
Percentile: 90, Class



Percentile: 30, Classifier: MLPClassifier, Best Params: {'hidden_layer_sizes': (3, 5, 3)}, Accuracy: 0.958




Percentile: 40, Classifier: MLPClassifier, Best Params: {'activation': 'relu'}, Accuracy: 0.951




Percentile: 50, Classifier: MLPClassifier, Best Params: {'activation': 'relu'}, Accuracy: 0.946




Percentile: 60, Classifier: MLPClassifier, Best Params: {'activation': 'relu'}, Accuracy: 0.972
Percentile: 70, Classifier: MLPClassifier, Best Params: {'activation': 'logistic'}, Accuracy: 0.977
Percentile: 80, Classifier: MLPClassifier, Best Params: {'activation': 'identity'}, Accuracy: 0.979
Percentile: 90, Classifier: MLPClassifier, Best Params: {'activation': 'logistic'}, Accuracy: 0.979
Percentile: 100, Classifier: MLPClassifier, Best Params: {'alpha': 10}, Accuracy: 0.974
---
Best Precentile: 80, Classifier: MLPClassifier, Best Params: {'activation': 'identity'}, Accuracy: 0.979


Percentile: 1, Classifier: SVC, Best Params: {'C': 0.1}, Accuracy: 0.908
Percentile: 2, Classifier: SVC, Best Params: {'C': 0.1}, Accuracy: 0.908
Percentile: 5, Classifier: SVC, Best Params: {'kernel': 'rbf'}, Accuracy: 0.941
Percentile: 10, Classifier: SVC, Best Params: {'C': 0.1}, Accuracy: 0.944
Percentile: 20, Classifier: SVC, Best Params: {'C': 10}, Accuracy: 0.951
Percentile: 30, Classifier: SVC,

In general, more features results in a higher accuracy. The best models used 70%-100% of the best features selected by the F-value. This implies that there is little correlation between the features (little "redundancy" in feature set). We can do a quick check of the correlation matrix to see if any features are strongly correlated.

In [20]:
%matplotlib inline  

import matplotlib.pyplot as plt

corr = data_cleaned.corr()
print(corr)

                         radius_mean  texture_mean  perimeter_mean  area_mean  \
radius_mean                 1.000000      0.323782        0.997855   0.987357   
texture_mean                0.323782      1.000000        0.329533   0.321086   
perimeter_mean              0.997855      0.329533        1.000000   0.986507   
area_mean                   0.987357      0.321086        0.986507   1.000000   
smoothness_mean             0.170581     -0.023389        0.207278   0.177028   
compactness_mean            0.506124      0.236702        0.556936   0.498502   
concavity_mean              0.676764      0.302418        0.716136   0.685983   
concave points_mean         0.822529      0.293464        0.850977   0.823269   
symmetry_mean               0.147741      0.071401        0.183027   0.151293   
fractal_dimension_mean     -0.311631     -0.076437       -0.261477  -0.283110   
radius_se                   0.679090      0.275869        0.691765   0.732562   
texture_se                 -

### Overall best and worst models from Cross Validation

In [56]:
sorted_by_accuracy = sorted(accuracy_results.items(), key=lambda kv: kv[1])

# best worst models are tuple ((percentile, model), accuracy)
best_model_tup = sorted_by_accuracy[-1]
worst_model_tup = sorted_by_accuracy[0]

print("The overall best model according to cross validation results was: {}\n with params {}\n using {}% of the best features\n with CV accuracy of {:.4f}".format(
    best_model_tup[0][1].estimator.__class__.__name__,
    best_model_tup[0][1].best_params_,
    best_model_tup[0][0],
    best_model_tup[1]
))

print("")

print("The overall worst model according to cross validation results was: {}\n with params {}\n using {}% of the best features\n with CV of accuracy of {:.4f}".format(
    worst_model_tup[0][1].estimator.__class__.__name__,
    worst_model_tup[0][1].best_params_,
    worst_model_tup[0][0],
    worst_model_tup[1]
))

The overall best model according to cross validation results was: LogisticRegression
 with params {'C': 0.1}
 using 100% of the best features
 with CV accuracy of 0.9812

The overall worst model according to cross validation results was: DecisionTreeClassifier
 with params {'max_depth': 1}
 using 1% of the best features
 with CV of accuracy of 0.9038


In [63]:
best_model = best_model_tup[0][1].estimator
best_model.fit(X_train_std, Y_train)
best_acc = accuracy_score(Y_test, best_model.predict(X_test_std))
print("The best model's test accuracy is: {:.4f}%".format(best_acc*100))

The best model's test accuracy is: 99.3007%


Our highest test accuracy was **`99.3%`** with **`Logistic Regression`** using **`100% of features`** and hyperparameter **`C=0.1`**.

# Error analysis 

### Confusion Matrix

In [69]:
from sklearn.metrics import confusion_matrix

y_pred = best_model.predict(X_std)


print(confusion_matrix(y, y_pred))

[[355   2]
 [  6 206]]


Here we are predicting the model on all of our data as only one sample was misclassified using just the test data.

We can see that the classifier is 3x (if we can conclude that with such little data) more likely to predict a sample as NOT breast cancaer if it actualy IS breast cancer vs. predicting a sample IS breast cancer if it's actually NOT breast cancer.

In this use case it might be useful to modify our model so that it penalizes false negatives and allow more false positives so it's not predicted that people don't have breast cancer when in reality they do.  

## Interpeting what the classifier is doing

In [92]:
labels = list(data_cleaned[data_cleaned.columns.difference(["diagnosis"])].columns.values)
coefs_and_labels = list(zip(best_model.coef_[0], labels))
sorted_coefs_and_labels = sorted(coefs_and_labels, key=lambda tup: abs(tup[0]))

print(("Weight", "Feature Name"))
print("--------------------------------")
for tup in sorted_coefs_and_labels:
    print(tup)

('Weight', 'Feature Name')
--------------------------------
(0.00016324339113484281, 'concavity_se')
(0.11605046247421666, 'smoothness_mean')
(-0.12762966852250768, 'compactness_worst')
(0.1305932122497972, 'symmetry_mean')
(0.15186035604779519, 'concave points_se')
(-0.25024500310196252, 'compactness_mean')
(-0.26432970521766735, 'fractal_dimension_mean')
(0.27200613356879, 'fractal_dimension_worst')
(0.40002441449523313, 'perimeter_mean')
(0.40254829821648508, 'radius_mean')
(-0.42851346178853988, 'symmetry_se')
(0.46461075894386478, 'area_mean')
(0.46713405190813262, 'smoothness_se')
(-0.47339531085328279, 'texture_se')
(0.48550796200685459, 'texture_mean')
(0.62829707346847874, 'perimeter_se')
(0.7057089333843366, 'smoothness_worst')
(-0.7073199363992122, 'fractal_dimension_se')
(0.81332054944967513, 'perimeter_worst')
(0.83018068041975379, 'symmetry_worst')
(0.84403023566214486, 'concave points_mean')
(0.92607872367038546, 'concavity_worst')
(0.93187317857591812, 'concavity_mean')

Above we print the sorted coefficients of our best model from least significant to most significant along with the corresponding human readable feature name.  Note that since we standardized our data we can directly compare the weights of features.  

The 5 **most significant** features are:
1. `texture_worst (1.416)` - "worst" or largest mean value for standard deviation of gray-scale values
2. `radius_se (1.079)` - standard error for the mean of distances from center to points on the perimeter
3. `area_se (0.993)` - 
4. `radius_worst (0.983)` - "worst" or largest mean value for mean of distances from center to points on the perimeter
5. `area_worst (0.960)` - 

The 5 **least significant** features are:
1. `concavity_se (0.000)` - "standard error for severity of concave portions of the contour"
2. `smoothness_mean (0.116)` - "standard error for local variation in radius lengths"
3. `compactness_worst (-0.128)` - "worst" or largest mean value for perimeter^2 / area - 1.0
4. `symmetry_mean (0.130)` - 
5. `concave_points_se (0.152)` - mean for number of concave portions of the contour

Overall, significantly more features are weighted positively than negatively and in fact the most significant 5 features are all positive.  