# INFO 4604 Final Project - Predicting if cancer is benign or not

## Amogh Jahagirdar and Ryan Rouleau

Dataset: [https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/home](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/home)

## Precursor Analysis/General Data Cleansing

In [43]:
import pandas as pd

data = pd.read_csv('./data/data.csv')

In [44]:
#Basic summary statistics
data.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


In [45]:
#Drop last column ("Unnamed:32" column of NaNs being read, when the CSV is opened up in Excel that column doesn't exist)
data_cleaned = data.iloc[:, :-1]
#Drop ID (just a bookeeping column part of the original data)
data_cleaned = data_cleaned.drop("id", 1)

In [46]:
#Analyze the balance of the data.
nrows = data_cleaned.shape[0]
print("Percentages of benign and maligant data is \n {}".format(100 * data_cleaned["diagnosis"].value_counts()/nrows))

Percentages of benign and maligant data is 
 B    62.741652
M    37.258348
Name: diagnosis, dtype: float64


As one can see, there are significantly more benign cases than malignant in the given dataset.

In [47]:
from sklearn.model_selection import train_test_split

X = data_cleaned[data_cleaned.columns.difference(["diagnosis"])]
y = data_cleaned['diagnosis']
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.25,random_state=123)

## Baseline Model

Next, we will create a simple baseline classifier with no feature extraction. We can use scikit learn's DummyClassifier class with "the most frequent" strategy. This is not used for actual classification purposes, it is mereley a benchmark for what a theoretical classifier would predict if it didn't actually learn from the features in the data (a minimum accuracy for our actual models). All of our models should perform much better than the DummyClasifier.

In [48]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

baseline = DummyClassifier(strategy='most_frequent', random_state=1234)
baseline.fit(X_train, Y_train)

print("Baseline Training accuracy: %0.6f" % accuracy_score(Y_train, baseline.predict(X_train)))
print("Baseline Testing accuracy: %0.6f" % accuracy_score(Y_test, baseline.predict(X_test)))

Baseline Training accuracy: 0.629108
Baseline Testing accuracy: 0.622378


Our models should be able to perform significantly above 60% accuracy.

## Classification Algorithms (Baseline)
###  Decision Tree Classifier

In [52]:
from sklearn.tree import DecisionTreeClassifier

decisionTree = DecisionTreeClassifier(random_state=1234)
decisionTree.fit(X_train, Y_train)

print("Baseline Decision Tree Training accuracy: %0.6f" % accuracy_score(Y_train, decisionTree.predict(X_train)))
print("Baseline Decision Tree Testing accuracy: %0.6f" % accuracy_score(Y_test, decisionTree.predict(X_test)))

Baseline Decision Tree Training accuracy: 1.000000
Baseline Decision Tree Testing accuracy: 0.958042


As seen above, a decision tree classifier with no modification of hyperparamters severly overfits with a training accuracy of exactly 100%.

... hyperparameter selection w/ cv here

### Logistic Regression

In [53]:
from sklearn.linear_model import LogisticRegression

logisticRegression = LogisticRegression(random_state=1234)
logisticRegression.fit(X_train, Y_train)

print("Baseline Logistic Regression Training accuracy: %0.6f" % accuracy_score(Y_train, logisticRegression.predict(X_train)))
print("Baseline Logistic Regression Testing accuracy: %0.6f" % accuracy_score(Y_test, logisticRegression.predict(X_test)))

Baseline Logistic Regression Training accuracy: 0.948357
Baseline Logistic Regression Testing accuracy: 0.986014


Logistic Regression has learned the `most frequent` strategy that is also used in our baseline without hyperparameter modifications.

... hyperparameter selection w/ cv here

### Support Vector Machine

In [54]:
from sklearn.svm import SVC 

svm = SVC(random_state=1234)
svm.fit(X_train, Y_train)

print("Baseline Support Vector Machine Training accuracy: %0.6f" % accuracy_score(Y_train, svm.predict(X_train)))
print("Baseline Support Vector Machine Testing accuracy: %0.6f" % accuracy_score(Y_test, svm.predict(X_test)))

Baseline Support Vector Machine Training accuracy: 1.000000
Baseline Support Vector Machine Testing accuracy: 0.622378


Using a support vector machine without any hyperparameter modifications also severly overfits with a training accuracy of 100%.  The test accuracy is concerning as it is exactly the same as `most frequent` baseline classifier.

### Neural Net

In [42]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(random_state=1234)
mlp.fit(X_train, Y_train)

print("Baseline Neural Net Training accuracy: %0.6f" % accuracy_score(Y_train, mlp.predict(X_train)))
print("Baseline Neural Net Testing accuracy: %0.6f" % accuracy_score(Y_test, mlp.predict(X_test)))

Baseline Neural Net Training accuracy: 0.906103
Baseline Neural Net Testing accuracy: 0.916084


# Feature preprocessing via Standard Scaling

In [14]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
X_train_std = std_scaler.fit_transform(X_train)
X_test_std = std_scaler.transform(X_test)

decisionTree.fit(X_train_std, Y_train)

#A little sanity check to see how models perform after scaling
models = [logisticRegression, decisionTree, mlp, svm]
for model in models:
    #Warm start by default is off so by calling fit it "retrains from scratch" which is what we want
    model.fit(X_train_std, Y_train)
    model_name = model.__class__.__name__
    train_accuracy = accuracy_score(Y_train, model.predict(X_train_std))
    test_accuracy = accuracy_score(Y_test, model.predict(X_test_std))
    print("Train accuracy for model {} after standardizing features: {}".format(model_name, train_accuracy))
    print("Test accuracy for model {} after standardizing features: {}".format(model_name, test_accuracy))

Train accuracy for model LogisticRegression after standardizing features: 0.9859154929577465
Test accuracy for model LogisticRegression after standardizing features: 0.986013986013986
Train accuracy for model DecisionTreeClassifier after standardizing features: 1.0
Test accuracy for model DecisionTreeClassifier after standardizing features: 0.965034965034965
Train accuracy for model MLPClassifier after standardizing features: 0.9929577464788732
Test accuracy for model MLPClassifier after standardizing features: 0.986013986013986
Train accuracy for model SVC after standardizing features: 0.9835680751173709
Test accuracy for model SVC after standardizing features: 0.986013986013986




## Feature Extraction and CV 

#### Perform feature selection using chi^2.
#### Maintain a map between model type and a list of potential params e.e {"logistic_regression": [list of potential c values], "neural net": [different hidden_neural_net sizes], etc}


# 