<h1 style="font-size:42px; text-align:center"><span>Breast Cancer Detection:</span></h1>
<h1 style="font-size:42px; text-align:center"><span>Model Training</span></h1>

<hr>

This Jupyter notebook contains model training code for the breast mass data in the UCI Machine Learning Repository located [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic%29). The ultimate goal of this project is to produce a robust classification algorithm that can correctly identify a benign/malignant tumor based on its features. The metric used will be the highest area under the ROC curve.

We have already explored and cleaned our data set. We will begin with our cleaned_df.

In [1]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd

# Matplotlib for visualization
from matplotlib import pyplot as plt

# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

# Scikit-Learn for Modeling
import sklearn

# Pickle for saving model files
import pickle

# Import Logistic Regression
from sklearn.linear_model import LogisticRegression

# Import RandomForestClassifier and GradientBoostingClassifer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Function for splitting training and test set
from sklearn.model_selection import train_test_split

# Function for creating model pipelines
from sklearn.pipeline import make_pipeline

# For standardization
from sklearn.preprocessing import StandardScaler

# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

# Classification metrics (added later)
from sklearn.metrics import roc_curve, auc

# Import confusion_matrix
from sklearn.metrics import confusion_matrix

from sklearn.metrics import zero_one_loss

Import Data:

In [2]:
#Import ABT
df=pd.read_csv("cleaned_df.csv")

In [3]:
#Display top 5 rows
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Build Testing & Training Sets:

In [4]:
# Create separate object for target variable
y = df['diagnosis']

# Create separate object for input features
X = df.drop('diagnosis',axis=1)

In [5]:
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1234,stratify=df.diagnosis)

# Print number of observations in X_train, X_test, y_train, and y_test
print(len(X_train),len(X_test),len(y_train),len(y_test))

455 114 455 114


Remember how many of the features exhibited multicollinearity? In this notebook, we will apply four models to our data that are robust to this artifact:
* L1 Regularized Logistic Regression
* L2 Regularized Logistic Regression
* Random Forests
* Boosted Trees

Let's begin by defining our data processing pipeline. Here we'll standardize our data out of necessity for L1 and L2 regularized logistic regression and out of consistency for random forests and boosted trees.

In [6]:
# Pipeline dictionary
pipelines = {
    'l1' : make_pipeline(StandardScaler(), LogisticRegression(random_state=123,penalty='l1')),
    'l2' : make_pipeline(StandardScaler(), LogisticRegression(random_state=123,penalty='l2')),
    'rf' : make_pipeline(StandardScaler(), RandomForestClassifier(random_state=123)),
    'gb' : make_pipeline(StandardScaler(),GradientBoostingClassifier(random_state=123))
}

Next, we'll try out various penalties for our logistic regression models:

In [7]:
# Logistic Regression hyperparameters
l1_hyperparameters = {
    'logisticregression__C' : np.linspace(1e-3, 1e3, 100),
}

l2_hyperparameters = {
    'logisticregression__C' : np.linspace(1e-3, 1e3, 100),
}

For our random forest model, we will tune the number of trees 'grown' and features considered at each split:

In [8]:
# Random Forest hyperparameters
rf_hyperparameters = {
    'randomforestclassifier__n_estimators':[100,200],
    'randomforestclassifier__max_features':['auto','sqrt',0.33]
}

For our boosted tree model, we will tune the number of boosting stages, the learning rate, and tree depth:

In [9]:
# Boosted Tree hyperparameters
gb_hyperparameters = {
    'gradientboostingclassifier__n_estimators':[100,200],
    'gradientboostingclassifier__learning_rate':[0.05,0.1,0.2],
    'gradientboostingclassifier__max_depth':[1,3,5]
}

In [10]:
# Create hyperparameters dictionary
hyperparameters = {
    'l1':l1_hyperparameters,
    'l2':l2_hyperparameters,
    'rf':rf_hyperparameters,
    'gb':gb_hyperparameters
}

In [11]:
# Create empty dictionary called fitted_models
fitted_models = {}

# Loop through model pipelines, tuning each one and saving it to fitted_models
for name, pipeline in pipelines.items():
    # Create cross-validation object from pipeline and hyperparameters
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1)
    
    # Fit model on X_train, y_train
    model.fit(X_train,y_train)

    
    # Store model in fitted_models[name] 
    fitted_models[name] = model

    
    # Print '{name} has been fitted'
    print(name, 'has been fitted.')

l1 has been fitted.
l2 has been fitted.
rf has been fitted.
gb has been fitted.


In [12]:
# Display best_score_ (e.g. cross-validated score) for each fitted model
for name, model in fitted_models.items():
    print(name,model.best_score_)
    

l1 0.971428571429
l2 0.98021978022
rf 0.958241758242
gb 0.96043956044


In [13]:
# Let's calculate the test accuracy for each model
for name, model in fitted_models.items():
    y_pred = model.predict(X_test)
    print(name,1-zero_one_loss(y_test, y_pred))


l1 0.929824561404
l2 0.938596491228
rf 0.973684210526
gb 0.982456140351


The model with the highest cross-validated score is L2 regularized logistic regression. The model with the highest test accuracy is gradient boosted trees. Let's calculate the area under the ROC Curve (AUROC) on the test data:

In [14]:
# Calculate Area Under ROC Curve
for name, model in fitted_models.items():
    pred = model.predict_proba(X_test)
    pred = [p[1] for p in pred]
    fpr, tpr, thresholds = roc_curve(y_test, pred)
    print(name,auc(fpr, tpr))

l1 0.982473544974
l2 0.983134920635
rf 0.992063492063
gb 0.99503968254


The model with the greatest AUROC is gradient boosted trees. Let's look at its confusion matrix.

In [15]:
# Predict classes using Gradient Boosted Trees:
pred = fitted_models['gb'].predict(X_test)

# Display confusion matrix for y_test and pred
print( confusion_matrix(y_test, pred) )

[[72  0]
 [ 2 40]]


The model correctly identifies all malignant tumors and incorrectly classifies two benign tumors (e.g. 2 false positives). Let's export our winning model!

In [16]:
# Save winning model as final_model.pkl
with open('final_model.pkl', 'wb') as f:
    pickle.dump(fitted_models['gb'].best_estimator_, f)