## Using Machine Learning to Predict Breast Cancer

This project utilizes the __[Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)__ for predictive analysis in breast cancer. Key tools used for this project include: Jupyter Notebook, Python - numpy, pandas, matplotlib, plotly, seaborn and scikit-learn


### __[Summary on the Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)__

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link] 

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes. 

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34]. 

This database is also available through the UW CS ftp server: 
ftp ftp.cs.wisc.edu 
cd math-prog/cpo-dataset/machine-learn/WDBC/


Attribute Information:

1) ID number 
2) Diagnosis (M = malignant, B = benign) 
3-32) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)


### Loading Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.gridspec as gridspec 
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
import mpld3 as mpl
import itertools
from itertools import chain
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve, train_test_split
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import ShuffleSplit
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.offline as py
import plotly.tools as tls
import warnings
warnings.filterwarnings("ignore") 

### Data Loading

In [None]:
data = pd.read_csv("Project_Data/WISC_breast_cancer_data.csv", header = 0)#Loading CSV file

### Data Cleaning and Inspection

In [None]:
data.head()

In [None]:
data.info()

All columns have the same number of features but for "Unnamed: 32". In the next check I will check for and validate missingness

In [None]:
null_feat = pd.DataFrame(len(data['id']) - data.isnull().sum(), columns = ['Count'])

trace = go.Bar(x = null_feat.index, y = null_feat['Count'] ,opacity = 0.8, marker=dict(color = 'steelblue',
        line=dict(color='black',width=1.5)))

layout = dict(title =  "Checking Data Missingness", plot_bgcolor = "white")
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

As we can see from the plot above, all features are complete but for 'Unnamed: 32' which has none and therefore will be dropped.

In [None]:
#Dropping 'Unnamed: 32' with no values
data.drop('Unnamed: 32', axis=1, inplace=True)

In [None]:
data.info()

In [None]:
#Validating drop
null_feat = pd.DataFrame(len(data['id']) - data.isnull().sum(), columns = ['Count'])

trace = go.Bar(x = null_feat.index, y = null_feat['Count'] ,opacity = 0.8, marker=dict(color = 'steelblue',
        line=dict(color='black',width=1.5)))

layout = dict(title =  "Checking Data Missingness", plot_bgcolor = "white")
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

Feature 'Unnamed: 32' has been dropped from the dataset


As we can see from the dataset, we have 2 features that represente attribute Information: 1) ID number and 2) Diagnosis (M = malignant, B = benign) 
 

The resta are real-valued features (10) that are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)

In [None]:
data.describe()

In [None]:
data.diagnosis.unique()

### Checking diagnosis distribution

In [None]:
# 2 datasets
# Reassign target
data.diagnosis.replace(to_replace = dict(M = 1, B = 0), inplace = True)
Malignant = data[(data['diagnosis'] != 0)]
Benign = data[(data['diagnosis'] == 0)]

trace = go.Bar( x = ['Malignant', 'Benign'], y = (len(Malignant), len(Benign)),opacity = 0.5, marker=dict(color = 'steelblue',
        line=dict(color='gray',width=0.5)))

layout = dict(title =  'Diagnosis Distribution', plot_bgcolor = "white")
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig)


### Investigating computed cell nucleus features per diagnosis type: 

In [None]:
features_mean=list(data.columns[1:11])
# split dataframe into two based on diagnosis
#dfM=df[df['diagnosis'] ==1]
#dfB=df[df['diagnosis'] ==0]

#Stack the data
plt.rcParams.update({'font.size': 8})
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(8,10))
axes = axes.ravel()
for idx,ax in enumerate(axes):
    ax.figure
    binwidth= (max(data[features_mean[idx]]) - min(data[features_mean[idx]]))/50
    ax.hist([Malignant[features_mean[idx]],Benign[features_mean[idx]]], bins=np.arange(min(data[features_mean[idx]]), max(data[features_mean[idx]]) + binwidth, binwidth) , alpha=0.5,stacked=True, density=True, label=['Malignant','Benign'],color=['steelblue','darkorange'])
    ax.legend(loc='upper right')
    ax.set_title(features_mean[idx])
plt.tight_layout()
plt.show()



Overall, we consistently see higher mean values per feature in malignant cells. We can leverage this for our classifications

### Checking Feature Correlation

In [None]:
corr = data.corr()
plt.figure(figsize=(20,20))
sns.heatmap(data.corr(),  annot = True, cmap="cividis")
plt.title("Correlation Plot", fontweight = "bold", fontsize=18)

In [None]:
#Pair-wise comparison of features
sns.pairplot(data[corr], diag_kind = "kde", markers = "+", hue = "diagnosis")
plt.show()

Based on the above analyses and observations, we can leverage correlated features and reasonably hypothesize that the cancer diagnosis depends on these.

### Creating test and training datasets

In [None]:
traindf, testdf = train_test_split(data, test_size = 0.3)

In [None]:
#Generic function for making a classification model and accessing the performance. 
# From AnalyticsVidhya tutorial
def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print("Accuracy : %s" % "{0:.3%}".format(accuracy))

  #Perform k-fold cross-validation with 5 folds
  #kf = KFold(data.shape[0], n_folds=5)

  kf = KFold(n_splits = 5)

  error = []
  for train, test in kf.split(data[predictors]):
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
    
    print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
    
  #Fit the model again so that it can be refered outside the function:
  model.fit(data[predictors],data[outcome]) 

### Logistic Regression

In [None]:
predictor_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concave points_mean']
outcome_var='diagnosis'
model=LogisticRegression()
classification_model(model,traindf,predictor_var,outcome_var)

In [None]:
predictor_var = ['radius_mean']
outcome_var='diagnosis'
model=LogisticRegression()
classification_model(model,traindf,predictor_var,outcome_var)

### Random Forest

In [None]:
# Use all the features of the nucleus
predictor_var = features_mean
model = RandomForestClassifier(n_estimators=100,min_samples_split=25, max_depth=7, max_features=2)
classification_model(model, traindf,predictor_var,outcome_var)

Using all the features improves the prediction accuracy and the cross-validation score is great.
An advantage with Random Forest is that it returns a feature importance matrix which can be used to select features. So lets select the top 5 features and use them as predictors.

In [None]:
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)

In [None]:
# Using top 5 features
predictor_var = ['concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean',]
model = RandomForestClassifier(n_estimators=100, min_samples_split=25, max_depth=7, max_features=2)
classification_model(model,traindf,predictor_var,outcome_var)

Using the top 5 features only changes the prediction accuracy a bit but I think we get a better result if we use all the predictors.
What happens if we use a single predictor as before? Just check.

In [None]:
predictor_var =  ['radius_mean']
model = RandomForestClassifier(n_estimators=100)
classification_model(model, traindf,predictor_var,outcome_var)

This gives a better prediction accuracy too but the cross-validation is not great. Below I will assess other classifiers



### KNeighborsClassifier

In [None]:
predictor_var = ['concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean',]
model = KNeighborsClassifier(n_neighbors = 2, weights ='uniform')
classification_model(model,traindf,predictor_var,outcome_var)

### SVC

In [None]:
model =SVC(kernel="rbf",random_state=15)
classification_model(model, traindf,predictor_var,outcome_var)

### DecisionTreeClassifier

In [None]:
predictor_var = ['concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean',]
model=DecisionTreeClassifier(random_state=10)
classification_model(model, traindf,predictor_var,outcome_var)

### RandomForestClassifier

In [None]:
predictor_var = ['concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean',]
#model=RandomForestClassifier(n_estimators=60, random_state=0)
model=RandomForestClassifier(n_estimators=100,min_samples_split=25, max_depth=7, max_features=2)
classification_model(model, traindf,predictor_var,outcome_var)

### GradientBoostingClassifier

In [None]:
predictor_var = ['concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean',]
model=GradientBoostingClassifier(random_state=20)
classification_model(model, traindf,predictor_var,outcome_var)

### AdaBoostClassifier

In [None]:
predictor_var = ['concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean',]
model=AdaBoostClassifier()
classification_model(model, traindf,predictor_var,outcome_var)

### XGBClassifier

In [None]:
predictor_var = ['concave points_mean','area_mean','radius_mean','perimeter_mean','concavity_mean',]
model=xgb.XGBClassifier(random_state=0,booster="gbtree", eval_metric="logloss")
classification_model(model, traindf,predictor_var,outcome_var)

## Using on the test data set

In [None]:
# Use all the features of the nucleus
predictor_var = features_mean
model = RandomForestClassifier(n_estimators=100,min_samples_split=25, max_depth=7, max_features=2)
classification_model(model, testdf,predictor_var,outcome_var)

### Model Accuracies

In [None]:
models=['Logistic Regression ', 'Random Forest', 'KNeighbors','SVC', 'DecisionTree', 'GradientBoosting','AdaBoost', 'XGB']

plot = go.Bar(x=models, y=[89.698, 95.477, 90.201,88.945,100.000, 100.000, 98.492,100.000],
               opacity = 0.5, marker=dict(color = 'steelblue',line=dict(color='gray',width=0.5))) 

layout = dict(title =  'Model Accuracies', plot_bgcolor = "white")
                    
fig = dict(data = [plot], layout=layout)
py.iplot(fig)
