**Predicting Breast Cancer From Nuclear Shape**

**The nucleus is an organelle present within all eukaryotic cells, including human cells.  Abberant nuclear shape can be used to identify cancer cells (e.g. pap smear tests and the diagnosis of cervical cancer). Likewise, a growing body of literature suggests that there is some connection between the shape of the nucleus and human disease states such as cancer and aging. As such, the quantitative analysis of nuclear size of shape has important biomedical applications.
**
For more information, please refer to the following resources:
* http://www.uwyo.edu/levy_lab/
* Vukovic LD, Jevtic P, Edens LJ, Levy DL. (2016) New Insights into Mechanisms and Functions of Nuclear Size Regulation. Int Rev Cell Mol Biol. 322:1–59.
* Webster, M., Witkin, K.L., and Cohen-Fix, O. (2009). Sizing up the nucleus: nuclear shape, size and nuclear-envelope assembly. J. Cell Sci. 122, 1477–1486.
* Zink, D., Fischer, A.H., and Nickerson, J.A. (2004). Nuclear structure in cancer cells. Nat. Rev. Cancer 4, 677–687.

**This script takes as an input the CSV file from the Kaggle Breast Cancer Wisconsin Dataset (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) containing information on various features describing the size and shape of the nucleus.  The measurements were made from digital images of a fine needle aspirate of a breast tissue mass.  The output of this script is a prediction of whether or not a given sample is cancerous.**

*Step 1: Import Modules*

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import itertools
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
#from sklearn.metrics import make_scorer, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.metrics import confusion_matrix
%matplotlib inline
# os.chdir('/Users/ptm/desktop/Current_working_directory')
# trainingData = pd.read_csv('train.csv')
trainingData = pd.read_csv('../input/data.csv')

*Step 2: Describe Data*

In [None]:
def printColumnTitles(input):
    print('\nColumn Values:\n')
    print(input.columns.values)
    print('')
printColumnTitles(trainingData)

*Step 3: Minimze and Describe Dataset*

In [None]:
trainingData = trainingData.drop(['id', 'radius_mean', 'perimeter_mean',
 'compactness_mean', 'fractal_dimension_mean', 'radius_se',
 'texture_se', 'perimeter_se', 'smoothness_se', 'compactness_se',
 'concavity_se', 'concave points_se', 'fractal_dimension_se',
 'radius_worst', 'texture_worst', 'perimeter_worst',
 'smoothness_worst', 'compactness_worst', 'concavity_worst',
 'concave points_worst', 'fractal_dimension_worst', 'Unnamed: 32', 'area_se', 'symmetry_se',
 'area_worst', 'symmetry_worst', 'concavity_mean'], axis=1)

def describeDataAgain(input):
    print('\nNew summary of data after making changes:\n')
    print('Column Values:\n')
    print(input.columns.values)
    print('\nFirst Few Values:]n')
    print(input.head())
    print('\nNull Value Counts:\n')
    print(input.isnull().sum())
describeDataAgain(trainingData)

*Step 4: Plot Data*

I predict that the nuclei from the malignant samples were larger than the nuclei from the benign samples.

In [None]:
def plotSizeDistribution(input):
    """ 
    The output is three graphs that each illustrate the two distributions of nuclear sizes for samples
    that are either malignant or benign.
    """  
    sns.set_style("whitegrid")
    distributionOne = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionOne.map(plt.hist, 'area_mean', bins=30)
    distributionOne.add_legend()
    distributionOne.set_axis_labels('area_mean', 'Count')
    distributionOne.fig.suptitle('Area vs Diagnosis ((Blue = Malignant; Green = Benign)')
    distributionTwo = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionTwo.map(sns.kdeplot,'area_mean',shade= True)
    distributionTwo.set(xlim=(0, input['area_mean'].max()))
    distributionTwo.add_legend()
    distributionTwo.set_axis_labels('area_mean', 'Proportion')
    distributionTwo.fig.suptitle('Area vs Diagnosis (Blue = Malignant; Green = Benign)')
plotSizeDistribution(trainingData)

This confirms my prediction that healthy nuclei have a default size and that cancer cells have a wide range of sizes, typically greater than the default size.  Let's look at all of the features now.

In [None]:
def plotConcaveDistribution(input):
    """ 
    The output is three graphs that each illustrate the two distributions of nuclear shapes for samples
    that are either malignant or benign.
    """  
    sns.set_style("whitegrid")
    distributionOne = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionOne.map(plt.hist, 'concave points_mean', bins=30)
    distributionOne.add_legend()
    distributionOne.set_axis_labels('concave points_mean', 'Count')
    distributionOne.fig.suptitle('# of Concave Points vs Diagnosis (Blue = Malignant; Green = Benign)')
    distributionTwo = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionTwo.map(sns.kdeplot,'concave points_mean',shade= True)
    distributionTwo.set(xlim=(0, input['concave points_mean'].max()))
    distributionTwo.add_legend()
    distributionTwo.set_axis_labels('concave points_mean', 'Proportion')
    distributionTwo.fig.suptitle('# of Concave Points vs Diagnosis (Blue = Malignant; Green = Benign)')
plotConcaveDistribution(trainingData)

def plotSymmetryDistribution(input):
    """ 
    The output is three graphs that each illustrate the two distributions of nuclear shapes for samples
    that are either malignant or benign.
    """  
    sns.set_style("whitegrid")
    distributionOne = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionOne.map(plt.hist, 'symmetry_mean', bins=30)
    distributionOne.add_legend()
    distributionOne.set_axis_labels('symmetry_mean', 'Count')
    distributionOne.fig.suptitle('Symmetry vs Diagnosis (Blue = Malignant; Green = Benign)')
    distributionTwo = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionTwo.map(sns.kdeplot,'symmetry_mean',shade= True)
    distributionTwo.set(xlim=(0, input['symmetry_mean'].max()))
    distributionTwo.add_legend()
    distributionTwo.set_axis_labels('symmetry_mean', 'Proportion')
    distributionTwo.fig.suptitle('Symmetry vs Diagnosis (Blue = Malignant; Green = Benign)')
plotSymmetryDistribution(trainingData)

def plotTextureDistribution(input):
    """ 
    The output is three graphs that each illustrate the two distributions of nuclear shapes for samples
    that are either malignant or benign.
    """  
    sns.set_style("whitegrid")
    distributionOne = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionOne.map(plt.hist, 'texture_mean', bins=30)
    distributionOne.add_legend()
    distributionOne.set_axis_labels('texture_mean', 'Count')
    distributionOne.fig.suptitle('Texture vs Diagnosis (Blue = Benign; Green = Malignant)')
    distributionTwo = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionTwo.map(sns.kdeplot,'texture_mean',shade= True)
    distributionTwo.set(xlim=(0, input['texture_mean'].max()))
    distributionTwo.add_legend()
    distributionTwo.set_axis_labels('texture_mean', 'Proportion')
    distributionTwo.fig.suptitle('Texture vs Diagnosis (Blue = Benign; Green = Malignant)')
plotTextureDistribution(trainingData)

def plotSmoothnessDistribution(input):
    """ 
    The output is three graphs that each illustrate the two distributions of nuclear shapes for samples
    that are either malignant or benign.
    """  
    sns.set_style("whitegrid")
    distributionOne = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionOne.map(plt.hist, 'smoothness_mean', bins=30)
    distributionOne.add_legend()
    distributionOne.set_axis_labels('smoothness_mean', 'Count')
    distributionOne.fig.suptitle('Smoothness vs Diagnosis (Blue = Benign; Green = Malignant)')
    distributionTwo = sns.FacetGrid(input, hue="diagnosis",aspect=2)
    distributionTwo.map(sns.kdeplot,'smoothness_mean',shade= True)
    distributionTwo.set(xlim=(0, input['smoothness_mean'].max()))
    distributionTwo.add_legend()
    distributionTwo.set_axis_labels('smoothness_mean', 'Proportion')
    distributionTwo.fig.suptitle('Smoothness vs Diagnosis (Blue = Benign; Green = Malignant)')
plotSmoothnessDistribution(trainingData)

*Step 5: Preprocess Data*

In [None]:
def diagnosisToBinary(input):
    """ 
    The output is a modified dataframe where 0 = "malignant" and 1 = "benign".
    """ 
    input["diagnosis"] = input["diagnosis"].astype("category")
    input["diagnosis"].cat.categories = [0,1]
    input["diagnosis"] = input["diagnosis"].astype("int")
diagnosisToBinary(trainingData)     

def areaToCategory(input):
    """ 
    The output is a modified dataframe where the area measurements are replaced with numbers between 
    zero and five based on their position within predetermined bins (based on the previous distribution plots).
    """ 
    input['area_mean'] = input.area_mean.fillna(-0.5)
    bins = (-0.01, 100, 750, 1250, 2000, 10000)
    categories = pd.cut(input.area_mean, bins, labels=False)
    input.area_mean = categories
areaToCategory(trainingData)

def concaveToCategory(input):
    """ 
    The output is a modified dataframe where the shape measurements are replaced with numbers between 
    zero and five based on their position within predetermined bins.
    """ 
    # Get rid of the space in the file name
    cols = trainingData.columns
    cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str, bytes)) else x)
    trainingData.columns = cols
    # Run the function
    input['concave_points_mean'] = input.concave_points_mean.fillna(-0.5)
    bins = (-0.01, 0.03, 0.06, 0.1, 1.0)
    categories = pd.cut(input.concave_points_mean, bins, labels=False)
    input.concave_points_mean = categories
concaveToCategory(trainingData)

def symmetryToCategory(input):
    """ 
    The output is a modified dataframe where the shape measurements are replaced with numbers between 
    zero and five based on their position within predetermined bins.
    """ 
    input['symmetry_mean'] = input.symmetry_mean.fillna(-0.5)
    bins = (-0.01, 0.15, 0.17, 0.2, 1.0)
    categories = pd.cut(input.symmetry_mean, bins, labels=False)
    input.symmetry_mean = categories
symmetryToCategory(trainingData)

def textureToCategory(input):
    """ 
    The output is a modified dataframe where the shape measurements are replaced with numbers between 
    zero and five based on their position within predetermined bins.
    """ 
    input['texture_mean'] = input.texture_mean.fillna(-0.5)
    bins = (-0.01, 10, 15, 19, 25, 100)
    categories = pd.cut(input.texture_mean, bins, labels=False)
    input.texture_mean = categories
textureToCategory(trainingData)

def smoothnessToCategory(input):
    """ 
    The output is a modified dataframe where the shape measurements are replaced with numbers between 
    zero and five based on their position within predetermined bins.
    """ 
    input['smoothness_mean'] = input.smoothness_mean.fillna(-0.5)
    bins = (-0.01, 0.07, 0.09, 0.11, .13, 1)
    categories = pd.cut(input.smoothness_mean, bins, labels=False)
    input.smoothness_mean = categories
smoothnessToCategory(trainingData)
# Furthermore, we will need to split up our training data, setting aside 20%
# of the training data for cross-validation testing, such that we can avoid
# potentially overfitting the data.
xValues = trainingData.drop(['diagnosis'], axis=1)
yValues = trainingData['diagnosis']
X_train, X_test, Y_train, Y_test = train_test_split(xValues, yValues, test_size=0.2, random_state=23)

*Step 6: Describe Processed Data*

In [None]:
describeDataAgain(trainingData)

In [None]:
def makeAHeatMap(input):
    """ The output is a heatmap showing the relationship between each numerical feature"""  
    plt.figure(figsize=[8,6])
    heatmap = sns.heatmap(input.corr(), vmax=1.0, square=True, annot=True)
    heatmap.set_title('Pearson Correlation Coefficients')
makeAHeatMap(trainingData)

Here with this heatmap we can see that big, mis-shapen nuclei are typicaly from cancerous samples. Let's explore that in more detail.

In [None]:
def pivotTheData(input):
    """ The output is a couple of pivot tables showing the relationship between each feature."""    
    print('\nPivot Tables:\n')
    print(input[["area_mean", "diagnosis"]].groupby(['area_mean'], as_index=False).mean().sort_values(by='diagnosis', ascending=False))
    print('')
    print(input[["concave_points_mean", "diagnosis"]].groupby(['concave_points_mean'], as_index=False).mean().sort_values(by='diagnosis', ascending=False))
    print('')
    print(input[['symmetry_mean', 'diagnosis']].groupby(['symmetry_mean'], as_index=False).mean().sort_values(by='diagnosis', ascending=False))
    print('')
    print(input[['texture_mean', 'diagnosis']].groupby(['texture_mean'], as_index=False).mean().sort_values(by='diagnosis', ascending=False))
    print('')
    print(input[['smoothness_mean', 'diagnosis']].groupby(['smoothness_mean'], as_index=False).mean().sort_values(by='diagnosis', ascending=False))
    print('')
pivotTheData(trainingData)

def plotTheData(input):
    """ The output is a bunch of bar graphs illustrating the relationships between features."""  
    fig = plt.figure(figsize=[10,8])
    fig.subplots_adjust(wspace=0.3,hspace=2.0)
    plt.subplot(321)
    plotOne = sns.barplot('area_mean', 'diagnosis', data=input, capsize=.1, linewidth=2.5, facecolor=(1, 1, 1, 0), errcolor=".2", edgecolor=".2")
    plotOne.set_title('Diagnosis vs Area')
    plotOne.set(xlabel='Surface Area \n (0 = smallest nuclei, 4 = largest nuclei)', ylabel='Probability of \n Malignant Diagnosis')
    plt.subplot(322)
    plotTwo = sns.barplot('concave_points_mean', 'diagnosis', data=input, capsize=.1, linewidth=2.5, facecolor=(1, 1, 1, 0), errcolor=".2", edgecolor=".2")
    plotTwo.set_title('Diagnosis vs # Concave Points')
    plotTwo.set(xlabel='Number of Concave Points \n (0 = least points, 3 = most points)', ylabel='Probability of \n Malignant Diagnosis')
    plt.subplot(323)
    plotThree = sns.barplot('texture_mean', 'diagnosis', data=input, capsize=.1, linewidth=2.5, facecolor=(1, 1, 1, 0), errcolor=".2", edgecolor=".2")
    plotThree.set_title('Diagnosis vs Texture')
    plotThree.set(xlabel='Texture \n (0 = low gray value stdev, 4 = high gray value stdev)', ylabel='Probability of \n Malignant Diagnosis')
    plt.subplot(324)
    plotFour = sns.barplot('symmetry_mean', 'diagnosis', data=input, capsize=.1, linewidth=2.5, facecolor=(1, 1, 1, 0), errcolor=".2", edgecolor=".2")
    plotFour.set_title('Diagnosis vs Symmetry') 
    plotFour.set(xlabel='Symmetry \n (0 = low symmetry score, 3 = high symmetry score)', ylabel='Probability of \n Malignant Diagnosis')
    plt.subplot(325)
    plotFive = sns.barplot('smoothness_mean', 'diagnosis', data=input, capsize=.1, linewidth=2.5, facecolor=(1, 1, 1, 0), errcolor=".2", edgecolor=".2")
    plotFive.set_title('Diagnosis vs Smoothness') 
    plotFive.set(xlabel='Smoothness \n (0 = low variation in radius lengths, 0 = high variation in radius lengths)', ylabel='Probability of \n Malignant Diagnosis')
plotTheData(trainingData)

Great!  This means that our classification algorithms should have somethinggood to work with.  Next we will identify a suitable classification algorithm that we can use to predict whether or not a given sample is malignant.

*Step 7: Compare Classification Algorithms*

In [None]:
def compareABunchOfDifferentModelsAccuracy(a, b, c, d):
    """
    The output is a table and boxplot illustrating the accuracy score for each of nine algorithms given this input.
    """    
    print('\nCompare Multiple Classifiers:\n')
    print('K-Fold Cross-Validation Accuracy:\n')
    models = []
    models.append(('LR', LogisticRegression()))
    models.append(('RF', RandomForestClassifier()))
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('SVM', SVC()))
    models.append(('LSVM', LinearSVC()))
    models.append(('GNB', GaussianNB()))
    models.append(('DTC', DecisionTreeClassifier()))
    models.append(('GBC', GradientBoostingClassifier()))
    models.append(('LDA', LinearDiscriminantAnalysis()))
    resultsAccuracy = []
    names = []
    for name, model in models:
        model.fit(a, b)
        kfold = model_selection.KFold(n_splits=10, random_state=7)
        accuracy_results = model_selection.cross_val_score(model, a,b, cv=kfold, scoring='accuracy')
        resultsAccuracy.append(accuracy_results)
        names.append(name)
        accuracyMessage = "%s: %f (%f)" % (name, accuracy_results.mean(), accuracy_results.std())
        print(accuracyMessage)
    # boxplot algorithm comparison
    fig = plt.figure()
    fig.suptitle('Algorithm Comparison: Accuracy')
    ax = fig.add_subplot(111)
    plt.boxplot(resultsAccuracy)
    ax.set_xticklabels(names)
    ax.set_ylabel('Cross-Validation: Accuracy Score')
    plt.show()
    return
compareABunchOfDifferentModelsAccuracy(X_train, Y_train, X_test, Y_test)

def defineModels():
    print('\nLR = LogisticRegression')
    print('RF = RandomForestClassifier')
    print('KNN = KNeighborsClassifier')
    print('SVM = Support Vector Machine SVC')
    print('LSVM = LinearSVC')
    print('GNB = GaussianNB')
    print('DTC = DecisionTreeClassifier')
    print('GBC = GradientBoostingClassifier')
    print('LDA = LinearDiscriminantAnalysis\n')
defineModels()

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Plots a learning curve. http://scikit-learn.org/stable/modules/learning_curve.html
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

plot_learning_curve(LogisticRegression(), 'Learning Curve For Logistic Regression Classifier', X_train, Y_train, (0.85,1), 10)
plot_learning_curve(SVC(), 'Learning Curve For SVM Classifier', X_train, Y_train, (0.85,1), 10)
plot_learning_curve(LinearDiscriminantAnalysis(), 'Learning Curve For LDA Classifier', X_train, Y_train, (0.85,1), 10)
plot_learning_curve(KNeighborsClassifier(), 'Learning Curve For K-Nearest Neighbors Classifier', X_train, Y_train, (0.85,1), 10)

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.figure(figsize = (5,5))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

dict_characters = {0: 'Malignant', 1: 'Benign'}

def selectParametersForSVM(a, b, c, d):
    model = SVC()
    parameters = {'C': [0.01, 0.1, 0.5, 1.0, 5.0, 10, 25, 50, 100],
                  'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}
    accuracy_scorer = make_scorer(accuracy_score)
    grid_obj = GridSearchCV(model, parameters, scoring=accuracy_scorer)
    grid_obj = grid_obj.fit(a, b)
    model = grid_obj.best_estimator_
    model.fit(a, b)
    print('Selected Parameters for SVM:\n')
    print(model,"\n")
#    predictions = model.predict(c)
#    print(accuracy_score(d, predictions))
#    print('Logistic Regression - Training set accuracy: %s' % accuracy_score(d, predictions))
    kfold = model_selection.KFold(n_splits=10, random_state=7)
    accuracy = model_selection.cross_val_score(model, a,b, cv=kfold, scoring='accuracy')
    mean = accuracy.mean() 
    stdev = accuracy.std()
    print('Support Vector Machine - Training set accuracy: %s (%s)' % (mean, stdev))
    print('')
    prediction = model.predict(c)
    cnf_matrix = confusion_matrix(d, prediction)
    np.set_printoptions(precision=2)
    class_names = dict_characters 
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix')
    plt.figure()
    plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
    plot_learning_curve(model, 'Learning Curve For SVM Classifier', X_train, Y_train, (0.85,1), 10)
selectParametersForSVM(X_train, Y_train, X_test, Y_test)

It looks like our model can predict with about 90% accuracty whether or not a given sample is malignant.  That is pretty good!
In order to improve the accuracty of our model, however, we will need to add back some of the features that we previously removed, and we will need to engineer some new features.  Furthermore, I will need to add in a feature selection
step, and I will also need to add in a parameter optimization step.  These additions can be found in the following kernel: https://www.kaggle.com/paultimothymooney/predicting-breast-cancer-from-nuclear-shape