# Use of Machine Learning in Diagnosing Breast Cancer
# [Part 1] Building several classification models

For this project, we will be using an open dataset from Kaggle containing Breast Cancer features. We will apply several machine learning algorthms on this dataset and eventually learn how to perform hyperoptimization on the algorithms

## Download the dataset from Kaggle

In [1]:
#Here is the download url for this open dataset
#https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

The download file comes as a .zip file. Unzip the file in the same working drectory as your jupyter notebook file

In [2]:
ls

archive.zip                             Kaggle_Breast_Cancer_Diagnosis.ipynb
data.csv                                Processed_breast_cancer_features.csv
Kaggle_Breast_Cancer_Diagnosis_2.ipynb


### Import libraries

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("data.csv")

In [5]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## Cleaning the data

Upon observing the data,we note it is quite cleaned up as you would expect to find with alot of datasets on Kaggle. Here will just drop the 'id' & 'Unnamed: 32' column and reformat the data values on the 'diagnosis' column

In [6]:
df = df.drop(['id'],axis=1)

In [7]:
df = df.drop(['Unnamed: 32'],axis=1)

In [8]:
df

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [9]:
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

In [10]:
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#### Save the clean version of the data for use in part 2

In [11]:
df.to_csv('Processed_breast_cancer_features.csv', index=False)

## Group the features

In [12]:
df.columns

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

Just to give a description of the data, these is data from histological slides of breast cancer cells where the above column descriptors are the attributes generated for each cell nucleus.

`Reason:` As cells become more cancerous, you expect changes in the morphology of the nucleus and observing these changes can help you quantify whether the cells are typical of cancer cells or not

In [13]:
mean_features = list(df.columns[1:11])
se_features = list(df.columns[11:21])
worst_features = list(df.columns[21:])

In [14]:
print (worst_features)

['radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']


In [15]:
mean_features.append('diagnosis')
se_features.append('diagnosis')
worst_features.append('diagnosis')

### Perform a correlation on the data 

In [16]:
mean_corr = df[mean_features].corr()
mean_corr

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,diagnosis
radius_mean,1.0,0.323782,0.997855,0.987357,0.170581,0.506124,0.676764,0.822529,0.147741,-0.311631,0.730029
texture_mean,0.323782,1.0,0.329533,0.321086,-0.023389,0.236702,0.302418,0.293464,0.071401,-0.076437,0.415185
perimeter_mean,0.997855,0.329533,1.0,0.986507,0.207278,0.556936,0.716136,0.850977,0.183027,-0.261477,0.742636
area_mean,0.987357,0.321086,0.986507,1.0,0.177028,0.498502,0.685983,0.823269,0.151293,-0.28311,0.708984
smoothness_mean,0.170581,-0.023389,0.207278,0.177028,1.0,0.659123,0.521984,0.553695,0.557775,0.584792,0.35856
compactness_mean,0.506124,0.236702,0.556936,0.498502,0.659123,1.0,0.883121,0.831135,0.602641,0.565369,0.596534
concavity_mean,0.676764,0.302418,0.716136,0.685983,0.521984,0.883121,1.0,0.921391,0.500667,0.336783,0.69636
concave points_mean,0.822529,0.293464,0.850977,0.823269,0.553695,0.831135,0.921391,1.0,0.462497,0.166917,0.776614
symmetry_mean,0.147741,0.071401,0.183027,0.151293,0.557775,0.602641,0.500667,0.462497,1.0,0.479921,0.330499
fractal_dimension_mean,-0.311631,-0.076437,-0.261477,-0.28311,0.584792,0.565369,0.336783,0.166917,0.479921,1.0,-0.012838


In [17]:
mean_prediction_var = ['radius_mean','perimeter_mean','area_mean','compactness_mean','concavity_mean','concave points_mean']

In [18]:
se_corr = df[se_features].corr()
se_corr

Unnamed: 0,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,diagnosis
radius_se,1.0,0.213247,0.972794,0.95183,0.164514,0.356065,0.332358,0.513346,0.240567,0.227754,0.567134
texture_se,0.213247,1.0,0.223171,0.111567,0.397243,0.2317,0.194998,0.230283,0.411621,0.279723,-0.008303
perimeter_se,0.972794,0.223171,1.0,0.937655,0.151075,0.416322,0.362482,0.556264,0.266487,0.244143,0.556141
area_se,0.95183,0.111567,0.937655,1.0,0.07515,0.28484,0.270895,0.41573,0.134109,0.127071,0.548236
smoothness_se,0.164514,0.397243,0.151075,0.07515,1.0,0.336696,0.268685,0.328429,0.413506,0.427374,-0.067016
compactness_se,0.356065,0.2317,0.416322,0.28484,0.336696,1.0,0.801268,0.744083,0.394713,0.803269,0.292999
concavity_se,0.332358,0.194998,0.362482,0.270895,0.268685,0.801268,1.0,0.771804,0.309429,0.727372,0.25373
concave points_se,0.513346,0.230283,0.556264,0.41573,0.328429,0.744083,0.771804,1.0,0.31278,0.611044,0.408042
symmetry_se,0.240567,0.411621,0.266487,0.134109,0.413506,0.394713,0.309429,0.31278,1.0,0.369078,-0.006522
fractal_dimension_se,0.227754,0.279723,0.244143,0.127071,0.427374,0.803269,0.727372,0.611044,0.369078,1.0,0.077972


In [19]:
se_prediction_var = ['radius_se','perimeter_se','area_se','concave points_se']

In [20]:
worst_corr = df[worst_features].corr()
worst_corr

Unnamed: 0,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
radius_worst,1.0,0.359921,0.993708,0.984015,0.216574,0.47582,0.573975,0.787424,0.243529,0.093492,0.776454
texture_worst,0.359921,1.0,0.365098,0.345842,0.225429,0.360832,0.368366,0.359755,0.233027,0.219122,0.456903
perimeter_worst,0.993708,0.365098,1.0,0.977578,0.236775,0.529408,0.618344,0.816322,0.269493,0.138957,0.782914
area_worst,0.984015,0.345842,0.977578,1.0,0.209145,0.438296,0.543331,0.747419,0.209146,0.079647,0.733825
smoothness_worst,0.216574,0.225429,0.236775,0.209145,1.0,0.568187,0.518523,0.547691,0.493838,0.617624,0.421465
compactness_worst,0.47582,0.360832,0.529408,0.438296,0.568187,1.0,0.892261,0.80108,0.614441,0.810455,0.590998
concavity_worst,0.573975,0.368366,0.618344,0.543331,0.518523,0.892261,1.0,0.855434,0.53252,0.686511,0.65961
concave points_worst,0.787424,0.359755,0.816322,0.747419,0.547691,0.80108,0.855434,1.0,0.502528,0.511114,0.793566
symmetry_worst,0.243529,0.233027,0.269493,0.209146,0.493838,0.614441,0.53252,0.502528,1.0,0.537848,0.416294
fractal_dimension_worst,0.093492,0.219122,0.138957,0.079647,0.617624,0.810455,0.686511,0.511114,0.537848,1.0,0.323872


In [21]:
worst_prediction_var = ['radius_worst','perimeter_worst','area_worst','concavity_worst','concave points_worst','fractal_dimension_worst']

In [22]:
prediction_var = mean_prediction_var + se_prediction_var + worst_prediction_var
prediction_var

['radius_mean',
 'perimeter_mean',
 'area_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'radius_se',
 'perimeter_se',
 'area_se',
 'concave points_se',
 'radius_worst',
 'perimeter_worst',
 'area_worst',
 'concavity_worst',
 'concave points_worst',
 'fractal_dimension_worst']

## Model Training

### Split the data into test and train

In [23]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.15, random_state=1)
#the random_state value of 1 sets seed to the random generator and this allows us to get the same value each time the algorithm runs

In [24]:
train_X = train[prediction_var]
train_y = train['diagnosis']
test_X = test[prediction_var]
test_y = test['diagnosis']

In [25]:
train_X

Unnamed: 0,radius_mean,perimeter_mean,area_mean,compactness_mean,concavity_mean,concave points_mean,radius_se,perimeter_se,area_se,concave points_se,radius_worst,perimeter_worst,area_worst,concavity_worst,concave points_worst,fractal_dimension_worst
501,13.82,92.33,595.9,0.16810,0.13570,0.067590,0.4751,2.974,39.05,0.016160,16.01,106.00,788.0,0.33810,0.15210,0.11830
545,13.62,87.19,573.2,0.06747,0.02974,0.024430,0.3460,2.066,31.24,0.009064,15.35,97.58,729.8,0.10490,0.07174,0.06953
62,14.25,96.42,645.7,0.20080,0.21350,0.086530,0.7036,5.373,60.78,0.018480,17.67,119.10,959.5,0.69220,0.17850,0.11320
344,11.71,75.03,420.3,0.07281,0.04006,0.032500,0.3446,2.355,24.53,0.011210,13.06,84.16,516.4,0.10870,0.07864,0.07806
457,13.21,84.10,537.9,0.05205,0.02772,0.020680,0.2084,1.314,17.58,0.006451,14.35,91.29,632.9,0.13900,0.06005,0.06788
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129,19.79,130.40,1192.0,0.15890,0.25450,0.114900,0.4953,2.765,63.33,0.010430,22.63,148.70,1589.0,0.56730,0.17320,0.08465
144,10.75,68.26,355.3,0.05139,0.02251,0.007875,0.2525,1.806,17.74,0.005612,11.95,77.79,441.2,0.09755,0.03413,0.06769
72,17.20,114.20,929.4,0.18300,0.16920,0.079440,0.5907,3.705,69.47,0.011270,23.32,151.60,1681.0,0.65660,0.18990,0.13390
235,14.03,89.79,603.4,0.06945,0.01462,0.018960,0.2589,1.667,22.07,0.010040,15.33,98.27,715.5,0.06231,0.07963,0.07617


### Import LIbraries for ML

In [26]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [27]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score

### 1.  MLPClassifier

In [28]:
model = MLPClassifier()
model.fit(train_X,train_y)

MLPClassifier()

In [29]:
prediction = model.predict(test_X)
#prediction
#test_y

#### Performance Evaluation

For check the performance of the machine learning models, we will use:-
1. Confusion matrix
2. Precision score
3. Recall score
4. Accuracy score

In [30]:
confusion_matrix(test_y,prediction)

array([[51,  1],
       [ 6, 28]])

The confusion matrix simply put shows the ways in which your classification model
is confused when it makes predictions.
https://machinelearningmastery.com/confusion-matrix-machine-learning/

In the simplest terms, Precision is the ratio between the True Positives and all the Positives(TP + FP).
The recall is the measure of our model correctly identifying True Positives. A sort of measure of sensitivity or True-Positive rate.
Accuracy is the ratio of the total number of correct predictions and the total number of predictions

In [31]:
precision = precision_score(test_y, prediction)
print('The precision score is %.2f' % precision)
recall = recall_score(test_y, prediction)
print('The recall score is %.2f' % recall)
accuracy = accuracy_score(test_y, prediction)
print('The accuracy score is %.2f' % accuracy)

The precision score is 0.97
The recall score is 0.82
The accuracy score is 0.92


### 2. KNeighborsClassifier

In [32]:
model = KNeighborsClassifier()
model.fit(train_X, train_y)
pred = model.predict(test_X)

#### Performance Evaluation

In [33]:
confusion_matrix(test_y, pred)

array([[51,  1],
       [ 5, 29]])

In [34]:
precision = precision_score(test_y, pred)
print('The precision score is %.2f' % precision)
recall = recall_score(test_y, pred)
print('The recall score is %.2f' % recall)
accuracy = accuracy_score(test_y, pred)
print('The accuracy score is %.2f' % accuracy)

The precision score is 0.97
The recall score is 0.85
The accuracy score is 0.93


### 3. Support Vector Machine Classifier

In [35]:
model = SVC()
model.fit(train_X, train_y)
pred1 = model.predict(test_X)

#### Performance Evaluation

In [36]:
confusion_matrix(test_y, pred1)

array([[52,  0],
       [ 8, 26]])

In [37]:
precision = precision_score(test_y, pred1)
print('The precision score is %.2f' % precision)
recall = recall_score(test_y, pred1)
print('The recall score is %.2f' % recall)
accuracy = accuracy_score(test_y, pred1)
print('The accuracy score is %.2f' % accuracy)

The precision score is 1.00
The recall score is 0.76
The accuracy score is 0.91


### 4. Random Forest Classifier

In [38]:
model = RandomForestClassifier()

In [39]:
model.fit(train_X,train_y) #Jupyter notebook does not display the parameters but it generally should

predictions = model.predict(test_X)
predictions

RandomForestClassifier()

Let's compare the results of the prediction with test_y

In [41]:
#test_y

Comparing the test_y values and the predictions,the prediction got some values wrong but generally the model is on track

#### Performance Evaluation

In [42]:
confusion_matrix(test_y,predictions)

array([[51,  1],
       [ 5, 29]])

In [43]:
precision = precision_score(test_y, predictions)
print('The precision score is %.2f' % precision)
recall = recall_score(test_y, predictions)
print('The recall score is %.2f' % recall)
accuracy = accuracy_score(test_y, predictions)
print('The accuracy score is %.2f' % accuracy)

The precision score is 0.97
The recall score is 0.85
The accuracy score is 0.93
