# Cells samples cancer - Supervised learning - classification

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html].

|Field name|Description|
|--- |--- |
|ID|Clump thickness|
|Clump|Clump thickness|
|UnifSize|Uniformity of cell size|
|UnifShape|Uniformity of cell shape|
|MargAdh|Marginal adhesion|
|SingEpiSize|Single epithelial cell size|
|BareNuc|Bare nuclei|
|BlandChrom|Bland chromatin|
|NormNucl|Normal nucleoli|
|Mit|Mitoses|
|Class|Benign or malignant|

We want to run a model that will differenciate between benign and malignant cells.

We will use neural network to obtain the model that gives us a Mean accuracy based on 10 random partitions of the total dataset = 0.941

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('cell_samples.csv')

In [3]:
df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [4]:
df.shape

(699, 11)

In [5]:
df.isna().sum()

ID             0
Clump          0
UnifSize       0
UnifShape      0
MargAdh        0
SingEpiSize    0
BareNuc        0
BlandChrom     0
NormNucl       0
Mit            0
Class          0
dtype: int64

In [6]:
df.drop(columns = 'ID', inplace = True)

In [7]:
df.head()

Unnamed: 0,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2


In [8]:
df['Class'].unique()

array([2, 4], dtype=int64)

In [9]:
df['Class'].replace({2:0, 4:1}, inplace = True)

In [10]:
df['Class'].unique()

array([0, 1], dtype=int64)

In [11]:
df.head()

Unnamed: 0,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,5,1,1,1,2,1,3,1,1,0
1,5,4,4,5,7,10,3,2,1,0
2,3,1,1,1,2,2,3,1,1,0
3,6,8,8,1,3,4,3,7,1,0
4,4,1,1,3,2,1,3,1,1,0


In [12]:
# The dataset is kind of imabalanced for every 2 benign cell we have 1 malignant cell info.
# That said we can work with this imbalance without having to applies any strategy to correct it.
df.groupby(['Class']).size() 

Class
0    458
1    241
dtype: int64

In [13]:
# We have a variable that is categorical BareNuc
df.dtypes

Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

In [14]:
# Transforming Categorical features to Numerical

for col_name in df.columns:
    if (df[col_name].dtype == 'object'):
        df[col_name]= df[col_name].astype('category')
        df[col_name] = df[col_name].cat.codes
        
df.head()

Unnamed: 0,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,5,1,1,1,2,0,3,1,1,0
1,5,4,4,5,7,1,3,2,1,0
2,3,1,1,1,2,2,3,1,1,0
3,6,8,8,1,3,4,3,7,1,0
4,4,1,1,3,2,0,3,1,1,0


In [15]:
# We drop highly correlated features
# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features 
df.drop(to_drop, axis=1, inplace=True)

In [16]:
df.head()

Unnamed: 0,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,5,1,1,1,2,0,3,1,1,0
1,5,4,4,5,7,1,3,2,1,0
2,3,1,1,1,2,2,3,1,1,0
3,6,8,8,1,3,4,3,7,1,0
4,4,1,1,3,2,0,3,1,1,0


In [17]:
# Now we want to select main features correlated with dependent variable Class

# Correlation with output variable
corr_matrix = df.corr().abs()
corr_matrix_target = abs(corr_matrix["Class"])

# Selecting highly correlated features
relevant_features = corr_matrix_target[corr_matrix_target > 0.5]
relevant_features

Clump          0.716001
UnifSize       0.817904
UnifShape      0.818934
MargAdh        0.696800
SingEpiSize    0.682785
BlandChrom     0.756616
NormNucl       0.712244
Class          1.000000
Name: Class, dtype: float64

In [18]:
# The variable BareNuc is discard for modeling as has a lower correlation to class than 0.5
df.drop(columns = 'BareNuc', inplace = True)

In [19]:
# We shouldn't need to Normalize the values since already of same order of magnitud
from sklearn.model_selection import train_test_split
#from sklearn.preprocessing import StandardScaler

val = df.values

X = val[:,0:val.shape[1]-1]
y = val[:,val.shape[1]-1]

In [20]:
# We load libraries, methods and functions to run the model as a neural network using Keras
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [21]:
# We define our Neural network for classification problem:
def NeuralNetwork_classification_model():
    
    # Create model: x3 hidden layers and initializing with Initial layers with as many nodes as parameters we are considering in X_train or X_test
    model = Sequential()
    model.add(Dense(50, activation='relu', input_shape=(X_train.shape[1],)))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(50, activation='relu'))
    # Final layer would contain 2 nodes because we have a classification problem with 2 possible classes (survided or not)
    model.add(Dense(2, activation='softmax'))
    
    # compile model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

In [22]:
# We will run the model on 10 random splits of the dataset to get a solid value of accuracy:
acc = []

for i in range(0,10):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True)
    model = NeuralNetwork_classification_model()
    model.fit(X_train, y_train, epochs=30, verbose=2)
    scores = model.evaluate(X_test, y_test, verbose=0)
    acc.append(scores[1])

accuracy = np.mean(acc)
print('Mean accuracy based on 10 random partitions of the total dataset = ', accuracy)

Train on 489 samples
Epoch 1/30
489/489 - 0s - loss: 0.6732 - accuracy: 0.5378
Epoch 2/30
489/489 - 0s - loss: 0.4813 - accuracy: 0.8384
Epoch 3/30
489/489 - 0s - loss: 0.4010 - accuracy: 0.8609
Epoch 4/30
489/489 - 0s - loss: 0.3517 - accuracy: 0.8732
Epoch 5/30
489/489 - 0s - loss: 0.3040 - accuracy: 0.9121
Epoch 6/30
489/489 - 0s - loss: 0.2709 - accuracy: 0.9202
Epoch 7/30
489/489 - 0s - loss: 0.2349 - accuracy: 0.9305
Epoch 8/30
489/489 - 0s - loss: 0.2147 - accuracy: 0.9468
Epoch 9/30
489/489 - 0s - loss: 0.1892 - accuracy: 0.9509
Epoch 10/30
489/489 - 0s - loss: 0.1649 - accuracy: 0.9550
Epoch 11/30
489/489 - 0s - loss: 0.1533 - accuracy: 0.9489
Epoch 12/30
489/489 - 0s - loss: 0.1293 - accuracy: 0.9673
Epoch 13/30
489/489 - 0s - loss: 0.1179 - accuracy: 0.9693
Epoch 14/30
489/489 - 0s - loss: 0.1128 - accuracy: 0.9734
Epoch 15/30
489/489 - 0s - loss: 0.1052 - accuracy: 0.9673
Epoch 16/30
489/489 - 0s - loss: 0.0891 - accuracy: 0.9714
Epoch 17/30
489/489 - 0s - loss: 0.0838 - ac

Epoch 19/30
489/489 - 0s - loss: 0.1121 - accuracy: 0.9611
Epoch 20/30
489/489 - 0s - loss: 0.1105 - accuracy: 0.9632
Epoch 21/30
489/489 - 0s - loss: 0.1023 - accuracy: 0.9611
Epoch 22/30
489/489 - 0s - loss: 0.1000 - accuracy: 0.9591
Epoch 23/30
489/489 - 0s - loss: 0.1034 - accuracy: 0.9632
Epoch 24/30
489/489 - 0s - loss: 0.0901 - accuracy: 0.9673
Epoch 25/30
489/489 - 0s - loss: 0.0810 - accuracy: 0.9693
Epoch 26/30
489/489 - 0s - loss: 0.0829 - accuracy: 0.9673
Epoch 27/30
489/489 - 0s - loss: 0.0825 - accuracy: 0.9673
Epoch 28/30
489/489 - 0s - loss: 0.0752 - accuracy: 0.9652
Epoch 29/30
489/489 - 0s - loss: 0.0782 - accuracy: 0.9673
Epoch 30/30
489/489 - 0s - loss: 0.0766 - accuracy: 0.9734
Train on 489 samples
Epoch 1/30
489/489 - 0s - loss: 0.6426 - accuracy: 0.5930
Epoch 2/30
489/489 - 0s - loss: 0.4940 - accuracy: 0.8507
Epoch 3/30
489/489 - 0s - loss: 0.3883 - accuracy: 0.8998
Epoch 4/30
489/489 - 0s - loss: 0.3232 - accuracy: 0.9059
Epoch 5/30
489/489 - 0s - loss: 0.2735 

Epoch 7/30
489/489 - 0s - loss: 0.2368 - accuracy: 0.9305
Epoch 8/30
489/489 - 0s - loss: 0.2218 - accuracy: 0.9346
Epoch 9/30
489/489 - 0s - loss: 0.1942 - accuracy: 0.9346
Epoch 10/30
489/489 - 0s - loss: 0.1742 - accuracy: 0.9407
Epoch 11/30
489/489 - 0s - loss: 0.1508 - accuracy: 0.9509
Epoch 12/30
489/489 - 0s - loss: 0.1362 - accuracy: 0.9591
Epoch 13/30
489/489 - 0s - loss: 0.1228 - accuracy: 0.9571
Epoch 14/30
489/489 - 0s - loss: 0.1126 - accuracy: 0.9652
Epoch 15/30
489/489 - 0s - loss: 0.1093 - accuracy: 0.9652
Epoch 16/30
489/489 - 0s - loss: 0.1041 - accuracy: 0.9611
Epoch 17/30
489/489 - 0s - loss: 0.0879 - accuracy: 0.9734
Epoch 18/30
489/489 - 0s - loss: 0.0788 - accuracy: 0.9714
Epoch 19/30
489/489 - 0s - loss: 0.0749 - accuracy: 0.9775
Epoch 20/30
489/489 - 0s - loss: 0.0685 - accuracy: 0.9734
Epoch 21/30
489/489 - 0s - loss: 0.0660 - accuracy: 0.9775
Epoch 22/30
489/489 - 0s - loss: 0.0636 - accuracy: 0.9836
Epoch 23/30
489/489 - 0s - loss: 0.0584 - accuracy: 0.9877


In [27]:
accuracy = np.mean(acc)
print('Mean accuracy based on 10 random partitions of the total dataset = ', accuracy)

Mean accuracy based on 10 random partitions of the total dataset =  0.94619054
