# Project 1: Breast Cancer Classification

## Data Set Information:
Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:

- Group 1: 367 instances (January 1989)
- Group 2: 70 instances (October 1989)
- Group 3: 31 instances (February 1990)
- Group 4: 17 instances (April 1990)
- Group 5: 48 instances (August 1990)
- Group 6: 49 instances (Updated January 1991)
- Group 7: 31 instances (June 1991)
- Group 8: 86 instances (November 1991)

Total: 699 points (as of the donated datbase on 15 July 1992)

## Attribute Information:

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

## Importing the Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Reading the dataset

In [2]:
# Reading the dataset in the variable bcc (Breast Cancer Classification)
bcc = pd.read_csv('breast-cancer-wisconsin.data', 
                  sep = ',', 
                  header = None)
bcc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


- Dataset does not have columns name

In [3]:
# Renaming the columns
cols = {0: 'ID', # Column 0. Sample code number: id number
        1: 'Clump_Thickness', # Column 1. Clump Thickness: 1 - 10
        2: 'Uniformity_of_Cell_Size', # Column 2. Uniformity of Cell Size: 1 - 10
        3: 'Uniformity_of_Cell_Shape', # Column 3. Uniformity of Cell Shape: 1 - 10
        4: 'Marginal_Adhesion',  # Column 4. Marginal Adhesion: 1 - 10
        5: 'Single_Epithelial_Cell_Size', # Column 5. Single Epithelial Cell Size: 1 - 10
        6: 'Bare_Nuclei', # Column 6. Bare Nuclei: 1 - 10
        7: 'Bland_Chromatin', # Column 7. Bland Chromatin: 1 - 10
        8: 'Normal_Nucleoli', # Column 8. Normal Nucleoli: 1 - 10
        9: 'Mitoses', # Column 9. Mitoses: 1 - 10
        10: 'Class' # Column 10. Class: (2 for benign, 4 for malignant)
        }
bcc.rename(columns = cols, inplace = True)
bcc.head()

Unnamed: 0,ID,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [4]:
bcc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   ID                           699 non-null    int64 
 1   Clump_Thickness              699 non-null    int64 
 2   Uniformity_of_Cell_Size      699 non-null    int64 
 3   Uniformity_of_Cell_Shape     699 non-null    int64 
 4   Marginal_Adhesion            699 non-null    int64 
 5   Single_Epithelial_Cell_Size  699 non-null    int64 
 6   Bare_Nuclei                  699 non-null    object
 7   Bland_Chromatin              699 non-null    int64 
 8   Normal_Nucleoli              699 non-null    int64 
 9   Mitoses                      699 non-null    int64 
 10  Class                        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


In [5]:
# bcc.Bare_Nuclei.astype('int64') #ValueError

In [6]:
bcc.Bare_Nuclei.value_counts()

1     402
10    132
2      30
5      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: Bare_Nuclei, dtype: int64

In [7]:
# Droping values where Bare_Nuclei contains '?'
bcc.drop(index = bcc[bcc.Bare_Nuclei=='?'].index,inplace = True)

In [8]:
bcc.Bare_Nuclei.value_counts()

1     402
10    132
2      30
5      30
3      28
8      21
4      19
9       9
7       8
6       4
Name: Bare_Nuclei, dtype: int64

In [9]:
bcc.dtypes

ID                              int64
Clump_Thickness                 int64
Uniformity_of_Cell_Size         int64
Uniformity_of_Cell_Shape        int64
Marginal_Adhesion               int64
Single_Epithelial_Cell_Size     int64
Bare_Nuclei                    object
Bland_Chromatin                 int64
Normal_Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

In [10]:
# Chenging the dtype of Bare_Nuclei
bcc.Bare_Nuclei = bcc.Bare_Nuclei.astype('int64')

In [11]:
bcc.dtypes

ID                             int64
Clump_Thickness                int64
Uniformity_of_Cell_Size        int64
Uniformity_of_Cell_Shape       int64
Marginal_Adhesion              int64
Single_Epithelial_Cell_Size    int64
Bare_Nuclei                    int64
Bland_Chromatin                int64
Normal_Nucleoli                int64
Mitoses                        int64
Class                          int64
dtype: object

In [12]:
bcc.describe()

Unnamed: 0,ID,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
count,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0
mean,1076720.0,4.442167,3.150805,3.215227,2.830161,3.234261,3.544656,3.445095,2.869693,1.603221,2.699854
std,620644.0,2.820761,3.065145,2.988581,2.864562,2.223085,3.643857,2.449697,3.052666,1.732674,0.954592
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,877617.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171795.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238705.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [13]:
bcc.nunique()

ID                             630
Clump_Thickness                 10
Uniformity_of_Cell_Size         10
Uniformity_of_Cell_Shape        10
Marginal_Adhesion               10
Single_Epithelial_Cell_Size     10
Bare_Nuclei                     10
Bland_Chromatin                 10
Normal_Nucleoli                 10
Mitoses                          9
Class                            2
dtype: int64

In [14]:
bcc.Mitoses.value_counts()

1     563
2      35
3      33
10     14
4      12
7       9
8       8
5       6
6       3
Name: Mitoses, dtype: int64

- Mitoses does not have any observation with value 9.

In [15]:
# Droping the Id Column
bcc.drop(columns = 'ID', inplace = True)
bcc.head()

Unnamed: 0,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2


## Splitting the dataset into Independent (X) and Dependent (y) Variables.

In [16]:
X = bcc.iloc[:, :-1].values
y = bcc.iloc[:, -1].values
X.shape, y.shape

((683, 9), (683,))

## Splitting the dataset into Training set and Test Set

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((546, 9), (137, 9), (546,), (137,))

# 1. Logistic Regression

## Training the Logistic Regression model on the Training set

In [18]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train.ravel())

LogisticRegression()

## Predicting the test set results

In [19]:
y_pred = lr.predict(X_test)
pd.DataFrame({'Predicted y': y_pred, 'Actual y': y_test})

Unnamed: 0,Predicted y,Actual y
0,4,4
1,4,4
2,2,2
3,2,2
4,2,2
...,...,...
132,4,4
133,4,4
134,4,4
135,2,2


## Making the confusion matrix

In [20]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[78,  1],
       [ 5, 53]], dtype=int64)

## Accuracy Score

In [21]:
accuracy_all_model = [[],[],[]]
acc_score = (cm[0][0]+cm[1][1])/len(y_test)
accuracy_all_model[0].append(acc_score)
print(f"Accuracy Score: {acc_score}")

Accuracy Score: 0.9562043795620438


## Computing the accuracy with k-Fold Cross Validation

In [22]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=lr, 
                             X = X_train, 
                             y = y_train, 
                             cv = 10)
print('Average Accuracy: {:.2f} %'.format(accuracies.mean()*100))
print('Standard Deviation: {:.2f} %'.format(accuracies.std()*100))
accuracy_all_model[1].append(accuracies.mean()*100)
accuracy_all_model[2].append(accuracies.std()*100)

Average Accuracy: 96.71 %
Standard Deviation: 2.13 %


# 2. K-Nearest Neighbors 

## Training the K-Nearest Neighbors model on the Training set

In [23]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier()

## Predicting the test set results

In [24]:
y_pred = knn.predict(X_test)
pd.DataFrame({'Predicted y': y_pred, 'Actual y': y_test})

Unnamed: 0,Predicted y,Actual y
0,4,4
1,4,4
2,2,2
3,2,2
4,2,2
...,...,...
132,4,4
133,4,4
134,4,4
135,2,2


## Making the confusion matrix

In [25]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[78,  1],
       [ 6, 52]], dtype=int64)

## Accuracy Score

In [26]:
acc_score = (cm[0][0]+cm[1][1])/len(y_test)
print(f"Accuracy Score: {acc_score}")
accuracy_all_model[0].append(acc_score)

Accuracy Score: 0.948905109489051


## Computing the accuracy with k-Fold Cross Validation

In [27]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=knn, 
                             X = X_train, 
                             y = y_train, 
                             cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
accuracy_all_model[1].append(accuracies.mean()*100)
accuracy_all_model[2].append(accuracies.std()*100)

Accuracy: 96.89 %
Standard Deviation: 2.32 %


# 3. Support Vector Machine

## Training the Support vector classifier on the training dataset

In [28]:
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)

SVC()

## Predicting the test set results

In [29]:
y_pred = svm.predict(X_test)
pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})

Unnamed: 0,y_test,y_pred
0,4,4
1,4,4
2,2,2
3,2,2
4,2,2
...,...,...
132,4,4
133,4,4
134,4,4
135,2,2


## Making the confusion matrix

In [30]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[77,  2],
       [ 5, 53]], dtype=int64)

## Accuracy Score

In [31]:
acc_score = (cm[0][0]+cm[1][1])/len(y_test)
print(f"Accuracy Score: {acc_score}")
accuracy_all_model[0].append(acc_score)

Accuracy Score: 0.948905109489051


## Computing the accuracy with k-Fold Cross Validation

In [32]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=svm, 
                             X = X_train, 
                             y = y_train, 
                             cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
accuracy_all_model[1].append(accuracies.mean()*100)
accuracy_all_model[2].append(accuracies.std()*100)

Accuracy: 96.71 %
Standard Deviation: 2.28 %


# 4. Neive Bayes

## Training the Neive Bayes classifier on the training dataset

In [33]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)

GaussianNB()

## Predicting the test set results

In [34]:
y_pred = nb.predict(X_test)
pd.DataFrame({'Predicted y': y_pred, 'Actual y': y_test})

Unnamed: 0,Predicted y,Actual y
0,4,4
1,4,4
2,2,2
3,2,2
4,2,2
...,...,...
132,4,4
133,4,4
134,4,4
135,2,2


## Making the confusion matrix

In [35]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[76,  3],
       [ 3, 55]], dtype=int64)

## Accuracy Score

In [36]:
acc_score = (cm[0][0]+cm[1][1])/len(y_test)
print(f"Accuracy Score: {acc_score}")
accuracy_all_model[0].append(acc_score)

Accuracy Score: 0.9562043795620438


## Computing the accuracy with k-Fold Cross Validation

In [37]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=nb, 
                             X = X_train, 
                             y = y_train, 
                             cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
accuracy_all_model[1].append(accuracies.mean()*100)
accuracy_all_model[2].append(accuracies.std()*100)

Accuracy: 96.33 %
Standard Deviation: 2.17 %


# 5. Decision Tree Classification

## Training the Neive Bayes classifier on the training dataset

In [38]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

DecisionTreeClassifier()

## Predicting the test set results

In [39]:
y_pred = nb.predict(X_test)
pd.DataFrame({'Predicted y': y_pred, 'Actual y': y_test})

Unnamed: 0,Predicted y,Actual y
0,4,4
1,4,4
2,2,2
3,2,2
4,2,2
...,...,...
132,4,4
133,4,4
134,4,4
135,2,2


## Making the confusion matrix

In [40]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[76,  3],
       [ 3, 55]], dtype=int64)

## Accuracy Score

In [41]:
acc_score = (cm[0][0]+cm[1][1])/len(y_test)
print(f"Accuracy Score: {acc_score}")
accuracy_all_model[0].append(acc_score)

Accuracy Score: 0.9562043795620438


## Computing the accuracy with k-Fold Cross Validation

In [42]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=dtc, 
                             X = X_train, 
                             y = y_train, 
                             cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
accuracy_all_model[1].append(accuracies.mean()*100)
accuracy_all_model[2].append(accuracies.std()*100)

Accuracy: 93.60 %
Standard Deviation: 2.18 %


# 6. Random Forest Classification

## Training the Neive Bayes classifier on the training dataset

In [43]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

RandomForestClassifier()

## Predicting the test set results

In [44]:
y_pred = nb.predict(X_test)
pd.DataFrame({'Predicted y': y_pred, 'Actual y': y_test})

Unnamed: 0,Predicted y,Actual y
0,4,4
1,4,4
2,2,2
3,2,2
4,2,2
...,...,...
132,4,4
133,4,4
134,4,4
135,2,2


## Making the confusion matrix

In [45]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[76,  3],
       [ 3, 55]], dtype=int64)

## Accuracy Score

In [46]:
acc_score = (cm[0][0]+cm[1][1])/len(y_test)
print(f"Accuracy Score: {acc_score}")
accuracy_all_model[0].append(acc_score)

Accuracy Score: 0.9562043795620438


## Computing the accuracy with k-Fold Cross Validation

In [47]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=rfc, 
                             X = X_train, 
                             y = y_train, 
                             cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
accuracy_all_model[1].append(accuracies.mean()*100)
accuracy_all_model[2].append(accuracies.std()*100)

Accuracy: 96.16 %
Standard Deviation: 2.07 %


## Best Classification Technique for this dataset

In [54]:
model = [lr, knn, svm, nb, dtc, rfc]
from sklearn.metrics import accuracy_score as ac
models = pd.DataFrame({'Model': ['Logistic Regression', 'K-Nearest Neighbors', 'Support Vector Machine', 'Neive Bayes', 'Decision Tree Classification', 'Random Forest Classification'],
             'Accuracy Score': [i*100 for i in accuracy_all_model[0]],
             'Accuracy': accuracy_all_model[1],
             'Standard Deviation': accuracy_all_model[2]})
models.sort_values(by = 'Accuracy Score', ascending = False)

Unnamed: 0,Model,Accuracy Score,Accuracy,Standard Deviation
0,Logistic Regression,95.620438,96.707071,2.129842
3,Neive Bayes,95.620438,96.333333,2.165873
4,Decision Tree Classification,95.620438,93.59596,2.184446
5,Random Forest Classification,95.620438,96.158249,2.072196
1,K-Nearest Neighbors,94.890511,96.888889,2.31733
2,Support Vector Machine,94.890511,96.707071,2.279777


- Logistic Regression is best model for this dataset