## Breast Cancer Diagnostic Classification Models and their performace comparison
- The following notebook contains exploratory data analysis and ML classification models to classify if a cancer is benign or malignant 
- Dataset used: Breast Cancer Wisconsin (Diagnostic) Data Set (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).
- In this notebook we are going to compare the various ML classification algorithms using metrics like:
 - Classification Accuracy
 - Confusion Matrix
 - Precision and Recall
 - F1 score

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

train = pd.read_csv(os.path.join(dirname, filename))
train.head()

In [None]:
train = train.drop(["Unnamed: 32"], axis=1)
train.shape

In [None]:
# Heatmap to mark all the missing values in white colour.
sns.heatmap(train.isnull(), cbar=False)

In [None]:
train["diagnosis"].replace({"M":2, "B":1}, inplace=True)
train.head()

In [None]:
train_corr = train.corr()

### Heatmap to show the correlation matrix of the features.

In [None]:
plt.subplots(figsize=(30, 30))
ax = sns.heatmap(
    train_corr,
    vmin = -1.0, vmax = 1.0, center=0,
    cmap = sns.diverging_palette(20, 220),
    annot = True,
    square = True
)

ax.set_xticklabels(
    ax.get_xticklabels(),
    horizontalalignment='right'
)

### Selecting features
The approach is to choose the features that the correlation of >=0.5 w.r.t. diagnosis

In [None]:
corr_target = abs(train_corr["diagnosis"])
features = corr_target[corr_target>=0.5]
features = features.keys()
features = features.delete(0)
features = features.tolist()
features

In [None]:
X = train[features]
X.head()

In [None]:
y = train["diagnosis"]
y.head()

### Splitting the dataset into train and test dataset.
- The dataset is split into Train and Test dataset in 2 ways:
 - First is in a ratio of 80% - 20%.
 - Second is in a ratio of 70% - 20%.

In [None]:
# 80% - 20% Split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.20, random_state=42, shuffle=False)

# 70% - 30% Split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.30, random_state=42)

algo = ["XGBoost Classifier", "Random Forest Classifier", "Logistic Regression", "AdaBoost Classifier", "Gradient Boosting", "Bagging Classifier", "CatBoost Classifier"]
accuracy1=[]
precision1 = []
recall1 = []
f1_score1 = []

accuracy2=[]
precision2 = []
recall2 = []
f1_score2 = []

## Creating and training models
These are the following algorithms that are used in this exercise.
1. XGBoost Classifier
2. Random Forest Classifier
3. Logistic Regression 
4. AdaBoost Classifier
5. Gradient Boosting
6. Bagging Classifier
7. CatBoost Classifier

### 1) XGBoost Classifier

In [None]:
# Model 1 with 80% - 20% split
xg_model1 = XGBClassifier()
xg_model1.fit(X_train1, y_train1)
y_pred_xg1 = xg_model1.predict(X_test1)

# Model 2 with 70% - 30% split
xg_model2 = XGBClassifier()
xg_model2.fit(X_train2, y_train2)
y_pred_xg2 = xg_model2.predict(X_test2)

# The statement below is meant to be used for continuous target variable not categorical
# predictions = [round(value) for value in y_pred]

# Calculating Evaluation Metrics for the Model 1
xg_accuracy1 = accuracy_score(y_test1, y_pred_xg1) * 100
xg_confusion1 = confusion_matrix(y_test1, y_pred_xg1)
xg_precision1 = xg_confusion1[0][0]/(xg_confusion1[0][0] + xg_confusion1[1][0]) * 100
xg_recall1 = xg_confusion1[0][0]/(xg_confusion1[0][0] + xg_confusion1[0][1]) * 100
xg_f1_score1 = ((2 * xg_precision1 * xg_recall1) / (xg_precision1 + xg_recall1)) / 100

# Calculating Evaluation Metrics for the Model 2
xg_accuracy2 = accuracy_score(y_test2, y_pred_xg2) * 100
xg_confusion2 = confusion_matrix(y_test2, y_pred_xg2)
xg_precision2 = xg_confusion2[0][0]/(xg_confusion2[0][0] + xg_confusion2[1][0]) * 100
xg_recall2 = xg_confusion2[0][0]/(xg_confusion2[0][0] + xg_confusion2[0][1]) * 100
xg_f1_score2 = ((2 * xg_precision2 * xg_recall2) / (xg_precision2 + xg_recall2)) / 100

# Storing all the metrics in values for Model 1 in common lists
accuracy1.append(round(xg_accuracy1, 2))
precision1.append(round(xg_precision1, 2))
recall1.append(round(xg_recall1, 2))
f1_score1.append(round(xg_f1_score1, 4))

# Storing all the metrics in values for Model 2 in common lists
accuracy2.append(round(xg_accuracy2, 2))
precision2.append(round(xg_precision2, 2))
recall2.append(round(xg_recall2, 2))
f1_score2.append(round(xg_f1_score2, 4))

In [None]:
# Evaluation Metrics for the Model 1
print("Results for the 80 - 20 split")
print("Accuracy:", xg_accuracy1)
print("Precision:", xg_precision1)
print("Recall:", xg_recall1)
print("F1 Score:", xg_f1_score1)

print("----------------------------------")
print("----------------------------------")

# Evaluation Metrics for the Model 2
print("Results for the 70 - 30 split")
print("Accuracy:", xg_accuracy2)
print("Precision:", xg_precision2)
print("Recall:", xg_recall2)
print("F1 Score:", xg_f1_score2)


### 2) Random Forest Classifier

In [None]:
# Model 1 with 80% - 20% split
rfc_model1 = RandomForestClassifier()
rfc_model1.fit(X_train1, y_train1)
y_pred_rf1 = rfc_model1.predict(X_test1)

# Model 2 with 70% - 30% split
rfc_model2 = RandomForestClassifier()
rfc_model2.fit(X_train2, y_train2)
y_pred_rf2 = rfc_model2.predict(X_test2)

# Calculating Evaluation Metrics for the Model 1
rf_accuracy1 = accuracy_score(y_test1, y_pred_rf1) * 100
rf_confusion1 = confusion_matrix(y_test1, y_pred_rf1)
rf_precision1 = xg_confusion1[0][0]/(rf_confusion1[0][0] + rf_confusion1[1][0]) * 100
rf_recall1 = xg_confusion1[0][0]/(rf_confusion1[0][0] + rf_confusion1[0][1]) * 100
rf_f1_score1 = ((2 * rf_precision1 * rf_recall1) / (rf_precision1 + rf_recall1)) / 100

# Calculating Evaluation Metrics for the Model 2
rf_accuracy2 = accuracy_score(y_test2, y_pred_rf2) * 100
rf_confusion2 = confusion_matrix(y_test2, y_pred_rf2)
rf_precision2 = xg_confusion2[0][0]/(rf_confusion2[0][0] + rf_confusion2[1][0]) * 100
rf_recall2 = xg_confusion2[0][0]/(rf_confusion2[0][0] + rf_confusion2[0][1]) * 100
rf_f1_score2 = ((2 * rf_precision2 * rf_recall2) / (rf_precision2 + rf_recall2)) / 100

# Storing all the metrics in values for Model 1 in common lists
accuracy1.append(round(rf_accuracy1, 2))
precision1.append(round(rf_precision1, 2))
recall1.append(round(rf_recall1, 2))
f1_score1.append(round(rf_f1_score1, 4))

# Storing all the metrics in values for Model 2 in common lists
accuracy2.append(round(rf_accuracy2, 2))
precision2.append(round(rf_precision2, 2))
recall2.append(round(rf_recall2, 2))
f1_score2.append(round(rf_f1_score2, 4))

In [None]:
# Evaluation Metrics for the Model 1
print("Results for the 80% - 20% split")
print("Accuracy:", rf_accuracy1)
print("Precision:", rf_precision1)
print("Recall:", rf_recall1)
print("F1 Score:", rf_f1_score1)

print("----------------------------------")
print("----------------------------------")

# Evaluation Metrics for the Model 2
print("Results for the 70% - 30% split")
print("Accuracy:", rf_accuracy2)
print("Precision:", rf_precision2)
print("Recall:", rf_recall2)
print("F1 Score:", rf_f1_score2)

### 3) Logistic Regression

In [None]:
# Model 1 with 80% - 20% split
lr_model1 = LogisticRegression(class_weight='dict', max_iter=500, random_state=42)
lr_model1.fit(X_train1, y_train1)
y_pred_lr1 = lr_model1.predict(X_test1)

# Model 2 with 70% - 30% split
lr_model2 = LogisticRegression(solver = 'liblinear', max_iter=300, random_state=42)
lr_model2.fit(X_train2, y_train2)
y_pred_lr2 = lr_model2.predict(X_test2)

# Calculating Evaluation Metrics for the Model 1
lr_accuracy1 = accuracy_score(y_test1, y_pred_lr1) * 100
lr_confusion1 = confusion_matrix(y_test1, y_pred_lr1)
lr_precision1 = lr_confusion1[0][0]/(lr_confusion1[0][0] + lr_confusion1[1][0]) * 100
lr_recall1 = lr_confusion1[0][0]/(lr_confusion1[0][0] + lr_confusion1[0][1]) * 100
lr_f1_score1 = ((2 * lr_precision1 * lr_recall1) / (lr_precision1 + lr_recall1)) / 100

# Calculating Evaluation Metrics for the Model 2
lr_accuracy2 = accuracy_score(y_test2, y_pred_lr2) * 100
lr_confusion2 = confusion_matrix(y_test2, y_pred_lr2)
lr_precision2 = lr_confusion2[0][0]/(lr_confusion2[0][0] + lr_confusion2[1][0]) * 100
lr_recall2 = lr_confusion2[0][0]/(lr_confusion2[0][0] + lr_confusion2[0][1]) * 100
lr_f1_score2 = ((2 * lr_precision2 * lr_recall2) / (lr_precision2 + lr_recall2)) / 100

# Storing all the metrics in values for Model 1 in common lists
accuracy1.append(round(lr_accuracy1, 2))
precision1.append(round(lr_precision1, 2))
recall1.append(round(lr_recall1, 2))
f1_score1.append(round(lr_f1_score1, 4))

# Storing all the metrics in values for Model 2 in common lists
accuracy2.append(round(lr_accuracy2, 2))
precision2.append(round(lr_precision2, 2))
recall2.append(round(lr_recall2, 2))
f1_score2.append(round(lr_f1_score2, 4))

In [None]:
# Evaluation Metrics for the Model 1
print("Results for the 80% - 20% split")
print("Accuracy:", lr_accuracy1)
print("Precision:", lr_precision1)
print("Recall:", lr_recall1)
print("F1 Score:", lr_f1_score1)

print("----------------------------------")
print("----------------------------------")

# Evaluation Metrics for the Model 2
print("Results for the 70% - 30% split")
print("Accuracy:", lr_accuracy2)
print("Precision:", lr_precision2)
print("Recall:", lr_recall2)
print("F1 Score:", lr_f1_score2)

### 4) AdaBoost Classifier

In [None]:
# Model 1 with 80% - 20% split
ab_model1 = AdaBoostClassifier(n_estimators=500, learning_rate=0.1, random_state=42)
ab_model1.fit(X_train1, y_train1)
y_pred_ab1 = ab_model1.predict(X_test1)

# Model 2 with 70% - 30% split
ab_model2 = AdaBoostClassifier(n_estimators=500, learning_rate=0.1, random_state=42)
ab_model2.fit(X_train2, y_train2)
y_pred_ab2 = ab_model2.predict(X_test2)

# Calculating Evaluation Metrics for the Model 1
ab_accuracy1 = accuracy_score(y_test1, y_pred_ab1) * 100
ab_confusion1 = confusion_matrix(y_test1, y_pred_ab1)
ab_precision1 = ab_confusion1[0][0]/(ab_confusion1[0][0] + ab_confusion1[1][0]) * 100
ab_recall1 = ab_confusion1[0][0]/(ab_confusion1[0][0] + ab_confusion1[0][1]) * 100
ab_f1_score1 = ((2 * ab_precision1 * ab_recall1) / (ab_precision1 + ab_recall1)) / 100

# Calculating Evaluation Metrics for the Model 2
ab_accuracy2 = accuracy_score(y_test2, y_pred_ab2) * 100
ab_confusion2 = confusion_matrix(y_test2, y_pred_ab2)
ab_precision2 = ab_confusion2[0][0]/(ab_confusion2[0][0] + ab_confusion2[1][0]) * 100
ab_recall2 = ab_confusion2[0][0]/(ab_confusion2[0][0] + ab_confusion2[0][1]) * 100
ab_f1_score2 = ((2 * ab_precision2 * ab_recall2) / (ab_precision2 + ab_recall2)) / 100

# Storing all the metrics in values for Model 1 in common lists
accuracy1.append(round(ab_accuracy1, 2))
precision1.append(round(ab_precision1, 2))
recall1.append(round(ab_recall1, 2))
f1_score1.append(round(ab_f1_score1, 4))

# Storing all the metrics in values for Model 2 in common lists
accuracy2.append(round(ab_accuracy2, 2))
precision2.append(round(ab_precision2, 2))
recall2.append(round(ab_recall2, 2))
f1_score2.append(round(ab_f1_score2, 4))

In [None]:
# Evaluation Metrics for the Model 1
print("Results for the 80% - 20% split")
print("Accuracy:", ab_accuracy1)
print("Precision:", ab_precision1)
print("Recall:", ab_recall1)
print("F1 Score:", ab_f1_score1)

print("----------------------------------")
print("----------------------------------")

# Evaluation Metrics for the Model 2
print("Results for the 70% - 30% split")
print("Accuracy:", ab_accuracy2)
print("Precision:", ab_precision2)
print("Recall:", ab_recall2)
print("F1 Score:", ab_f1_score2)

### 5) Gradient Boosting Classifier

In [None]:
# Model 1 with 80% - 20% split
gb_model1 = GradientBoostingClassifier()
gb_model1.fit(X_train1, y_train1)
y_pred_gb1 = gb_model1.predict(X_test1)

# Model 2 with 70% - 30% split
gb_model2 = GradientBoostingClassifier()
gb_model2.fit(X_train2, y_train2)
y_pred_gb2 = gb_model2.predict(X_test2)

# Calculating Evaluation Metrics for the Model 1
gb_accuracy1 = accuracy_score(y_test1, y_pred_gb1) * 100
gb_confusion1 = confusion_matrix(y_test1, y_pred_gb1)
gb_precision1 = gb_confusion1[0][0]/(gb_confusion1[0][0] + gb_confusion1[1][0]) * 100
gb_recall1 = gb_confusion1[0][0]/(gb_confusion1[0][0] + gb_confusion1[0][1]) * 100
gb_f1_score1 = ((2 * gb_precision1 * gb_recall1) / (gb_precision1 + gb_recall1)) / 100

# Calculating Evaluation Metrics for the Model 2
gb_accuracy2 = accuracy_score(y_test2, y_pred_gb2) * 100
gb_confusion2 = confusion_matrix(y_test2, y_pred_gb2)
gb_precision2 = gb_confusion2[0][0]/(gb_confusion2[0][0] + gb_confusion2[1][0]) * 100
gb_recall2 = gb_confusion2[0][0]/(gb_confusion2[0][0] + gb_confusion2[0][1]) * 100
gb_f1_score2 = ((2 * gb_precision2 * gb_recall2) / (gb_precision2 + gb_recall2)) / 100

# Storing all the metrics in values for Model 1 in common lists
accuracy1.append(round(gb_accuracy1, 2))
precision1.append(round(gb_precision1, 2))
recall1.append(round(gb_recall1, 2))
f1_score1.append(round(gb_f1_score1, 4))

# Storing all the metrics in values for Model 2 in common lists
accuracy2.append(round(gb_accuracy2, 2))
precision2.append(round(gb_precision2, 2))
recall2.append(round(gb_recall2, 2))
f1_score2.append(round(gb_f1_score2, 4))

In [None]:
# Evaluation Metrics for the Model 1
print("Results for the 80% - 20% split")
print("Accuracy:", gb_accuracy1)
print("Precision:", gb_precision1)
print("Recall:", gb_recall1)
print("F1 Score:", gb_f1_score1)

print("----------------------------------")
print("----------------------------------")

# Evaluation Metrics for the Model 2
print("Results for the 70% - 30% split")
print("Accuracy:", gb_accuracy2)
print("Precision:", gb_precision2)
print("Recall:", gb_recall2)
print("F1 Score:", gb_f1_score2)

### 6) Bagging Classifier

In [None]:
# Model 1 with 80% - 20% split
bc_model1 = BaggingClassifier(n_estimators=200, random_state=42)
bc_model1.fit(X_train1, y_train1)
y_pred_bc1 = bc_model1.predict(X_test1)

# Model 2 with 70% - 30% split
bc_model2 = BaggingClassifier(n_estimators=200, random_state=42)
bc_model2.fit(X_train2, y_train2)
y_pred_bc2 = bc_model2.predict(X_test2)


# Calculating Evaluation Metrics for the Model 1
bc_accuracy1 = accuracy_score(y_test1, y_pred_bc1) * 100
bc_confusion1 = confusion_matrix(y_test1, y_pred_bc1)
bc_precision1 = bc_confusion1[0][0]/(bc_confusion1[0][0] + bc_confusion1[1][0]) * 100
bc_recall1 = bc_confusion1[0][0]/(bc_confusion1[0][0] + bc_confusion1[0][1]) * 100
bc_f1_score1 = ((2 * bc_precision1 * bc_recall1) / (bc_precision1 + bc_recall1)) / 100

# Calculating Evaluation Metrics for the Model 2
bc_accuracy2 = accuracy_score(y_test2, y_pred_bc2) * 100
bc_confusion2 = confusion_matrix(y_test2, y_pred_bc2)
bc_precision2 = bc_confusion2[0][0]/(bc_confusion2[0][0] + bc_confusion2[1][0]) * 100
bc_recall2 = bc_confusion2[0][0]/(bc_confusion2[0][0] + bc_confusion2[0][1]) * 100
bc_f1_score2 = ((2 * bc_precision2 * bc_recall2) / (bc_precision2 + bc_recall2)) / 100

# Storing all the metrics in values for Model 1 in common lists
accuracy1.append(round(bc_accuracy1, 2))
precision1.append(round(bc_precision1, 2))
recall1.append(round(bc_recall1, 2))
f1_score1.append(round(bc_f1_score1, 4))

# Storing all the metrics in values for Model 2 in common lists
accuracy2.append(round(bc_accuracy2, 2))
precision2.append(round(bc_precision2, 2))
recall2.append(round(bc_recall2, 2))
f1_score2.append(round(bc_f1_score2, 4))

In [None]:
# Evaluation Metrics for the Model 1
print("Results for the 80% - 20% split")
print("Accuracy:", bc_accuracy1)
print("Precision:", bc_precision1)
print("Recall:", bc_recall1)
print("F1 Score:", bc_f1_score1)

print("----------------------------------")
print("----------------------------------")

# Evaluation Metrics for the Model 2
print("Results for the 70% - 30% split")
print("Accuracy:", bc_accuracy2)
print("Precision:", bc_precision2)
print("Recall:", bc_recall2)
print("F1 Score:", bc_f1_score2)

### 7) CatBoost Classifier

In [None]:
# Model 1 with 80% - 20% split
cb_model1 = CatBoostClassifier(iterations=200, logging_level='Silent')
cb_model1.fit(X_train1, y_train1)
y_pred_cb1 = cb_model1.predict(X_test1)

# Model 2 with 70% - 30% split
cb_model2 = CatBoostClassifier(iterations=200, logging_level='Silent')
cb_model2.fit(X_train2, y_train2)
y_pred_cb2 = cb_model2.predict(X_test2)

# Calculating Evaluation Metrics for the Model 1
cb_accuracy1 = accuracy_score(y_test1, y_pred_cb1) * 100
cb_confusion1 = confusion_matrix(y_test1, y_pred_cb1)
cb_precision1 = cb_confusion1[0][0]/(cb_confusion1[0][0] + cb_confusion1[1][0]) * 100
cb_recall1 = cb_confusion1[0][0]/(cb_confusion1[0][0] + cb_confusion1[0][1]) * 100
cb_f1_score1 = ((2 * cb_precision1 * cb_recall1) / (cb_precision1 + cb_recall1)) / 100

# Calculating Evaluation Metrics for the Model 2
cb_accuracy2 = accuracy_score(y_test2, y_pred_cb2) * 100
cb_confusion2 = confusion_matrix(y_test2, y_pred_cb2)
cb_precision2 = cb_confusion2[0][0]/(cb_confusion2[0][0] + cb_confusion2[1][0]) * 100
cb_recall2 = cb_confusion2[0][0]/(cb_confusion2[0][0] + cb_confusion2[0][1]) * 100
cb_f1_score2 = ((2 * cb_precision2 * cb_recall2) / (cb_precision2 + cb_recall2)) / 100

# Storing all the metrics in values for Model 1 in common lists
accuracy1.append(round(cb_accuracy1, 2))
precision1.append(round(cb_precision1, 2))
recall1.append(round(cb_recall1, 2))
f1_score1.append(round(cb_f1_score1, 4))

# Storing all the metrics in values for Model 2 in common lists
accuracy2.append(round(cb_accuracy2, 2))
precision2.append(round(cb_precision2, 2))
recall2.append(round(cb_recall2, 2))
f1_score2.append(round(cb_f1_score2, 4))

# To see the hyper-parameters for this model use "cb_model.get_all_params()"

In [None]:
# Evaluation Metrics for the Model 1
print("Results for the 80% - 20% split")
print("Accuracy Score:", cb_accuracy1)
print("Precision:", cb_precision1)
print("Recall:", cb_recall1)
print("F1 Score:", cb_f1_score1)

print("----------------------------------")
print("----------------------------------")

# Evaluation Metrics for the Model 2
print("Results for the 70% - 30% split")
print("Accuracy Score:", cb_accuracy2)
print("Precision:", cb_precision2)
print("Recall:", cb_recall2)
print("F1 Score:", cb_f1_score2)

### Tabular Performance Comparison of all the algorithm w.r.t. the train-test data split

In [None]:
metric1 = pd.DataFrame({
    'Alogrithms':algo,
    'Accuracy':accuracy1,
    'Precision':precision1,
    'Recall':recall1,
    'F1 Score':f1_score1
})

metric2 = pd.DataFrame({
    'Alogrithms':algo,
    'Accuracy':accuracy2,
    'Precision':precision2,
    'Recall':recall2,
    'F1 Score':f1_score2
})

#### Metrics in the 80% - 20% setting

In [None]:
metric1

#### Metrics in the 70% - 30% setting

In [None]:
metric2

### Conclusion
<p>In my opinion I think that choosing the 80-20 setting is always a better in small or medium sized datasets. But I have observed that in 70-30 setting Logistic Regression has outperformed all the algorithms irrespective of the settings by a very narrow margin.</p>
Overall it would be a good idea to choose XGBoost Classifier in 80-20 setting.