# Classification Model Selection

## Data preprocessing

✔️ Import the necessary libraries.

✔️ Load dataset (Breast_Cancer.csv).

❌ Our dataset doesn't have any missing data.

❌ Our dataset doesn't have any string data.

✔️ We have 684 data. So, we can split and have 75% for the training set and 25% for the testing set. 

✔️ Applying feature scaling for the dataset will improve the performance of the model.

In [1]:
# Import libraries....
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# setting two digits after decimal point...
np.set_printoptions(precision=2)


In [2]:
# Load dataset...
dataset = pd.read_csv(r"../dataset/Breast_Cancer.csv")
X = dataset.iloc[:, :-1].values  # [row, column]
y = dataset.iloc[:, -1].values


In [3]:
# Split testing and training dataset...
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)


In [4]:
# Performing feature scaling for the independent variable...
from sklearn.preprocessing import StandardScaler
x_sc = StandardScaler()
X_train = x_sc.fit_transform(X_train)
X_test = x_sc.transform(X_test)


## Train and evaluate the performance of Logistic Regression Classification


In [5]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
logistic_regression_calssifier = LogisticRegression(random_state=0)
logistic_regression_calssifier.fit(X_train, y_train)

# Testing....
y_pred = logistic_regression_calssifier.predict(X_test)

# Confusion Matrix....
print(confusion_matrix(y_true=y_test, y_pred=y_pred))
# Score....
acc_logistic_regression_classification = accuracy_score(
    y_true=y_test, y_pred=y_pred)
print("Accuracy score for Logistic Regression Classification :",
      acc_logistic_regression_classification)


[[103   4]
 [  5  59]]
Accuracy score for Logistic Regression Classification : 0.9473684210526315


## Train and evaluate the performance of K Nearest Neighbor Classification

In [6]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
k_nn_calssifier = KNeighborsClassifier(n_neighbors=5, p=2, metric="minkowski")
k_nn_calssifier.fit(X_train, y_train)

# Testing....
y_pred = k_nn_calssifier.predict(X_test)


# Confusion Matrix....
print(confusion_matrix(y_true=y_test, y_pred=y_pred))
# Score....
acc_k_nearest_neighbor_classification = accuracy_score(
    y_true=y_test, y_pred=y_pred)
print("Accuracy score for K Nearest Neighbor Classification :",
      acc_k_nearest_neighbor_classification)


[[103   4]
 [  5  59]]
Accuracy score for K Nearest Neighbor Classification : 0.9473684210526315


## Train and evaluate the performance of (SVC) Support Vector Classification

In [7]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
svc_calssifier = SVC(kernel="linear", random_state=0)
svc_calssifier.fit(X_train, y_train)

# Testing....
y_pred = svc_calssifier.predict(X_test)

# Confusion Matrix....
print(confusion_matrix(y_true=y_test, y_pred=y_pred))
# Score....
acc_svc_support_vector_classification = accuracy_score(
    y_true=y_test, y_pred=y_pred)
print("Accuracy score for (SVC) Support Vector Classification :",
      acc_svc_support_vector_classification)


[[102   5]
 [  5  59]]
Accuracy score for (SVC) Support Vector Classification : 0.9415204678362573


## Train and evaluate the performance of Kernel (SVC) Support Vector Classification

In [8]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
kernel_svc_calssifier = SVC(kernel="rbf", random_state=0)
kernel_svc_calssifier.fit(X_train, y_train)

# Testing....
y_pred = kernel_svc_calssifier.predict(X_test)

# Confusion Matrix....
print(confusion_matrix(y_true=y_test, y_pred=y_pred))
# Score....
acc_kernel_support_vector_classification = accuracy_score(
    y_true=y_test, y_pred=y_pred)
print("Accuracy score for Kernel (SVC) Support Vector Classification :",
      acc_kernel_support_vector_classification)


[[102   5]
 [  3  61]]
Accuracy score for Kernel (SVC) Support Vector Classification : 0.9532163742690059


## Train and evaluate the performance of Naive Bayes Classification

In [9]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB
naive_bayes_calssifier = GaussianNB()
naive_bayes_calssifier.fit(X_train, y_train)

# Testing....
y_pred = naive_bayes_calssifier.predict(X_test)

# Confusion Matrix....
print(confusion_matrix(y_true=y_test, y_pred=y_pred))
# Score....
acc_naive_bayes_classification = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy score for Naive Bayes Classification :",
      acc_naive_bayes_classification)


[[99  8]
 [ 2 62]]
Accuracy score for Naive Bayes Classification : 0.9415204678362573


## Train and evaluate the performance of Decision Tree Classification

In [10]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier
decision_tree_calssifier = DecisionTreeClassifier(
    criterion='entropy', random_state=0)
decision_tree_calssifier.fit(X_train, y_train)

# Testing....
y_pred = decision_tree_calssifier.predict(X_test)

# Confusion Matrix....
print(confusion_matrix(y_true=y_test, y_pred=y_pred))
# Score....
acc_decision_tree_classification = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy score for Decision Tree Classification :",
      acc_decision_tree_classification)


[[103   4]
 [  3  61]]
Accuracy score for Decision Tree Classification : 0.9590643274853801


## Train and evaluate the performance of Random Forest Classification

In [11]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
random_forest_calssifier = RandomForestClassifier(
    n_estimators=10, criterion='entropy', random_state=0)
random_forest_calssifier.fit(X_train, y_train)

# Testing....
y_pred = random_forest_calssifier.predict(X_test)

# Confusion Matrix....
print(confusion_matrix(y_true=y_test, y_pred=y_pred))
# Score....
acc_random_forest_classification = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Accuracy score for Random Forest Classification :",
      acc_random_forest_classification)


[[102   5]
 [  6  58]]
Accuracy score for Random Forest Classification : 0.935672514619883


## Which is best for given dataset ?

In [12]:
accuracy_score_list = {
    "Logistic Regression": acc_logistic_regression_classification,
    "K Nearest Neighbor Classification": acc_k_nearest_neighbor_classification,
    "(SVC) Support Vector Classification)": acc_svc_support_vector_classification,
    "Kernel (SVC) Support Vector Classification": acc_kernel_support_vector_classification,
    "Naive Bayes Classification": acc_naive_bayes_classification,
    "Decision Tree Classification": acc_decision_tree_classification,
    "Random Forest Classification": acc_random_forest_classification,
}
# Print final result of all model....
for model, accuracy in accuracy_score_list.items():
    print(f"{model} with accuracy score : {accuracy}")

# find best of them....
best_of_them = max(accuracy_score_list.values())

# Print best of them....
for model, r2 in accuracy_score_list.items():
    if r2 == best_of_them:
        print_me = f"{model} is the best model for given dataset 🥳 with Accuracy score {r2}"
        print("🎉" * (len(print_me) // 2))
        print(print_me)
        print("🎉" * (len(print_me) // 2))
        break


Logistic Regression with accuracy score : 0.9473684210526315
K Nearest Neighbor Classification with accuracy score : 0.9473684210526315
(SVC) Support Vector Classification) with accuracy score : 0.9415204678362573
Kernel (SVC) Support Vector Classification with accuracy score : 0.9532163742690059
Naive Bayes Classification with accuracy score : 0.9415204678362573
Decision Tree Classification with accuracy score : 0.9590643274853801
Random Forest Classification with accuracy score : 0.935672514619883
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
Decision Tree Classification is the best model for given dataset 🥳 with Accuracy score 0.9590643274853801
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉


**Note:** Above result is only for the dataset (Breast_Cancer.csv) which we were given as the input. If you change the dataset, the result also changes certainly.