# Classification Model Selection

## Data preprocessing

✔️ Import the necessary libraries.

✔️ Load dataset (Breast_Cancer.csv).

❌ Our dataset doesn't have any missing data.

❌ Our dataset doesn't have any string data.

✔️ We have 684 data. So, we can split and have 75% for the training set and 25% for the testing set. 

✔️ Applying feature scaling for the dataset will improve the performance of the model.

In [1]:
# Import libraries....
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# setting two digits after decimal point...
np.set_printoptions(precision=2)


In [2]:
dataset = pd.read_csv(r"test_v3.csv")
X = dataset.iloc[:, [0,1,2,3,4,5,6,7,8]].values
y = dataset.iloc[:, -1].values

In [3]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in range(0, 8+1):
    X[:, i] = le.fit_transform(X[:, i])
print(X)

[[0 0 0 ... 0 0 2]
 [0 0 0 ... 0 0 2]
 [0 0 0 ... 0 0 2]
 ...
 [2 2 1 ... 0 1 0]
 [2 2 1 ... 1 0 0]
 [2 2 1 ... 1 0 0]]


## Train and evaluate the performance of Logistic Regression Classification


In [4]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
logistic_regression_calssifier = LogisticRegression()
logistic_regression_calssifier.fit(X, y)

# Testing....
y_pred = logistic_regression_calssifier.predict(X)

# Confusion Matrix....
print(confusion_matrix(y_true=y, y_pred=y_pred))
# Score....
acc_logistic_regression_classification = accuracy_score(
    y_true=y, y_pred=y_pred)
print("Accuracy score for Logistic Regression Classification :",
      acc_logistic_regression_classification)


[[309  50  92  47  60  39  99  43 108]
 [ 67 226  63  44  21  43 167  16 170]
 [ 87  50 310  41  59  48 111  45 100]
 [ 66  49 167 211  18  15  71  49 163]
 [134  39 132  37 188  37 132  39 134]
 [162  49  68  12  18 214 170  49  67]
 [ 99  45 105  48  59  41 316  50  88]
 [168  13 165  43  21  44  65 229  69]
 [103  43  97  39  60  47  94  50 314]]
Accuracy score for Logistic Regression Classification : 0.3081117021276596


## Train and evaluate the performance of K Nearest Neighbor Classification

In [5]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
k_nn_calssifier = KNeighborsClassifier(n_neighbors=5, p=2, metric="minkowski")
k_nn_calssifier.fit(X, y)

# Testing....
y_pred = k_nn_calssifier.predict(X)


# Confusion Matrix....
print(confusion_matrix(y_true=y, y_pred=y_pred))
# Score....
acc_k_nearest_neighbor_classification = accuracy_score(
    y_true=y, y_pred=y_pred)
print("Accuracy score for K Nearest Neighbor Classification :",
      acc_k_nearest_neighbor_classification)


[[403  76  44  57  71  24  38  86  48]
 [162 443  21  41  48  26  24  33  19]
 [191 178 256  70  46  26  35  36  13]
 [185 157  52 242  49  25  26  53  20]
 [195 196  76  78 205  20  34  53  15]
 [217 173  67  49  66 118  36  55  28]
 [222 195  60  69  59  28 140  60  18]
 [152 172  83  70  57  52  46 166  19]
 [212 195  76  77  61  38  36  69  83]]
Accuracy score for K Nearest Neighbor Classification : 0.2734042553191489


## Train and evaluate the performance of (SVC) Support Vector Classification

In [6]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
svc_calssifier = SVC(kernel="linear", )
svc_calssifier.fit(X, y)

# Testing....
y_pred = svc_calssifier.predict(X)

# Confusion Matrix....
print(confusion_matrix(y_true=y, y_pred=y_pred))
# Score....
acc_svc_support_vector_classification = accuracy_score(
    y_true=y, y_pred=y_pred)
print("Accuracy score for (SVC) Support Vector Classification :",
      acc_svc_support_vector_classification)


[[439  39 136  34   0   1  84   0 114]
 [ 93 127  91  23  63  23 194  20 183]
 [ 88  50 495   0   0  33 110   0  75]
 [ 80  58 220 109  63   6  76  18 179]
 [181  16 183   8 170   2 164   2 146]
 [204  48 102  15  69  96 190  18  67]
 [135  10 123  39  11   4 418  25  86]
 [197  15 228  44  75  40  71  76  71]
 [100   5 144   6  17  37 117  25 396]]
Accuracy score for (SVC) Support Vector Classification : 0.3093085106382979


## Train and evaluate the performance of Kernel (SVC) Support Vector Classification

In [7]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
kernel_svc_calssifier = SVC(kernel="rbf", )
kernel_svc_calssifier.fit(X, y)

# Testing....
y_pred = kernel_svc_calssifier.predict(X)

# Confusion Matrix....
print(confusion_matrix(y_true=y, y_pred=y_pred))
# Score....
acc_kernel_support_vector_classification = accuracy_score(
    y_true=y, y_pred=y_pred)
print("Accuracy score for Kernel (SVC) Support Vector Classification :",
      acc_kernel_support_vector_classification)


[[480  12 111  19   2   9 107   6 101]
 [117  64 118  31  55  30 194  15 193]
 [102  13 491   9   2  18 100   6 110]
 [115  22 198  81  53  12 118  22 188]
 [185   1 184   1 145   0 178   0 178]
 [192  23 120  13  54  79 192  22 114]
 [109   6 105  19   3   9 488  12 100]
 [192  16 197  31  58  31 114  62 116]
 [103   6 109   9   4  19 106  12 479]]
Accuracy score for Kernel (SVC) Support Vector Classification : 0.31502659574468084


## Train and evaluate the performance of Naive Bayes Classification

In [8]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB
naive_bayes_calssifier = GaussianNB()
naive_bayes_calssifier.fit(X, y)

# Testing....
y_pred = naive_bayes_calssifier.predict(X)

# Confusion Matrix....
print(confusion_matrix(y_true=y, y_pred=y_pred))
# Score....
acc_naive_bayes_classification = accuracy_score(y_true=y, y_pred=y_pred)
print("Accuracy score for Naive Bayes Classification :",
      acc_naive_bayes_classification)


[[ 45  82  13  99 320 132  17 122  17]
 [  0 307   0  81 298  75   0  56   0]
 [ 17  78  44 139 319  95  15 128  16]
 [  0  77   0 318 295  43   0  76   0]
 [  0   0   0   0 872   0   0   0   0]
 [  0  76   0  48 295 313   0  77   0]
 [ 16 130  16  97 319 137  43  76  17]
 [  0  58   0  76 298  80   0 305   0]
 [ 18 122  17 134 320  97  13  82  44]]
Accuracy score for Naive Bayes Classification : 0.3046542553191489


## Train and evaluate the performance of Decision Tree Classification

In [9]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier
decision_tree_calssifier = DecisionTreeClassifier(criterion='entropy')
decision_tree_calssifier.fit(X, y)

# Testing....
y_pred = decision_tree_calssifier.predict(X)

# Confusion Matrix....
print(confusion_matrix(y_true=y, y_pred=y_pred))
# Score....
acc_decision_tree_classification = accuracy_score(y_true=y, y_pred=y_pred)
print("Accuracy score for Decision Tree Classification :",
      acc_decision_tree_classification)


[[847   0   0   0   0   0   0   0   0]
 [263 554   0   0   0   0   0   0   0]
 [277 184 390   0   0   0   0   0   0]
 [264 194 131 220   0   0   0   0   0]
 [308 203 135  79 147   0   0   0   0]
 [278 185 124  73  51  98   0   0   0]
 [280 197 129  79  56  38  72   0   0]
 [279 183 130  78  54  36  25  32   0]
 [279 195 130  80  56  35  26  25  21]]
Accuracy score for Decision Tree Classification : 0.3166223404255319


## Train and evaluate the performance of Random Forest Classification

In [10]:
# Training....
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
random_forest_calssifier = RandomForestClassifier(
    n_estimators=20, criterion='entropy')
random_forest_calssifier.fit(X, y)

# Testing....
y_pred = random_forest_calssifier.predict(X)

# Confusion Matrix....
print(confusion_matrix(y_true=y, y_pred=y_pred))
# Score....
acc_random_forest_classification = accuracy_score(y_true=y, y_pred=y_pred)
print("Accuracy score for Random Forest Classification :",
      acc_random_forest_classification)


[[267  64  76  76  83  68  68  78  67]
 [ 75 247  59  62  81  69  69  81  74]
 [ 67  74 261  72  67  72  78  88  72]
 [ 71  69  65 252  81  59  63  85  64]
 [ 63  65  87  83 266  65  76  90  77]
 [ 80  63  68  70  72 239  81  76  60]
 [ 65  71  60  67  87  64 287  76  74]
 [ 71  56  73  60  75  67  72 289  54]
 [ 70  70  72  69  74  70  69  80 273]]
Accuracy score for Random Forest Classification : 0.3166223404255319


## Which is best for given dataset ?

In [11]:
accuracy_score_list = {
    "Logistic Regression": acc_logistic_regression_classification,
    "K Nearest Neighbor Classification": acc_k_nearest_neighbor_classification,
    "(SVC) Support Vector Classification)": acc_svc_support_vector_classification,
    "Kernel (SVC) Support Vector Classification": acc_kernel_support_vector_classification,
    "Naive Bayes Classification": acc_naive_bayes_classification,
    "Decision Tree Classification": acc_decision_tree_classification,
    "Random Forest Classification": acc_random_forest_classification,
}
# Print final result of all model....
for model, accuracy in accuracy_score_list.items():
    print(f"{model} with accuracy score : {accuracy}")

# find best of them....
best_of_them = max(accuracy_score_list.values())

# Print best of them....
for model, r2 in accuracy_score_list.items():
    if r2 == best_of_them:
        print_me = f"{model} is the best model for given dataset 🥳 with Accuracy score {r2}"
        print("🎉" * (len(print_me) // 2))
        print(print_me)
        print("🎉" * (len(print_me) // 2))
        break


Logistic Regression with accuracy score : 0.3081117021276596
K Nearest Neighbor Classification with accuracy score : 0.2734042553191489
(SVC) Support Vector Classification) with accuracy score : 0.3093085106382979
Kernel (SVC) Support Vector Classification with accuracy score : 0.31502659574468084
Naive Bayes Classification with accuracy score : 0.3046542553191489
Decision Tree Classification with accuracy score : 0.3166223404255319
Random Forest Classification with accuracy score : 0.3166223404255319
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉
Decision Tree Classification is the best model for given dataset 🥳 with Accuracy score 0.3166223404255319
🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉


**Note:** Above result is only for the dataset (Breast_Cancer.csv) which we were given as the input. If you change the dataset, the result also changes certainly.