# Comparison of Classification methods

Author: Dr. Vijesh J. Bhute   

Highlights of this notebook:
<ul><li>Performs classification of a dataset with more than 2 features.</li>
    <li> Compares different methods using Accuracy and Cumulative Accuracy Profile analysis. </li></ul>

<b>Types of classification models:</b>
<ul><li>Logistical regression</li>
    <li>K-Nearest Neighbours</li>
    <li>Support vector machine (SVM)</li>
    <li>Kernel SVM</li>
    <li>Naive Bayes</li>
    <li>Decision Tree</li>
    <li>Random Forest</li>
    </ul> 

## Accuracy and Confusion matrix for classification models

For classification problems, accuracy for the model is given by, $$\text{Accuracy} =\frac{\text{TN+TP}}{\text{TN+FN+TP+FP}}$$ 

You can also look at the confusion matrix to see a summary of how the model has performed:
<table>
  <tr>
    <td></td>
    <th>Predicted No ($0$)</th>
    <th>Predicted Yes ($1$)</th>
  </tr>
  <tr>
    <th>Actual No ($0$)</th>
    <td>TN</td>
    <td>FP</td>
  </tr>
  <tr>
    <th>Actual Yes ($1$)</th>
    <td>FN</td>
    <td>TP</td>
  </tr>
</table>

### CAP analysis

CAP: Cumulative Accuracy Profile is often used to compare models as accuracy alone may not be an ideal metric. 
<br><br>
This requires evaluating accuracy ratio. 
<br><br>
The first method to analyse the CAP Curve is using Area Under Curve. Let’s consider area under random model as $a$. We calculate the Accuracy Rate using the following steps:
<ul><li>Calculate the area under the perfect model ($a_P$) till the random model $a$</li>
    <li>Calculate the area under the prediction model ($a_R$) till the random model ($a$)</li>
    <li>Calculate Accuracy Ratio (AR) $= a_R / a_P$</li></ul>
The closer the Accuracy Ratio is to the 1, better is the model.
<br> CAP analysis is adapted from <a href="https://www.kaggle.com/code/rohandawar/cap-cumulative-accuracy-profile-analysis-1/notebook" target="_blank">Kaggle website</a>

### Importing the libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

### Importing the dataset

I am going to use a breast cancer dataset from UCI for comparing different models. 

In [16]:
dataset = pd.read_csv('data/Data_UCI_Breast cancer_classification.csv')
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


<ul><li>Last column is a binary variable. This is already encoded as 2 (benign) or 4 (malignant) and it might be beneficial to make this 0 and 1 by further encoding using Label Encoder.</li>
    <li>First column is not relevant for modelling and should be excluded from the training and test set</li></ul>

In [17]:
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

### Optional: converting Class or target categorical variable as 0 and 1 instead of 2 and 4

In [18]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y= le.fit_transform (y)
#There is also ordinal encoder when variable is not only categorical but also has certain order (high, low for example)
sum(y)

239

### Splitting the dataset into the Training set and Test set

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

<b>Note:</b> The predictors/features are of similar orders of magnitude. If they were different, then it would have been important to perform feature scaling (especially since the following models will relate to classification based on these features and there isn't any explicit relationship between $y$ and $x$). 
<br>
In this case, it is also fine to do the feature scaling as it will ensure that all features have similar average and standard deviation<br>

### Feature Scaling (Optional)

In [20]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test) #Fit using training set and transform the test set

## Logistic Regression

Multiple linear regression model is given by, $$ y = b_0 + b_1 *x_1 + b_2 *x_2$$
<br>In logistic regression, you fit a function which lies in $0-1$. For example, sigmoid function, which is given by, $$p = \frac{1}{1+e^{-y}}$$
<br> Where $p$ is the probability which takes the values between $0$ and $1$. Solving above equation for $y$, we get, $$y = \ln \Big(\frac{p}{1-p}\Big) = b_0 + b_1*x_1 + b_2*x_2$$

In [21]:
from sklearn.linear_model import LogisticRegression
logModel = LogisticRegression(random_state = 0)
logModel.fit(X_train, y_train)
y_predict_logModel= logModel.predict(X_test) #Predicting test set
#print(logModel.predict(sc.transform([[30,87000]]))) #Predicting a new result using the model
from sklearn.metrics import confusion_matrix #Confusion matrix
cm = confusion_matrix(y_test, y_predict_logModel)
print(cm)

[[126   4]
 [  5  70]]


## K Nearest Neighbours

In [22]:
from sklearn.neighbors import KNeighborsClassifier
kNN = KNeighborsClassifier(n_neighbors=5, metric = 'minkowski',p=2) #minkowski with p=2 corresponds to Euclidean distance
kNN.fit(X_train, y_train)
y_predict_kNN=kNN.predict(X_test)
cm = confusion_matrix(y_test, y_predict_kNN)
print(cm)

[[126   4]
 [  6  69]]


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


## Support Vector Machine (SVM)

### Linear kernel for SVM

In [33]:
from sklearn.svm import SVC
LinearSVMmodel = SVC(kernel='linear', random_state = 0)
LinearSVMmodel.fit(X_train, y_train)
y_predict_LinSVM=LinearSVMmodel.predict(X_test)
cm = confusion_matrix(y_test,y_predict_LinSVM)
print(cm)

[[126   4]
 [  4  71]]


### Kernel SVM

Types of common kernel functions:<br>
<ul><li>Gaussian Radial basis function (RBF)</li>
    <li>Sigmoid kernel </li>
    <li>Polynomial kernel</li>
    </ul>

In [34]:
from sklearn.svm import SVC
kSVMmodel = SVC() #default kernel is rbf (radial basis function)
#Other kernels include 'linear', ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or callable, default=’rbf’
kSVMmodel.fit(X_train, y_train)
y_predict_kSVM=kSVMmodel.predict(X_test)
cm = confusion_matrix(y_test,y_predict_kSVM)
print(cm)

[[123   7]
 [  3  72]]


## Naive Bayes

<ul><li>Naive Bayes makes several assumptions but can be used even if the dataset don't satisfy these assumptions. </li><li>The most important assumption is that the <b>features are linearly independent of each other</b>. This is not true in most cases.</li> <li>
In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. </li>
    <li><b>This algorithm is good for classification but very good for estimation!</b></li>
<li>There are different types of Naive Bayes classifiers which differ mainly by the assumptions they make regarding the distribution of $P(x_i | y)$. These are:
<ul><li> Gaussian Naive Bayes</li>
    <li> Multinomial Naive Bayes </li>
    <li> Complement Naive Bayes</li>
    <li> Bernoulli Naive Bayes</li>
    <li> Categorical Naive Bayes </li></ul></li></ul>
You can learn more about Naive Bayes classifier from <a href="https://scikit-learn.org/stable/modules/naive_bayes.html" target="_blank">scikit learn website</a> <br>

### Gaussian Naive Bayes

In [35]:
from sklearn.naive_bayes import GaussianNB #MultinomialNB, ComplementNB, BernoulliNB, CategoricalNB
NBmodel = GaussianNB()
NBmodel.fit(X_train,y_train)
y_predict_NB= NBmodel.predict(X_test)
cm = confusion_matrix(y_test,y_predict_NB)
print(cm)

[[121   9]
 [  2  73]]


### Decision Trees classifier

In [36]:
from sklearn.tree import DecisionTreeClassifier
DecTreeModel= DecisionTreeClassifier()
DecTreeModel.fit(X_train,y_train)
y_predict_DecTree= DecTreeModel.predict(X_test)
cm = confusion_matrix(y_test,y_predict_DecTree)
print(cm)

[[125   5]
 [  6  69]]


## Random Forests classifier

In [37]:
from sklearn.ensemble import RandomForestClassifier
RFModel = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
RFModel.fit(X_train, y_train)
y_predict_RF= RFModel.predict(X_test)
cm = confusion_matrix(y_test,y_predict_RF)
print(cm)

[[126   4]
 [  6  69]]


## Model comparison

In [47]:
dat = {'Model name':['Log Regression', 'kNN', 
                     'Linear SVM', 'Kernel SVM', 
                     'Naive Bayes', 'Decision Trees', 
                     'Random Forests'], 
       'Training set':[logModel.score(X_train, y_train), kNN.score(X_train, y_train), 
                       LinearSVMmodel.score(X_train,y_train), kSVMmodel.score(X_train,y_train), 
                      NBmodel.score(X_train,y_train), DecTreeModel.score(X_train, y_train),
                      RFModel.score(X_train,y_train)], 
       'Test set':[logModel.score(X_test, y_test), kNN.score(X_test,y_test), 
                   LinearSVMmodel.score(X_test,y_test), kSVMmodel.score(X_test,y_test), 
                  NBmodel.score(X_test,y_test), DecTreeModel.score(X_test, y_test),
                  RFModel.score(X_test,y_test)]}
accuracyDF = pd.DataFrame(data = dat)
#accuracy[0]= logModel.score(X_train, y_train)
#accuracyTest
accuracyDF.sort_values('Test set')

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Unnamed: 0,Model name,Training set,Test set
4,Naive Bayes,0.970711,0.946341
5,Decision Trees,1.0,0.946341
1,kNN,0.979079,0.95122
3,Kernel SVM,0.981172,0.95122
6,Random Forests,1.0,0.95122
0,Log Regression,0.976987,0.956098
2,Linear SVM,0.976987,0.960976


For this dataset, with default parameters, Linear SVM and Log regression perform better than other classifier models based on accuracy. Accuracy may not be the best metric (accuracy paradox can be an issue). Cumulative Accuracy Profile can be compared for different models to get a better metric for comparison of classification models. 

### CAP analysis

In [31]:
from sklearn.metrics import auc

In [32]:
def CAP_analysis(y_test, y_predict):
    total = len(y_test)
    one_count = np.sum(y_test)
    
    lm = [y for _,y in sorted(zip(y_predict,y_test), reverse=True)]
    xaxis = np.arange(0, total+1)
    yaxis = np.append([0], np.cumsum(lm))
    
    # Area under Random Model
    a = auc([0, total], [0, one_count])

    # Area between Perfect and Random Model
    aP = auc([0, one_count, total], [0, one_count, one_count]) - a

    # Area between Trained and Random Model
    aR = auc(xaxis, yaxis) - a

    return aR / aP

In [50]:
datCAP = [CAP_analysis(y_test, y_predict_logModel), CAP_analysis(y_test, y_predict_kNN), 
                       CAP_analysis(y_test, y_predict_LinSVM), CAP_analysis(y_test, y_predict_kSVM), 
                      CAP_analysis(y_test, y_predict_NB), CAP_analysis(y_test, y_predict_DecTree),
                      CAP_analysis(y_test, y_predict_RF)]
accuracyDF['Accuracy Ratio (CAP)'] = datCAP

In [51]:
accuracyDF.sort_values('Accuracy Ratio (CAP)')

Unnamed: 0,Model name,Training set,Test set,Accuracy Ratio (CAP)
5,Decision Trees,1.0,0.946341,0.993846
1,kNN,0.979079,0.95122,0.995077
6,Random Forests,1.0,0.95122,0.995077
3,Kernel SVM,0.981172,0.95122,0.995692
0,Log Regression,0.976987,0.956098,0.995897
4,Naive Bayes,0.970711,0.946341,0.996308
2,Linear SVM,0.976987,0.960976,0.996718


Linear SVM model performs best based on both accuracy in predicting the test set and based on accuracy ration evaluated using CAP analysis.