<a href="https://colab.research.google.com/github/ritzi12/notebooks_supervised/blob/main/multiclass_classification_metrics_tutorial_97.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

This notebook is a tutorial on various classifiers that can be used for multiclass classifications and also we will go through various classification metrics used to evaluate such a classifier.

## Strategy Used
Usually following strategy is used to tackle such problems:
1. One vs Rest
2. One vs One
 We will also go through classifiers which inherently can handle multiclass data as per scikit learn's implementation.
 For more details one can visit documentation here <br>
 https://scikit-learn.org/stable/modules/multiclass.html#:~:text=Multiclass%20classification%20is%20a%20classification,an%20apple%2C%20or%20a%20pear.

In [None]:
#import important libraries
import pandas as pd
from sklearn.datasets import fetch_openml
import numpy as np


In [None]:
train=pd.read_csv('../input/digit-recognizer/train.csv')
train.head()

In [None]:
train.shape #42,000 rows of data with each row having (28*28=784 features or col with pixel values)

In [None]:
X=train.iloc[:,1:]
Y=train.label

In [None]:
X.shape

In [None]:
test=pd.read_csv('../input/digit-recognizer/test.csv')
test.head()

In [None]:
sum(Y==2)

In [None]:
#PLOT images
import matplotlib as mpl
import matplotlib.pyplot as plt

some_digit = X.loc[4,:].values
some_digit_image = some_digit.reshape(28, 28)#reshaping into 2-D image matrix

plt.imshow(some_digit_image)
plt.axis("off")
plt.show()

# MultiClass Classification Methods

In [None]:
# #Label Encoding : If digit is  2 value true else false

# #target value is in integer converting to boolean
# y_is_digit2= (Y==2)
# y_is_digit2

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,Y,random_state=1,stratify=Y)

In [None]:
display(y_train.value_counts())
display(y_test.value_counts())

In [None]:

#Dictionary to store scores of various classifiers
score_dict={} #initialize

# 1) Decision Tree CLASSIFIER

It is the most commonly used supervised ml method used due to it's fast implementation and easy interpretation.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from random import seed

seed(123) # setting  random seed for reproducibility

#create instance of Knn classifier
dec_tree_clf=DecisionTreeClassifier(max_depth=20,class_weight='balanced')

dec_tree_clf.fit(X_train,y_train)

In [None]:
score=dec_tree_clf.score(X_test,y_test)
print(score)
score_dict['DecisionTree']=score

## 2) Stochastic Gradient Descent Classifier

This classifier has the advantage of being **capable of handling very large datasets efficiently**. This is in part because SGD deals with training instances independently, one at a time (which also makes SGD well suited for online learning)

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)

In [None]:
y_pred_sgd = sgd_clf.predict(X_test)
score=sgd_clf.score(X_test,y_test)
print(score)
score_dict['SGDClassifier']=score

### Modifying SGD Classifier

Using the OneVsRestClassifier class to construct OVO classifier of SGD.

In [None]:
from sklearn.multiclass import OneVsOneClassifier
sgd_modified=OneVsOneClassifier(SGDClassifier(random_state=42))
sgd_modified.fit(X_train,y_train)

In [None]:
y_pred_sgd_mod = sgd_modified.predict(X_test)
score=sgd_modified.score(X_test,y_test)
print(score)
score_dict['SGD Classifier Modified']=score

## 3) Gaussian Naive Bayes Classifier

Naive Bayes is the most common classifier for binary classification. It is fast and easy to calculate.Naive — Bayes is a classifier which uses Bayes Theorem. It calculates the probability for membership of a data-point to each class and assigns the label of the class with the highest probability.

For our MNIST dataset we have continuous features so we can apply Gaussian Naive Bayes which assumes distribution of feature values as Gaussian Distribution .

In [None]:
from sklearn.naive_bayes  import GaussianNB

gauss_clf= GaussianNB()
gauss_clf.fit(X_train,y_train)

score=gauss_clf.score(X_test,y_test)
print(score)
score_dict['Gaussian NB']=score

In [None]:
print('The mean of each pixel for True class',gauss_clf.theta_[2].shape)
fig,axis=plt.subplots(2,2)
axis[0,0].imshow(gauss_clf.theta_[2].reshape(28,28))
axis[0,0].axis('off')
axis[0,1].imshow(gauss_clf.theta_[3].reshape(28,28))
axis[0,1].axis('off')
axis[1,0].imshow(gauss_clf.theta_[4].reshape(28,28))
axis[1,0].axis('off')
axis[1,1].imshow(gauss_clf.theta_[5].reshape(28,28))
axis[1,1].axis('off')

In [None]:
y_pred_gauss=gauss_clf.predict(X_test)

## 4) Logistic Regression Classifier

 It it the most common regression classifier .  Like a Linear Regression model, a Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic or **sigmoid** function output which gives the probability of belonging to a class or not and output range of sigmoid function (0,1)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
scaler.fit(X_train)   #fit calculates the mean and std dev for each feature in data
X_train_scaled=scaler.transform(X_train) #transforming the X_train feature inputs
logis_clf=LogisticRegression(solver='sag')
logis_clf.fit(X_train_scaled,y_train)

In [None]:
X_test_scaled=scaler.transform(X_test) #scaling the test set note we dont use fit on test set which is wrong since it leads to data leakage
score=logis_clf.score(X_test_scaled,y_test)
print(score)
score_dict['LogisticReg']=score

In [None]:
y_pred_logis=logis_clf.predict(X_test)

## 5) Support Vector Machines

SVM is a powerful classifier/regression algorithms . They are non-parametric and can construct complex decision boundaries. For multclass classification SVM by default uses One vs One Method.

In [None]:
from sklearn.svm import SVC

svc_clf=SVC(random_state=123,class_weight='balanced') #using default kernel 'rbf'
svc_clf.fit(X_train,y_train)


In [None]:
score=svc_clf.score(X_test,y_test)
print(score)
score_dict['SVC']=score

In [None]:
y_pred_svc=svc_clf.predict(X_test)

## 6) Random Forest Classifier



In [None]:
from sklearn.ensemble import RandomForestClassifier
random_clf=RandomForestClassifier(n_estimators=100,class_weight='balanced')
random_clf.fit(X_train,y_train)

In [None]:
score=random_clf.score(X_test,y_test)
print(score)
score_dict['RandomForest']=score

## Evaluation of Models

We see highest score(Accuracy) is received for SVC followed by Random Forest and Logistic Regression.

In [None]:
df_score=pd.DataFrame(score_dict,index=["Score"])
df_score.transpose()

# Classification Evaluation Metrics

## Confusion Matrix


A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general idea is to count the number of times instances of class A are classified as class B. To compute the confusion matrix, you first need to have a set of predictions so that they can be compared to the actual targets.

We need to have predicted classes from each of our classifiers which we calculated earlier (y_pred_knn ,y_pred_sgd...etc.)

In [None]:
#CONFUSION MATRIX
from sklearn.metrics import ConfusionMatrixDisplay,confusion_matrix
import matplotlib.pyplot as plt

#Confusion Matrix
conf_mat=confusion_matrix(y_test,y_pred_svc)

#Display Confusion Matrix
#ConfusionMatrixDisplay(conf_mat,display_labels=['Not Digit 2','Is Digit 2'])
#plt.title('Confusion Matrix for KNNeighbors whether a data is digit 2 or not ')
#plt.show()

In [None]:
ig, ax = plt.subplots(figsize=(7.5, 7.5))
ax.matshow(conf_mat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(conf_mat.shape[0]):
    for j in range(conf_mat.shape[1]):
        ax.text(x=j, y=i,s=conf_mat[i, j], va='center', ha='center', size='xx-large')

plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

## Normalized Confusion Matrix
We can get normalized confusion matrix by using `normalize` argument of ConfusionMatrixDisplay

Using `normalize` =(default ='None')<br>
'all'=values divided by total datapoints,<br>
'true'= each value divided by corresponding total true values for each class , in this diagonal represent recall<br>
'pred'=each value divided by total predicted value for each class,diagonals represent precision<br>


In [None]:
# ConfusionMatrixDisplay.from_predictions(y_true=y_test ,y_pred=y_pred_logis,cmap='plasma',normalize='true',display_labels=['Not Digit 2','Is Digit 2'])
# plt.title('NOrmalized Confusion Matrix for KNNeighbors based on true values ')
# plt.show()
# print('Diagonals Represent Recall of each class')

## Classification Report


### 1.Precision
TP/TP+FP = calculated wrt to predicted values.
It measures “exactness” of our model . How exact or precise it is in predicting  true positive classes out of all classes that were predicted as positive .
 When cost of false positive is high High precision is required. Ex- spam email.


### 2.Recall/Sensitivity

TP/TP+FN = calculated wrt to true values.
It “Measures ” completeness of model . How well our model in capture all positive cases out of all the data points.
When cost of False Negative is high high recal is required .Ex- target customers to accept discount offer.


### 3.Specificity

TN/TN+FP  ==calculated wrt to true values.


### 4.Accuracy

Diagonal elements/Total datapoints = TP+TN/Total Datapoints

*Not an accurate metric in case of imbalanced/skewed dataset.*


### 5.F-1 Score
Harmonic Mean of precision and Recall.F1 score is the harmonic mean of precision and recall (Equation 3-3). Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both recall and precision are high.

F1 score favors classifiers that have similar precision and recall.

In some contexts you mostly care about precision, and in other contexts you really care about recall.


In [None]:
#CLassification Metric Report
from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred_svc))

## Analysis of CLassification Report and Matrix

* From the classification matrix we can observe that maximum of the data lies on the principal diagonal which means our SVC model is fairly accurate with very less number of misclassifcations

* In the report above we can see both precision and recall are similar and high as a result f1 score is also very high with an overall average f1 score of 97%

In [None]:
# Calculating F-1 Score for all classifiers since we have imbalanced
#dataset so F1 score metric more suitable for this case than accuracy
from sklearn.metrics import f1_score


# score_df=[y_pred_gauss,y_pred_sgd,y_pred_svc,y_pred_logis]
# f1_scores=[]
# name=['Gauss','SGD','SVC','Logis']

# for score in score_df:
#   print("The f1 score")
#   f1_scores.append(round(f1_score(y_test,score),3))
#   print(f1_scores)




In [None]:
# plt.plot(name,f1_scores,'r.-')
# plt.title('F1- Scores of various classifiers')

We can see we get best scores for KNN and SVC and worst score for Gaussian Naive Bayes.

The reason could be that we are analysing pixel points of image and knn and SVC are distance based classifiers which is apt for this case .

### 6.Macro Average
It is average of precision of each class.
Macro Average =(Precision Class A + Precision Class B)/2

### 7.Weighted Average
It is weighted average of precision of each class.
Na*PrecisionA +Nb*PrecisionB/ Total elements

## Precision - Recall Curve

Increasing precision reduces recall, and vice versa. This is called the precision/recall trade-off. We can plot PR Curve to decide on the decision threshold based on our requirement of precision /recall.

## Cross Validation Score

We can use cross validation to test our model across several folds of data to get better metric for evaluation which is not biased. This comes at a cost of more training time .

In [None]:
#Cross val score,cross val predict
# from sklearn.model_selection import cross_val_score,cross_val_predict

# print(cross_val_score(logis_clf,X_train,y_train))

#y_cross_pred= cross_val_predict(knn_clf,X_test,y_test,method='decision_function')

## Submission CSV

Finally out of all the above classifier we got maximum accuracy for SVC so using it to generate prediction for test.csv for our submission

In [None]:
y_test_predictions=svc_clf.predict(test)

In [None]:
y_test_predictions

In [None]:
ImageId=[i for i in range(1,28001)]
submission=pd.DataFrame({"ImageId":ImageId,"Label":y_test_predictions},columns=['ImageId','Label'])

submission.head()

In [None]:
submission.to_csv("./submission.csv",index=False)