## Selecting the relevant metric in classification

#### Select the relevant metric for classification problems in order to compare multiple models performance

#### Tags:
    Data: labeled data, Kaggle competition
    Technologies: python, pandas, scikit-learn
    Techniques: selecting the relevant metric for calssification 
    
#### Resources:
[UCI Machine Learning Repository - Default of Credit Clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#)

[ROC curve and AUC](https://www.youtube.com/watch?v=OAl6eAyP-yo)


## Classification metrics for model evaluation

The metrics for classification are more elaborate than for regression. The main idea is comparing the predictions with acutal classes where usually the rows are the actual class while the columns are the predicted class. In the simplest classification task where the class is binary we have 4 cells in the confusion matrix, true positive, false negative, false positive and false negative. The perfect classifier would have only values on the main diagonal.

An example of the below confusion matrix can be found here:
[Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

![title](confusion-matrix.png)

Hence the main metrics are:

1. Accuracy - has issues with the skewed datasets, where there is a low probability of a certain class
2. Precision - is the accuracy of the positive predictions
3. Recall - sensitivity or TPR is the proportion of positive istances that are correctly detected by the classifier
4. TNR - specificty or TNR is the ration of negative instances that are correctly predicted as negative
5. FPR - false positive rate, ratio of false positive instances out of all negative instances

Increasing precision decreases recall and vice-versa. Depending on the type of problem we are trying to solve we can prefer higer precision or higher recall. We can plot the Precision / Recall graph for different levels of precision and recall to understand the behavior of the model and pick the balance that suits our needs best.

Finally the usually used metric in evaluating classification models is Area under the Curve (AUC), that comes from the ROC curve. ROC curve plots TPR (Recall) against FPR for each threshold of the TPR. As this metric is also available in the Scikit-learn library we will use it to compare different classification models.

Lets take a closer look in Python. 

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

%matplotlib inline

In [2]:
# import the relevant dataset
df = pd.read_excel('../data/default-of-credit-card-clients.xls',skiprows=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
ID                            30000 non-null int64
LIMIT_BAL                     30000 non-null int64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null int64
BILL_AMT2                     30000 non-null int64
BILL_AMT3                     30000 non-null int64
BILL_AMT4                     30000 non-null int64
BILL_AMT5                     30000 non-null int64
BILL_AMT6               

##### There are 30000 observations and there are all numeric in character 

In [3]:
X = df.drop('default payment next month',axis=1)

y = df['default payment next month']

In [4]:
# split the data into the trainging and evaluation sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [5]:
# prepare the parameters for the grid search for each of the classifiers

modelnames = ['LR', 'KNN', 'SVM', 'RF', 'GB']

models = {'LR':LogisticRegression(random_state=42),
         'KNN':KNeighborsClassifier(),
         'SVM':SVC(random_state=42),
         'RF':RandomForestClassifier(random_state=42),
         'GB':GradientBoostingClassifier(random_state=42)}

parameters = {'LR':{'C':[0.1,1,2]},
             'KNN':{'n_neighbors':[3,5,7]},
             'SVM':{'kernel':['linear','poly','rbf']},
             'RF':{'max_depth':[3,5,7]},
             'GB':{'learning_rate':[0.05,0.1,0.2]}}

In [None]:
scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
results = {}

for m in models:
      
    cv = GridSearchCV(models[m],parameters[m], cv=3, scoring=scoring, refit='AUC', verbose=2, n_jobs=3)
    
    cv.fit(X_train, y_train)
    
    results[m] = cv.cv_results_
    

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   9 out of   9 | elapsed:    2.0s remaining:    0.0s
[Parallel(n_jobs=3)]: Done   9 out of   9 | elapsed:    2.0s finished


Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   9 out of   9 | elapsed:    9.3s remaining:    0.0s
[Parallel(n_jobs=3)]: Done   9 out of   9 | elapsed:    9.3s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 3 candidates, totalling 9 fits
