
Model Evaluation Measures : For Classification 
---
*   Metrics Based on Confusion Matrix 
*   Area under ROC
*   KS
*   Gini
*   Lift- Gain
*   Concordance
*   Somer's D

**Classfication** : Classification models can predict two outcomes, called binary classification; or multiple classes, called multiclass classification.

1. A **confusion matrix** is a way to visualize the results from a classification
model and a specific decision threshold.


class| Predicted Negative | Predicted Postive 
--- | --- | --- 
**Acutal Negative** |**True Negative (TN)**  | **Flase Postive (FP)** 
**Acutal Positive** |**False Negative (FN)**  | **True Positive (TP)** 


A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
# import seaborn as sns
d=pd.read_csv(r'output.csv')
d['y_pred']=np.where(d.prob>0.5,1,0)

In [None]:
d['y_pred']=np.where(d.prob>0.4,1,0)

In [None]:
tn, fp, fn, tp =confusion_matrix(d.y, d.y_pred).ravel()
print(tn,fp,fn,tp)

22280 7644 511 3473


Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right.

$Accuracy = \frac{TP+TN}{TP+FN+TN+FP}$


In [None]:
accuracy=(tp+tn)/(tn+fp+fn+tp)
print("Accuracy is",accuracy)

Accuracy is 0.759496284062758


In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(d.y, d.y_pred)

0.8162675474814203

$TruePositiveRate = \frac{TP}{TP+FN}$; also called as Senstitivity or Recall.

In [None]:
true_positive_rate=tp/(tp+fn)
print("True Positive Rate is",true_positive_rate)

True Positive Rate is 0.7929216867469879


$TrueNegativeRate=\frac{TN}{TN+FP}$ ; also called as Specificity

In [None]:
true_negative_rate=tn/(tn+fp)
print("True Negative Rate is",true_negative_rate)

True Negative Rate is 0.8193757519048256


$Precision=\frac{TP}{TP+FP}$ 

In [None]:
precision=tp/(tp+fp)
print("Precision is",precision)

Precision is 0.3688696870621205


In [None]:
from sklearn.metrics import precision_score
precision_score(d.y, d.y_pred)

0.3688696870621205

$FalsePositiveRate= \frac{FP}{FP+TN}$

In [None]:
false_positive_rate=fp/(fp+tn)
print("False Positive Rate is",false_positive_rate)

False Positive Rate is 0.18062424809517444


$FalseNegativeRate=\frac{FN}{TP}$

In [None]:
false_negative_rate=fn/(fn+tp)
print("False Negative Rate is",false_negative_rate)

False Negative Rate is 0.20707831325301204


**F1 Score** The F1 score is the harmonic mean of the precision and recall.

$F1 Score=2\frac{Precision*Recall}{Precision+Recall}$


In [None]:
f1=2*(precision*true_positive_rate)/(precision+true_positive_rate)
print("F1 Score is",f1)
from sklearn.metrics import f1_score
f1_score(d.y, d.y_pred)

F1 Score is 0.5035065349059611


0.5035065349059611

An **ROC** curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

+ True Positive Rate
+ False Positive Rate

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

**AUC: Area Under the ROC Curve**

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(d.y,d.y_pred)

0.8110308665486703

AUC is desirable for the following two reasons:

1. AUC is **scale-invariant.** It measures how well predictions are ranked, rather than their absolute values.
1. AUC is **classification-threshold-invariant.** It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

**KS test** is one of goodness-of-fit tests. This statistical test is used to decide if a sample from population comes from specific distribution. It is useful to compare between two distributions in population.
It is widely used in BFSI domain. 

**H0** both cumulative distributions are similar. 
**H1**  cumulative distributions are different.

It helps us to understand how well our predictive model is able to discriminate between events (positive) and non-events(negative).

In [None]:
def ks(data=None,target=None, prob=None):
    data['target0'] = 1 - data[target]
    data['bucket'] = pd.qcut(data[prob], 10)
    grouped = data.groupby('bucket', as_index = False)
    kstable = pd.DataFrame()
    kstable['min_prob'] = grouped.min()[prob]
    kstable['max_prob'] = grouped.max()[prob]
    kstable['events']   = grouped.sum()[target]
    kstable['nonevents'] = grouped.sum()['target0']
    kstable['N'] = kstable['events'] + kstable['nonevents']
    kstable = kstable.sort_values(by="min_prob", ascending=False).reset_index(drop = True)
    kstable['event_rate'] = (kstable.events / data[target].sum()).apply('{0:.2%}'.format)
    kstable['nonevent_rate'] = (kstable.nonevents / data['target0'].sum()).apply('{0:.2%}'.format)
    kstable['cum_eventrate']=(kstable.events / data[target].sum()).cumsum()
    kstable['cum_noneventrate']=(kstable.nonevents / data['target0'].sum()).cumsum()
    kstable['KS'] = np.round(kstable['cum_eventrate']-kstable['cum_noneventrate'], 3) * 100
    kstable['Lift']=kstable['cum_eventrate']/(kstable['N'].cumsum()/kstable['N'].sum())
    #Formating
    kstable['cum_eventrate']= kstable['cum_eventrate'].apply('{0:.2%}'.format)
    kstable['cum_noneventrate']= kstable['cum_noneventrate'].apply('{0:.2%}'.format)
    
    
    
    kstable.index = range(1,11)
    kstable.index.rename('Decile', inplace=True)
    pd.set_option('display.max_columns',11)
   
    return(kstable)

In [None]:
ks(data=d,target="y", prob="prob")

Unnamed: 0_level_0,min_prob,max_prob,events,nonevents,N,event_rate,nonevent_rate,cum_eventrate,cum_noneventrate,KS,Lift
Decile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.808292,1.0,1761,1630,3391,44.20%,5.45%,44.20%,5.45%,38.8,4.41992
2,0.591551,0.808214,1043,2348,3391,26.18%,7.85%,70.38%,13.29%,57.1,3.518869
3,0.432037,0.591457,583,2808,3391,14.63%,9.38%,85.02%,22.68%,62.3,2.833668
4,0.324549,0.431987,278,3112,3390,6.98%,10.40%,91.99%,33.08%,58.9,2.299858
5,0.251025,0.324507,155,3236,3391,3.89%,10.81%,95.88%,43.89%,52.0,1.917671
6,0.190676,0.251022,85,3306,3391,2.13%,11.05%,98.02%,54.94%,43.1,1.633602
7,0.139222,0.190664,46,3344,3390,1.15%,11.17%,99.17%,66.11%,33.1,1.416774
8,0.095562,0.139178,22,3369,3391,0.55%,11.26%,99.72%,77.37%,22.4,1.246567
9,0.05825,0.09554,7,3384,3391,0.18%,11.31%,99.90%,88.68%,11.2,1.110003
10,0.000146,0.058247,4,3387,3391,0.10%,11.32%,100.00%,100.00%,0.0,1.0


In [None]:
from scipy.stats import ks_2samp
ks_2samp(d.loc[d.y==0,"prob"], d.loc[d.y==1,"prob"])

Ks_2sampResult(statistic=0.6237771229282858, pvalue=0.0)

**Gain and Lift**

They measure how much better one can expect to do with the predictive model comparing without a model. It's a very popular metrics in marketing analytics. It's not just restricted to marketing analysis. It can be used in other domains as well such as risk modeling, supply chain analytics etc. It also helps to find the best predictive model among multiple challenger models.

**Concordant** : Percentage of pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event).

**Discordant** : Percentage of pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event).

**Tied**: Percentage of pairs where the observation with the desired outcome (event) has same predicted probability than the observation without the outcome (non-event).

In [None]:
Event=d.loc[d.y==1]
Non_event=d.loc[d.y==0]

In [None]:
Pairs=0
Conc=0
Disc=0
Tie=0
for i in Event.prob:
  for j in Non_event.prob:
    Pairs+=1
    if (i>j):
      Conc+=1
    elif (i<j):
      Disc+=1
    else:
      Tie+=1
print("===========================================================")
print("Total pairs",Pairs)
print("Total Conc", Conc)
print("Total Disc", Disc)
print("Total Tie", Tie)
print("The percentage of Concordance",round(Conc/Pairs*100,3),"%")
print("The percentage of Discordance",round(Disc/Pairs*100,3),"%")
print("The percentage of Tie",round(Tie/Pairs*100,3),"%")

Total pairs 119217216
Total Conc 105291696
Total Disc 13925520
Total Tie 0
The percentage of Concordance 88.319 %
The percentage of Discordance 11.681 %
The percentage of Tie 0.0 %


**Gini (Somer's D)**

It is a common measure for assessing predictive power of a credit risk model. It measures the degree to which the model has better discrimination power than the model with random scores.

$Somer's D =  \frac{(Concordant Percent  - Discordant Percent)}{100} $



In [None]:
print("Somers D is",(Conc-Disc-Tie)/Pairs)

Somers D is 0.7663840766085328


In [None]:
from scipy.stats import mannwhitneyu
mannwhitneyu(d.loc[d.y==0,"prob"], d.loc[d.y==1,"prob"])

MannwhitneyuResult(statistic=13925520.0, pvalue=0.0)