## Classification Metrics

## Model evaluation

One needs a way to choose between models: different model types, tuning parameters, and features. In general  to assess the fit of the model one can use summary measures of goodness of fit (such as $R^2$) or by assessing the predictive ability of the model (using  k-fold cross-validation).

- Use a **goodness of fit** metric to assess how closely the model fits the data.

- Use a **model evaluation procedure** to estimate the predictive ability. That is, how well a model will generalize to out-of-sample data.

### Goodness of fit metric

The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Goodness of fit metrics vary in their quality and reliability.

Sometimes goodness of fit metrics are best fit metrics. For example, the likelihood ratio test statistic is a measure of the goodness of fit of a model, judged by whether an expanded form of the model provides a substantially improved fit.



### Model evaluation procedures


**Train/test split**
  - Split the dataset into two pieces (usually of different sizes), so that the model can be trained and tested on different data.

**K-fold cross-validation**
  - Systematically create "K" train/test splits and average the results together.

K-fold cross-validation is a simple, intuitive way to estimate prediction error.  K-fold cross-validation, which partitions the data into $k$ equally sized segments (called ‘folds’). One fold is held out for validation while the other k-1 folds are used to train the model and then used to predict the target variable in our testing data. This process is repeated k times, with the performance of each model in predicting the hold-out set being tracked using a performance metric such as accuracy.  

### Recall, Precision and Accuracy


#### Classification outcomes

* _true positives_
* _true negatives_
* _false positives_ (type I errors)
* _false negatives_ (type II errors)

Precision is number of true positive over all that are predicted to be positive.

$$\text{Precision}=\frac{tp}{tp+fp} \, $$

Recall is number of true positive over all that are called positive.

$$\text{Recall}=\frac{tp}{tp+fn} \, $$

Recall in this context is also referred to as the true positive rate or sensitivity, and precision is also referred to a positive predictive value]] (PPV); other related measures used in classification include true negative rate and accuracy. True negative rate is also called specificity.

$$\text{True negative rate}=\frac{tn}{tn+fp} \, $$

$$\text{Accuracy}=\frac{tp+tn}{tp+tn+fp+fn} \, $$

See [What's the difference between accuracy and precision? - Matt Anticole](https://youtu.be/hRAFPdDppzs)


### Sensitivity and Specificity

#### Sensitivity  

_Sensitivity_ (also called the _true positive rate_, the _ precision_ and the _recall _, or _probability of detection_ in some fields) measures the proportion of positives that are correctly identified as such (i.e. the percentage of sick people who are correctly identified as having the condition).

$$\text{Sensitivity}=\frac{tp}{tp+fn} \, $$

Note that is _sensitivity_ is also

$$\text{Recall}=\frac{tp}{tp+fn} \, $$

Note that _sensitivity_ or _recall_ are all of the positives (actual and predicted). 

Assume you had a poison mushroom classifier, then you would want the _sensitivity_ to be very high. You would want to catch almost all instances of poison mushrooms. 

#### Specificity  

_Specificity_ (also called the _true negative rate_) measures the proportion of negatives that are correctly identified as such (i.e., the percentage of healthy people who are correctly identified as not having the condition).

$$\text{Specificity}=\frac{tn}{tn+fp} \, $$

Note that is _specificity_ is also

$$\text{True negative rate}=\frac{tn}{tn+fp} \, $$

Note that _specificity_ or _true negative rate_ are all of the negatives (actual and predicted).

Assume you had a poison mushroom classifier, then you would care less if the _specificity_ was high. If you falsely labled an edible poison mushroom as a poison mushroom, it would be much less of an issue than if you falsely labled a poison mushroom as an edible mushroom. 

These concepts often are expressed using a number of terms. Given

* (number of) positive samples (P)
* (number of) negative samples (N)
* (number of) true positive (TP)
 eqv. with hit
* (number of) true negative (TN)
 eqv. with correct rejection
* (number of) false positive (FP)
 eqv. with false alarm , Type I error 
* (number of) false negative (FN)
 eqv. with miss, Type II error 


Then

* sensitivity or true positive rate (TPR) or hit rate or recall 
 
 $$\mathit{TPR} = \mathit{TP} / P = \mathit{TP} / (\mathit{TP}+\mathit{FN})$$  
 
* Specificity or true negative rate

 $$\mathit{SPC} = \mathit{TN} / N = \mathit{TN} / (\mathit{TN}+\mathit{FP}) $$
 
 
* precision or positive predictive value (PPV)  

 $$\mathit{PPV} = \mathit{TP} / (\mathit{TP} + \mathit{FP})$$
 
 
* negative predictive value (NPV)

 $$\mathit{NPV} = \mathit{TN} / (\mathit{TN} + \mathit{FN})$$
 
 
* fall-out or false positive rate (FPR)

 $$\mathit{FPR} = \mathit{FP} / N = \mathit{FP} / (\mathit{FP} + \mathit{TN}) = 1-\mathit{SPC}$$
 
 
* false negative rate (FNR)

 $$\mathit{FNR} = \mathit{FN} / (\mathit{TP} + \mathit{FN}) = 1-\mathit{TPR}$$
 
 
* false discovery rate (FDR)

 $$\mathit{FDR} = \mathit{FP} / (\mathit{TP} + \mathit{FP}) = 1 - \mathit{PPV} $$
 

* accuracy (ACC)

 $$\mathit{ACC} = (\mathit{TP} + \mathit{TN}) / (\mathit{TP} + \mathit{FP} + \mathit{FN} + \mathit{TN})$$
 
* F1 score 

 $$\mathit{F1} = 2 \mathit{TP} / (2 \mathit{TP} + \mathit{FP} + \mathit{FN})$$

* Matthews correlation coefficient (MCC)

 $$ \frac{ \mathit{TP} \times \mathit{TN} - \mathit{FP} \times \mathit{FN} } {\sqrt{ (\mathit{TP}+\mathit{FP}) ( \mathit{TP} + \mathit{FN} ) ( \mathit{TN} + \mathit{FP} ) ( \mathit{TN} + \mathit{FN} ) } }
$$

* Youden's J statistic  

 $$\mathit{TPR} + \mathit{SPC} - 1$$
* Markedness 
 $$\mathit{PPV} + \mathit{NPV} - 1$$



### Sensitivity and Specificity Tradeoff

The sensitivity and specificity are dependent on the threshold used in a classifier.  sensitivity and specificity tradeoff.

[Sensitivity and specificity tradeoff - Model Building and Validation]( https://youtu.be/5XMVhOZ5KMg)

[The tradeoff between sensitivity and specificity](https://youtu.be/vtYDyGGeQyo)


### Confusion Matrix

Confusion Matrix

In predictive analytics, a _table of confusion_ or  _confusion matrix_, is a table with two rows and two columns that reports the number of _false positives_, _false negatives_, _true positives_, and _true negatives_. This allows more detailed analysis than mere proportion of correct guesses (accuracy). Accuracy is not a reliable metric for the real performance of a classifier, because it will yield misleading results if the data set is unbalanced (that is, when the number of samples in different classes vary greatly). For example, if there were 95 cats and only 5 dogs in the data set, the classifier could easily be biased into classifying all the samples as cats. The overall accuracy would be 95%, but in practice the classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog class.

| Confusion Matrix    | Actual positives                 | Actual negatives                |
|---------------------|----------------------------------|---------------------------------|
| Predicted positives | True positives                   | False positives (Type I errors) |
| Predicted negatives | False negatives (Type II errors) | True negatives                  |


#### ROC Curves

In statistics, a _receiver operating characteristic curve_, or _ROC curve_, is a  graph of a function|graphical plot  that illustrates the performance of a  binary classifier  system as its discrimination threshold is varied. 

The ROC curve is created by plotting the  true positive rate  (TPR) against the  false positive rate  (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, or 'probability of detection'

 
$$\text{True positive rate}=\frac{tp}{tp+fn} \, $$ 

 
The false-positive rate is also known as the fall-out  or 'probability of false alarm' and can be calculated as (1 − specificity ). 

$$\text{False positive rate}=\frac{fp}{tn+fp} \, $$

  
$$\text{Specificity}=\frac{tn}{tn+fp} \, $$

Note that is _specificity_ is also

$$\text{True negative rate}=\frac{tn}{tn+fp} \, $$

![ROC Curves](images/ROC_curves.png)