# General workflow for ML notebook

1. Import dataset, do basic data cleaning / feature generation
1. Train / test split
    1. Often use a time distribution for this (eg. Jan - May train, June test)
1. Do a bunch of EDA on data
    1. Distributions of each column
    1. Looking at how to group categoricals
    1. Look at how to impute nulls
    1. Relationships of each column with outcome
    1. Correlations of columns
1. Baseline model (simple logistic regression)
1. Fit models on training set
    1. Use an sklearn pipeline to do main preprocessing / model fitting
    1. Use sklearn classification report to understand precision / recall / f1 score
1. Do hyperparameter optimization
    1. Can use BayesSearchCV over a stratified k-fold (or a gridsearch?)
1. Take the best hyperparameters, train a model on the entire train set
1. Analyze test set performance
1. Feature importance plots


# Model evaluation metrics

### Precision and recall

**Precision**: If you classify as a positive, what is the probability it is an actual positive?

$$ \text{Precision} = \frac{\text{Relevant retrieved instances}}{\text{All retreived instances}} = \frac{TP}{TP + FP}$$

1. Can be of positive class, or of negative class
1. Can be dollar weighted. Eg. of the total dollar transaction amount you classify as fraud, how much is actually fraud?

**Recall**: Of all the positive instances, how many did you capture?

$$ \text{Recall} = \frac{\text{Relevant retrieved instances}}{\text{All relevant instances}} = \frac{TP}{TP + FN}$$

1. Also called "True positive rate"
1. Can be of positive class, or of negative class
1. Can be dollar weighted. Eg. of the total fraudulent transaction volume, how much do you classify as fraud?

Precision and recall are threshold-specific metrics. Also look at precision-recall curve and area under precision recall curve to get a full picture of the model.


### Recall at exposure

Precision-recall curve is sensitive to underlying fraud prevalence.
1. When fraud prevalence is low, precision at a specific recall will be lower than if fraud prevalence is higher.
1. Can plot recall (y-axis) vs exposure (x-axis):
    1. Exposure: percent of good users that we friction (false positives / false positives + true negatives)
    1. To compute, look at the score threshold in good user distribution. Eg 5% exposure means look at the 95 percentile score in good users. Then get the recall. What fraction of bad users had a score above the threshold you computed.
1. This works because recall is solely calculated on positive examples, and the threshold for an exposure rate is calculated solely based on negative examples. So the ratio of the two classes does not matter.

![images](./images/recall_at_exposure.png)

### Precision at k

Useful for ranking models. Remember to adjust for position bias in ranking model evaluation.

### F1 score

F1 score is the harmonic mean of precision and recall. Therefore it symetrically represents both precision and recall in the computation. 
1. The point that maximizes F1 score should have high precision and high recall.
1. Ranges between 0 and 1. The closer F1 score is to 1, the more accurate your model is.
1. Precision and recall explicitly depend on the ratio of positive to negative test cases. That makes F1 score also sensitive to this ratio.


\begin{align*} 
F 1 \text{ Score} &= \frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}} \\
\\
&= 2 \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\end{align*}

The F1 score is robust to imbalanced classes? [link](https://www.picsellia.com/post/understanding-the-f1-score-in-machine-learning-the-harmonic-mean-of-precision-and-recall)

You can compute F1 scores for each class (positive and negative, or k classes in multi-class decision making), and then weight the F1 score by number of samples in each class.


The $\text{F}\beta$ score is a weighted version of the F1 score.
1. If $\beta$ is bigger than one, then recall will be overweighted.
1. If $\beta$ is less than one, then precision will be overweighted.


\begin{align*} 
F \beta \text{ Score} &= \frac{1 + \beta^2}{\frac{1}{\text{Precision}} + \frac{\beta^2}{\text{Recall}}} \\
\\
&= \frac{(1 + \beta^2) \times \text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision} + \text{Recall})}
\end{align*}

### ROC-AUC

Area under the receiver operator characteristic curve. [link](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc), [good overview of multiple metrics](https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc).

**False positive rate** (x-axis). Intuition: "of all the negatives, how many are false positives"
$$FPR = \frac{FP}{FP + TN}$$

**True positive rate** (recall, y axis). Intuition: "of all the positives, how many are true positives"
$$TPR = \frac{TP}{TP + FN}$$

Some more details:
1. Plot the FPR and TPR for multiple thresholds and take the area under the curve to compute the metric.
1. If you rank all observations by their model scores, AUROC tells you what the probability of ranking a random positive example higher than a random negative example is.
1. AUC is classification threshold invariant, it measures the quality of predictions, irrespective of the threshold.
1. AUC is scale invariant. It measures how well predictions are ranked, rather than their absolute values.
1. It doesn't do well with heavily imbalanced data. The intuition: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
1. You generally want the curve to be convex, heavily bending to the top left corner.

Other points / questions:
1. All of these metrics can be used for threshold selection
1. What does a stratified k-fold do?