<a href="https://colab.research.google.com/github/xtbtds/ml-zoomcamp/blob/main/lesson4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4.2 Accuracy and Dummy Model

**Accuracy** measures the fraction of correct predictions. Specifically, it is the number of correct predictions divided by the total number of predictions.
```
accuracy_score(y_val, y_pred >= 0.5)
```

We can change the decision **threshold**, it should not be always 0.5. But, in this particular problem, the best decision cutoff, associated with the hightest accuracy (80%), was indeed 0.5.
```
thresholds = np.linspace(0, 1, 21)

scores = []

for t in thresholds:
    score = accuracy_score(y_val, y_pred >= t)
    print('%.2f %.3f' % (t, score))
    scores.append(score)
```

Note that if we build a **dummy model** in which the decision cutoff is 1, so the algorithm *predicts that no clients will churn*, the accuracy would be 73%. Thus, we can see that the improvement of the original model with respect to the dummy model is not as high as we would expect.
```
from collections import Counter
Counter(y_pred >= 1.0)
1 - y_val.mean()
```

Therefore, in this problem accuracy can not tell us how good is the model because the dataset is **unbalanced**, which means that there are more instances from one category than the other. This is also known as **class imbalance**.

# 4.3 Confusion table

Working with unbalanced classes.

Confusion table is a way of measuring different types of errors and correct decisions that binary classifiers can make. Considering this information, it is possible to evaluate the quality of the model by different strategies.

When comes to a prediction of an LR model, each falls into one of four different categories:

- Prediction is that the customer WILL churn. This is known as the Positive class

  - And Customer actually churned - Known as a **True Positive (TP)**

    ```
    tp = (predict_positive & actual_positive).sum()
    ```

  - But Customer actually did not churn - Knwon as a **False Positive (FP)**
    ```
    fp = (predict_positive & actual_negative).sum()
    ```

- Prediction is that the customer WILL NOT churn' - This is known as the Negative class

  - Customer did not churn - **True Negative (TN)**

    ```
    tn = (predict_negative & actual_negative).sum()
    ```

  - Customer churned - **False Negative (FN)**
    ```
    fn = (predict_negative & actual_positive).sum()
    ```


# 4.4 Precision and Recall

**Precision** tell us the fraction of positive predictions that are correct. It takes into account only the positive class (TP and FP - second column of the confusion matrix), as is stated in the following formula:

 $$P = \frac{TP}{TP+FP}$$

**Recall** measures the fraction of correctly identified postive instances. It considers parts of the postive and negative classes (TP and FN - second row of confusion table). The formula of this metric is presented below:

 $$R = \frac{TP}{TP+FN}$$

In this problem, the precision and recall values were 67% and 54% respectively. So, these measures reflect some errors of our model that accuracy did not notice due to the class imbalance.

# 4.5 ROC Curves

**ROC** stands for Receiver Operating Characteristic, and this idea was applied during the Second World War for evaluating the strenght of radio detectors. This measure considers **False Positive Rate (FPR)** and **True Postive Rate (TPR)**, which are derived from the values of the confusion matrix.

**FPR** is the fraction of false positives (FP) divided by the total number of negatives (FP and TN - the first row of confusion matrix), and we want to **minimize** it.
$$FPR = \frac{FP}{FP+TN}$$


**TPR or Recall** is the fraction of true positives (TP) divided by the total number of positives (FN and TP - second row of confusion table), and we want to **maximize** this metric.
$$TPR = \frac{TP}{TP + FN}$$


ROC curves consider Recall and FPR under all the possible thresholds. If the threshold is 0 or 1, the TPR and Recall scores are the opposite of the threshold (1 and 0 respectively), but they have different meanings, as we explained before.

We need to compare the ROC curves against a point of reference to evaluate its performance, so the corresponding curves of **random** and **ideal models**are required. It is possible to plot the ROC curves with FPR and Recall scores vs thresholds, or FPR vs Recall.

# 4.6 ROC AUC

The ***A***rea ***U***nder the ROC ***C***urves can tell us how good is our model with a single value. The AUC ROC of a random model is 0.5, while for an ideal one is 1.

In ther words, AUC can be interpreted as the probability that a randomly selected positive example has a greater score than a randomly selected negative example.

# 4.7 Cross-Validation

The full training dataset is divided into **k partitions**, we train the model in k-1 partiions of this dataset and evaluate it on the remaining subset. Then, we end up evaluating the model in all the k folds, and we calculate the average evaluation metric for all the folds.
```
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
```
This method is applied in the **parameter tuning** step, which is the process of selecting the best parameter.


In general, if the dataset is large, we should use the hold-out validation dataset strategy. In the other hand, if the dataset is small or we want to know the standard deviation of the model across different folds, we can use the cross-validation approach.

# Notes

**F1 score:**
$$F_1 = \frac{2PR}{P+R}$$

**PR curve**
- precision/recall curve
- also useful metric