# 3. Metrics/scoring

**Classification binary and multi-class/label**

Binary:
- Accuracy (but not work well on imbalanced labels/classes)
- Precision: Measures the proportion of true positives (TP) among the predicted positives. High precision means few false positives (predict positive but it's correctly negative = test covid but not really).$$\text{Precision} = \frac{TP}{TP + FP}$$
- Recall (Sensitivity): Measures the proportion of true positives (TP) among the actual positives.  High recall means fewer false negatives (predict negative but it's correctly positive = test non-covid but it's actually yes-covid, more seriously sometimes). $$\text{Recall} = \frac{TP}{TP + FN} $$
- F1-Score: The harmonic mean of precision and recall. It balances precision and recall, especially when you want a single metric that considers both. $$\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
    - A high F1-score means both precision and recall are high.
    - Useful when there is an imbalance between classes or when false positives and false negatives have different costs.

- AUC-ROC


Multi-class:
- Macro-averaging
- Micro-averaging
- Weighted-averaging



---
# 4. Imbalanced class
- Over/under-sampling
- SMOTE (oversampling minority): [original paper]([View of SMOTE: Synthetic Minority Over-sampling Technique](https://www.jair.org/index.php/jair/article/view/10302/24590))

https://towardsdatascience.com/smote-fdce2f605729

#### **SMOTE**

SMOTE is an algorithm that performs data augmentation by creating **synthetic data points** based on the original data points. SMOTE can be seen as an advanced version of oversampling, or as a specific algorithm for data augmentation. The advantage of SMOTE is that you are **not generating duplicates**, but rather creating synthetic data points that are **slightly different** from the original data points.

> SMOTE is an improved alternative for oversampling

The **SMOTE algorithm** works as follows:

- You draw a random sample from the minority class.
- For the observations in this sample, you will identify the k nearest neighbors.
- You will then take one of those neighbors and identify the vector between the current data point and the selected neighbor.
- You multiply the vector by a random number between 0 and 1.
- To obtain the synthetic data point, you add this to the current data point.

This operation is actually very much like **slightly moving the data point in the direction of its neighbor**. This way, you make sure that your synthetic data point is **not an exact copy** of an existing data point while making sure that it is **also not too different** from the known observations in your minority class.
##### SMOTE influences precision vs. recall

In the previously presented mountain sports example, we have looked at the overall accuracy of the model. Accuracy measures the percentages of predictions that you got right. In classification problems, we generally want to go a bit further than that and take into account **predictive performance for each class**.

In binary classification, the **confusion matrix** is a machine learning metric that shows the number of:
- _true positives (the model correctly predicted true)_
- _false positives (the model incorrectly predicted true)_
- _true negatives_ _(the model correctly predicted false)_
- _false negatives (the model incorrectly predicted false)_

In this context, we also talk about **precision vs. recall**. Precision means how well a model succeeds in identifying **ONLY positive cases**. Recall means how well a model succeeds in identifying **ALL the positive cases within the data**.

True positives and true negatives are both correct predictions: having many of those is the ideal situation. False positives and false negatives are both wrong predictions: having little of them is the ideal case as well. Yet in many cases, **we may prefer having false positives rather than having false negatives**.

When machine learning is used for automating business processes, false negatives (positives that are predicted as negative) will not show up anywhere and will probably never be detected, whereas false positives (negatives that are wrongly predicted as positive) will generally be filtered out quite easily in later manual checks that many businesses have in place.

> In many business cases, false positives are less problematic than false negatives.

An obvious example would be **testing for the coronavirus**. Imagine that sick people take a test and they obtain a false negative: they will go out and infect other people. On the other hand, if they are false positive they will be obliged to stay home: not ideal, but at least they do not form a public health hazard.

When we have a strong class imbalance, we have very few cases in one class, resulting in the model hardly ever predicting that class. **Using SMOTE we can tweak the model to reduce false negatives, at the cost of increasing false positives.** The result of using SMOTE is generally an **increase in recall**, at the cost of **lower precision**. This means that we will add **more predictions of the minority class**: some of them correct (increasing recall), but some of them wrong (decreasing precision).

> SMOTE increases recall at the cost of lower precision

For example, a model that predicts buyers all the time will be good in terms of recall, as it did identify all the positive cases. Yet it will be bad in terms of precision. The overall model accuracy may also decrease, but this is not a problem: **accuracy should not be used as a metric in case of imbalanced data**.

In [106]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd

In [107]:
np.random.seed(0)

f1 = np.random.uniform(0, 1, size=30)
f2 = np.random.uniform(0, 1, size=30)
y = np.array([0] * round(0.89 * 30) + [1] * round(0.11 * 30))
np.random.shuffle(y)

pd.DataFrame(np.c_[f1, f2, y])

Unnamed: 0,0,1,2
0,0.548814,0.264556,0.0
1,0.715189,0.774234,0.0
2,0.602763,0.45615,0.0
3,0.544883,0.568434,0.0
4,0.423655,0.01879,0.0
5,0.645894,0.617635,0.0
6,0.437587,0.612096,0.0
7,0.891773,0.616934,0.0
8,0.963663,0.943748,0.0
9,0.383442,0.68182,0.0


In [108]:
from sklearn.neighbors import NearestNeighbors

def custom_smote(X, y, minority_class=1, N=100, k=5):
    """
    SMOTE implementation for generating synthetic samples.
    
    Params:
        X: Feature matrix (numpy array).
        y: Target array.
        minority_class: Label of the minority class (binary so this is assumed to be 1)

        N: Percentage of new synthetic samples (in % of the minority class size).
        
        k: Number of nearest neighbors to consider.

    Returns:
        X_resampled, y_resampled: Resampled feature matrix and target array.

    """

    # Step 1: extract minority samples
    X_minority = X[y == minority_class]
    n_minority = X_minority.shape[0]
    
    n_synthetic = int(N / 100 * n_minority)  # number of synthetic samples to generate (conventional way)

    # Step 2: fit KNN
    knn = NearestNeighbors(n_neighbors=k).fit(X_minority)

    # Step 3: Generate synthetic samples
    synthetic_samples = []

    for _ in range(n_synthetic):

        # Randomly choose a minority sample
        i = np.random.randint(0, n_minority)
        x_min = X_minority[i]
        
        # Find k-nearest neighbors
        neighbors = knn.kneighbors(x_min.reshape(1, -1), return_distance=False)[0]
        
        # Randomly pick one neighbor
        neighbor_idx = np.random.choice(neighbors[1:])  # Avoid self-pairing
        x_neighbor = X_minority[neighbor_idx]
        
        # Generate synthetic sample by interpolation
        lam = np.random.uniform(0, 1)
        synthetic_sample = x_min + lam * (x_neighbor - x_min)
        synthetic_samples.append(synthetic_sample)
    
    # Step 4: Concatenate the original and synthetic samples
    X_synthetic = np.array(synthetic_samples)
    y_synthetic = np.array([minority_class] * n_synthetic)

    X_resampled = np.vstack((X, X_synthetic))
    y_resampled = np.hstack((y, y_synthetic))
    
    return X_resampled, y_resampled


X_train = np.c_[f1, f2]
y_train = y

X_resampled, y_resampled = custom_smote(X_train, y_train, minority_class=1, N=200, k=2)

pd.DataFrame(np.c_[X_resampled, y_resampled])

Unnamed: 0,0,1,2
0,0.548814,0.264556,0.0
1,0.715189,0.774234,0.0
2,0.602763,0.45615,0.0
3,0.544883,0.568434,0.0
4,0.423655,0.01879,0.0
5,0.645894,0.617635,0.0
6,0.437587,0.612096,0.0
7,0.891773,0.616934,0.0
8,0.963663,0.943748,0.0
9,0.383442,0.68182,0.0


Why randomly choose a minority sample but not go over each minority sample?

-> Avoid overfitting: If systematically generate synthetic samples for every minority sample, might produce many points that are clustered closely around the original minority points, potentially overfitting to the minority class distribution. By randomly sampling, increase the diversity of synthetic points

In [109]:
from imblearn.datasets import fetch_datasets
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

data = fetch_datasets()['ecoli']
data.data.shape, data.target.shape, np.unique(data.target, return_counts=True)

((336, 7),
 (336,),
 (array([-1,  1], dtype=int64), array([301,  35], dtype=int64)))

In [110]:
X_train, X_valid, y_train, y_valid = train_test_split(data.data, data.target, test_size=0.2, random_state=0, stratify=data.target)

X_train_resampled, y_train_resampled = custom_smote(X_train, y_train, N=200, k=5)

np.unique(y_train, return_counts=True), np.unique(y_train_resampled, return_counts=True)

((array([-1,  1], dtype=int64), array([240,  28], dtype=int64)),
 (array([-1,  1], dtype=int64), array([240,  84], dtype=int64)))

In [111]:
# not resampled

rf1 = RandomForestClassifier(random_state=0).fit(X_train, y_train)
y_pred = rf1.predict(X_valid)
accuracy = accuracy_score(y_valid, y_pred)
precision = precision_score(y_valid, y_pred, zero_division=1)
recall = recall_score(y_valid, y_pred, zero_division=1)
conf_matrix = confusion_matrix(y_valid, y_pred)
accuracy, precision, recall, conf_matrix 

# accuracy is "high"! very misleading
# recall is terribly low
# TN | FP
# FN | TP

# https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_5_0.html#sphx-glr-auto-examples-release-highlights-plot-release-highlights-1-5-0-py

(0.9264705882352942,
 0.6666666666666666,
 0.5714285714285714,
 array([[59,  2],
        [ 3,  4]], dtype=int64))

In [112]:
# resampled

rf2 = RandomForestClassifier(random_state=0).fit(X_train_resampled, y_train_resampled)
y_pred = rf2.predict(X_valid)
accuracy = accuracy_score(y_valid, y_pred)
precision = precision_score(y_valid, y_pred, zero_division=1)
recall = recall_score(y_valid, y_pred, zero_division=1)
conf_matrix = confusion_matrix(y_valid, y_pred)
accuracy, precision, recall, conf_matrix 

# recall is higher!
# TN | FP
# FN | TP
# note that FN decreases (recall higher) and FP increases (precision lower)

(0.9264705882352942,
 0.625,
 0.7142857142857143,
 array([[58,  3],
        [ 2,  5]], dtype=int64))

In [113]:
# use SMOTE from imblearn

# tuning
ratio = 0.45
# ratio = 0.5

smote = SMOTE(
    random_state=0, 
    sampling_strategy=ratio,  # sampling_strategy = ratio between minority and majority
    k_neighbors=5
)
X_train_resampled_SMOTE, y_train_resampled_SMOTE = smote.fit_resample(X_train, y_train)
np.unique(y_train_resampled_SMOTE, return_counts=True) 

(array([-1,  1], dtype=int64), array([240, 108], dtype=int64))

In [114]:
rf3 = RandomForestClassifier(random_state=0).fit(X_train_resampled_SMOTE, y_train_resampled_SMOTE)
y_pred = rf3.predict(X_valid)
accuracy = accuracy_score(y_valid, y_pred)
precision = precision_score(y_valid, y_pred, zero_division=1)
recall = recall_score(y_valid, y_pred, zero_division=1)
conf_matrix = confusion_matrix(y_valid, y_pred)
accuracy, precision, recall, conf_matrix 

(0.9264705882352942,
 0.625,
 0.7142857142857143,
 array([[58,  3],
        [ 2,  5]], dtype=int64))