### <img src=images/gdd-logo.png width=200px align=right>
# Cost Sensitive Learning

In the previous notebook, you explored how over- and under- sampling techniques can address issues with modeling imbalanced data and learned to utilize appropriate evaluation metrics.

Taking a step back, we realize that the goal is not to only account for class imbalance, but to build a classifier that is robust and representative of the real-world, and more importantly, drives business value!

🤔 Assuming the minority class is the negative class, if the cost of false negatives are far lower compared to false positives, would you even bother with resampling?

In this notebook, we’re going to explore how to build robust classifiers that really focus on boosting business value. We’ll kick things off by looking at how each classification (and any misclassifications) impacts your business’s bottom line. We’ll break this down using something we call a “cost matrix.” Let’s dive in! 💰


### Outline 
- [The cost matrix](#intro)
- [Cost-sensitive learning in ```sklearn```](#sklearn)
- [Tuning the decision threshold](#tuning)

<a id = 'intro'></a>

## Cost matrix

The essence of cost-sensitive decision-making is that it can be optimal to act as if one class is true even when some other class is more probable. For example, it can be rational not to approve a large credit card transaction even if the transaction is most likely legitimate. 

When working on a task where the cost of misclassification is not equal, you can use a ***cost matrix*** to specify the cost of misclassification.

### Example: Fraud Detection
Let's take an example of a banking application, in particular, credit card transaction fraud detection. 

In this case, the cost of labelling a fraud as a non-fraud is much higher than labelling a non-fraud as a fraud. This is because missing a fraudulent transaction (false negative) involves a loss directly related to the amount of the transaction, but also on further fraudulent uses of the credit/debit card. At the same time, the blocking of transactions that are legitimate (false positive) causes inconvenience to customers, generates useless investigation costs, and also impacts the company reputation. 

In this case, the cost matrix might look like this:

| | Predicted: Fraud | Predicted: Non-Fraud |
| --- | --- | --- |
| **Actual: Fraud** | 0 | 5 |
| **Actual: Non-Fraud** | 1 | 0 |

####  <mark>Exercise: Define a cost matrix</mark>

Choose **one** of the following applications and try to define a cost matrix. You will then discuss with your peers your motivations for choosing certain costs.

1. You are a data scientist at a manufacturing company producing automobile parts and are tasked with building a model to predict whether a part is defective (positive) or not (negative) based on optical inspection. False negatives might lead to death on the highway while false positive might lead to good parts being discarded. What cost matrix would you define?
2. You are a data scientist working in a bank. You are tasked with building a model to predict whether a customer will default on their loan given their financial information . False negatives might lead to missed payments while false positives might lead to lost opportunity costs. What cost matrix would you define?
3. You are a data scientist at a hospital and are tasked with building a model to predict whether a patient has a tumor (positive) or not (negative) based on a biopsy. False negatives might lead to death while false positives might lead to unnecessary surgery. What cost matrix would you define?


Add your answer in the cell below making sure to specify the application you are working on and the class labels you are using. 

*Double-click or press Enter to open cell*

| | Predicted: 0| Predicted: 1|
| --- | --- | --- |
| **Actual: 0** | 0 | ? |
| **Actual: 1** | ? | 0 |

<a id = 'sklearn'></a>
As you may have noticed, defining a cost matrix is hard and often requires the assistance of a domain expert. In practice, a simple heuristic that is often used to define cost matrices is to assign costs based on the inverse class distribution. This is achieved by setting the ```class_weight``` parameter in the model to ```balanced```. 

<a id = 'sklearn'></a>

## Cost-sensitive learning in ```sklearn```

In this example, we are going to use the [Statlog](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) dataset to predict whether a customer will default on their loan.

In [None]:
import sklearn
from sklearn.datasets import fetch_openml

sklearn.set_config(transform_output="pandas")

credit_df = fetch_openml(data_id=31, as_frame=True, parser="pandas")
X, y = credit_df.data, credit_df.target

###  <mark>**Exercise**</mark>

1. Explore the dataset and identify the class distribution of the target variable.
2. Encode the positive class (good credit) as 1 and the negative class (bad credit) as 0.
3. Build, fit, and evaluate a pipeline instance consisting of appropriate preprocessing steps and an ensemble classifier.

In [None]:
# Your answers here

In [None]:
# %load answers/credit-lending-pipeline.py

### Adding a cost matrix to the pipeline

Let's now add a cost matrix to score our model based on the cost of misclassification.

| | Predicted: Good credit | Predicted: Bad credit |
| --- | --- | --- |
| **Actual: Good credit** | 0 | -1 |
| **Actual: Bad credit** | -5 | 0 |

<br>


<details>
  <summary>💡 Why do we multiply the costs by -1?</summary>
    Scikit-learn model selection tools expect that we follow a convention
    that "higher" means "better", and that the weights represent gains, minimizing the cost is equivalent to maximizing the gain.
</details>



In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix, make_scorer, precision_score, recall_score, accuracy_score


def monetary_gain_score(y, y_pred):
    cm = confusion_matrix(y, y_pred)
    gain_matrix = np.array(
        [
            [0, -1],  # -1 gain for false positives
            [-5, 0],  # -5 gain for false negatives
        ]
    )
    return np.sum(cm * gain_matrix)


scores = {
    "accuracy": make_scorer(accuracy_score),
    "precision": make_scorer(precision_score),
    "recall": make_scorer(recall_score),
    "monetary_gains": make_scorer(monetary_gain_score)
}

Let's now print out the monetary cost of the model's (mis)classifications

In [None]:
print(f"Accuracy: {scores['accuracy'](model, X_test, y_test)}")
print(f"Precision: {scores['precision'](model, X_test, y_test)}")
print(f"Recall: {scores['recall'](model, X_test, y_test)}")
print(f"Business cost metric: {scores['monetary_gains'](model, X_test, y_test)}")

We can also investigate the precision-recall curve to better understand the model's sensitivity to the decision threshold.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import PrecisionRecallDisplay

def plot_precision_recall_curve(est, ax, name, decision_threshold):
    PrecisionRecallDisplay.from_estimator(
        est,
        X_test,
        y_test,
        ax=ax,
        name=name,
    )
    ax.plot(
        scores["recall"](est, X_test, y_test),
        scores["precision"](est, X_test, y_test),
        marker="o",
        markersize=10,
        label=f"Decision Threshold: {decision_threshold:.2f}",
    )
    ax.set_title("Precision-Recall curve")
    ax.legend()

fig, ax = plt.subplots()
plot_precision_recall_curve(model, ax, "GBM", 0.5)

<a id = 'tuning'></a>

## Tuning the decision threshold

To find the optimal decision threshold, we need to compute the expected cost (or gain) for each possible threshold value.
Rather than computing the costs manually, we can use the ```TunedThresholdClassifierCV``` class to automatically find the optimal threshold.

In [None]:
from sklearn.model_selection import TunedThresholdClassifierCV

tuned_model = TunedThresholdClassifierCV(
    estimator=model,
    scoring=scores["monetary_gains"],
    store_cv_results=True,  # necessary to inspect all results
)

tuned_model.fit(X_train, y_train)
print(f"{tuned_model.best_threshold_=:0.2f}")

Let's now visualize the cost curve and find the optimal threshold, and further, evaluate the model's performance using the optimal threshold.

In [None]:
def plot_objective_score_curve(tuned_model, ax):
    ax.plot(
        tuned_model.cv_results_["thresholds"],
        tuned_model.cv_results_["scores"],
        color="tab:orange",
    )
    ax.plot(
        tuned_model.best_threshold_,
        tuned_model.best_score_,
        "o",
        markersize=10,
        color="tab:orange",
        label="Optimal cut-off point for the business metric",
    )
    ax.legend()
    ax.set_xlabel("Decision threshold (probability)")
    ax.set_ylabel("Monetary gains")
    ax.set_title("Business metric as a function of the decision threshold")

fig, axs = plt.subplots(1, 2, figsize=(12, 6))
plot_precision_recall_curve(tuned_model, axs[0], "GBM", tuned_model.best_threshold_)
plot_objective_score_curve(tuned_model, axs[1])


Have we improved the model's performance by tuning the decision threshold?

In [None]:
print(f"Accuracy: {scores['accuracy'](tuned_model, X_test, y_test)}")
print(f"Precision: {scores['precision'](tuned_model, X_test, y_test)}")
print(f"Recall: {scores['recall'](tuned_model, X_test, y_test)}")
print(f"Business cost metric: {scores['monetary_gains'](tuned_model, X_test, y_test)}")

####  <mark>Food for thought 🤔</mark>

1. Why does the precision recall curve not change for the tuned threshold classifier?
2. The tuned threshold classifier uses cross-validation to find the optimal threshold. What if we want to tune a pre-trained model?
   1. Look at the [documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.TunedThresholdClassifierCV.html) for the ```TunedThresholdClassifierCV``` class and see what function arguments you need to change to use a pre-trained model.
   2. Would you use the same dataset to train and tune the threshold? Why or why not?
3. Repeat the above steps for a cost-matrix where the cost of false positives is twice the cost of false negatives. How does the optimal threshold change?