<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/Cost_Sensitive_Learning/blob/main/MA4_cost_sens_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will take the wine quality [dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv) and build the baseline classification model. We will convert the continuous y variable ("quality") into three segements:
- "Good" (>= 7)
- "Average" (> 5)
- "Poor" (<= 5)

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
wine_data = pd.read_csv(url, sep=';')

# Define quality thresholds
def quality_label(quality):
    if quality >= 7:
        return 'Good'
    elif quality > 5:
        return 'Average'
    else:
        return 'Poor'

# Apply the thresholds to create a new target column
wine_data['quality_label'] = wine_data['quality'].apply(quality_label)

In [27]:
# Now we drop the previous continuous target variable, consider the newly generated 'quality_label' as y and split into train and test data
# Separate features and target
X = wine_data.drop(columns=['quality', 'quality_label'])
y = wine_data['quality_label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y) # stratified random sampling

In [28]:
# let's see the class distribution of the 4 labels and compare before and after the split
print(y.value_counts(normalize = True).round(2),
      y_train.value_counts(normalize =  True).round(2),
      y_test.value_counts(normalize = True).round(2))

quality_label
Poor       0.47
Average    0.40
Good       0.14
Name: proportion, dtype: float64 quality_label
Poor       0.47
Average    0.40
Good       0.14
Name: proportion, dtype: float64 quality_label
Poor       0.46
Average    0.40
Good       0.14
Name: proportion, dtype: float64


As you can see, the class thresholds are chosen in such a way that it becomes an imbalanced dataset, which is closer to a real-world situation. Also, it prepares the ground for the use case where cost-sensitive learning will come into play.

We also find that the same distribution is present in the test data. We will be using the scikit-learn pipeline to build the models so that the preprocessing and modeling steps can be applied sequentially with ease.

In [29]:
# feature scaling and encoding

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Initialize the scaler and encoder
scaler = StandardScaler()
encoder = LabelEncoder()

# Encode the target labels
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test) # we won't need this as we will inverse transform the predicted labels

# Create a pipeline for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, X.columns)
    ])


In [30]:
# pipeline and model

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Instantiate a Random Forest Classifier model
model = RandomForestClassifier(n_estimators=100,
                               random_state=42)

# Combine preprocessing and model into a pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', model)])

# Train the model
pipeline.fit(X_train, y_train_encoded)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Decode the predictions to original labels
y_pred_labels = encoder.inverse_transform(y_pred)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_labels))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_labels))

Confusion Matrix:
[[124  13  55]
 [ 28  32   5]
 [ 40   3 180]]

Classification Report:
              precision    recall  f1-score   support

     Average       0.65      0.65      0.65       192
        Good       0.67      0.49      0.57        65
        Poor       0.75      0.81      0.78       223

    accuracy                           0.70       480
   macro avg       0.69      0.65      0.66       480
weighted avg       0.70      0.70      0.70       480



#### Defining a cost matrix and calculating real-world cost of misclassification:
I will define and calculate costs in terms of Euros and wine production and consumption both are very high there.

We will consider the following:
- the average wine costs 5 euros/bottle.
- A good wine if predicted as average costs 10 Euros loss, but if predicted as Poor should cost higher, say 20 Euros/bottle loss.
- Similarly if a Poor wine predicted as Good the cost is high but a little less than 20, say 15 (since it is an indirect loss to the company in terms of brand may be)

Note: these are just assumptions or guesstimates to show you the methodology and give a feel of the problem - the real-world cost matrix may be different.


In [32]:
import numpy as np

# Cost matrix: cost[i][j] represents the cost of predicting class j when the true class is i
cost_matrix = np.array([[0, 10, 20],
                        [8, 0, 5],
                        [15, 7, 0]])

In [41]:
# Based on the cost matrix we will now calculate the cost of misclassification

# Cost-sensitive evaluation
def calculate_total_cost(y_true, y_pred, cost_matrix):
    total_cost = 0
    # jsut summing up costs of misclassification for all our predictions
    for true, pred in zip(y_true, y_pred):
        total_cost += cost_matrix[true, pred] # cost[i][j]
    return total_cost

# Calculate the total cost of the cost-sensitive predictions
total_cost = calculate_total_cost(y_test_encoded, y_pred, cost_matrix)
print(f'Total Misclassification Cost: {total_cost}')

average_cost = total_cost/len(y_test)
print(f'Average Misclassification Cost per prediction: {average_cost.round(2)} Euros/bottle')


Total Misclassification Cost: 2156
Average Misclassification Cost per prediction: 4.49 Euros/bottle


Note, with the baseline model, the average cost of misclassification per prediction or per bottle is 4.5 Euros, which is pretty high, given that my average wine costs 5 Euros.

To reslove this issue, we will make use of cost-sensitive learning. We will generate class weights based on its misclassification cost and pass it to the model so that it optimizes the loss accordingly.

In [48]:
# Calculate raw class weights by summing costs for each class
raw_class_weights = cost_matrix.sum(axis=1) # sum of each row as row is true label
print("Raw class weights:", raw_class_weights)

# Normalize by sum of weights
norm_class_weights = raw_class_weights / raw_class_weights.sum()
print("Normalized class weights by sum:", norm_class_weights) # order is Good, Average and Poor.

# We now convert the class weights Convert to dictionary format expected by scikit-learn
class_weights_dict = {i: weight for i, weight in enumerate(norm_class_weights)}
print("Class weights dictionary:", class_weights_dict)

Raw class weights: [30 13 22]
Normalized class weights by sum: [0.46153846 0.2        0.33846154]
Class weights dictionary: {0: 0.46153846153846156, 1: 0.2, 2: 0.3384615384615385}


Note: The sum of the class weights is 1.

Now, we will retrain the model with these weights.

In [54]:
# Train the model with these class weights
model = RandomForestClassifier(n_estimators=100,
                               class_weight=class_weights_dict,
                               random_state=123)

# Combine preprocessing and model into a pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', model)])

# Train the model
pipeline.fit(X_train, y_train_encoded)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Decode the predictions to original labels
y_pred_labels = encoder.inverse_transform(y_pred)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_labels))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_labels))

Confusion Matrix:
[[125  14  53]
 [ 29  30   6]
 [ 32   3 188]]

Classification Report:
              precision    recall  f1-score   support

     Average       0.67      0.65      0.66       192
        Good       0.64      0.46      0.54        65
        Poor       0.76      0.84      0.80       223

    accuracy                           0.71       480
   macro avg       0.69      0.65      0.67       480
weighted avg       0.71      0.71      0.71       480



Note, there is an increase in f1-score for the label Good resulting in 1% increase in overall accuracy. But in each case the precision and recalls have changed too. Let's see how it affects the overall cost.

In [55]:
# Calculate the total cost of the cost-sensitive predictions
total_cost = calculate_total_cost(y_test_encoded, y_pred, cost_matrix)
print(f'Total Misclassification Cost: {total_cost}')

average_cost = total_cost/len(y_test)
print(f'Average Misclassification Cost per prediction: {average_cost.round(2)} Euros/bottle')

Total Misclassification Cost: 1963
Average Misclassification Cost per prediction: 4.09 Euros/bottle


Great! The overall and average cost reduced by ~ 9% with the help of cost-sensitive learning!
