# **Model Evaluation**: Cost-Sensitive Learning from Imbalanced Data (SOLUTION)

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. In case of a fraudulent transaction, the credit card company can incur substantial costs. In this exercise, we model the costs associated with prediction errors using a regularized logit model. We will learn that (1) weighting classes and (2) data resampling can help lower real-world costs for prediction errors.


<img src="https://images.unsplash.com/photo-1563013544-824ae1b704d3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2250&q=80" width="500" height="500" align="center"/>


Image source: https://images.unsplash.com/photo-1563013544-824ae1b704d3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2250&q=80




## Data

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

The data contain only numerical input variables which are the result of a data transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. 

Variables are defined as follows: 

| Feature name     | Variable Type | Description 
|------------------|---------------|--------------------------------------------------------
| Time             | Continuous    | Seconds elapsed between each transaction and the first transaction in the dataset
| V1               | Continuous    | Transformed feature 1 (due to confidentiality)
| ...              | ...           | ...
| V28              | Continuous    | Transformed feature 28
| Amount           | Continuous    | Transaction amount
| Class            | Binary        | Target variable (1 = fraud; 0 = no fraud)

Data source: https://www.kaggle.com/mlg-ulb/creditcardfraud



References:

- [Jason Brownlee on MachineLearningMastery.com](https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/)
- [Elkan (2001) - The Foundations of Cost-Sensitive Learning](http://web.cs.iastate.edu/~honavar/elkan.pdf)
- [Thai-Nghe et al. (2010) - Cost-sensitive learning methods for imbalanced data](https://ieeexplore.ieee.org/document/5596486)

## Part 0: Setup

### Import packages

In [None]:
# Standard imports
import pandas as pd

# Statistical modeling functions from sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import confusion_matrix, plot_confusion_matrix, roc_auc_score, accuracy_score, f1_score
from sklearn.utils           import resample


### Define constants

In [None]:
# Think of CHF costs for prediction errors (change these costs -> click "Run All" -> see how model performance changes)
# Default = 1000
COST_FN = 1000

# Default = 1
COST_FP = 1

# Column names
COLUMN_NAMES = ["Time", "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14", "V15",
                "V16", "V17", "V18", "V19", "V20", "V21", "V22", "V23", "V24" ,"V25", "V26", "V27", "V28", "Amount", "Class"]

# Seed for replication
SEED = 42

### Download data

Run the cell below to download the .csv data from our Google Cloud storage bucket.

In [None]:
!wget -N 'https://storage.googleapis.com/dsfm-datasets/model-evaluation/creditcard.csv.zip'
!unzip -o -j creditcard.csv.zip

# **MAIN EXERCISE**

## Part 1: Load data

First, we load in the credit card data and summarize the data.

**Q 1**: Load the `creditcard.csv` file and set the column names, using the column names defined above. Display the first 5 rows and the shape of the dataframe.

In [None]:
# Load the dataset
df = pd.read_csv('creditcard.csv')

print(df.shape)
df.head()

**Q 2**: Are any missing values in the data? What are the summary statistics for each column? 

In [None]:
# Summary statistics
df.describe().T

In [None]:
# Missing values by variable
df.isnull().sum()

## Part 2: EDA and train/test split

Next, we investigate the class distribution of the target.

**Q 1**: Plot a histogram of the target variable `Class`. 

In [None]:
df['Class'].hist()

**Q 2**: What percentage of transactions are fraudulent? 

Hint: Use the `value_counts` function in Pandas on the `Class` variable to count the number of transactions.

In [None]:
count = df['Class'].value_counts()
count

In [None]:
# Percentage of fraudulent transactions
percentage_fraudulent = count[1] / (count[1] + count[0]) * 100
print('{}% of transactions are fraudulent'.format(round(percentage_fraudulent, 4)))

**Q 3**: Split the data into training (80%) and testing data (20%). What's the shape of all four data sets?

Hint: Don't forget to stratify by the target variable. Why?

In [None]:
# Divide data into training and testing sets
X = df.drop(columns=['Class'], inplace=False)
y = df['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = SEED, stratify = y)

print(X_train.shape , X_test.shape , y_train.shape , y_test.shape)

## Part 3: Baseline - all predictions errors have the same cost

In this part, we establish the baseline that (naively) assigns uniform costs to each prediction error. We use a regularized l2 logistic regression model throughout this exercise. For the baseline, the model does not know that costs for FP and FN are different.

**Q 1**: Fit the logit model and predict fraud on the test data as a probability between 0 and 1. What does the confusion matrix look like? 

Hint: Use the `plot_confusion_matrix` in the `sklearn` package to plot the confusion matrix.

In [None]:
# Binary classification using a logit model
lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train, y_train)

# Predict fraud - as a binary outcome and a probability between 0 and 1
y_pred_lr       = lr_clf.predict(X_test)
y_pred_lr_proba = lr_clf.predict_proba(X_test)

# Plot confusion matrix
print('{} observations'.format(len(X_test)))
cm_baseline = plot_confusion_matrix(lr_clf, X_test, y_test, values_format='.5g')
cm_baseline

**Q 2**: Count the total prediction errors, compute the accuracy score, AUC score, and the total prediction costs.

In [None]:
# How many errors did the the model make?
cm_lr = confusion_matrix(y_test, y_pred_lr)
errors_baseline = cm_lr[0][1] + cm_lr[1][0]
cost_baseline = cm_lr[0][1] * COST_FP + cm_lr[1][0] * COST_FN
amount_baseline = sum(X_test[y_test != y_pred_lr]['Amount'])

acc_baseline  = accuracy_score(y_test, y_pred_lr)
auc_baseline  = roc_auc_score(y_test, y_pred_lr_proba[:, 1])
f1_baseline   = f1_score(y_test, y_pred_lr, average='weighted')

print('Num errors =', errors_baseline, '\n')
print('Accuracy   =', "{0:.4f}".format(acc_baseline))
print('AUC score  =', "{0:.4f}".format(auc_baseline))
print('F1 score   =', "{0:.4f}".format(f1_baseline))
print('Total amt. =', "CHF {0:.2f}".format(amount_baseline))
print('Total cost =', "CHF {0:.2f}".format(cost_baseline))

**Q 3**: Fit the logit model and predict fraud on the test data as a probability between 0 and 1. What does the confusion matrix look like? 

Hint: Use the `plot_confusion_matrix` in the `sklearn` package to plot the confusion matrix.

In [None]:
# Binary classification using a logit model
lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train, y_train)

# Predict fraud - as a binary outcome and a probability between 0 and 1
y_pred_lr       = lr_clf.predict(X_test)
y_pred_lr_proba = lr_clf.predict_proba(X_test)

# Plot confusion matrix
print('{} observations'.format(len(X_test)))
cm_baseline = plot_confusion_matrix(lr_clf, X_test, y_test, values_format='.5g')
cm_baseline

**Q 4**: Count the total prediction errors, compute the accuracy score, AUC score, and the total prediction costs.

In [None]:
# How many errors did the the model make?
cm_lr = confusion_matrix(y_test, y_pred_lr)
errors_baseline = cm_lr[0][1] + cm_lr[1][0]
cost_baseline = cm_lr[0][1] * COST_FP + cm_lr[1][0] * COST_FN

acc_baseline  = accuracy_score(y_test, y_pred_lr)
auc_baseline  = roc_auc_score(y_test, y_pred_lr_proba[:, 1])
f1_baseline   = f1_score(y_test, y_pred_lr, average='weighted')

print('Num errors =', errors_baseline, '\n')
print('Accuracy   =', "{0:.4f}".format(acc_baseline))
print('AUC score  =', "{0:.4f}".format(auc_baseline))
print('F1 score   =', "{0:.4f}".format(f1_baseline))
print('Total cost =', "CHF {0:.2f}".format(cost_baseline))

**Comment**: We see that the model performs reasonably well judging from AUC and F1 scores. It still makes 28 FN prediction errors, which are particularly expensive. 

# **ADVANCED EXERCISE**

*Optional.* If time permits and you feel comfortable with Python, continue with the advanced parts of this exercise below.

## Part 4: Class weights - modifying algorithm parameters

One approach to cost-sensitive learning for imbalanced classification is to up-weigh the minority class, which we will implement in this part. 

**Q 1**: Fit the same logit model as in Part 3, but set the `class_weight` parameter in the `LogisticRegression` function to represent different prediction error costs.

Hint: You can re-use the code from Part 3, Question 1. All you have to add is a value for the `class_weight` parameter in the `LogisticRegression` function.

In [None]:
# Binary classification using a logit model
lr_clf = LogisticRegression(max_iter = 1000, class_weight={0:COST_FP, 1: COST_FN})
lr_clf.fit(X_train, y_train)

# Predict fraud - as a binary outcome and a probability between 0 and 1
y_pred_lr       = lr_clf.predict(X_test)
y_pred_lr_proba = lr_clf.predict_proba(X_test)

# Plot confusion matrix
print('{} observations'.format(len(X_test)))
cm_weighted = plot_confusion_matrix(lr_clf, X_test, y_test, values_format='.5g')
cm_weighted

**Q 2**: Count the total prediction errors, compute the accuracy score, AUC score, and the total prediction costs.

Hint: You can re-use the code from Part 3, Question 2.

In [None]:
# How many errors did the model make?
cm_lr = confusion_matrix(y_test, y_pred_lr)
errors_weighted = cm_lr[0][1] + cm_lr[1][0]
cost_weighted = cm_lr[0][1] * COST_FP + cm_lr[1][0] * COST_FN
amount_weighted = sum(X_test[y_test != y_pred_lr]['Amount'])

acc_weighted  = accuracy_score(y_test, y_pred_lr)
auc_weighted  = roc_auc_score(y_test, y_pred_lr_proba[:, 1])
f1_weighted   = f1_score(y_test, y_pred_lr, average='weighted')

print('Num errors =', errors_weighted, '\n')
print('Accuracy   =', "{0:.4f}".format(acc_weighted))
print('AUC score  =', "{0:.4f}".format(auc_weighted))
print('F1 score   =', "{0:.4f}".format(f1_weighted))
print('Total amt. =', "CHF {0:.2f}".format(amount_weighted))
print('Total cost =', "CHF {0:.2f}".format(cost_weighted))

**Comment**: We see that the number of FNs was reduced from 28 to 8, substantially reducing the total prediction costs. However, the number of FP increase by two orders of magnitude (!). Given the FN and FP costs, the benefits of reducing FNs outweigh the higher costs of more FPs.

## Part 5: Sample weights - modifying algorithm parameters

Another approach to cost-sensitive learning for imbalanced classification is to up-weigh more important samples. We will use the transaction amount as a proxy for how important the prediction for that sample is.

**Q 1**: Fit the same logit model as in Part 3, but set the `sample_weight` parameter in the `LogisticRegression.fit()` function to represent the transaction amounts for each sample.

In [None]:
# Binary classification using a logit model
lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train, y_train, sample_weight = X_train['Amount'])

# Predict fraud - as a binary outcome and a probability between 0 and 1
y_pred_lr       = lr_clf.predict(X_test)
y_pred_lr_proba = lr_clf.predict_proba(X_test)

# Plot confusion matrix
print('{} observations'.format(len(X_test)))
cm_weightedSample = plot_confusion_matrix(lr_clf, X_test, y_test, values_format='.5g')
cm_weightedSample

**Q 2**: Count the total prediction errors, compute the accuracy score, AUC score, and the total prediction costs.

Hint: You can re-use the code from Part 3, Question 2.

In [None]:
# How many errors did the model make?
cm_lr = confusion_matrix(y_test, y_pred_lr)
errors_weightedSample = cm_lr[0][1] + cm_lr[1][0]
cost_weightedSample = cm_lr[0][1] * COST_FP + cm_lr[1][0] * COST_FN
amount_weightedSample = sum(X_test[y_test != y_pred_lr]['Amount'])

acc_weightedSample  = accuracy_score(y_test, y_pred_lr)
auc_weightedSample  = roc_auc_score(y_test, y_pred_lr_proba[:, 1])
f1_weightedSample   = f1_score(y_test, y_pred_lr, average='weighted')

print('Num errors =', errors_weightedSample, '\n')
print('Accuracy   =', "{0:.4f}".format(acc_weightedSample))
print('AUC score  =', "{0:.4f}".format(auc_weightedSample))
print('F1 score   =', "{0:.4f}".format(f1_weightedSample))
print('Total amt. =', "CHF {0:.2f}".format(amount_weightedSample))
print('Total cost =', "CHF {0:.2f}".format(cost_weightedSample))

**Comment**: We see that the number of FNs increased from 8 to 21 and the number of FPs decreased from 4162 to 21. The total transaction amount of the 42 prediction errors substantially decreased from CHF 1'130'300.33 to CHF 3'186.50. In short, if the prediction costs change based on the transaction amount, we have greatly improved our model.

## Part 6: Data resampling - modifying training data

Yet another approach to tackling the class imbalance problem is to resample data. One can resample by undersampling the majority class (not fraudulent) or oversampling the minority class (fraudulent), as summarized by the image below. Given our pre-defined costs for prediction errors, this is also called "cost-proportionate resampling".

<img src="https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png" width="700" height="500" align="center"/>

Image source: https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png

**Q 1**: Upsample the minority class using the `resample` function in the `sklearn` package.

Hint: Only use the TRAINING data for resampling, not the full dataset. We want to leave the TESTING data untouched for evaluating performance.

In [None]:
# Re-construct the training data with the target variable
Xy_train = X_train.copy()
Xy_train['Class'] = y_train

# Separate majority and minority classes
df_majority = Xy_train[Xy_train['Class'] == 0]
df_minority = Xy_train[Xy_train['Class'] == 1]

print('{} majority samples in the training data.'.format(len(df_majority)))
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace = True,                # sample with replacement
                                 n_samples = len(df_majority),  # to match majority class
                                 random_state = SEED)           # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled_train = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled_train['Class'].value_counts()

**Q 2**: Re-fit the same logit model as in Part 3, but with the upsampled training data and unchanged testing data. Before doing that, make sure to split the features (X) from the target (y) variable.

In [None]:
# Divide data into features and target variables
X_train_upsampled = df_upsampled_train.drop(columns=['Class'], inplace=False)
y_train_upsampled = df_upsampled_train['Class']

print(X_train_upsampled.shape , X_test.shape , y_train_upsampled.shape , y_test.shape)

In [None]:
# Binary classification using a logit model
lr_clf = LogisticRegression(max_iter = 1000)
lr_clf.fit(X_train_upsampled, y_train_upsampled)

# Predict fraud - as a binary outcome and a probability between 0 and 1
y_pred_lr       = lr_clf.predict(X_test)
y_pred_lr_proba = lr_clf.predict_proba(X_test)

# Plot confusion matrix
print('{} observations'.format(len(X_test)))
cm_resampled = plot_confusion_matrix(lr_clf, X_test, y_test, values_format='.5g')
cm_resampled

**Q 3**: Count the total prediction errors, compute the accuracy score, AUC score, and the total prediction costs. What changed?

Hint: You can re-use the code from Part 3, Question 2.

In [None]:
# How many errors did the model make?
cm_lr = confusion_matrix(y_test, y_pred_lr)
errors_resampled = cm_lr[0][1] + cm_lr[1][0]
cost_resampled = cm_lr[0][1] * COST_FP + cm_lr[1][0] * COST_FN
amount_resampled = sum(X_test[y_test != y_pred_lr]['Amount'])

acc_resampled  = accuracy_score(y_test, y_pred_lr)
auc_resampled  = roc_auc_score(y_test, y_pred_lr_proba[:, 1])
f1_resampled   = f1_score(y_test, y_pred_lr, average='weighted')

print('Num errors =', errors_resampled, '\n')
print('Accuracy   =', "{0:.4f}".format(acc_resampled))
print('AUC score  =', "{0:.4f}".format(auc_resampled))
print('F1 score   =', "{0:.4f}".format(f1_resampled))
print('Total amt. =', "CHF {0:.2f}".format(amount_resampled))
print('Total cost =', "CHF {0:.2f}".format(cost_resampled))

**Comment**: We see that the number of False Negatives slightly increased from 8 to 9, but the number of False Positives also decreased substantially. In sum, the total costs are slightly lower for the data resampling approach than for the class weights approach. If the prediction costs depend on the type of error (FNs and FPs), we have further improved our model.

## Part 6: Summary of ROC curves and model performances


### ROC curves

In [None]:
# Baseline
cm_baseline.plot(values_format='.5g')

In [None]:
# Class weights
cm_weighted.plot(values_format='.5g')

In [None]:
# Sample weights
cm_weightedSample.plot(values_format='.5g')

In [None]:
# Resampling
cm_resampled.plot(values_format='.5g')

### Performance, errors, and costs

In [None]:
# Print summary of model performances
width     = 25
width_box = 100
models    = ['Baseline', 'Class weights', 'Sample weights', 'Resampling']
metrics   = [' Accuracy', ' AUC', ' F1', ' Num errors', ' CHF amount', ' CHF costs']
accs      = [acc_baseline, acc_weighted, acc_weightedSample, acc_resampled]
aucs      = [auc_baseline, auc_weighted, auc_weightedSample, auc_resampled]
f1s       = [f1_baseline, f1_weighted, f1_weightedSample, f1_resampled]
errors    = [errors_baseline, errors_weighted, errors_weightedSample, errors_resampled]
costs     = [cost_baseline, cost_weighted, cost_weightedSample, cost_resampled]
amounts   = [amount_baseline, amount_weighted, amount_weightedSample, amount_resampled]
summary   = [accs, aucs, f1s, errors, amounts, costs]

print('Summary table: Predictive performance on TEST data.')
print(str('=' * width * (len(models)+1)))
print(''.ljust(width) + '{}'.format(models[0]).ljust(width) + '{}'.format(models[1]).ljust(width) + '{}'.format(models[2]).ljust(width) + '{}'.format(models[3]).ljust(width))
print(str('=' * width * (len(models)+1)))
for i in range(len(metrics)):
    line = metrics[i].ljust(width) + '{}'.format(round(summary[i][0], 4)).ljust(width) + '{}'.format(round(summary[i][1], 4)).ljust(width) + '{}'.format(round(summary[i][2], 4)).ljust(width) + '{}'.format(round(summary[i][3], 4)).ljust(width)
    print(line.center(width_box))
print()

# Further reading

- [SMOTE for Imbalanced Classification with Python](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)