## Predicting term deposit subscriptions

Predicting whether a bank client will subscribe to a term deposit. This is a typical imbalanced problem because most clients do not subscribe. Correctly identifying clients likely to subscribe (the minority class) is important for targeted marketing campaigns

Similar to undersampling, the goal of oversampling is to address the bias towards the majority class in imbalanced datasets. However, instead of removing data from the majority class, random oversampling replicates instances from the minority class to increase its representation in the training data.  

This technique can be particularly useful when the dataset is small and discarding majority class data (as in undersampling) would lead to a significant loss of information. The main drawback of simple random oversampling is that it can lead to overfitting, as the model might learn to classify the duplicated minority instances too specifically, without generalizing well to unseen data

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import RandomOverSampler
import warnings

# Suppress potential warnings
warnings.filterwarnings("ignore")

In [10]:
# 1. Load and prepare the dataset
df = pd.read_csv('bank-additional-full.csv', sep=';')

# Map the target variable to numerical values
df['y'] = df['y'].map({'yes': 1, 'no': 0})

# Select only numerical features for simplicity
X = df.select_dtypes(include=['int64', 'float64']).drop('y', axis=1) # Drop 'y' from features
y = df['y']

# Handle potential missing values (simple mean imputation)
X = X.fillna(X.mean())

print("Original dataset class distribution:")
print(y.value_counts())
print("-" * 30)

Original dataset class distribution:
y
0    36548
1     4640
Name: count, dtype: int64
------------------------------


In [11]:
# 2. Split the original data into training and testing sets
# Stratify ensures the test set has a similar class distribution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("Training set class distribution before oversampling:")
print(y_train.value_counts())
print("\nTest set class distribution:")
print(y_test.value_counts())
print("-" * 30)

Training set class distribution before oversampling:
y
0    25583
1     3248
Name: count, dtype: int64

Test set class distribution:
y
0    10965
1     1392
Name: count, dtype: int64
------------------------------


In [12]:
# 3. Train a model on the original imbalanced training data (Baseline)
print("Training model on original imbalanced data (Baseline)...")
model_original = LogisticRegression(solver='liblinear', random_state=42)
model_original.fit(X_train, y_train)

Training model on original imbalanced data (Baseline)...


In [13]:
# 4. Evaluate the baseline model on the original test set
print("Evaluating baseline model on original test set:")
y_pred_original = model_original.predict(X_test)
print(classification_report(y_test, y_pred_original, target_names=['no', 'yes']))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_original))
print("-" * 30)

Evaluating baseline model on original test set:
              precision    recall  f1-score   support

          no       0.93      0.98      0.95     10965
         yes       0.67      0.39      0.49      1392

    accuracy                           0.91     12357
   macro avg       0.80      0.68      0.72     12357
weighted avg       0.90      0.91      0.90     12357

Confusion Matrix:
[[10702   263]
 [  854   538]]
------------------------------


In [14]:
# 5. Apply Random Oversampling to the training data only
print("Applying Random Oversampling to training data...")
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
print("Training set class distribution after Random Oversampling:")
print(y_train_ros.value_counts())
print("Shape after Random Oversampling:", X_train_ros.shape)
print("-" * 30)

Applying Random Oversampling to training data...
Training set class distribution after Random Oversampling:
y
0    25583
1    25583
Name: count, dtype: int64
Shape after Random Oversampling: (51166, 10)
------------------------------


In [15]:
# 6. Train the same model on the oversampled training data
print("Training model on oversampled data...")
model_ros = LogisticRegression(solver='liblinear', random_state=42)
model_ros.fit(X_train_ros, y_train_ros)

Training model on oversampled data...


In [16]:
# 7. Evaluate the model on the original test set
print("Evaluating model on original test set (trained on oversampled data):")
y_pred_ros = model_ros.predict(X_test)
print(classification_report(y_test, y_pred_ros, target_names=['no', 'yes']))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_ros))
print("-" * 30)

Evaluating model on original test set (trained on oversampled data):
              precision    recall  f1-score   support

          no       0.98      0.85      0.91     10965
         yes       0.42      0.87      0.57      1392

    accuracy                           0.85     12357
   macro avg       0.70      0.86      0.74     12357
weighted avg       0.92      0.85      0.87     12357

Confusion Matrix:
[[9293 1672]
 [ 183 1209]]
------------------------------


In [18]:
train_accuracy = model_ros.score(X_train_ros, y_train_ros)
test_accuracy = model_ros.score(X_test, y_test)

print('train_accuracy: ', train_accuracy)
print('test_accuracy: ', test_accuracy)

train_accuracy:  0.8482976976898722
test_accuracy:  0.8498826576029781


Continuing with the Bank Marketing scenario, the use case remains predicting term deposit subscriptions. Random oversampling is applied here to see if increasing the representation of subscribers in the training data helps the model better identify potential subscribers.

Model trained on Original Data (Baseline): As seen with undersampling, this model has high precision for the majority class ('no') but low recall and F1-score for the minority class ('yes')

Model trained on Oversampled Data: There is an increase in recall for the minority class ('yes'). The model, having seen more examples of the 'yes' class during training, is now better at identifying them. However, there is a decrease in precision for the 'yes' class and for the 'no' class as well. The confusion matrix showed fewer False Negatives (actual 'yes' predicted as 'no') but potentially more False Positives (actual 'no' predicted as 'yes')

Random oversampling helped the model pay more attention to the minority class during training by artificially increasing its size. 

The primary benefit is improved recall for the minority class. This is valuable when the cost of missing a minority instance is high. It also doesn't discard any original data, which is advantageous for smaller datasets

Trade-off: The main risk is overfitting. By simply duplicating instances, the model doesn't learn new information about the minority class; it just sees the same examples multiple times. This can lead the model to become too specialized in recognizing the specific duplicated instances and perform poorly on unseen, slightly different minority examples in the test set. This is why precision might decrease, and the F1-score (which balances precision and recall) improved slightly depending on the extent of the trade-off

In the Bank Marketing context, oversampling can help identify more potential subscribers (higher recall), which is good for lead generation. However, precision dropped which means the bank might be wasting resources contacting many clients who have no interest (False Positives). This highlights the importance of evaluating the business impact of the model's performance based on the specific costs of False Positives and False Negatives. The problem can be solved by SMOTE