## Predicting term deposit subscriptions

Predicting whether a bank client will subscribe to a term deposit. This is a typical imbalanced problem because most clients do not subscribe. Correctly identifying clients likely to subscribe (the minority class) is important for targeted marketing campaigns

When dealing with imbalanced data, training a model directly on the original data often leads to a model that is good at predicting the majority class but performs poorly on the minority class. This is because the model sees many more examples of the majority class and becomes biased towards it

Random undersampling is a technique to mitigate this by reducing the number of instances in the majority class to balance it with the minority class. The goal is to give the minority class more importance during the training process. However, a potential drawback is discarding data from the majority class, which might remove valuable information

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.under_sampling import RandomUnderSampler
import warnings

# Suppress potential warnings from imblearn or sklearn
warnings.filterwarnings("ignore")

In [13]:
# 1. Load and prepare the dataset
df = pd.read_csv('bank-additional-full.csv', sep=';')

In [14]:
# Map the target variable to numerical values
df['y'] = df['y'].map({'yes': 1, 'no': 0})

# Select only numerical features for simplicity
X = df.select_dtypes(include=['int64', 'float64']).drop('y', axis=1) # Drop 'y' from features
y = df['y']

# Handle potential missing values (simple mean imputation)
X = X.fillna(X.mean())

print("Original dataset class distribution:")
print(y.value_counts())
print("-" * 30)

Original dataset class distribution:
y
0    36548
1     4640
Name: count, dtype: int64
------------------------------


In [15]:

# 2. Split the original data into training and testing sets
# Stratify ensures that the test set has a similar class distribution to the original dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print("Training set class distribution before undersampling:")
print(y_train.value_counts())
print("\nTest set class distribution:")
print(y_test.value_counts())
print("-" * 30)

Training set class distribution before undersampling:
y
0    25583
1     3248
Name: count, dtype: int64

Test set class distribution:
y
0    10965
1     1392
Name: count, dtype: int64
------------------------------


In [16]:

# 3. Train a model on the original imbalanced training data
print("Training model on original imbalanced data...")
model_original = LogisticRegression(solver='liblinear', random_state=42)
model_original.fit(X_train, y_train)

Training model on original imbalanced data...


In [17]:
# 4. Evaluate the model on the original test set
print("Evaluating model on original test set (trained on original data):")
y_pred_original = model_original.predict(X_test)
print(classification_report(y_test, y_pred_original, target_names=['no', 'yes']))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_original))
print("-" * 30)

Evaluating model on original test set (trained on original data):
              precision    recall  f1-score   support

          no       0.93      0.98      0.95     10965
         yes       0.67      0.39      0.49      1392

    accuracy                           0.91     12357
   macro avg       0.80      0.68      0.72     12357
weighted avg       0.90      0.91      0.90     12357

Confusion Matrix:
[[10702   263]
 [  854   538]]
------------------------------


In [18]:
# 5. Apply Random Undersampling to the training data only
print("Applying Random Undersampling to training data...")
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

Applying Random Undersampling to training data...


In [19]:
print("Training set class distribution after Random Undersampling:")
print(y_train_rus.value_counts())
print("Shape after Random Undersampling:", X_train_rus.shape)
print("-" * 30)

Training set class distribution after Random Undersampling:
y
0    3248
1    3248
Name: count, dtype: int64
Shape after Random Undersampling: (6496, 10)
------------------------------


In [20]:
# 6. Train the same model on the undersampled training data
print("Training model on undersampled data...")
model_rus = LogisticRegression(solver='liblinear', random_state=42)
model_rus.fit(X_train_rus, y_train_rus)

Training model on undersampled data...


In [21]:
# 7. Evaluate the model on the original test set
print("Evaluating model on original test set (trained on undersampled data):")
y_pred_rus = model_rus.predict(X_test)
print(classification_report(y_test, y_pred_rus, target_names=['no', 'yes']))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rus))
print("-" * 30)

# 8. Comparison and Explanation (will be in the text output)

Evaluating model on original test set (trained on undersampled data):
              precision    recall  f1-score   support

          no       0.98      0.85      0.91     10965
         yes       0.42      0.86      0.56      1392

    accuracy                           0.85     12357
   macro avg       0.70      0.86      0.74     12357
weighted avg       0.92      0.85      0.87     12357

Confusion Matrix:
[[9284 1681]
 [ 188 1204]]
------------------------------


In [22]:
train_accuracy = model_rus.score(X_train_rus, y_train_rus)
test_accuracy = model_rus.score(X_test, y_test)

print('train_accuracy: ', train_accuracy)
print('test_accuracy: ', test_accuracy)

train_accuracy:  0.8505233990147784
test_accuracy:  0.8487496965282836


Model trained on Original Data: Has high precision and F1-score for the 'no' class (majority) but significantly lower recall and F1-score for the 'yes' class (minority). The model is very good at identifying clients who won't subscribe but misses many who will. The confusion matrix showed a large number of False Negatives (actual 'yes' predicted as 'no').
Model trained on Undersampled Data: There is improvement in the recall for the 'yes' class. The model is now better at identifying the minority class instances. However, this often comes at the cost of precision for the 'yes' class (more False Positives - predicting 'yes' when it's 'no') and slightly decreased performance on the majority class.   
Effect of Undersampling on Performance:

Random undersampling directly addresses the class imbalance during training by reducing the dominance of the majority class. By training on a more balanced dataset, the model's decision boundary is less biased towards the majority class.   

Benefit: It improves the model's ability to detect the minority class, leading to higher recall for the minority class. This is crucial in use cases where identifying as many minority instances as possible is critical (e.g., not missing a fraudulent transaction or a potential disease case).
Trade-off: It can potentially decrease the precision for the minority class because the model might now be more prone to incorrectly classifying majority instances as minority. It also discards training data, which could be detrimental if the original dataset is small or if the discarded majority instances contained important variations.   

In the context of the Bank Marketing dataset, improved recall for the 'yes' class means the bank can potentially identify a larger proportion of clients who would actually subscribe, allowing for more effective targeted marketing, even if it means contacting some clients who won't subscribe (lower precision). The business goal would determine the acceptable trade-off between recall and precision.

