## **Boosting Marketing Efficiency: Targeted Bank Campaign by Customer Subscription Behavior**

**Overall Project Objective:** 

Develop a data-driven marketing strategy that maximizes return on investment (ROI) by identifying optimal trade-off between broad customer outreach and precision marketing using predictive modeling.

**Notebook 2 of 3: Feature Engineering & Predictive Modeling**

This notebook covers the core technical steps of the predictive modeling pipeline. 

The primary goals are to:
- Prepare the cleaned data for machine learning through feature engineering.
- Build and evaluate several classification models.
- Identify the best-performing model for predicting term deposit subscriptions.
- Analyze the precision-recall trade-off to inform the final strategy.

### **Data Load**

Simple data and libraries load

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, RobustScaler
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve


In [15]:
file_path = '../data/bank_cleaned.csv'
df = pd.read_csv(file_path)

**-Data Overview-**

The specific file used is 'bank_cleaned.csv,' which is the cleaned and imputed output from notebook 1. 

Data at a glance:
- Dataset: bank_cleaned.csv
- Observations: 4,521
- Variables: 17 (pre-feature engineering)

In [16]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown_outcome,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unspecified,3,jun,199,4,-1,0,unknown_outcome,no
4,59,blue-collar,married,secondary,no,0,yes,no,unspecified,5,may,226,1,-1,0,unknown_outcome,no


In [17]:
# verify loaded data
print("--- Cleaned Data ---\n")
print(f"Dataset shape: {df.shape}\n")
print("--- Missing Values Check ---\n") 
print(df.isnull().sum())

--- Cleaned Data ---

Dataset shape: (4521, 17)

--- Missing Values Check ---

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64


*The dataset contains no missing values with 4,521 observations and 17 variables (16 input variables and 1 output variable)*

### **Feature Engineering**

The feature engineering process involves three primary steps:
1. **Feature removal:** several columns will be removed before training the model to improve performance:

    - Data leakage features: The features identified as source of data leakage ('duration', 'campaign') will be dropped.
    
    - Low-impact variable: The 'day' variable will be excluded. As seasonality impact is already being analyzed by 'month' variable, the specific day is unlikely to provide a meaningful signal and may introduce more noise than a valuable relationship with the outcome.

2. **Categorical variables conversion:** For binary categorical variables, we will use binary endocing to convert them into a numerical format (1s and 0s) that machine learning models can process.

3. **Encoding categorical variables:** For multi-class categorical variables, we will apply one-hot encoding. This technique converts each feature into multiple new binary columns, preventing the model from assuming a false ordinal relationship between the categories.

In [18]:
# keep original clean df intact
df_model = df.copy()

**Feature Removal**

*'duration', 'campaign', 'day'*

In [19]:
# drop variables (duration, campaign, day)
exclude_cols = ['duration', 'campaign', 'day']
df_model = df_model.drop(columns=exclude_cols)
print("--- Remaining Variables ---\n")
print(df_model.columns.to_list())

--- Remaining Variables ---

['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'month', 'pdays', 'previous', 'poutcome', 'y']


**Categorical Variables Conversion**

*'y', 'default', 'housing', 'loan'*

In [20]:
binary_values = {'yes': 1, 'no': 0}

df_model['y'] = df_model['y'].map(binary_values)
df_model['default'] = df_model['default'].map(binary_values)
df_model['housing'] = df_model['housing'].map(binary_values)
df_model['loan'] = df_model['loan'].map(binary_values)

print("--- Binary Variables Mapped ---\n")
print(df_model[['y','default','housing','loan']].head())

--- Binary Variables Mapped ---

   y  default  housing  loan
0  0        0        0     0
1  0        0        1     1
2  0        0        1     0
3  0        0        1     1
4  0        0        1     0


**Multi-Class Categorical Variables Encoding**

*'job', 'marital', 'education', 'contact', 'month', 'poutcome'*

In [21]:
multi_cols = ['job','marital','education','contact','month','poutcome']
df_model = pd.get_dummies(data=df_model, columns=multi_cols, drop_first=True)

print("--- Multi-Class Variables Encoded ---\n")
print("New dataset shape:", df_model.shape)
df_model.head(1)

--- Multi-Class Variables Encoded ---

New dataset shape: (4521, 38)


Unnamed: 0,age,default,balance,housing,loan,pdays,previous,y,job_blue-collar,job_entrepreneur,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown_outcome
0,30,0,1787,0,0,-1,0,0,False,False,...,False,False,False,False,False,True,False,False,False,True


This section has transformed cleaned dataset into a model-ready, fully numeric format.

- Feature exclusion: 'Duration' and 'campaign' variables were droppted to prevent data leakage, and the 'day' variable was dropped to minimize data noise.

- Binary encoding: The target variable 'y' and the binary features ('default', 'housing', 'loan') were converted from yes/no to 1/0.

- One-hot encoding: The 6 multi-class categorical features were converted into numeric dummy variables.

The final dataset is fully numeric and is ready to be split for model training.

### **Data Split & Feature Scaling**

Before training, the data will be prepared using a two-step process to ensure reliable model evaluation and performance.

1. **Stratified data split:** The dataset will be split into an 80% training set and 20% test set. Because the dataset is imbalanced (approximately 11.5% positive class), we will use a **stratified split.** This ensures the proportion of subscribers is the same in both the training and test sets, which is a best practice for imbalanced classification problems.

2. **Feature scaling:** After splitting the data, all numerical variables will be scaled. This step is neccessary to prevent numerical instability in distance-based algorithms like SMOTE and is performed after the split to avoid data leakage from the test set.

In [22]:
# 'y' is target variable
# rest will be input variables
y = df_model['y']
X = df_model.drop(columns='y')
# 80 training vs. 20 test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

# capping outliers
num_cols = ['age','balance','pdays','previous']
X_train[num_cols] = X_train[num_cols].astype(float)
X_test[num_cols] = X_test[num_cols].astype(float)

for col in num_cols:
    lower_bound = X_train[col].quantile(0.01)
    upper_bound = X_train[col].quantile(0.99)

    X_train.loc[:, col] = X_train[col].clip(lower_bound, upper_bound)
    X_test.loc[:, col] = X_test[col].clip(lower_bound, upper_bound)

# scaling numerical variables
scaler = RobustScaler()
scaler.fit(X_train[num_cols])

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled.loc[:, num_cols] = scaler.transform(X_train[num_cols])
X_test_scaled.loc[:, num_cols] = scaler.transform(X_test[num_cols])

print("--- Data Splitting Verification ---\n")
print("Training set shape:", X_train_scaled.shape)
print("Testing set shape:", X_test_scaled.shape)

print("\nSubscription rate in original dataset:", f"{y.mean()*100:.2f}%")
print("Subscription rate in training dataset:", f"{y_train.mean()*100:.2f}%")
print("Subscription rate in testing dataset:", f"{y_test.mean()*100:.2f}%")

--- Data Splitting Verification ---

Training set shape: (3616, 37)
Testing set shape: (905, 37)

Subscription rate in original dataset: 11.52%
Subscription rate in training dataset: 11.53%
Subscription rate in testing dataset: 11.49%


### **Class Imbalance Handling**

To prevent model from ignoring the minority class due to the 90/10 imbalance, we will use SMOTE (synthetic minority over-sampling technique) on the training set to create a balanced dataset for model training.

In [23]:
# smote to the training set
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

print("--- Class Imbalance Handling ---\n")
print("The original training dataset shape:", X_train_scaled.shape)
print("Subscription rate in original training dataset:", f"{y_train.mean()*100:.2f}%")
print("\nResampled training dataset shape:", X_train_smote.shape)
print("Subscription rate in resampled training dataset:", f"{y_train_smote.mean()*100:.2f}%")

--- Class Imbalance Handling ---

The original training dataset shape: (3616, 37)
Subscription rate in original training dataset: 11.53%

Resampled training dataset shape: (6398, 37)
Subscription rate in resampled training dataset: 50.00%


### **Model Training & Evaluation**

In this section, we will develop and evaluate several classification models to predict customer subscriptions. Our modeling strategy will proceed in three key steps:

1. **Establish a baseline:** We will begin by building a simple, interpretable model to serve as a solid performance benchmark.

2. **Develop advanced models:** We will train more complex models with the goal of outperforming the baseline.

3. **Hyperparameter tuning:** Next, the best-perfoming model will be selected and its hyperparameters will be tuned to maximize its predictive power. 

The models to be trained are:
- **Logistic regression:** This will serve as our baseline model. It's a highly interpretable and efficient model that provides a solid benchmark for comparison.

- **Random forest and XGBoost:** These are industry-standard ensemble models, which are expected to offer higher predictive performance.

Each model's performance will be assessed using **precision**, **recall**, and the **ROC AUC score.** These are optimal metrics for an imbalanced dataset and will help us evaluate the trade-offs relevant to our business objective.

**Baseline Logistic Regression Model**

In [11]:
# logistic regression
log_reg = LogisticRegression(random_state=42, solver='saga', penalty='l2', C=0.1, max_iter=5000)
# train on the balanced training set
log_reg.fit(X_train_smote, y_train_smote)

# prediction on imbalanced test set
pred_lr = log_reg.predict(X_test_scaled)
# prediction probabilities for 'yes'
prob_lr = log_reg.predict_proba(X_test_scaled)[:, 1]

print("--- Logistic Regression Performance ---\n")
print(f"Accuracy: {accuracy_score(y_test, pred_lr):.3f}")
print(f"Precision: {precision_score(y_test, pred_lr):.3f}")
print(f"Recall: {recall_score(y_test, pred_lr):.3f}")
print(f"ROC AUC: {roc_auc_score(y_test, prob_lr):.3f}")

--- Logistic Regression Performance ---

Accuracy: 0.734
Precision: 0.178
Recall: 0.365
ROC AUC: 0.628


  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b


*Note on runtime warnings*

The logistic regression model produced several RuntimeWarning messages, indicating that its optimization algorithm had difficulty converging, even with scaled data. 

**Baseline Model Evaluation**

**Hyperparameter Tuning**

### **Identify Best Model**

### **Conclusion & Key Insights**