## **Boosting Marketing Efficiency: Targeted Bank Campaign by Customer Subscription Behavior**

**Overall Project Objective:** 

Develop a data-driven marketing strategy that maximizes return on investment (ROI) by identifying optimal trade-off between broad customer outreach and precision marketing using predictive modeling.

**Notebook 2 of 3: Feature Engineering & Predictive Modeling**

This notebook covers the core technical steps of the predictive modeling pipeline. 

The primary goals are to:
- Prepare the cleaned data for machine learning through feature engineering.
- Build and evaluate several classification models.
- Identify the best-performing model for predicting term deposit subscriptions.
- Analyze the precision-recall trade-off to inform the final strategy.

### **Setup & Data Load**

Simple data and libraries load.

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, RobustScaler
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve, f1_score


In [22]:
file_path = '../data/bank_cleaned.csv'
df = pd.read_csv(file_path)

**-Data Overview-**

The specific file used is 'bank_cleaned.csv,' which is the cleaned and imputed output from notebook 1. 

Data at a glance:
- Dataset: bank_cleaned.csv
- Observations: 4,521
- Variables: 17 (pre-feature engineering)

In [23]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown_outcome,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unspecified,3,jun,199,4,-1,0,unknown_outcome,no
4,59,blue-collar,married,secondary,no,0,yes,no,unspecified,5,may,226,1,-1,0,unknown_outcome,no


In [24]:
# verify loaded data
print("--- Cleaned Data ---\n")
print(f"Dataset shape: {df.shape}\n")
print("--- Missing Values Check ---\n") 
print(df.isnull().sum())

--- Cleaned Data ---

Dataset shape: (4521, 17)

--- Missing Values Check ---

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64


*The dataset contains no missing values with 4,521 observations and 17 variables (16 input variables and 1 output variable)*

### **Feature Engineering**

The feature engineering process involves three primary steps:
1. **Feature removal:** several columns will be removed before training the model to improve performance:

    - Data leakage features: The features identified as source of data leakage ('duration', 'campaign') will be dropped.
    
    - Low-impact variable: The 'day' variable will be excluded. As seasonality impact is already being analyzed by 'month' variable, the specific day is unlikely to provide a meaningful signal and may introduce more noise than a valuable relationship with the outcome.

2. **Categorical variables conversion:** For binary categorical variables, we will use binary endocing to convert them into a numerical format (1s and 0s) that machine learning models can process.

3. **Encoding categorical variables:** For multi-class categorical variables, we will apply one-hot encoding. This technique converts each feature into multiple new binary columns, preventing the model from assuming a false ordinal relationship between the categories.

In [25]:
# keep original clean df intact
df_model = df.copy()

**Feature Removal**

*'duration', 'campaign', 'day'*

In [26]:
# drop variables (duration, campaign, day)
exclude_cols = ['duration', 'campaign', 'day']
df_model = df_model.drop(columns=exclude_cols)
print("--- Remaining Variables ---\n")
print(df_model.columns.to_list())

--- Remaining Variables ---

['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'month', 'pdays', 'previous', 'poutcome', 'y']


**Categorical Variables Conversion**

*'y', 'default', 'housing', 'loan'*

In [27]:
binary_values = {'yes': 1, 'no': 0}

df_model['y'] = df_model['y'].map(binary_values)
df_model['default'] = df_model['default'].map(binary_values)
df_model['housing'] = df_model['housing'].map(binary_values)
df_model['loan'] = df_model['loan'].map(binary_values)

print("--- Binary Variables Mapped ---\n")
print(df_model[['y','default','housing','loan']].head())

--- Binary Variables Mapped ---

   y  default  housing  loan
0  0        0        0     0
1  0        0        1     1
2  0        0        1     0
3  0        0        1     1
4  0        0        1     0


**Multi-Class Categorical Variables Encoding**

*'job', 'marital', 'education', 'contact', 'month', 'poutcome'*

In [28]:
multi_cols = ['job','marital','education','contact','month','poutcome']
df_model = pd.get_dummies(data=df_model, columns=multi_cols, drop_first=True)

print("--- Multi-Class Variables Encoded ---\n")
print("New dataset shape:", df_model.shape)
df_model.head(1)

--- Multi-Class Variables Encoded ---

New dataset shape: (4521, 38)


Unnamed: 0,age,default,balance,housing,loan,pdays,previous,y,job_blue-collar,job_entrepreneur,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown_outcome
0,30,0,1787,0,0,-1,0,0,False,False,...,False,False,False,False,False,True,False,False,False,True


This section has transformed cleaned dataset into a model-ready, fully numeric format.

- Feature exclusion: 'Duration' and 'campaign' variables were droppted to prevent data leakage, and the 'day' variable was dropped to minimize data noise.

- Binary encoding: The target variable 'y' and the binary features ('default', 'housing', 'loan') were converted from yes/no to 1/0.

- One-hot encoding: The 6 multi-class categorical features were converted into numeric dummy variables.

The final dataset is fully numeric and is ready to be split for model training.

### **Data Split & Feature Scaling**

Before training, the data will be prepared using a two-step process to ensure reliable model evaluation and performance.

1. **Stratified data split:** The dataset will be split into an 80% training set and 20% test set. Because the dataset is imbalanced (approximately 11.5% positive class), we will use a **stratified split.** This ensures the proportion of subscribers is the same in both the training and test sets, which is a best practice for imbalanced classification problems.

2. **Feature scaling:** After splitting the data, all numerical variables will be scaled. This step is neccessary to prevent numerical instability in distance-based algorithms like SMOTE and is performed after the split to avoid data leakage from the test set.

In [29]:
# 'y' is target variable
# rest will be input variables
y = df_model['y']
X = df_model.drop(columns='y')
# 80 training vs. 20 test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=12, stratify=y)

# scaling numerical variables
num_cols = ['age','balance','pdays','previous']
scaler = RobustScaler()
scaler.fit(X_train[num_cols])

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[num_cols] = scaler.transform(X_train[num_cols])
X_test_scaled[num_cols] = scaler.transform(X_test[num_cols])

print("--- Data Splitting Verification ---\n")
print("Training set shape:", X_train_scaled.shape)
print("Testing set shape:", X_test_scaled.shape)

print("\nSubscription rate in original dataset:", f"{y.mean()*100:.2f}%")
print("Subscription rate in training dataset:", f"{y_train.mean()*100:.2f}%")
print("Subscription rate in testing dataset:", f"{y_test.mean()*100:.2f}%")

--- Data Splitting Verification ---

Training set shape: (3616, 37)
Testing set shape: (905, 37)

Subscription rate in original dataset: 11.52%
Subscription rate in training dataset: 11.53%
Subscription rate in testing dataset: 11.49%


### **Class Imbalance Handling**

To prevent model from ignoring the minority class due to the 90/10 imbalance, we will use SMOTE (synthetic minority over-sampling technique) on the training set to create a balanced dataset for model training.

In [30]:
# smote to the training set
smote = SMOTE(random_state=12)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

print("--- Class Imbalance Handling ---\n")
print("The original training dataset shape:", X_train_scaled.shape)
print("Subscription rate in original training dataset:", f"{y_train.mean()*100:.2f}%")
print("\nResampled training dataset shape:", X_train_smote.shape)
print("Subscription rate in resampled training dataset:", f"{y_train_smote.mean()*100:.2f}%")

--- Class Imbalance Handling ---

The original training dataset shape: (3616, 37)
Subscription rate in original training dataset: 11.53%

Resampled training dataset shape: (6398, 37)
Subscription rate in resampled training dataset: 50.00%


### **Model Training & Evaluation**

In this section, we will develop and evaluate several classification models to predict customer subscriptions. Our modeling strategy will proceed in three key steps:

1. **Establish a baseline:** We will begin by building a simple, interpretable model to serve as a solid performance benchmark.

2. **Develop advanced models:** We will train more complex models with the goal of outperforming the baseline.

3. **Hyperparameter tuning:** Next, the best-perfoming model will be selected and its hyperparameters will be tuned to maximize its predictive power. 

The models to be trained are:
- **Logistic regression:** This will serve as our baseline model. It's a highly interpretable and efficient model that provides a solid benchmark for comparison.

- **Random forest and XGBoost:** These are industry-standard ensemble models, which are expected to offer higher predictive performance.

Each model's performance will be assessed using **precision**, **recall**, **ROC AUC score**, and the **f1-score.** These are optimal metrics for an imbalanced dataset and will help us evaluate the trade-offs relevant to our business objective.

**-Baseline Logistic Regression Model-**

In [31]:
# logistic regression
log_reg = LogisticRegression(random_state=12, solver='liblinear')
# train on the balanced training set
log_reg.fit(X_train_smote, y_train_smote)

# prediction on imbalanced test set
pred_lr = log_reg.predict(X_test_scaled)
# prediction probabilities for 'yes'
prob_lr = log_reg.predict_proba(X_test_scaled)[:, 1]

print("--- Logistic Regression Performance ---\n")
print(f"Accuracy: {accuracy_score(y_test, pred_lr):.3f}")
print(f"Precision: {precision_score(y_test, pred_lr):.3f}")
print(f"Recall: {recall_score(y_test, pred_lr):.3f}")
print(f"ROC AUC score: {roc_auc_score(y_test, prob_lr):.3f}")

--- Logistic Regression Performance ---

Accuracy: 0.786
Precision: 0.204
Recall: 0.298
ROC AUC score: 0.613


  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b
  ret = a @ b


*-Note on Runtime Warnings-*

During training, the logistic regression model produced several RuntimeWarning messages, indicating that its optimization algorithm had difficulty converging, even with scaled data. While further tuning could be applied, we will accept the generated metrics as our preliminary baseline. The primary goal is to determine if more complex models provide a significant performance boost.

We do not anticipate these warnings for random forest and XGBoost, as tree-based models are insensitive to feature scaling and do not rely on the same type of convergence-based optimization.

*-Further Note-*

To solve the issue, we moved from a standard scaler to a robust scaler and implemented outlier capping to handle extreme values. When the issue persisted, we switch to a 'RidgeClassifier' to handle potential multicollinearity. Since the warnings remained even with the above troubleshooting process, we concluded the issue was a fundamental incompatibility between the post-SMOTE data and linear model. 

**Baseline Model Evaluation**

The initial logistic regression model serves as our baseline. The performance on the test set is as follows:

- **Precision (0.204):** When the model predicts a client will subscribe, it is correct only 20.4% of the time. This means nearly 80% of marketing calls made based on these predictions would be to uninterested clients, highlighting significant inefficiency.

- **Recall (0.298):** The model successfully identifies only 29.8% of all clients who actually would subscribe. This means over 70% of potential customers are being missed by this strategy.

- **ROC AUC Score (0.613):** This score confirms that the model has learned some predictive patterns, performing slightly better than a random guess.

The baseline model quantifies the business problem. A strategy based on this model would be inefficient (low precision) and would miss the majority of potential customers (low recall). Our goal now is to improve upon these metrics with more advanded models.

**-Random Forest Model-**

In [32]:
# random forest
rf_mod = RandomForestClassifier(random_state=12)
# train on the balanced training set
rf_mod.fit(X_train_smote, y_train_smote)

pred_rf = rf_mod.predict(X_test_scaled)
prob_rf = rf_mod.predict_proba(X_test_scaled)[:, 1]

print("--- Random Forest Performance ---\n")
print(f"Accuracy: {accuracy_score(y_test, pred_rf):.3f}")
print(f"Precision: {precision_score(y_test, pred_rf):.3f}")
print(f"Recall: {recall_score(y_test, pred_rf):.3f}")
print(f"ROC AUC score: {roc_auc_score(y_test, prob_rf):.3f}")

--- Random Forest Performance ---

Accuracy: 0.851
Precision: 0.322
Recall: 0.269
ROC AUC score: 0.690


**Random Forest Model Evaluation**

The random forest model demonstrates a significant performance improvement over our baseline. Its performance on the test set is as follows:

- **Precision (0.322):** When the model predicts a client will subscirbe, it is correct 32.2% of the time. This represents a major boost in marketing efficiency.

- **Recall (0.269):** By being more selective, the model identified a slightly smaller portion of all potential subscribers (26.9%) compared to the baseline.

- **ROC AUC Score (0.690):** This score indicates a strong ability to distinguish between subscribing and non-subscribing clients, performing much better than a random guess and surpassing the baseline. 

Overall, the random forest model is the superior model. The small drop is an acceptable trade-off for the significant gain in precision, leading to a much more efficient marketing outreach. 

**-XGBoost Model-**

In [33]:
# xgboost
xgb_mod = XGBClassifier(random_state=12)
xgb_mod.fit(X_train_smote, y_train_smote)

pred_xgb = xgb_mod.predict(X_test_scaled)
prob_xgb = xgb_mod.predict_proba(X_test_scaled)[:, 1]

print("--- XGBoost Performance ---\n")
print(f"Accuracy: {accuracy_score(y_test, pred_xgb):.3f}")
print(f"Precision: {precision_score(y_test, pred_xgb):.3f}")
print(f"Recall: {recall_score(y_test, pred_xgb):.3f}")
print(f"ROC AUC score: {roc_auc_score(y_test, prob_xgb):.3f}")

--- XGBoost Performance ---

Accuracy: 0.864
Precision: 0.358
Recall: 0.231
ROC AUC score: 0.675


**XGBoost Model Evaluation**

The XGBoost model presents the clearest trade-off between precision and recall, prioritizing marketing efficiency above all else.

- **Precision (0.358):** At 35.8%, this model is the most precise of the three, making it the most efficient in terms of targeted marketing spend.

- **Recall (0.231):** As a result of being the most selective, it captures the smallest portion of all potential subscribers (23.1%).

- **ROC AUC Score (0.675):** This score indicates a good ability to distinguish between classes; better than the baseline, but slightly lower than the random forest model.

In summary, compared to the random forest, the XGBoost model achieves a marginal gain in precision by sacrificing a more significant amount of recall.

**Final Model Comparion & Evaluation**

In [34]:
print("--- F1 Score Comparison ---\n")
print("Logistic Regression")
print(f"F1 score: {f1_score(y_test, pred_lr):.3f}\n")
print("Random Forest")
print(f"F1 score: {f1_score(y_test, pred_rf):.3f}\n")
print("XGBoost")
print(f"F1 score: {f1_score(y_test, pred_xgb):.3f}")

--- F1 Score Comparison ---

Logistic Regression
F1 score: 0.242

Random Forest
F1 score: 0.293

XGBoost
F1 score: 0.281


A comparison of the F1 score, a metric that balances precision and recall, confirms that the random forest model (0.293) is the top performer. It surpasses both the XGBoost model (0.281) and the logistic regression baseline (0.242). 

Therefore, we will select the random forest model for the final phase of this section: hyperparameter tuning.

**Hyperparameter Tuning**

In [35]:
para_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=12),
                           param_grid=para_grid,
                           scoring='roc_auc',
                           cv=3,
                           verbose=0)

grid_search.fit(X_train_smote, y_train_smote)

print("\nBest Parameters:")
print(grid_search.best_params_)

print(f"\nBest ROC AUC Score on Cross-Validation: {grid_search.best_score_:.3f}")

best_rf_mod = grid_search.best_estimator_


Best Parameters:
{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Best ROC AUC Score on Cross-Validation: 0.962


The optimal parameters found by the grid search were 200 estimators with no maximum depth. The tree-growing conditions require a minimum of 2 samples to split an internal node and a minimum of 1 sample per leaf node.

### **Identify Best Model**

In [36]:
pred_rf_tuned = best_rf_mod.predict(X_test_scaled)
prob_rf_tuned = best_rf_mod.predict_proba(X_test_scaled)[:, 1]

print("--- Tuned Random Forest Performance ---\n")
print(f"Accuracy: {accuracy_score(y_test, pred_rf_tuned):.3f}")
print(f"Precision: {precision_score(y_test, pred_rf_tuned):.3f}")
print(f"Recall: {recall_score(y_test, pred_rf_tuned):.3f}")
print(f"F1 score: {f1_score(y_test, pred_rf_tuned):.3f}")
print(f"ROC AUC score: {roc_auc_score(y_test, prob_rf_tuned):.3f}")

--- Tuned Random Forest Performance ---

Accuracy: 0.856
Precision: 0.338
Recall: 0.260
F1 score: 0.293
ROC AUC score: 0.687


### **Final Model Selection & Conclusion**

After comparing our three models, the random forest architecture was selected as the top performer based on its ROC AUC and F1-score. We then used 'GridSearchCV' to tune its hyperparameters to optimize its predictive power.

The tuning process resulted in a final model with the following performance on the test set:

- **Precision: 0.338**
- **Recall: 0.260**
- **F1 Score: 0.293**
- **ROC AUC: 0.687**

Compared to the default settings, the tuned model became slightly more **precision-focused,** making it even more efficient at identifying high-probability leads with a minimal trade-ff in recall. We will proceed with this **Tuned Random Forest** as our final model.

**Business Insight**

The application of this model now depends on the bank's primary business objective, as its predictions can be used for two distinct strategies:

- **Strategy 1: Maximize Market Reach (Prioritizing Recall)**
    
    - Use the model to identify the largest possible pool of potential subscribers, accepting a less efficient marketing spend.

- **Strategy 2: Maximize Marketing ROI (Prioritizing Precision)**

    - Use the model to target only the highest-probability leads, accepting that some potential customers will be missed.

**Next Steps**

Our final notebook will analyze these strategic trade-offs in detail, explore the model's feature importances, and provide a definitive, data-driven recommendation for the bank.

In [37]:
joblib.dump(best_rf_mod, '../models/tuned_rf_mod.joblib')
joblib.dump(scaler, '../models/scaler.joblib')
print(f"Final model saved")

X_test.to_csv('../data/X_test_original.csv', index=False) # original, unscaled test data
X_test_scaled.to_csv('../data/X_test_scaled.csv', index=False) # scaled test data
y_test.to_csv('../data/y_test.csv', index=False) # untouched target test data
print("Test data saved")

Final model saved
Test data saved


*We save the trained model and the data to load it instantly in the next notebook (03_business_impact_and_conclusion)*