## **Boosting Marketing Efficiency: Targeted Bank Campaign by Customer Subscription Behavior**

**Overall Project Objective:** 

Develop a data-driven marketing strategy that maximizes return on investment (ROI) by identifying optimal trade-off between broad customer outreach and precision marketing using predictive modeling.

**Notebook 2 of 3: Feature Engineering & Predictive Modeling**

This notebook covers the core technical steps of the predictive modeling pipeline. 

The primary goals are to:
- Prepare the cleaned data for machine learning through feature engineering.
- Build and evaluate several classification models.
- Identify the best-performing model for predicting term deposit subscriptions.
- Analyze the precision-recall trade-off to inform the final strategy.

### **Data Load**

Simple data and libraries load

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve


In [29]:
file_path = '../data/bank_cleaned.csv'
df = pd.read_csv(file_path)

**-Data Overview-**

The specific file used is 'bank_cleaned.csv,' which is the cleaned and imputed output from notebook 1. 

Data at a glance:
- Dataset: bank_cleaned.csv
- Observations: 4,521
- Variables: 17 (pre-feature engineering)

In [37]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown_outcome,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unspecified,3,jun,199,4,-1,0,unknown_outcome,no
4,59,blue-collar,married,secondary,no,0,yes,no,unspecified,5,may,226,1,-1,0,unknown_outcome,no


In [36]:
# verify loaded data
print("--- Cleaned Data ---\n")
print(f"Dataset shape: {df.shape}\n")
print("--- Missing Values Check ---\n") 
print(df.isnull().sum())

--- Cleaned Data ---

Dataset shape: (4521, 17)

--- Missing Values Check ---

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64


*The dataset contains no missing values with 4,521 observations and 17 variables (16 input variables and 1 output variable)*

### **Feature Engineering**

The feature engineering process involves two primary steps:
1. Feature removal: several columns will be removed before training the model to improve performance:

    - Data leakage features: The features identified as source of data leakage ('duration', 'campaign') will be dropped.
    
    - Low-impact variable: The 'day' variable will be excluded. As seasonality is already being analyzed by 'month' variable, the specific day is unlikely to provide a meaningful signal and may introduce more noise than a valuable relationship with the outcome.

2. Encoding categorical variables: To prepare the categorical variables for our machine learning models, we will use one-hot encoding. This technique converts each feature into a series of binary columns, a numerical format that the models can understand.

### **Data Split**

### **Class Imbalance Handling**

### **Model Training & Evaluation**

### **Identify Best Model**

### **Conclusion & Key Insights**