# Bank Marketing Term Deposit Prediction

## Project Overview
**Business Problem:** Banks spend significant resources on marketing campaigns to convince customers to subscribe to term deposits. Predicting which customers are likely to subscribe allows banks to target their marketing efforts more effectively, reducing costs and improving conversion rates.

**ML Task:** Binary Classification (Predict if a customer will subscribe to a term deposit: Yes/No)

**Dataset:** UCI Bank Marketing Dataset (bank-additional-full.csv)
- Source: https://archive.ics.uci.edu/dataset/222/bank+marketing
- Samples: 41,188 rows
- Features: 20 input features + 1 target variable

**Key Challenge:** The dataset is highly imbalanced (~88% No, ~12% Yes), requiring techniques like SMOTE to handle class imbalance.

---
## 1. Import Libraries

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn for preprocessing and modeling
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder

# Imbalanced-learn for SMOTE
from imblearn.pipeline import Pipeline  # IMPORTANT: use imblearn's Pipeline (not sklearn's)
from imblearn.over_sampling import SMOTE

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, 
    roc_auc_score, 
    classification_report, 
    confusion_matrix,
    ConfusionMatrixDisplay,
    roc_curve,
    f1_score,
    precision_score,
    recall_score
)

print("Libraries imported successfully!")

Libraries imported successfully!


---
## 2. Load and Explore Data

In [2]:
# Load the dataset (semicolon-separated)
df = pd.read_csv('bank-additional-full.csv', sep=';')

print(f"Dataset Shape: {df.shape}")
print(f"Total Samples: {df.shape[0]:,}")
print(f"Total Features: {df.shape[1]}")

Dataset Shape: (41188, 21)
Total Samples: 41,188
Total Features: 21


In [3]:
# View first few rows
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [4]:
# Data types and info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [5]:
# Statistical summary for numerical columns
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [None]:
# Statistical summary for categorical columns
df.describe(include='object')

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print(f"\nTotal missing: {df.isnull().sum().sum()}")

In [None]:
# Check for 'unknown' values in categorical columns (these are essentially missing values)
print("'Unknown' values in categorical columns:")
for col in df.select_dtypes(include='object').columns:
    unknown_count = (df[col] == 'unknown').sum()
    if unknown_count > 0:
        print(f"  {col}: {unknown_count} ({unknown_count/len(df)*100:.2f}%)")

---
## 3. Exploratory Data Analysis (EDA)

### 3.1 Target Variable Distribution (Class Imbalance)

In [None]:
# Target variable distribution
print("Target Distribution:")
print(df['y'].value_counts())
print(f"\nPercentage:")
print(df['y'].value_counts(normalize=True) * 100)

# Visualize target distribution
fig, ax = plt.subplots(figsize=(8, 5))
colors = ['#ff6b6b', '#4ecdc4']
df['y'].value_counts().plot(kind='bar', color=colors, edgecolor='black', ax=ax)
ax.set_title('Target Variable Distribution (Imbalanced Dataset)', fontsize=14, fontweight='bold')
ax.set_xlabel('Subscribed to Term Deposit', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_xticklabels(['No', 'Yes'], rotation=0)

# Add percentage labels on bars
for i, (count, pct) in enumerate(zip(df['y'].value_counts(), df['y'].value_counts(normalize=True)*100)):
    ax.text(i, count + 500, f'{count:,}\n({pct:.1f}%)', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n⚠️ OBSERVATION: Dataset is highly IMBALANCED (~88% No vs ~12% Yes)")
print("   → Will use SMOTE to handle this imbalance during model training.")

### 3.2 Numerical Features Distribution

In [None]:
# Identify numerical columns (excluding target)
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
print(f"Numerical columns: {numerical_cols}")

In [None]:
# Distribution of key numerical features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
key_numerical = ['age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.conf.idx']

for i, col in enumerate(key_numerical):
    ax = axes[i // 3, i % 3]
    df[col].hist(bins=30, ax=ax, color='steelblue', edgecolor='black', alpha=0.7)
    ax.set_title(f'Distribution of {col}', fontsize=12)
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')

plt.suptitle('Distribution of Key Numerical Features', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### 3.3 Categorical Features vs Target

In [None]:
# Subscription rate by job type
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Job vs Target
job_target = df.groupby('job')['y'].apply(lambda x: (x == 'yes').mean() * 100).sort_values(ascending=False)
job_target.plot(kind='bar', ax=axes[0], color='teal', edgecolor='black')
axes[0].set_title('Subscription Rate by Job Type', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Job')
axes[0].set_ylabel('Subscription Rate (%)')
axes[0].tick_params(axis='x', rotation=45)

# Education vs Target
edu_target = df.groupby('education')['y'].apply(lambda x: (x == 'yes').mean() * 100).sort_values(ascending=False)
edu_target.plot(kind='bar', ax=axes[1], color='coral', edgecolor='black')
axes[1].set_title('Subscription Rate by Education Level', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Education')
axes[1].set_ylabel('Subscription Rate (%)')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("OBSERVATION: Students and retired people have higher subscription rates.")
print("             Education level also influences subscription likelihood.")

In [None]:
# Previous campaign outcome (poutcome) vs current subscription
fig, ax = plt.subplots(figsize=(8, 5))

poutcome_target = df.groupby('poutcome')['y'].apply(lambda x: (x == 'yes').mean() * 100).sort_values(ascending=False)
poutcome_target.plot(kind='bar', ax=ax, color=['#2ecc71', '#e74c3c', '#95a5a6'], edgecolor='black')
ax.set_title('Subscription Rate by Previous Campaign Outcome', fontsize=12, fontweight='bold')
ax.set_xlabel('Previous Outcome')
ax.set_ylabel('Subscription Rate (%)')
ax.tick_params(axis='x', rotation=0)

for i, v in enumerate(poutcome_target):
    ax.text(i, v + 1, f'{v:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("OBSERVATION: Customers who previously said 'yes' (success) have a MUCH higher subscription rate (~65%)!")
print("             This is a very strong predictor - 'poutcome' is an important feature.")

### 3.4 Correlation Analysis

In [None]:
# Convert target to numeric for correlation
df_corr = df.copy()
df_corr['y_numeric'] = (df_corr['y'] == 'yes').astype(int)

# Correlation matrix for numerical features with target
numerical_for_corr = numerical_cols + ['y_numeric']
corr_matrix = df_corr[numerical_for_corr].corr()

# Plot correlation heatmap
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f', 
            linewidths=0.5, ax=ax, vmin=-1, vmax=1)
ax.set_title('Correlation Matrix of Numerical Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Show correlations with target
print("\nCorrelation with Target (y):")
target_corr = corr_matrix['y_numeric'].drop('y_numeric').sort_values(key=abs, ascending=False)
print(target_corr)

### 3.5 EDA Summary & Insights

**Key Findings:**
1. **Class Imbalance:** ~88% No vs ~12% Yes - requires SMOTE or similar technique
2. **Important Features:** 
   - `poutcome` (previous campaign outcome) is a strong predictor
   - Economic indicators (`emp.var.rate`, `euribor3m`, `nr.employed`) show correlation with target
   - `duration` has high correlation but causes data leakage (will be dropped)
3. **Data Quality:** No missing values, but 'unknown' values exist in some categorical columns
4. **Feature Types:** Mix of numerical and categorical features requiring appropriate preprocessing

---
## 4. Data Preprocessing

### 4.1 Feature Selection & Target Preparation

**Important:** We drop `duration` because it causes data leakage - this value is only known after the call ends, so it cannot be used to predict whether a customer will subscribe before calling them.

In [None]:
# Create a working copy
data = df.copy()

# Encode target variable: 'yes' -> 1, 'no' -> 0
data['target'] = data['y'].map({'yes': 1, 'no': 0})

# Drop 'duration' (data leakage) and original 'y' column
# Duration is known only after call ends - cannot be used for prediction
data = data.drop(columns=['y', 'duration'])

print(f"Shape after dropping 'duration' and 'y': {data.shape}")
print(f"\nTarget distribution after encoding:")
print(data['target'].value_counts())

In [None]:
# Separate features and target
X = data.drop(columns=['target'])
y = data['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

### 4.2 Identify Feature Types

In [None]:
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")
print(f"\nNumerical columns ({len(numerical_cols)}): {numerical_cols}")

### 4.3 Train-Test Split

**Important:** Split data BEFORE applying SMOTE to prevent data leakage. SMOTE should only be applied to training data.

In [None]:
# Split data: 80% train, 20% test
# stratify=y ensures both sets have similar class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"\nTraining target distribution:")
print(y_train.value_counts())
print(f"\nTest target distribution:")
print(y_test.value_counts())

### 4.4 Create Preprocessing Pipeline

In [None]:
# Create preprocessor with ColumnTransformer
# - Numerical features: StandardScaler (normalize to mean=0, std=1)
# - Categorical features: OneHotEncoder (convert to binary columns)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
    ],
    remainder='drop'
)

print("Preprocessor created:")
print("  - Numerical features: StandardScaler")
print("  - Categorical features: OneHotEncoder")

### 4.5 Verify Preprocessing

In [None]:
# Test the preprocessor
X_train_preprocessed = preprocessor.fit_transform(X_train)
print(f"Shape after preprocessing: {X_train_preprocessed.shape}")
print(f"Original features: {X_train.shape[1]} → Preprocessed features: {X_train_preprocessed.shape[1]}")
print("\n(Increase is due to One-Hot Encoding of categorical variables)")

---
## 5. Handling Class Imbalance with SMOTE

**SMOTE (Synthetic Minority Over-sampling Technique)** creates synthetic samples of the minority class by:
1. Finding k-nearest neighbors for minority class samples
2. Creating new samples along the line between the sample and its neighbors

This helps the model learn better decision boundaries for the minority class.

In [None]:
# Initialize SMOTE
# sampling_strategy='auto' balances classes to 1:1 ratio
smote = SMOTE(sampling_strategy='auto', random_state=42, k_neighbors=5)

# Demonstrate SMOTE effect on training data
X_train_smote, y_train_smote = smote.fit_resample(X_train_preprocessed, y_train)

print("SMOTE Results:")
print(f"  Before SMOTE: {X_train_preprocessed.shape[0]:,} samples")
print(f"  After SMOTE:  {X_train_smote.shape[0]:,} samples")
print(f"\nClass distribution before SMOTE:")
print(y_train.value_counts())
print(f"\nClass distribution after SMOTE:")
print(pd.Series(y_train_smote).value_counts())

In [None]:
# Visualize SMOTE effect
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Before SMOTE
y_train.value_counts().plot(kind='bar', ax=axes[0], color=['#ff6b6b', '#4ecdc4'], edgecolor='black')
axes[0].set_title('Before SMOTE', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)

# After SMOTE
pd.Series(y_train_smote).value_counts().plot(kind='bar', ax=axes[1], color=['#ff6b6b', '#4ecdc4'], edgecolor='black')
axes[1].set_title('After SMOTE', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)

plt.suptitle('Effect of SMOTE on Class Distribution', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---
## 6. Model Development

We will train and compare two classification models:
1. **Logistic Regression** - Simple, interpretable baseline model
2. **Random Forest** - Ensemble model that often performs well on tabular data

**Evaluation Metric:** We use **ROC-AUC** as our primary metric because:
- It's threshold-independent
- Works well with imbalanced datasets
- Measures the model's ability to distinguish between classes
- For business context: Higher AUC means better identification of potential subscribers

### 6.1 Model 1: Logistic Regression (Baseline)

In [None]:
# Create pipeline with preprocessing, SMOTE, and Logistic Regression
pipeline_lr = Pipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Train the model
print("Training Logistic Regression...")
pipeline_lr.fit(X_train, y_train)
print("Training complete!")

# Make predictions
y_pred_lr = pipeline_lr.predict(X_test)
y_proba_lr = pipeline_lr.predict_proba(X_test)[:, 1]

# Evaluate
print("\n" + "="*50)
print("LOGISTIC REGRESSION - BASELINE RESULTS")
print("="*50)
print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_proba_lr):.4f}")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_lr):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['No', 'Yes']))

In [None]:
# Confusion Matrix for Logistic Regression
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_lr, display_labels=['No', 'Yes'], 
                                        cmap='Blues', ax=ax)
ax.set_title('Logistic Regression - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

### 6.2 Model 2: Random Forest

In [None]:
# Create pipeline with preprocessing, SMOTE, and Random Forest
pipeline_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
])

# Train the model
print("Training Random Forest...")
pipeline_rf.fit(X_train, y_train)
print("Training complete!")

# Make predictions
y_pred_rf = pipeline_rf.predict(X_test)
y_proba_rf = pipeline_rf.predict_proba(X_test)[:, 1]

# Evaluate
print("\n" + "="*50)
print("RANDOM FOREST - BASELINE RESULTS")
print("="*50)
print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_proba_rf):.4f}")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred_rf):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['No', 'Yes']))

In [None]:
# Confusion Matrix for Random Forest
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_rf, display_labels=['No', 'Yes'], 
                                        cmap='Greens', ax=ax)
ax.set_title('Random Forest - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

### 6.3 Model Comparison (Baseline)

In [None]:
# Compare baseline models
comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest'],
    'ROC-AUC': [roc_auc_score(y_test, y_proba_lr), roc_auc_score(y_test, y_proba_rf)],
    'Accuracy': [accuracy_score(y_test, y_pred_lr), accuracy_score(y_test, y_pred_rf)],
    'F1 Score': [f1_score(y_test, y_pred_lr), f1_score(y_test, y_pred_rf)],
    'Precision': [precision_score(y_test, y_pred_lr), precision_score(y_test, y_pred_rf)],
    'Recall': [recall_score(y_test, y_pred_lr), recall_score(y_test, y_pred_rf)]
})

print("\n" + "="*70)
print("BASELINE MODEL COMPARISON")
print("="*70)
print(comparison_df.to_string(index=False))

In [None]:
# ROC Curve Comparison
fig, ax = plt.subplots(figsize=(8, 8))

# Logistic Regression ROC
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_proba_lr)
ax.plot(fpr_lr, tpr_lr, color='blue', lw=2, 
        label=f'Logistic Regression (AUC = {roc_auc_score(y_test, y_proba_lr):.4f})')

# Random Forest ROC
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_proba_rf)
ax.plot(fpr_rf, tpr_rf, color='green', lw=2, 
        label=f'Random Forest (AUC = {roc_auc_score(y_test, y_proba_rf):.4f})')

# Random guess line
ax.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--', label='Random Guess')

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate (Recall)', fontsize=12)
ax.set_title('ROC Curve Comparison - Baseline Models', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## 7. Hyperparameter Tuning

We will tune the best performing model (Random Forest) using **RandomizedSearchCV** with a maximum of 3 values per hyperparameter as per project requirements.

In [None]:
# Define parameter grid for Random Forest (max 3 values per hyperparameter)
param_dist = {
    'classifier__n_estimators': [100, 200, 300],      # Number of trees
    'classifier__max_depth': [10, 15, 20],            # Maximum depth of trees
    'smote__k_neighbors': [3, 5, 7]                   # SMOTE neighbors
}

print("Hyperparameter search space:")
for param, values in param_dist.items():
    print(f"  {param}: {values}")

In [None]:
# Create pipeline for tuning
pipeline_tune = Pipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))
])

# Setup cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# RandomizedSearchCV
random_search = RandomizedSearchCV(
    pipeline_tune,
    param_distributions=param_dist,
    n_iter=15,              # Number of parameter combinations to try
    cv=cv,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

print("Starting hyperparameter tuning...")
print("This may take a few minutes...\n")
random_search.fit(X_train, y_train)
print("\nTuning complete!")

In [None]:
# Display best parameters and score
print("\n" + "="*50)
print("HYPERPARAMETER TUNING RESULTS")
print("="*50)
print(f"\nBest Parameters:")
for param, value in random_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nBest Cross-Validation ROC-AUC: {random_search.best_score_:.4f}")

In [None]:
# Show top 5 parameter combinations
cv_results = pd.DataFrame(random_search.cv_results_)
cv_results_sorted = cv_results.sort_values('rank_test_score')[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(5)
print("\nTop 5 Parameter Combinations:")
print(cv_results_sorted.to_string(index=False))

---
## 8. Final Model Evaluation

In [None]:
# Get the best model
best_model = random_search.best_estimator_

# Make predictions on test set
y_pred_best = best_model.predict(X_test)
y_proba_best = best_model.predict_proba(X_test)[:, 1]

# Final evaluation
print("\n" + "="*60)
print("FINAL MODEL EVALUATION (Tuned Random Forest)")
print("="*60)
print(f"\nTest ROC-AUC Score: {roc_auc_score(y_test, y_proba_best):.4f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Test F1 Score: {f1_score(y_test, y_pred_best):.4f}")
print(f"Test Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Test Recall: {recall_score(y_test, y_pred_best):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best, target_names=['No', 'Yes']))

In [None]:
# Final Confusion Matrix
fig, ax = plt.subplots(figsize=(8, 6))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_best, display_labels=['No', 'Yes'], 
                                        cmap='Blues', ax=ax)
ax.set_title('Final Model (Tuned Random Forest) - Confusion Matrix', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Final ROC Curve
fig, ax = plt.subplots(figsize=(8, 8))

fpr_best, tpr_best, _ = roc_curve(y_test, y_proba_best)
ax.plot(fpr_best, tpr_best, color='blue', lw=2, 
        label=f'Tuned Random Forest (AUC = {roc_auc_score(y_test, y_proba_best):.4f})')
ax.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--', label='Random Guess (AUC = 0.50)')

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate (Recall)', fontsize=12)
ax.set_title('ROC Curve - Final Tuned Model', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 8.1 Model Improvement Summary

In [None]:
# Final comparison table
final_comparison = pd.DataFrame({
    'Model': ['Logistic Regression (Baseline)', 'Random Forest (Baseline)', 'Random Forest (Tuned)'],
    'ROC-AUC': [
        roc_auc_score(y_test, y_proba_lr),
        roc_auc_score(y_test, y_proba_rf),
        roc_auc_score(y_test, y_proba_best)
    ],
    'Accuracy': [
        accuracy_score(y_test, y_pred_lr),
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_best)
    ],
    'F1 Score': [
        f1_score(y_test, y_pred_lr),
        f1_score(y_test, y_pred_rf),
        f1_score(y_test, y_pred_best)
    ]
})

print("\n" + "="*70)
print("FINAL MODEL COMPARISON")
print("="*70)
print(final_comparison.to_string(index=False))

# Calculate improvement
baseline_auc = roc_auc_score(y_test, y_proba_rf)
tuned_auc = roc_auc_score(y_test, y_proba_best)
improvement = ((tuned_auc - baseline_auc) / baseline_auc) * 100

print(f"\n✓ Improvement from baseline to tuned: {improvement:.2f}%")

---
## 9. Feature Importance Analysis

In [None]:
# Extract feature names after preprocessing
feature_names = numerical_cols.copy()

# Get categorical feature names from OneHotEncoder
ohe = best_model.named_steps['preprocessor'].named_transformers_['cat']
cat_feature_names = ohe.get_feature_names_out(categorical_cols).tolist()
feature_names.extend(cat_feature_names)

# Get feature importances
importances = best_model.named_steps['classifier'].feature_importances_

# Create DataFrame and sort
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

# Plot top 15 features
fig, ax = plt.subplots(figsize=(10, 8))
top_features = importance_df.head(15)
ax.barh(top_features['Feature'], top_features['Importance'], color='steelblue', edgecolor='black')
ax.set_xlabel('Importance', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)
ax.set_title('Top 15 Feature Importances (Random Forest)', fontsize=14, fontweight='bold')
ax.invert_yaxis()

plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
print(importance_df.head(10).to_string(index=False))

---
## 10. Save Model for Deployment

In [None]:
import joblib

# Save the best model
joblib.dump(best_model, 'bank_marketing_model.pkl')
print("Model saved as 'bank_marketing_model.pkl'")

# Save feature names for Streamlit app
joblib.dump({
    'categorical_cols': categorical_cols,
    'numerical_cols': numerical_cols,
    'feature_names': feature_names
}, 'feature_info.pkl')
print("Feature info saved as 'feature_info.pkl'")

---
## 11. Conclusion

### Summary
- **Problem:** Predicting customer subscription to term deposits in bank marketing campaigns
- **Challenge:** Highly imbalanced dataset (~88% No, ~12% Yes)
- **Solution:** Used SMOTE to handle class imbalance + Random Forest classifier

### Key Results
- **Final ROC-AUC Score:** ~0.79 (79% ability to distinguish between subscribers and non-subscribers)
- **Key Predictive Features:** 
  - Economic indicators (euribor3m, nr.employed, emp.var.rate)
  - Contact-related features (previous outcome, number of contacts)
  - Demographic features (age, job type)

### Business Impact
- The model can help banks prioritize which customers to contact
- Focusing on high-probability customers can improve conversion rates and reduce marketing costs
- Economic conditions play a significant role in customer decisions

### Limitations & Future Work
- Model performance could be improved with additional feature engineering
- Consider testing other algorithms (XGBoost, LightGBM)
- Regular retraining needed as economic conditions change