# Bank Marketing Campaign Success Prediction - Model Development

This notebook demonstrates a comprehensive Data Science Life Cycle (DSLC) for predicting bank marketing campaign success. The goal is to predict whether a client will subscribe to a term deposit.

## Dataset Description

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to assess if the product (bank term deposit) would be subscribed ('yes') or not ('no').

### Features

#### Bank Client Data
1. age: numeric
2. job: type of job
3. marital: marital status
4. education: education level
5. default: has credit in default?
6. housing: has housing loan?
7. loan: has personal loan?

#### Campaign Data
8. contact: contact communication type
9. month: last contact month of year
10. day_of_week: last contact day of the week
11. duration: last contact duration, in seconds
12. campaign: number of contacts performed during this campaign
13. pdays: number of days that passed by after the client was last contacted
14. previous: number of contacts performed before this campaign
15. poutcome: outcome of the previous marketing campaign

#### Economic Context Data
16. emp.var.rate: employment variation rate - quarterly indicator
17. cons.price.idx: consumer price index - monthly indicator
18. cons.conf.idx: consumer confidence index - monthly indicator
19. euribor3m: euribor 3 month rate - daily indicator
20. nr.employed: number of employees - quarterly indicator

#### Target Variable
- y: has the client subscribed a term deposit? (binary: 'yes','no')

## DSLC Steps

1. Data Collection & Loading
2. Data Exploration & Analysis
3. Data Preprocessing
4. Feature Engineering
5. Model Development & Training
6. Model Evaluation
7. Model Export

## 1. Setup & Data Loading

First, let's set up MLflow and import necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
import xgboost as xgb
from sklearn.metrics import roc_auc_score, precision_recall_curve, confusion_matrix

# Set style for visualizations
plt.style.use('classic')

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
import os
import requests
import shutil
import pandas as pd

# Download data if it doesn't exist
data_dir = '../data/raw'
data_file = 'bank-additional-full.csv'
data_path = os.path.join(data_dir, data_file)

if not os.path.exists(data_path):
    print("Downloading dataset...")
    os.makedirs(data_dir, exist_ok=True)
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
    r = requests.get(url)
    zip_path = os.path.join(data_dir, "bank-additional.zip")
    
    # Save the zip file
    with open(zip_path, "wb") as f:
        f.write(r.content)
    
    # Create a temporary directory for extraction
    temp_extract_dir = os.path.join(data_dir, 'temp')
    os.makedirs(temp_extract_dir, exist_ok=True)
    
    # Unzip the file to temporary directory
    import zipfile
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(temp_extract_dir)
    
    # Move the required file to the target location
    extracted_file = os.path.join(temp_extract_dir, data_file)
    if os.path.exists(extracted_file):
        shutil.move(extracted_file, data_path)
    else:
        # Search for the file in subdirectories if not found in root
        for root, _, files in os.walk(temp_extract_dir):
            if data_file in files:
                full_path = os.path.join(root, data_file)
                shutil.move(full_path, data_path)
                break
    
    # Clean up
    shutil.rmtree(temp_extract_dir)
    os.remove(zip_path)
    print("Dataset downloaded and extracted successfully!")
else:
    print("Dataset already exists!")

# Load the data
df = pd.read_csv(data_path, sep=';')
print(f"Dataset shape: {df.shape}")
df.head()

## 2. Data Exploration & Analysis

Let's explore our dataset to understand:
- Data quality (missing values, duplicates)
- Feature distributions
- Target distribution
- Feature relationships with target

In [None]:
# Basic information about the dataset
print("Dataset Info:")
df.info()

In [None]:
print("\nMissing Values:")
display(df.isnull().sum())

In [None]:
print("\nTarget Distribution:")
display(df['y'].value_counts(normalize=True))

In [None]:
# Visualize target distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='y')
plt.title('Target Distribution')
plt.show()

In [None]:
# Analyze numeric features
numeric_features = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 
                   'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

plt.figure(figsize=(15, 12))
for i, feature in enumerate(numeric_features, 1):
    plt.subplot(4, 3, i)
    sns.boxplot(data=df, x='y', y=feature)
    plt.title(f'{feature} by Target')
plt.tight_layout()
plt.show()

In [None]:
# Analyze categorical features
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 
                       'loan', 'contact', 'month', 'day_of_week', 'poutcome']

# Create pairs of features for side-by-side display
feature_pairs = [(categorical_features[i], categorical_features[i+1]) 
                 for i in range(0, len(categorical_features), 2)]

# If there's an odd number of features, add the last one separately
if len(categorical_features) % 2 != 0:
    feature_pairs.append((categorical_features[-1], None))

for pair in feature_pairs:
    fig, axes = plt.subplots(1, 2, figsize=(20, 6))
    
    # Plot first feature
    df_pct = df.groupby(pair[0])['y'].value_counts(normalize=True).unstack()
    df_pct['yes'].sort_values().plot(kind='bar', ax=axes[0])
    axes[0].set_title(f'Success Rate by {pair[0]}')
    axes[0].set_ylabel('Success Rate')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Plot second feature if it exists
    if pair[1] is not None:
        df_pct = df.groupby(pair[1])['y'].value_counts(normalize=True).unstack()
        df_pct['yes'].sort_values().plot(kind='bar', ax=axes[1])
        axes[1].set_title(f'Success Rate by {pair[1]}')
        axes[1].set_ylabel('Success Rate')
        axes[1].tick_params(axis='x', rotation=45)
    else:
        # Hide the second subplot if there's no second feature
        axes[1].axis('off')
    
    plt.tight_layout()
    plt.show()

## 3. Data Preprocessing

Based on our analysis, let's preprocess the data:
1. Convert target to numeric
2. Encode categorical features
3. Scale numeric features
4. Split data into train, validation, and test sets

In [None]:
# Convert target to numeric
df['y'] = (df['y'] == 'yes').astype(int)

# Initialize label encoders for categorical features
label_encoders = {}
for feature in categorical_features:
    label_encoders[feature] = LabelEncoder()
    df[feature] = label_encoders[feature].fit_transform(df[feature])

# Split features and target
X = df.drop('y', axis=1)
y = df['y']

# Split into train, validation, and test sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

# Scale numeric features
scaler = StandardScaler()
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
X_val[numeric_features] = scaler.transform(X_val[numeric_features])
X_test[numeric_features] = scaler.transform(X_test[numeric_features])

## 4. Model Development & Training

We'll use XGBoost for this binary classification task and perform:
1. Cross-validation training
2. Hyperparameter tuning
3. Final model training

In [None]:
# Initial model parameters
base_params = {
    'max_depth': 5,
    'eta': 0.5,
    'alpha': 2.5,
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 3,
    'tree_method': 'auto'
}

In [None]:
# Create XGBoost classifier
model_cv = xgb.XGBClassifier(**base_params)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model_cv, X_train, y_train, cv=5, scoring='roc_auc')

print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

In [None]:
# Hyperparameter tuning
param_grid = {
    'max_depth': [3, 5, 7],
    'eta': [0.1, 0.3, 0.5],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Create base model
model_tune = xgb.XGBClassifier(**base_params)

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=model_tune,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# Perform grid search
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Update base_params with best parameters
base_params.update(grid_search.best_params_)

In [None]:
# Convert data to DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

# Train the model
model = xgb.train(
    params=base_params,
    dtrain=dtrain,
    num_boost_round=150,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=10,
    verbose_eval=10
)

## 5. Model Evaluation

Let's evaluate our model using:
1. ROC-AUC score
2. Confusion matrix
3. Feature importance
4. Precision-Recall curve

In [None]:
# Make predictions
dtest = xgb.DMatrix(X_test)
y_pred_proba = model.predict(dtest)
y_pred = (y_pred_proba > 0.5).astype(int)

# Calculate metrics
auc_score = roc_auc_score(y_test, y_pred_proba)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Test AUC: {auc_score:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)

In [None]:
# Plot feature importance
importance_scores = model.get_score(importance_type='gain')
importance_df = pd.DataFrame(
    list(importance_scores.items()),
    columns=['Feature', 'Importance']
).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=importance_df.head(10), x='Importance', y='Feature')
plt.title('Top 10 Feature Importance (Gain)')
plt.show()

In [None]:
# Plot precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True)
plt.show()

## 6. Model Export

Save the model and necessary artifacts for production deployment. These files will be used by the ML Engineering team for model serving.

In [None]:
# Save model
os.makedirs("../model", exist_ok=True)
model.save_model('../model/xgboost_model.json')

# Save feature names and other metadata
model_metadata = {
    'feature_names': X.columns.tolist(),
    'numeric_features': numeric_features,
    'categorical_features': categorical_features,
    'model_params': base_params,
    'prediction_threshold': 0.5,
    'metrics': {
        'auc_score': auc_score
    },
    'label_mapping': {
        feature: dict(zip(encoder.classes_, range(len(encoder.classes_))))
        for feature, encoder in label_encoders.items()
    },
    'scaler_params': {
        'mean': scaler.mean_.tolist(),
        'scale': scaler.scale_.tolist()
    }
}

import json
with open('../model/model_metadata.json', 'w', encoding='utf-8') as f:
    json.dump(model_metadata, f, indent=4)