In [None]:
# Importing necessary libraries for the project
import numpy as np
import pandas as pd

from sklearn.base import ClassifierMixin
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline, make_pipeline as base_make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier

from matplotlib import pyplot as plt
from tabulate import tabulate

import calendar
from datetime import datetime, timedelta

In [None]:
# Constants for the telemarketing project
from pathlib import Path

CATEGORICAL_FEATURES = [
    "contact",
    "day_of_week",
    "default",
    "education",
    "housing",
    "job",
    "loan",
    "marital",
    "month",
    "poutcome",
    "year",
]
NUMERICAL_FEATURES = [
    "age",
    "campaign",
    "pdays",
    "previous",
    "emp.var.rate",
    "cons.price.idx",
    "cons.conf.idx",
    "euribor3m",
    "nr.employed",
]
BINARY_FEATURES = [
    "y",
]

DATA_DIR = Path("data")
RAW_DATA_DIR = DATA_DIR / "raw"
INTERIM_DATA_DIR = DATA_DIR / "interim"
PROCESSED_DATA_DIR = DATA_DIR / "processed"

DATA_FILENAME = "bank-additional-full.csv"
APPROACHED_DATA_FILENAME = "approached_data.csv"
NOT_APPROACHED_DATA_FILENAME = "not_approached_data.csv"

HONOLULU_BLUE = "#1F77B4"
IMPERIAL_RED = "#F0534F"
PERSIAN_GREEN = "#27A69A"

# Time-based Data Split Strategy

- **Test period selection**: 2010 data serves as the test set to ensure realistic and future-proof evaluation

- **Training and validation periods**: 2008-2009 data is used for model training and validation phases

- **Temporal integrity**: The time-based split of training and test datasets maintains chronological order, preventing data leakage where future information inappropriately influences model training

- **Generalization assessment**: This approach enables honest evaluation of how effectively the model performs on completely unseen data

- **Business application**: The split structure supports realistic business strategy simulations that mirror actual real-world deployment conditions

**note**

We deliberately avoided stratifying the data splits by the target variable. Stratification requires random shuffling, which is fundamentally incompatible with the strict chronological ordering necessary for a time-series problem.
Our priority is to prevent data leakage and create a realistic train/test split that respects the arrow of time. The class imbalance is instead addressed at the model training stage using the class_weight parameter. This method correctly handles the imbalance without compromising the temporal integrity of our validation framework.

In [90]:
# Load not_approached dataset
df = pd.read_csv(PROCESSED_DATA_DIR / NOT_APPROACHED_DATA_FILENAME)

In [91]:
# Define X and y for modeling
X = df.drop(columns=["y"], axis=1)
y = df["y"]

print(f"X shape: {X.shape}, y shape: {y.shape}")

X.head()

X shape: (39673, 17), y shape: (39673,)


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,year
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,1.1,93.994,-36.4,4.857,5191.0,2008
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,1.1,93.994,-36.4,4.857,5191.0,2008
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,1.1,93.994,-36.4,4.857,5191.0,2008
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,1.1,93.994,-36.4,4.857,5191.0,2008
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,1.1,93.994,-36.4,4.857,5191.0,2008


In [92]:
# dertemine the split in train and test data.
print(X.groupby(['year']).size().sort_index())
# Ratio between 2010 data and the rest
print(f"Ratio of 2010 data to the rest: {X[X['year'] == 2010].shape[0] / X[X['year'] != 2010].shape[0]:.2f}")

year
2008    27655
2009    10685
2010     1333
dtype: int64
Ratio of 2010 data to the rest: 0.03


Our time-based data split revealed a key challenge: the most recent data available for testing (2010) is significantly smaller than the preceding training years. Standard evaluation on this set alone would be statistically unreliable.

Therefore, we have adopted a more sophisticated validation protocol to ensure trustworthiness:

### Model selection will be driven by a stable, averaged score from a rigorous time-series cross-validation on the large 2008-2009 dataset

**The Problem with unequal spaced time series:**

TimeSeriesSplit splits by row count, not actual time duration. This means:
1. **Unequal Spacing:** Test sets (same number of rows) can cover vastly different **time durations**.
2. **Imbalanced Periods:** Different test sets will represent different years/contexts (e.g., 2008 data vs. 2009 data).

**Result:** Cross-validation metrics (RMSE, MAE) are **not truly comparable** across folds, as they reflect performance over different time horizons and contexts.

**How to Address It:**
1. **Keep TimeSeriesSplit**: It's still correct for chronological splits.
2. **Smart Feature Engineering (Key!)**:
   * **Time-based Features:** Add year, month, day_of_week, etc., so the model knows the temporal context.
   * **Gap Features:** Add days_since_last_observation or observations_in_last_X_days to inform the model about data density.
3. **Careful Evaluation**:
   * Report **metrics for each individual fold** to see how performance shifts over time/contexts.
   * **Contextualize** results; explain performance changes based on the different periods covered by each fold


### The small 2010 test set will then serve as a final, mandatory sanity check, with its performance explicitly framed by confidence intervals to account for potential variance.

**References need to be checked**

Statistical Issues with Small Test Sets
This issue is well-documented in statistics and machine learning:

Reference: Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
Chapter 7 discusses model assessment and the impact of sample size on estimation stability. Small test sets lead to high-variance estimates, necessitating robust evaluation methods like cross-validation or confidence intervals.


Reference: Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI.
Kohavi emphasizes that small test sets produce unreliable performance estimates and recommends techniques like cross-validation or confidence intervals to quantify uncertainty.




# Concept drift and temporal imbalance

Our dataset is influenced by two interacting temporal phenomena. First, concept drift, which could be driven by for example the 2008 financial crisis, has rendered older data less predictive of current outcomes. Second, we observe a temporal instance imbalance, where this outdated 2008-era data is overrepresented in volume. While each issue is problematic on its own, their combination is especially harmful: the volume imbalance significantly amplifies the negative impact of the drift. To address this, we apply time-based sample weighting — not to equalize instance counts, but to rebalance the influence of different time periods. This ensures the model prioritizes learning from more recent, and therefore more relevant, observations."
