# Bank Term Deposit Prediction — Starter Notebook
**Created:** 2025-11-12 12:40

This notebook is a clean, reproducible pipeline for your assignment:

- Loads `train.csv` (32,950 × 21) and `test.csv` (8,238 × 20).
- Preprocesses categorical / numeric features with a single `ColumnTransformer`.
- Trains and evaluates multiple models with 5-fold stratified CV using the class metric:
  **score = (Accuracy + F1 + AUC) / 3**.
- Selects the best model by mean score and fits it on the full training set.
- Generates `prediction.csv` with required columns: `id`, `y_predict` (0/1), `y_prob` (P(class=1)).

> **Files expected in the same folder as this notebook**  
> - `train.csv` (must include columns: id, age, job, marital, education, default, housing, loan, contact, month, day_of_week, campaign, pdays, previous, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed, y)  
> - `test.csv` (all the same **except** it has no `y` column)

After running all cells, you'll get:  
- `prediction.csv`
- `model_card.txt` (brief interpretation summary)


In [1]:
# ==== Imports & Config ====
import os
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_DIR = os.getcwd()  # put train.csv/test.csv in the same folder as this notebook
TRAIN_PATH = os.path.join(DATA_DIR, 'train.csv')
TEST_PATH  = os.path.join(DATA_DIR, 'test.csv')

In [2]:
# ==== Load Data ====
train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)

print('Train shape:', train.shape)
print('Test  shape:', test.shape)
display(train.head(3))
display(test.head(3))

Train shape: (32950, 21)
Test  shape: (8238, 20)


Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,19495,36,technician,married,university.degree,no,yes,yes,cellular,aug,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.968,5228.1,no
1,38793,28,admin.,single,university.degree,no,yes,no,cellular,nov,...,2,999,0,nonexistent,-3.4,92.649,-30.1,0.714,5017.5,no
2,27160,57,management,divorced,professional.course,no,yes,no,cellular,nov,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195.8,no


Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,8626,27,technician,single,university.degree,no,no,yes,telephone,jun,wed,1,999,0,nonexistent,1.4,94.465,-41.8,4.864,5228.1
1,6749,29,services,married,basic.9y,no,yes,no,telephone,may,wed,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
2,7227,59,admin.,married,high.school,no,yes,no,telephone,may,thu,4,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0


In [3]:
# ==== Quick EDA (minimal, expand in your report) ====
print('\nTarget distribution (y):')
print(train['y'].value_counts(dropna=False))

print('\nMissing values per column (train):')
print(train.isna().sum())

print('\nUnknown label counts in categorical columns (train):')
cat_cols_all = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']
for c in cat_cols_all:
    if c in train.columns:
        unk = (train[c].astype(str).str.lower() == 'unknown').sum()
        print(f'{c}: {unk}')


Target distribution (y):
y
no     29238
yes     3712
Name: count, dtype: int64

Missing values per column (train):
id                0
age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

Unknown label counts in categorical columns (train):
job: 262
marital: 62
education: 1371
default: 6859
housing: 790
loan: 790
contact: 0
month: 0
day_of_week: 0
poutcome: 0


In [4]:
# ==== Preprocessing Setup ====
target_col = 'y'

# Map target: 'yes'->1, 'no'->0
train[target_col] = (train[target_col].astype(str).str.lower() == 'yes').astype(int)

id_col = 'id'

numeric_features = ['age','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
categorical_features = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']

# Numeric: median impute + (optional) scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler(with_mean=False))  # with_mean=False keeps it sparse-friendly
])

# Categorical: keep 'unknown' as a valid category; impute most_frequent for real NaNs; one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [5]:
# ==== Define Models ====
models = {
    'logreg': LogisticRegression(max_iter=2000, class_weight='balanced', random_state=RANDOM_STATE, n_jobs=None),
    'rf'    : RandomForestClassifier(n_estimators=400, max_depth=None, min_samples_leaf=2, random_state=RANDOM_STATE, n_jobs=-1, class_weight='balanced_subsample'),
    'hgb'   : HistGradientBoostingClassifier(max_depth=None, learning_rate=0.08, max_bins=255, random_state=RANDOM_STATE)
}

pipelines = {name: Pipeline(steps=[('preprocess', preprocessor), ('model', clf)]) 
             for name, clf in models.items()}

X = train.drop(columns=[target_col])
y = train[target_col]

In [6]:
# ==== 5-Fold Stratified Cross-Validation ====
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

def blended_score(acc, f1, auc):
    return (acc + f1 + auc) / 3.0

cv_summary = []

for name, pipe in pipelines.items():
    accs, f1s, aucs = [], [], []
    for fold, (tr_idx, va_idx) in enumerate(cv.split(X, y), 1):
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]
        
        pipe.fit(X_tr, y_tr)
        prob_va = pipe.predict_proba(X_va)[:, 1] if hasattr(pipe.named_steps['model'], 'predict_proba') else pipe.decision_function(X_va)
        
        pred_va = (prob_va >= 0.5).astype(int)
        acc = accuracy_score(y_va, pred_va)
        f1  = f1_score(y_va, pred_va)
        auc = roc_auc_score(y_va, prob_va)
        
        accs.append(acc); f1s.append(f1); aucs.append(auc)
        print(f"[{name}] Fold {fold}: ACC={acc:.4f}  F1={f1:.4f}  AUC={auc:.4f}")
    
    mean_acc = np.mean(accs)
    mean_f1  = np.mean(f1s)
    mean_auc = np.mean(aucs)
    score    = blended_score(mean_acc, mean_f1, mean_auc)
    cv_summary.append({'model': name, 'ACC': mean_acc, 'F1': mean_f1, 'AUC': mean_auc, 'Score': score})

cv_df = pd.DataFrame(cv_summary).sort_values('Score', ascending=False).reset_index(drop=True)
print("\nCV Summary (higher is better):")
display(cv_df)

[logreg] Fold 1: ACC=0.8249  F1=0.4436  AUC=0.7818
[logreg] Fold 2: ACC=0.8264  F1=0.4425  AUC=0.7836
[logreg] Fold 3: ACC=0.8276  F1=0.4517  AUC=0.7879
[logreg] Fold 4: ACC=0.8264  F1=0.4552  AUC=0.7958
[logreg] Fold 5: ACC=0.8302  F1=0.4638  AUC=0.8014
[rf] Fold 1: ACC=0.8794  F1=0.4682  AUC=0.7866
[rf] Fold 2: ACC=0.8829  F1=0.4901  AUC=0.7876
[rf] Fold 3: ACC=0.8844  F1=0.4752  AUC=0.7824
[rf] Fold 4: ACC=0.8880  F1=0.5073  AUC=0.7971
[rf] Fold 5: ACC=0.8832  F1=0.4974  AUC=0.8051
[hgb] Fold 1: ACC=0.8991  F1=0.3624  AUC=0.7981
[hgb] Fold 2: ACC=0.9018  F1=0.3712  AUC=0.7932
[hgb] Fold 3: ACC=0.9032  F1=0.3594  AUC=0.7973
[hgb] Fold 4: ACC=0.9015  F1=0.3398  AUC=0.8079
[hgb] Fold 5: ACC=0.8988  F1=0.3702  AUC=0.8142

CV Summary (higher is better):


Unnamed: 0,model,ACC,F1,AUC,Score
0,rf,0.883551,0.487652,0.791741,0.720981
1,logreg,0.827102,0.451376,0.790114,0.689531
2,hgb,0.90088,0.360605,0.802158,0.687881


In [7]:
# ==== Fit Best Model on Full Train & Predict Test ====
best_model_name = cv_df.loc[0, 'model']
best_pipe = pipelines[best_model_name]
best_pipe.fit(X, y)

# Save a tiny model card / interpretation helper
model_card_lines = [f"Best model: {best_model_name}",
                    f"CV Scores: ACC={cv_df.loc[0,'ACC']:.4f}, F1={cv_df.loc[0,'F1']:.4f}, AUC={cv_df.loc[0,'AUC']:.4f}, Score={cv_df.loc[0,'Score']:.4f}",
                    "Notes: 'unknown' categories kept as-is; numeric features median-imputed & scaled; class weights applied where appropriate.",
                    "Feature importance (top 20) below if available."]

# Try to compute feature importances/coefs after preprocessing
try:
    importances = None
    model = best_pipe.named_steps['model']
    ohe = best_pipe.named_steps['preprocess'].named_transformers_['cat'].named_steps['onehot']
    num_cols = best_pipe.named_steps['preprocess'].transformers_[0][2]
    cat_cols = ohe.get_feature_names_out(best_pipe.named_steps['preprocess'].transformers_[1][2])

    feature_names = list(num_cols) + list(cat_cols)

    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
    elif hasattr(model, 'coef_'):
        # LogisticRegression (one-vs-rest binary -> shape (1, n_features))
        importances = model.coef_.ravel()
    else:
        importances = None

    if importances is not None:
        fi = pd.DataFrame({'feature': feature_names, 'importance': importances})
        fi['abs_importance'] = fi['importance'].abs()
        fi = fi.sort_values('abs_importance', ascending=False).head(20)
        model_card_lines.append("\nTop 20 features by importance/|coef|:")
        for _, r in fi.iterrows():
            model_card_lines.append(f"- {r['feature']}: {r['importance']:.6f}")

        fi.to_csv('feature_importance_top20.csv', index=False)
except Exception as e:
    model_card_lines.append(f"(Could not compute importances: {e})")

with open('model_card.txt','w', encoding='utf-8') as f:
    f.write("\n".join(model_card_lines))

# Predict on test
test_ids = test[id_col].values
test_prob = best_pipe.predict_proba(test)[:, 1] if hasattr(best_pipe.named_steps['model'], 'predict_proba') else best_pipe.decision_function(test)
test_pred = (test_prob >= 0.5).astype(int)

prediction = pd.DataFrame({
    'id': test_ids,
    'y_predict': test_pred,
    'y_prob': test_prob
})

prediction.to_csv('prediction.csv', index=False)
print("Saved: prediction.csv, model_card.txt (and feature_importance_top20.csv if available)")
display(prediction.head())

Saved: prediction.csv, model_card.txt (and feature_importance_top20.csv if available)


Unnamed: 0,id,y_predict,y_prob
0,8626,0,0.235304
1,6749,0,0.035834
2,7227,0,0.108397
3,12558,0,0.172662
4,9628,0,0.04123


### (Optional) Threshold Tuning
If you want to optimize the 0/1 decision for your blended score, you can try thresholds on a validation split:

- Split `train` into `train/valid`.
- For each threshold in `np.linspace(0.1, 0.9, 17)`, compute Accuracy, F1 on the valid set (AUC is threshold-invariant).
- Pick the threshold that maximizes the blended score.

> Keep it simple first; default 0.5 is already acceptable.


In [8]:
# ==== Save versions (for reproducibility in your report) ====
import sys, sklearn
print('Python:', sys.version)
print('pandas:', pd.__version__)
print('numpy:', np.__version__)
print('scikit-learn:', sklearn.__version__)

Python: 3.13.3 (tags/v3.13.3:6280bb5, Apr  8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)]
pandas: 2.2.3
numpy: 2.2.5
scikit-learn: 1.6.1
