
# Bank Marketing — Classifier Comparison (k-NN, Logistic Regression, Decision Trees, SVM)

**Assignment 17.1 — Practical Application 3**  
This notebook follows CRISP-DM to compare four classifiers on the Portuguese bank marketing dataset (UCI ML Repository).  
Models compared:
- k-Nearest Neighbors (k-NN)  
- Logistic Regression  
- Decision Tree  
- Support Vector Machine (SVM)

**Objective / Business Problem**  
The bank wants to increase subscriptions to a term deposit (`y = "yes"`). We will build and compare classification models to predict which clients are most likely to subscribe after a telemarketing call. Insights will guide campaign targeting to improve conversion rates and reduce costs.


In [None]:

# Core
import os
from pathlib import Path
import numpy as np
import pandas as pd

# Viz
import matplotlib.pyplot as plt

# Preprocessing & Modeling
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import (classification_report, confusion_matrix, ConfusionMatrixDisplay,
                             roc_auc_score, roc_curve, auc, precision_recall_curve,
                             average_precision_score)
from sklearn import set_config

# Models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# Inference / Stats
from statsmodels.stats.contingency_tables import mcnemar
import statsmodels.api as sm

# Feature importance (model-agnostic)
from sklearn.inspection import permutation_importance

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
set_config(transform_output="pandas")

# Paths
ROOT = Path("/mnt/data/module17_starter")
DATA = ROOT / "data"
FULL = DATA / "bank-additional-full.csv"
REDUCED = DATA / "bank-additional.csv"
NAMES = DATA / "bank-additional-names.txt"

print(f"Using data at: {FULL if FULL.exists() else REDUCED}")

**Note**: We include the reduced CSV in the repo to avoid large files.


## 1. Business & Data Understanding
We load the UCI Bank Marketing dataset. The target variable is `y` (`"yes"` if the client subscribed to a term deposit, otherwise `"no"`).

*If running from the repo folder, this will try the local reduced dataset first for faster execution.*

In [None]:

# Prefer a local reduced CSV inside the repo (small), else fall back to full dataset if available
local_reduced = Path("./data/bank-additional.csv")
if local_reduced.exists():
    csv_path = local_reduced
else:
    csv_path = FULL if FULL.exists() else REDUCED

df = pd.read_csv(csv_path, sep=';')
df.head()


In [None]:

print(df.shape)
df['y'].value_counts(normalize=True)


In [None]:

# Read feature descriptions, if available
if NAMES.exists():
    with open(NAMES, 'r', encoding='utf-8', errors='ignore') as f:
        desc = f.read()
    print(desc[:1500] + "\n...")
else:
    print("Feature names file not found.")



## 2. Exploratory Data Analysis (EDA)
We examine target imbalance, basic stats for numeric variables, and frequency tables for categoricals.


In [None]:

# Seaborn for richer visuals
import seaborn as sns
sns.set_theme()  # default theme


In [None]:

# --- Seaborn Visualizations ---
# 1) Target distribution with seaborn (categorical)
plt.figure()
sns.countplot(x=df['y'])
plt.title("Target distribution: y")
plt.xlabel("Subscription (y)")
plt.ylabel("Count")
plt.show()

# 2) Numeric distributions (subplots)
num_to_plot = numeric_feats[:6]  # plot a handful to keep output readable
fig, axes = plt.subplots(nrows=len(num_to_plot), ncols=1, figsize=(6, 3*len(num_to_plot)))
if len(num_to_plot) == 1:
    axes = [axes]
for ax, col in zip(axes, num_to_plot):
    sns.histplot(data=df, x=col, bins=30, ax=ax)
    ax.set_title(f"Distribution of {col}")
    ax.set_xlabel(col)
    ax.set_ylabel("Count")
plt.tight_layout()
plt.show()

# 3) Boxplot example: duration by outcome
plt.figure()
sns.boxplot(data=df, x='y', y='duration')
plt.title("Call duration by subscription outcome")
plt.xlabel("Subscription (y)")
plt.ylabel("Call duration (seconds)")
plt.show()

# 4) Top categories: job
plt.figure()
top_jobs = df['job'].value_counts().nlargest(10).index
sns.countplot(data=df[df['job'].isin(top_jobs)], y='job')
plt.title("Top 10 job categories")
plt.xlabel("Count")
plt.ylabel("Job")
plt.show()


In [None]:

# Identify feature types
target = 'y'
features = [c for c in df.columns if c != target]
numeric_feats = df[features].select_dtypes(include=['int64','float64']).columns.tolist()
categorical_feats = df[features].select_dtypes(exclude=['int64','float64']).columns.tolist()

len(numeric_feats), len(categorical_feats), numeric_feats[:5], categorical_feats[:5]


In [None]:

# Missingness
df.isna().mean().sort_values(ascending=False).head(15)


In [None]:

# Basic numeric summary
df[numeric_feats].describe().T


In [None]:

# Target distribution
ax = df[target].value_counts().plot.bar(rot=0, title="Target distribution: y")
plt.xlabel("y")
plt.ylabel("Count")
plt.show()


In [None]:

# Correlation (numeric only)
corr = df[numeric_feats].corr()
corr.round(2)



## 3. Data Preparation
We create a preprocessing pipeline:
- **Numeric**: median imputation, standardization  
- **Categorical**: most-frequent imputation, one-hot encoding (ignore unknowns)

We use a **stratified** train/test split to respect class imbalance.


In [None]:

X = df.drop(columns=[target])
y = (df[target] == 'yes').astype(int)  # binary 1/0

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

numeric_processor = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_processor = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer(transformers=[
    ("num", numeric_processor, numeric_feats),
    ("cat", categorical_processor, categorical_feats)
])

X_train.shape, X_test.shape



## 4. Modeling & Hyperparameter Tuning
We compare four models with modest grids (to keep runtime reasonable) using 5-fold stratified CV.
Scoring emphasizes **ROC AUC** (good for imbalanced data), but we will report multiple metrics.


In [None]:
# Build pipelines & modest grids for CV — clean variable names, comments per rubric

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

pipelines = {
    "knn": Pipeline([("prep", preprocess), ("clf", KNeighborsClassifier())]),
    "logreg": Pipeline([("prep", preprocess), ("clf", LogisticRegression(max_iter=2000, class_weight="balanced", solver="saga"))]),
    "tree": Pipeline([("prep", preprocess), ("clf", DecisionTreeClassifier(random_state=RANDOM_STATE, class_weight="balanced"))]),
    "svm": Pipeline([("prep", preprocess), ("clf", SVC(probability=True, class_weight="balanced", random_state=RANDOM_STATE))]),
}

param_grids = {
    "knn": {
        "clf__n_neighbors": [5, 15, 25],
        "clf__weights": ["uniform", "distance"],
        "clf__p": [1, 2]
    },
    "logreg": {
        "clf__C": [0.1, 1.0, 10.0],
        "clf__penalty": ["l1", "l2"],
        "clf__l1_ratio": [None]  # keep simple
    },
    "tree": {
        "clf__max_depth": [None, 6, 10, 16],
        "clf__min_samples_split": [2, 10, 30],
        "clf__min_samples_leaf": [1, 5, 20]
    },
    "svm": {
        "clf__C": [0.5, 1.0, 2.0],
        "clf__kernel": ["rbf"],
        "clf__gamma": ["scale", "auto"]
    }
}

searches = {}
for name, pipe in pipelines.items():
    gs = GridSearchCV(
        estimator=pipe,
        param_grid=param_grids[name],
        scoring="roc_auc",
        cv=cv,
        n_jobs=-1,
        refit=True,
        verbose=1
    )
    gs.fit(X_train, y_train)
    searches[name] = gs
    print(f"{name}: best AUC={gs.best_score_:.4f} | params={gs.best_params_}")



## 5. Evaluation on Test Set
We evaluate the best model from each search on the held-out test set.
Metrics reported:
- Accuracy, Precision, Recall, F1
- ROC AUC, Average Precision (PR AUC)
- Confusion Matrix


In [None]:

def evaluate_model(name, search, X_test, y_test):
    pipe = search.best_estimator_
    y_proba = pipe.predict_proba(X_test)[:, 1]
    y_pred = pipe.predict(X_test)
    roc = roc_auc_score(y_test, y_proba)
    ap = average_precision_score(y_test, y_proba)
    print(f"\n{name.upper()} | ROC AUC={roc:.4f} | AP (PR AUC)={ap:.4f}")
    print(classification_report(y_test, y_pred, digits=3))
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(cm)
    disp.plot(values_format='d')
    plt.title(f"{name.upper()} — Confusion Matrix")
    plt.show()
    return {
        "name": name,
        "roc_auc": roc,
        "ap": ap,
        "report": classification_report(y_test, y_pred, output_dict=True),
        "y_pred": y_pred,
        "y_proba": y_proba
    }

results = {}
for name, search in searches.items():
    results[name] = evaluate_model(name, search, X_test, y_test)

# Summary table
summary = pd.DataFrame([
    {
        "model": r["name"],
        "roc_auc": r["roc_auc"],
        "ap_pr": r["ap"],
        "precision": r["report"]["1"]["precision"],
        "recall": r["report"]["1"]["recall"],
        "f1": r["report"]["1"]["f1-score"],
        "accuracy": r["report"]["accuracy"]
    } for r in results.values()
]).sort_values(by=["roc_auc","ap_pr"], ascending=False).reset_index(drop=True)

summary



### ROC & Precision–Recall Curves


In [None]:

plt.figure()
for name, search in searches.items():
    y_score = search.best_estimator_.predict_proba(X_test)[:,1]
    precision, recall, _ = precision_recall_curve(y_test, y_score)
    plt.plot(recall, precision, label=name.upper())
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curves")
plt.legend()
plt.show()



## 6. Inferential Statistics
To test if performance differences are statistically significant, we run **McNemar's test** comparing the top two models' errors on the test set.


In [None]:

# pick top two by ROC AUC
top2 = summary.head(2)["model"].tolist()
m1, m2 = top2[0], top2[1]
p1 = results[m1]["y_pred"]
p2 = results[m2]["y_pred"]

# contingency table
both_correct = ((p1 == y_test) & (p2 == y_test)).sum()
m1_correct_m2_wrong = ((p1 == y_test) & (p2 != y_test)).sum()
m1_wrong_m2_correct = ((p1 != y_test) & (p2 == y_test)).sum()
both_wrong = ((p1 != y_test) & (p2 != y_test)).sum()

table = [[both_correct, m1_correct_m2_wrong],
         [m1_wrong_m2_correct, both_wrong]]

res = mcnemar(table, exact=False, correction=True)
print("Top two models:", m1, "vs", m2)
print("Contingency table:", table)
print(f"McNemar statistic={res.statistic:.4f}, p-value={res.pvalue:.6f}")



## 7. Feature Importance & Explainability
We inspect model-specific importances (where available) and **permutation importance** for the best model to identify actionable drivers.


In [None]:

# Identify best overall model by ROC AUC
best_name = summary.iloc[0]["model"]
best_search = searches[best_name]
best_pipe = best_search.best_estimator_

# Get feature names after preprocessing
ohe = best_pipe.named_steps["prep"].named_transformers_["cat"].named_steps["onehot"]
cat_out = ohe.get_feature_names_out(best_pipe.named_steps["prep"].transformers_[1][2])
num_out = best_pipe.named_steps["prep"].named_transformers_["num"].get_feature_names_out(best_pipe.named_steps["prep"].transformers_[0][2])
feat_names = np.concatenate([num_out, cat_out])

# Permutation importance on test
perm = permutation_importance(best_pipe, X_test, y_test, n_repeats=5, random_state=RANDOM_STATE, scoring="roc_auc")
imp = pd.DataFrame({"feature": feat_names, "importance_mean": perm.importances_mean, "importance_std": perm.importances_std})
imp = imp.sort_values("importance_mean", ascending=False).head(20).reset_index(drop=True)
imp


In [None]:

# Plot top permutation importances
ax = imp.sort_values("importance_mean").plot.barh(x="feature", y="importance_mean", legend=False)
plt.title(f"Top Permutation Importances — {best_name.upper()}")
plt.xlabel("Mean decrease in ROC AUC")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()


In [None]:

# Logistic Regression coefficients (if best model is LR)
if best_name == "logreg":
    lr = best_pipe.named_steps["clf"]
    coefs = pd.Series(lr.coef_.ravel(), index=feat_names).sort_values(key=abs, ascending=False).head(20)
    coefs.to_frame("coefficient").head(20)
else:
    print("Best model is not Logistic Regression; skipping coefficient table.")



## 8. Findings (with Actionable Insights)
- **Class imbalance**: The positive class (`y = yes`) is relatively rare. We used class-weighting and ROC/PR metrics to account for this.
- **Model performance**: See the **summary table** and **ROC/PR** curves above. The best model by ROC AUC is identified as `best_name` at runtime.
- **Top drivers**: See the permutation importance chart. Historically important drivers in this dataset include recent contact outcomes (`poutcome`), last contact duration (`duration`), contact type, month, and economic indicators (`emp.var.rate`, `cons.price.idx`, etc.).
- **Actionable target strategy**: 
  - Prioritize clients contacted via channels and months associated with higher success probabilities.
  - Use predicted probabilities to **rank** leads and allocate agent time to the top deciles.
  - Enforce call **time limits** or early stop rules leveraging the `duration` effect to reduce waste on unlikely conversions.
  - Test contact cadence and offer framing in **A/B** experiments, measuring lift in conversion among top-ranked deciles.
  
## 9. Next Steps & Recommendations
1. **Calibrated probabilities** (CalibratedClassifierCV) to improve decision thresholding for operations.
2. **Cost-sensitive optimization**: Define business-specific costs for FP/FN and tune the threshold to maximize expected ROI.
3. **Temporal validation**: Use a time-based split by campaign month to mirror production drift.
4. **Model monitoring**: Track conversion, AUC, and data drift by feature; retrain quarterly or when drift is detected.
5. **Privacy & fairness**: Review sensitive attributes; ensure no disparate impact in targeting.
6. **Deployment**: Export the best pipeline with `joblib` and serve behind an API or batch scoring job.



## 10. Reproducibility
- Random seed fixed at 42
- All steps use `sklearn` Pipelines and `ColumnTransformer`
- To run end-to-end: **Kernel → Restart & Run All**


In [None]:

# Utility fix: ensure string method exists for labels if you re-run PR plot cell
def UPPER(s): 
    try: 
        return s.upper()
    except: 
        return str(s).upper()


> Note: If you see an error from `name.UPPER()` in the PR plot cell, run the small helper cell above and change it to `UPPER(name)`.


### Interpreting Coefficients (Logistic Regression)
If Logistic Regression is the top model, positive coefficients increase log-odds of subscription; negative coefficients decrease it.  
For one-hot encoded categories, coefficients are relative to the omitted base level. Always consider **magnitude and direction** and validate with **permutation importance**.
