
# Credit Scoring Project — Starter Notebook (Banking Style)

**Objective:** Build a **logistic regression credit scorecard-style model** to predict **default** (PD).  
**Datasets:** Prefer the *UCI Credit Card Default* (Taiwan) or *German Credit* (UCI).  
**Deliverables:** Clean EDA, compliant preprocessing, baseline logistic model, metrics (AUC/KS/Gini), drift (PSI), and an executive summary.

> Tip: Keep an **audit trail** — record data sources, transformations, versions, and model parameters.


In [None]:

# --- 0) Setup -----------------------------------------------------------------
# If running locally, consider creating a new virtual environment.
# Optional installs (uncomment if you have internet & permissions):
# %pip install pandas numpy scikit-learn matplotlib seaborn shap optbinning scorecardpy

import os, sys, math, textwrap, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report

# Try optional libraries (if installed)
try:
    import shap
except Exception:
    shap = None

try:
    import optbinning as _optbinning
except Exception:
    _optbinning = None

RANDOM_STATE = 42
pd.set_option("display.max_columns", 200)



## 1) Data Loading

You can use either:
- **UCI Credit Card Default** (recommended): https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients  
- **German Credit**: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

The code below tries to load from a local `data/credit_default.csv`.  
If not found, it will **attempt** to download the UCI default dataset CSV mirror (requires internet).


In [None]:

# --- 1) Data loading -----------------------------------------------------------
from pathlib import Path
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

local_path = DATA_DIR / "credit_default.csv"

if local_path.exists():
    df = pd.read_csv(local_path)
else:
    # Attempt a direct download of a preprocessed CSV from a common mirror (schema-compatible)
    # If this fails (no internet), manually place your CSV at data/credit_default.csv
    try:
        url = "https://raw.githubusercontent.com/plotly/datasets/master/credit-card-default.csv"
        df = pd.read_csv(url)
        df.to_csv(local_path, index=False)
        print(f"Downloaded dataset to {local_path}")
    except Exception as e:
        raise SystemExit(f"Could not load data. Place your dataset at {local_path}. Error: {e}")

print(df.shape)
df.head()



## 2) Exploratory Data Analysis (EDA)

Bank-style checks:
- Target distribution (class imbalance).
- Missing values & data types.
- Basic univariate distributions.
- Leakage checks (post-outcome features).


In [None]:

# --- 2) EDA -------------------------------------------------------------------
target_col_candidates = [c for c in df.columns if c.lower() in ("default", "default.payment.next.month", "y", "target")]
if not target_col_candidates:
    # Heuristic for the UCI Taiwan dataset mirror
    # The Plotly mirror uses 'default payment next month' or 'default' depending on version
    for cand in df.columns:
        if "default" in cand.lower():
            target_col_candidates.append(cand)
target_col = target_col_candidates[0]

print("Target column:", target_col)
print("\nData types:")
print(df.dtypes)

print("\nMissing values (top 20):")
print(df.isna().sum().sort_values(ascending=False).head(20))

# Target distribution
vc = df[target_col].value_counts(dropna=False).sort_index()
print("\nTarget distribution:")
print(vc, "\nShare:", (vc / len(df)).round(3))

# Simple histogram for a few numeric predictors
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
sample_numeric = numeric_cols[:6]

for col in sample_numeric:
    df[col].hist(bins=40)
    plt.title(f"Histogram: {col}")
    plt.xlabel(col); plt.ylabel("Count")
    plt.show()



## 3) Train/Test Split

We hold out a test set for **unseen evaluation**. Optionally also create a validation set for tuning.


In [None]:

# --- 3) Split -----------------------------------------------------------------
y = df[target_col].astype(int)
X = df.drop(columns=[target_col])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

X_train.shape, X_test.shape



## 4) Preprocessing

**Two tracks** (choose based on your environment):

- **(A) Scorecard-style** (preferred for regulated risk): supervised binning (WOE/IV) via `optbinning` then logistic regression with monotonicity if applicable.
- **(B) Simpler baseline**: one-hot encode categoricals + standardize numerics, logistic regression.

Below we implement **(B) baseline** (always available). If `optbinning` is installed, a WOE/IV path is sketched.


In [None]:

# --- 4A) Baseline preprocessing (OHE + scaling) -------------------------------
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols),
    ]
)

baseline_clf = Pipeline(steps=[
    ("prep", preprocessor),
    ("logreg", LogisticRegression(max_iter=200, solver="liblinear"))
])

baseline_clf.fit(X_train, y_train)

y_prob_test = baseline_clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob_test)
print(f"Baseline Logistic AUC: {auc:.3f}")



## 5) Risk Metrics: ROC, KS, Gini, Confusion Matrix

- **AUC (ROC)** — overall ranking quality.  
- **KS** — max separation between positive/negative CDFs (common in credit risk).  
- **Gini** — `2*AUC - 1`.  
- **Threshold selection** — show confusion matrix at a business-relevant cutoff.


In [None]:

# --- 5) Metrics & plots -------------------------------------------------------
def ks_statistic(y_true, y_score):
    fpr, tpr, thr = roc_curve(y_true, y_score)
    ks = max(tpr - fpr)
    return ks

fpr, tpr, thr = roc_curve(y_test, y_prob_test)
ks = ks_statistic(y_test, y_prob_test)
gini = 2 * roc_auc_score(y_test, y_prob_test) - 1

print(f"AUC:  {roc_auc_score(y_test, y_prob_test):.3f}")
print(f"KS:   {ks:.3f}")
print(f"Gini: {gini:.3f}")

# Plot ROC
plt.figure()
plt.plot(fpr, tpr, label=f"ROC (AUC={roc_auc_score(y_test, y_prob_test):.3f})")
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate"); plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.show()

# Example threshold selection (You can tune this based on business costs)
cut = np.quantile(y_prob_test, 0.8)  # top 20% as 'risky' example
y_pred_test = (y_prob_test >= cut).astype(int)
print("\nConfusion Matrix @ 80th percentile cutoff:")
print(confusion_matrix(y_test, y_pred_test))
print("\nClassification report:")
print(classification_report(y_test, y_pred_test, digits=3))



## 6) Population Stability Index (PSI)

Compares score distributions between two samples (e.g., **train vs test** or **time A vs time B**). Large PSI may indicate **drift**.


In [None]:

# --- 6) PSI -------------------------------------------------------------------
def psi(expected, actual, buckets=10, range_margin=0.001):
    expected = np.array(expected); actual = np.array(actual)
    # define buckets by expected quantiles
    quantiles = np.linspace(0 + range_margin, 1 - range_margin, buckets - 1)
    cuts = np.quantile(expected, quantiles)
    expected_bins = np.digitize(expected, cuts)
    actual_bins = np.digitize(actual, cuts)

    def dist(bins, size):
        counts = np.bincount(bins, minlength=buckets)
        perc = counts / size
        return perc + 1e-10  # smooth to avoid div-by-zero

    e_perc = dist(expected_bins, expected.shape[0])
    a_perc = dist(actual_bins, actual.shape[0])

    return np.sum((a_perc - e_perc) * np.log(a_perc / e_perc))

# Compute model score (prob) for train as well
y_prob_train = baseline_clf.predict_proba(X_train)[:, 1]
psi_train_test = psi(y_prob_train, y_prob_test, buckets=10)
print(f"PSI (Train vs Test): {psi_train_test:.4f}")



## 7) (Optional) WOE/IV Binning Path with `optbinning`

If `optbinning` is installed, you can compute **IV** (variable strength) and produce **WOE-transformed** features for a classic **scorecard** workflow.


In [None]:

# --- 7) WOE/IV (optional) -----------------------------------------------------
if _optbinning is None:
    print("optbinning not installed; skipping WOE/IV demo.")
else:
    from optbinning import OptimalBinning
    iv_table = []
    for col in num_cols:
        try:
            ob = OptimalBinning(name=col, dtype="numerical", solver="mip", monotonic_trend=None)
            ob.fit(X_train[col].values, y_train.values)
            iv = ob.information_value_
            iv_table.append((col, iv))
        except Exception as e:
            pass
    iv_df = pd.DataFrame(iv_table, columns=["feature", "IV"]).sort_values("IV", ascending=False)
    display(iv_df.head(20))



## 8) Explainability (SHAP)

Use **SHAP** values to explain individual predictions and global feature importance (if `shap` is installed).


In [None]:

# --- 8) SHAP (optional) -------------------------------------------------------
if shap is None:
    print("SHAP not installed; skipping explainability plots.")
else:
    # Use a sample for speed
    X_test_sample = X_test.sample(min(200, len(X_test)), random_state=RANDOM_STATE)
    # Use predict_proba on the pipeline via a KernelExplainer (model-agnostic)
    def f_model(X_array):
        return baseline_clf.predict_proba(pd.DataFrame(X_array, columns=X_test.columns))[:, 1]

    explainer = shap.KernelExplainer(f_model, X_test_sample, link="logit")
    shap_values = explainer.shap_values(X_test_sample, nsamples=100)
    shap.summary_plot(shap_values, X_test_sample, show=True)



## 9) Model Report (Executive/Risk Style) — Outline

**Business Objective:** Predict probability of default (PD) for credit applicants to improve underwriting & portfolio risk management.  
**Data:** Source, period, key fields, exclusions.  
**Method:** Train/test split, preprocessing steps, logistic regression specs, hyperparameters.  
**Performance:** AUC, KS, Gini, confusion, calibration plots.  
**Stability:** PSI, drift checks, backtesting (if time-split available).  
**Governance:** Assumptions, limitations, fairness checks, documentation of versions.  
**Decisioning:** Example cutoffs, expected approvals/declines, loss impacts.



## 10) Next Steps

- Add **calibration** (Platt scaling or isotonic).  
- Add **reject inference** (if you have accepted-only bias).  
- Try **monotonic constraints** (e.g., XGBoost) for regulator-friendly behavior.  
- Productionize: **MLflow** tracking, **Airflow** pipeline, **Docker** packaging.
