# **Cross Validation for Logistic Regression**

Maintainer: Zhaohu(Jonathan) Fan.Contact him at (psujohnny@gmail.com)

Note: This lab note is still WIP, let us know if you encounter bugs or issues.





#### *Colab Notebook [Open in Colab](https://colab.research.google.com/drive/1aSnmFlT1KDhhZJQHt5nHC6q22Dnf95qC?usp=sharing)*

#### *Useful information about [Cross Validation for logistic regression (cv.glm)](https://yanyudm.github.io/Data-Mining-R/lecture/5.B_CrossValidationLogit.html)*



# 1 Cross Validation

In this section, we reproduce several cross-validation setups and scoring rules:

- A custom **symmetric** misclassification cost  
- An **asymmetric** misclassification cost with a **5:1** penalty ratio (**FN:FP**)  
- **AUC** as the evaluation score  
- **10-fold cross-validation** on the **Credit Card Default** dataset





### 1.1 A custom **symmetric** misclassification cost










In [4]:
def symmetric_misclass_cost(y_true, y_prob, threshold=0.5):
    """
    Symmetric misclassification cost (0–1 loss), using a probability cutoff.

    This is the symmetric counterpart to asymmetric_cost_5_to_1(), with the same inputs:
    - y_true: true labels (0/1)
    - y_prob: predicted probabilities P(Y=1)
    - threshold: classification cutoff (default = 0.5)

    Output
    - Misclassification rate (average 0–1 loss)
    """
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob)

    y_pred = (y_prob >= threshold).astype(int)
    return np.mean(y_pred != y_true)

threshold = 0.5
sym_costs = []


### 1.2 An **asymmetric** misclassification cost with a **5:1** penalty ratio (**FN:FP**)


In [7]:
def asymmetric_cost_5_to_1(y_true, y_prob, threshold=0.5, fn_cost=5.0, fp_cost=1.0):
    """
    Asymmetric misclassification cost with FN:FP = 5:1 by default.
    - y_prob: predicted probability of class 1
    - threshold: classification cutoff (0.5 is the standard default)
    Returns: average cost per observation
    """
    y_pred = (y_prob >= threshold).astype(int)
    fn = np.sum((y_true == 1) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    return (fn_cost * fn + fp_cost * fp) / len(y_true)



### 1.3 AUC as the Evaluation Score

In scikit-learn, we compute AUC directly using `roc_auc_score`.  
A higher AUC indicates better classification performance.


### 1.4 10-Fold Cross-Validation on the Credit Card Default Dataset


In [11]:
# ------------------------------------------------------------
# 10-fold CV: Symmetric misclassification cost + AUC (side-by-side)
# ------------------------------------------------------------

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# -----------------------------
# 1) Load and prepare the data
# -----------------------------
url = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit_default.csv"
credit_data = pd.read_csv(url)

# Rename target to "default" (same as the R code)
credit_data = credit_data.rename(columns={"default.payment.next.month": "default"})

# Treat categorical fields as factors (one-hot encoding in Python)
cat_cols = ["SEX", "EDUCATION", "MARRIAGE"]
credit_data[cat_cols] = credit_data[cat_cols].astype("category")

# Separate X and y
y = credit_data["default"].astype(int).values
X = credit_data.drop(columns=["default"])

# One-hot encode categorical variables (drop_first=True helps avoid perfect collinearity)
X = pd.get_dummies(X, columns=cat_cols, drop_first=True)

print("Data shape after one-hot encoding:", X.shape)

# -----------------------------
# 2) Define the model
# -----------------------------
model = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=5000, solver="lbfgs"))
])

# -----------------------------
# 3) Define CV and scoring rules
# -----------------------------
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=123)

def symmetric_misclass_cost(y_true, y_prob, threshold=0.5):
    """
    Symmetric misclassification cost (0–1 loss), using a probability cutoff.

    This is the symmetric counterpart to asymmetric_cost_5_to_1(), with the same inputs:
    - y_true: true labels (0/1)
    - y_prob: predicted probabilities P(Y=1)
    - threshold: classification cutoff (default = 0.5)

    Output
    - Misclassification rate (average 0–1 loss)
    """
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob)

    y_pred = (y_prob >= threshold).astype(int)
    return np.mean(y_pred != y_true)

threshold = 0.5
sym_costs = []
aucs = []

# -----------------------------
# 4) Run 10-fold cross validation
# -----------------------------
for fold, (train_idx, test_idx) in enumerate(cv.split(X, y), start=1):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_test)[:, 1]  # predicted P(Y=1)

    # (1) Symmetric misclassification cost
    sym_cost_fold = symmetric_misclass_cost(y_test, y_prob, threshold=threshold)
    sym_costs.append(sym_cost_fold)

    # (2) AUC
    auc_fold = roc_auc_score(y_test, y_prob)
    aucs.append(auc_fold)

    print(f"Fold {fold:>2}: SymCost = {sym_cost_fold:.6f} | AUC = {auc_fold:.6f}")

# -----------------------------
# 5) Summarize results
# -----------------------------
sym_costs = np.array(sym_costs)
aucs = np.array(aucs)

print("\n10-Fold CV Summary")
print(f"Symmetric cost (threshold={threshold}): mean = {sym_costs.mean():.7f}, std = {sym_costs.std():.7f}")
print(f"AUC:                               mean = {aucs.mean():.7f}, std = {aucs.std():.7f}")


Data shape after one-hot encoding: (12000, 26)
Fold  1: SymCost = 0.177500 | AUC = 0.711359
Fold  2: SymCost = 0.204167 | AUC = 0.687438
Fold  3: SymCost = 0.177500 | AUC = 0.737070
Fold  4: SymCost = 0.194167 | AUC = 0.743255
Fold  5: SymCost = 0.185833 | AUC = 0.738754
Fold  6: SymCost = 0.186667 | AUC = 0.702493
Fold  7: SymCost = 0.197500 | AUC = 0.715178
Fold  8: SymCost = 0.173333 | AUC = 0.744468
Fold  9: SymCost = 0.171667 | AUC = 0.748057
Fold 10: SymCost = 0.195000 | AUC = 0.725460

10-Fold CV Summary
Symmetric cost (threshold=0.5): mean = 0.1863333, std = 0.0105817
AUC:                               mean = 0.7253533, std = 0.0194365


In [None]:
# ------------------------------------------------------------
# Credit Card Default dataset: 10-fold CV for
# (1) Asymmetric misclassification cost (5:1 FN:FP)
# (2) AUC as the scoring metric
#
# R reference:
# credit_glm1 <- glm(default~., family=binomial, data=credit_data)
# cv.glm(..., cost=costfunc,  K=10)   # asymmetric 5:1
# cv.glm(..., cost=costfunc1, K=10)   # AUC
# ------------------------------------------------------------

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# -----------------------------
# 1) Load and prepare the data
# -----------------------------
url = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit_default.csv"
credit_data = pd.read_csv(url)

# Rename target to "default" (same as the R code)
credit_data = credit_data.rename(columns={"default.payment.next.month": "default"})

# Treat categorical fields as factors (one-hot encoding in Python)
cat_cols = ["SEX", "EDUCATION", "MARRIAGE"]
credit_data[cat_cols] = credit_data[cat_cols].astype("category")

# Separate X and y
y = credit_data["default"].astype(int).values
X = credit_data.drop(columns=["default"])

# One-hot encode categorical variables (drop_first=True helps avoid perfect collinearity)
X = pd.get_dummies(X, columns=cat_cols, drop_first=True)

print("Data shape after one-hot encoding:", X.shape)

# -----------------------------
# 2) Define the model
# -----------------------------
# Logistic regression is the Python analogue of a binomial GLM in R
model = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=5000, solver="lbfgs"))
])

# -----------------------------
# 3) Define CV and cost function
# -----------------------------
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=123)

def asymmetric_cost_5_to_1(y_true, y_prob, threshold=0.5, fn_cost=5.0, fp_cost=1.0):
    """
    Asymmetric misclassification cost with FN:FP = 5:1 by default.

    Inputs
    - y_true: true labels (0/1)
    - y_prob: predicted probabilities P(Y=1)
    - threshold: classification cutoff (default = 0.5)
    - fn_cost: cost weight for a false negative (default = 5)
    - fp_cost: cost weight for a false positive (default = 1)

    Output
    - Average misclassification cost per observation
    """
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob)

    y_pred = (y_prob >= threshold).astype(int)
    fn = np.sum((y_true == 1) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))

    return (fn_cost * fn + fp_cost * fp) / len(y_true)

threshold = 0.5
fn_cost = 5.0
fp_cost = 1.0

# -----------------------------
# 4) Run 10-fold cross validation
# -----------------------------
asym_costs = []
aucs = []

for fold, (train_idx, test_idx) in enumerate(cv.split(X, y), start=1):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_test)[:, 1]  # predicted P(Y=1)

    # (1) Asymmetric misclassification cost (5:1 FN:FP)
    cost_fold = asymmetric_cost_5_to_1(
        y_test, y_prob,
        threshold=threshold,
        fn_cost=fn_cost,
        fp_cost=fp_cost
    )
    asym_costs.append(cost_fold)

    # (2) AUC
    auc_fold = roc_auc_score(y_test, y_prob)
    aucs.append(auc_fold)

    print(f"Fold {fold:>2}: AsymCost(5:1) = {cost_fold:.6f} | AUC = {auc_fold:.6f}")

# -----------------------------
# 5) Summarize results
# -----------------------------
asym_costs = np.array(asym_costs)
aucs = np.array(aucs)

print("\n10-Fold CV Summary")
print(f"Asymmetric cost (FN:FP = {int(fn_cost)}:{int(fp_cost)}): mean = {asym_costs.mean():.7f}, std = {asym_costs.std():.7f}")
print(f"AUC:                               mean = {aucs.mean():.7f}, std = {aucs.std():.7f}")


Data shape after one-hot encoding: (12000, 26)
Fold  1: AsymCost(5:1) = 0.807500 | AUC = 0.711359
Fold  2: AsymCost(5:1) = 0.894167 | AUC = 0.687438
Fold  3: AsymCost(5:1) = 0.817500 | AUC = 0.737070
Fold  4: AsymCost(5:1) = 0.864167 | AUC = 0.743255
Fold  5: AsymCost(5:1) = 0.855833 | AUC = 0.738754
Fold  6: AsymCost(5:1) = 0.863333 | AUC = 0.702493
Fold  7: AsymCost(5:1) = 0.880833 | AUC = 0.715178
Fold  8: AsymCost(5:1) = 0.793333 | AUC = 0.744468
Fold  9: AsymCost(5:1) = 0.778333 | AUC = 0.748057
Fold 10: AsymCost(5:1) = 0.901667 | AUC = 0.725460

10-Fold CV Summary
Asymmetric cost (FN:FP = 5:1): mean = 0.8456667, std = 0.0412375
AUC:                               mean = 0.7253533, std = 0.0194365


In [None]:
%%shell
jupyter nbconvert --to html ///content/5_B_Cross_Validation.ipynb

[NbConvertApp] Converting notebook ///content/5_B_Cross_Validation.ipynb to html
[NbConvertApp] Writing 315507 bytes to /content/5_B_Cross_Validation.html


