# AAE 722 — Lab 4: Default dataset (ISLP)

## Gary Sun

### Q1. Load & inspect data; fit Logistic Regression on full data

In [1]:
# Q1: Load Default dataset; inspect and fit logistic regression on full data
print("Q1: Load Default dataset; inspect and fit logistic regression on full data")

import numpy as np
import pandas as pd

# Statsmodels for logistic regression with interpretable coefficients
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Try to import ISLP; install if missing
try:
    from ISLP import load_data
except Exception as e:
    import sys, subprocess
    print("Installing ISLP ...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "ISLP"])
    from ISLP import load_data

# 1) Load data
Default = load_data("Default").copy()

# 2) Basic structure
n_rows, n_cols = Default.shape
dtypes = Default.dtypes.astype(str)

# 3) Distribution of 'default' (Yes/No)
dist_default = Default["default"].value_counts().rename_axis("default").to_frame("count")
dist_default["proportion"] = (dist_default["count"] / len(Default)).round(4)

# Show structure
print(f"Dataset dimensions: {n_rows} rows x {n_cols} columns\n")
print("Column names and dtypes:")
display(pd.DataFrame({"dtype": dtypes}))

print("\nDistribution of 'default':")
display(dist_default)

# 4) Prepare outcome as 0/1 for modeling clarity
# Keep a copy of original string label for counts, but map to 1/0 for models we control.
Default["default_bin"] = (Default["default"] == "Yes").astype(int)

# 5) Logistic regression on full data using statsmodels (log-odds interpretation)
# student is categorical ("Yes"/"No") -> use it as a factor via formula (patsy will create a dummy)
# We will feed the binary response explicitly
Default_for_logit = Default.rename(columns={"default_bin": "y"})
logit_mod = smf.logit("y ~ income + balance + C(student)", data=Default_for_logit).fit(disp=False)

print("\nStatsmodels Logit (entire dataset) — coefficients:")
display(pd.DataFrame({
    "coef": logit_mod.params,
    "std err": logit_mod.bse,
    "z": logit_mod.tvalues,
    "p>|z|": logit_mod.pvalues
}).round(6))

# 6) Report & interpret the coefficient for balance
beta_balance = logit_mod.params.get("balance", np.nan)
print(f"\nCoefficient for balance (log-odds scale): {beta_balance:.6f}")

print("""
Interpretation:
Holding income and student status fixed, a one-unit increase in 'balance' changes the log-odds of default by the balance coefficient.
If balance is measured in dollars, then an additional $1 of balance multiplies the odds of default by exp(beta_balance).
For example:
""")
print(f"Odds multiplier for +$100 in balance: exp(100 * beta_balance) = {np.exp(100*beta_balance):.3f}")


Q1: Load Default dataset; inspect and fit logistic regression on full data
Dataset dimensions: 10000 rows x 4 columns

Column names and dtypes:


Unnamed: 0,dtype
default,category
student,category
balance,float64
income,float64



Distribution of 'default':


Unnamed: 0_level_0,count,proportion
default,Unnamed: 1_level_1,Unnamed: 2_level_1
No,9667,0.9667
Yes,333,0.0333



Statsmodels Logit (entire dataset) — coefficients:


Unnamed: 0,coef,std err,z,p>|z|
Intercept,-10.869045,0.492273,-22.07932,0.0
C(student)[T.Yes],-0.646776,0.236257,-2.737595,0.006189
income,3e-06,8e-06,0.369808,0.711525
balance,0.005737,0.000232,24.736506,0.0



Coefficient for balance (log-odds scale): 0.005737

Interpretation:
Holding income and student status fixed, a one-unit increase in 'balance' changes the log-odds of default by the balance coefficient.
If balance is measured in dollars, then an additional $1 of balance multiplies the odds of default by exp(beta_balance).
For example:

Odds multiplier for +$100 in balance: exp(100 * beta_balance) = 1.775


The coefficient of balance is on the log-odds scale.

If β_balance = b, then each +1 unit in balance multiplies the odds by exp(b).

For a more meaningful chunk (e.g., $100), use exp(100 * b).

### Q2. Train/Test split; LDA & QDA on (income, balance)

In [2]:
# Q2: Split data; fit LDA and QDA; report class means/priors (LDA) and test confusion/accuracy for both
print("Q2: Split data; fit LDA and QDA; report class means/priors and test performance")

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

# Use only income and balance as predictors, target as 0/1
X = Default[["income", "balance"]].to_numpy()
y = Default["default_bin"].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# LDA
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# Class means (row per class 0='No', 1='Yes'), priors
lda_means = pd.DataFrame(lda.means_, columns=["income", "balance"], index=["class 0 (No)", "class 1 (Yes)"])
lda_priors = pd.Series(lda.priors_, index=["class 0 (No)", "class 1 (Yes)"], name="prior")

print("\nLDA class means:")
display(lda_means.round(2))
print("\nLDA class priors:")
display(lda_priors.round(4))

y_pred_lda = lda.predict(X_test)
cm_lda = confusion_matrix(y_test, y_pred_lda, labels=[0,1])
acc_lda = accuracy_score(y_test, y_pred_lda)

print("\nLDA Confusion Matrix (rows=true, cols=pred; order [No, Yes]):")
display(pd.DataFrame(cm_lda, index=["True No","True Yes"], columns=["Pred No","Pred Yes"]))
print(f"LDA test accuracy: {acc_lda:.4f}")

# QDA
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

y_pred_qda = qda.predict(X_test)
cm_qda = confusion_matrix(y_test, y_pred_qda, labels=[0,1])
acc_qda = accuracy_score(y_test, y_pred_qda)

print("\nQDA Confusion Matrix (rows=true, cols=pred; order [No, Yes]):")
display(pd.DataFrame(cm_qda, index=["True No","True Yes"], columns=["Pred No","Pred Yes"]))
print(f"QDA test accuracy: {acc_qda:.4f}")


Q2: Split data; fit LDA and QDA; report class means/priors and test performance

LDA class means:


Unnamed: 0,income,balance
class 0 (No),33531.0,806.22
class 1 (Yes),31626.77,1744.36



LDA class priors:


class 0 (No)     0.9667
class 1 (Yes)    0.0333
Name: prior, dtype: float64


LDA Confusion Matrix (rows=true, cols=pred; order [No, Yes]):


Unnamed: 0,Pred No,Pred Yes
True No,2891,9
True Yes,74,26


LDA test accuracy: 0.9723

QDA Confusion Matrix (rows=true, cols=pred; order [No, Yes]):


Unnamed: 0,Pred No,Pred Yes
True No,2886,14
True Yes,71,29


QDA test accuracy: 0.9717


### Q3. Naive Bayes (GaussianNB) on same split; test CM & accuracy; predict_proba at (income=40000, balance=2000)

In [3]:
# Q3: GaussianNB; compare with LDA/QDA; predict_proba for given point
print("Q3: Naive Bayes (GaussianNB); test confusion, accuracy, compare; predict_proba for (40000, 2000)")

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred_gnb = gnb.predict(X_test)
cm_gnb = confusion_matrix(y_test, y_pred_gnb, labels=[0,1])
acc_gnb = accuracy_score(y_test, y_pred_gnb)

print("\nGaussianNB Confusion Matrix (rows=true, cols=pred; order [No, Yes]):")
display(pd.DataFrame(cm_gnb, index=["True No","True Yes"], columns=["Pred No","Pred Yes"]))
print(f"GaussianNB test accuracy: {acc_gnb:.4f}")

# Predict probability for specified customer
pt = np.array([[40000, 2000]], dtype=float)
prob_default = gnb.predict_proba(pt)[0, 1]  # probability of class 1 = default
print(f"\nPredicted probability of default at income=40000, balance=2000: {prob_default:.4f}")


Q3: Naive Bayes (GaussianNB); test confusion, accuracy, compare; predict_proba for (40000, 2000)

GaussianNB Confusion Matrix (rows=true, cols=pred; order [No, Yes]):


Unnamed: 0,Pred No,Pred Yes
True No,2883,17
True Yes,73,27


GaussianNB test accuracy: 0.9700

Predicted probability of default at income=40000, balance=2000: 0.4994


### Q4. Standardize features; KNN with k = 1, 3, 5, 10

In [4]:
# Q4: Standardize income/balance; fit KNN with k in {1,3,5,10}; summarize test accuracy
print("Q4: Standardize features; KNN k={1,3,5,10}; test accuracy table")

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

results_knn = []

for k in [1, 3, 5, 10]:
    pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=k))
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    results_knn.append({"k": k, "test_accuracy": acc})

knn_table = pd.DataFrame(results_knn).sort_values("k").reset_index(drop=True)
display(knn_table)

best_row = knn_table.iloc[knn_table["test_accuracy"].idxmax()]
best_k = int(best_row["k"])
best_k_acc = float(best_row["test_accuracy"])
print(f"\nBest KNN: k={best_k} with test accuracy={best_k_acc:.4f}")

print("""
Why small k (e.g., k=1) may not be optimal:
- k=1 perfectly follows the nearest training point, which makes the decision boundary very jagged.
- It has very low bias but very high variance and is sensitive to noise/outliers.
- A moderate k usually gives better generalization on unseen data.
""")


Q4: Standardize features; KNN k={1,3,5,10}; test accuracy table


Unnamed: 0,k,test_accuracy
0,1,0.958333
1,3,0.967333
2,5,0.968
3,10,0.970333



Best KNN: k=10 with test accuracy=0.9703

Why small k (e.g., k=1) may not be optimal:
- k=1 perfectly follows the nearest training point, which makes the decision boundary very jagged.
- It has very low bias but very high variance and is sensitive to noise/outliers.
- A moderate k usually gives better generalization on unseen data.



### Q5. Summary across all methods; false negative rates; cost-sensitive recommendation; adjust threshold to 0.3

In [5]:
# Q5: Summary table; FNR; cost-sensitive recommendation; threshold 0.3 effect
print("Q5: Summary across methods; FNR; cost-sensitive choice; threshold adjustment to 0.3")

from sklearn.linear_model import LogisticRegression

def metrics_from_cm(cm):
    # cm order: [[TN, FP],[FN, TP]]
    TN, FP, FN, TP = cm.ravel()
    acc = (TP + TN) / cm.sum()
    fnr = FN / (FN + TP) if (FN + TP) > 0 else np.nan   # miss rate
    fpr = FP / (FP + TN) if (FP + TN) > 0 else np.nan
    return TN, FP, FN, TP, acc, fnr, fpr

summary = []

# Refit Logistic Regression on training data (sklearn for consistency)
log_clf = LogisticRegression(solver="liblinear")  # binary, no scaling to mimic Q1
log_clf.fit(X_train, y_train)
y_pred_log = log_clf.predict(X_test)
cm_log = confusion_matrix(y_test, y_pred_log, labels=[0,1])
TN, FP, FN, TP, acc, fnr, fpr = metrics_from_cm(cm_log)
summary.append({"model":"Logistic (train split)","accuracy":acc,"FNR":fnr,"FPR":fpr,"TN":TN,"FP":FP,"FN":FN,"TP":TP})

# LDA (from Q2)
TN, FP, FN, TP, acc, fnr, fpr = metrics_from_cm(cm_lda)
summary.append({"model":"LDA","accuracy":acc,"FNR":fnr,"FPR":fpr,"TN":TN,"FP":FP,"FN":FN,"TP":TP})

# QDA (from Q2)
TN, FP, FN, TP, acc, fnr, fpr = metrics_from_cm(cm_qda)
summary.append({"model":"QDA","accuracy":acc,"FNR":fnr,"FPR":fpr,"TN":TN,"FP":FP,"FN":FN,"TP":TP})

# GaussianNB (from Q3)
TN, FP, FN, TP, acc, fnr, fpr = metrics_from_cm(cm_gnb)
summary.append({"model":"GaussianNB","accuracy":acc,"FNR":fnr,"FPR":fpr,"TN":TN,"FP":FP,"FN":FN,"TP":TP})

# Best KNN (from Q4) — refit and evaluate to capture CM
best_knn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=best_k))
best_knn.fit(X_train, y_train)
y_pred_knn = best_knn.predict(X_test)
cm_knn = confusion_matrix(y_test, y_pred_knn, labels=[0,1])
TN, FP, FN, TP, acc, fnr, fpr = metrics_from_cm(cm_knn)
summary.append({"model":f"KNN(k={best_k})","accuracy":acc,"FNR":fnr,"FPR":fpr,"TN":TN,"FP":FP,"FN":FN,"TP":TP})

summary_df = pd.DataFrame(summary).set_index("model").round(4)
display(summary_df[["accuracy","FNR","FPR","TN","FP","FN","TP"]])

# Identify method with lowest FNR (fewest misses)
best_fnr_model = summary_df["FNR"].idxmin()
print(f"\nLowest false negative rate (FNR): {best_fnr_model}  (FNR={summary_df.loc[best_fnr_model,'FNR']:.4f})")

# If cost of FN is 10x FP, compute simple cost = 10*FN + 1*FP (on test)
costs = {}
for m, row in summary_df.iterrows():
    costs[m] = 10*row["FN"] + 1*row["FP"]
costs_s = pd.Series(costs, name="cost (10*FN + 1*FP)").sort_values()
display(costs_s)

recommended = costs_s.index[0]
print(f"\nRecommendation under 10:1 FN:FP cost — choose: {recommended}")

# Threshold adjustment for the chosen method (if it supports predict_proba)
def adjust_threshold_and_report(model_name, model_obj, X_test, y_test, thr=0.3):
    if not hasattr(model_obj, "predict_proba"):
        print(f"{model_name} has no predict_proba; skipping threshold demo.")
        return None
    proba = model_obj.predict_proba(X_test)[:,1]
    y_pred_thr = (proba >= thr).astype(int)
    cm = confusion_matrix(y_test, y_pred_thr, labels=[0,1])
    TN, FP, FN, TP, acc, fnr, fpr = metrics_from_cm(cm)
    print(f"\n{model_name} with threshold {thr}:")
    display(pd.DataFrame(cm, index=["True No","True Yes"], columns=["Pred No","Pred Yes"]))
    print(f"accuracy={acc:.4f}, FNR={fnr:.4f}, FPR={fpr:.4f}")
    return {"TN":TN,"FP":FP,"FN":FN,"TP":TP,"accuracy":acc,"FNR":fnr,"FPR":fpr}

# Map names to fitted models
fitted = {
    "Logistic (train split)": log_clf,
    "LDA": lda,
    "QDA": qda,
    "GaussianNB": gnb,
    f"KNN(k={best_k})": best_knn
}

# Adjust threshold from 0.5 to 0.3 for the recommended model
_ = adjust_threshold_and_report(recommended, fitted[recommended], X_test, y_test, thr=0.3)

print("""
Interpreting threshold change:
- Lowering threshold from 0.5 to 0.3 usually reduces FN (missed defaults) at the expense of more FP (false alarms).
- When FN is much more costly than FP (10:1), this trade-off is often desirable.
""")


Q5: Summary across methods; FNR; cost-sensitive choice; threshold adjustment to 0.3


Unnamed: 0_level_0,accuracy,FNR,FPR,TN,FP,FN,TP
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Logistic (train split),0.9667,1.0,0.0,2900,0,100,0
LDA,0.9723,0.74,0.0031,2891,9,74,26
QDA,0.9717,0.71,0.0048,2886,14,71,29
GaussianNB,0.97,0.73,0.0059,2883,17,73,27
KNN(k=10),0.9703,0.67,0.0076,2878,22,67,33



Lowest false negative rate (FNR): KNN(k=10)  (FNR=0.6700)


KNN(k=10)                  692.0
QDA                        724.0
GaussianNB                 747.0
LDA                        749.0
Logistic (train split)    1000.0
Name: cost (10*FN + 1*FP), dtype: float64


Recommendation under 10:1 FN:FP cost — choose: KNN(k=10)

KNN(k=10) with threshold 0.3:


Unnamed: 0,Pred No,Pred Yes
True No,2831,69
True Yes,47,53


accuracy=0.9613, FNR=0.4700, FPR=0.0238

Interpreting threshold change:
- Lowering threshold from 0.5 to 0.3 usually reduces FN (missed defaults) at the expense of more FP (false alarms).
- When FN is much more costly than FP (10:1), this trade-off is often desirable.

