# Discriminant Analysis

Maintainer: Zhaohu(Jonathan) Fan. Contact him at (psujohnny@gmail.com)

Note: This lab note is still WIP, let us know if you encounter bugs or issues.


## Table of Contents
1. [Credit Score Data](#1-credit-score-data)  
   1. [Load Data](#11-load-data)  
2. [Discriminant Analysis](#2-discriminant-analysis)  
   1. [In-sample](#21-in-sample)  
   2. [Out-of-sample](#22-out-of-sample)

#### *Colab Notebook [Open in Colab](https://colab.research.google.com/drive/1fPXTO6VkN08cLHz1kIW4UeNlYF-vEHAn?usp=sharing)*

#### *Useful information about [Discriminant Analysis in R](https://yanyudm.github.io/Data-Mining-R/lecture/9.B_DiscriminantAnalysis.html)*



## 1 Credit Score Data
### 1.1 Load Data


In [17]:
# ============================================================
# 1 Credit Score Data
# 1.1 Load Data (Python / Google Colab)
# ============================================================
import numpy as np
import pandas as pd

# Load data
credit_url = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit0.csv"
credit_data = pd.read_csv(credit_url)

print("Original shape:", credit_data.shape)
credit_data.head()


Original shape: (5000, 63)


Unnamed: 0,id,Y,X2,X3,X4,X5,X6,X7,X8,X9,...,X22_5,X22_6,X22_7,X22_8,X22_9,X22_10,X22_11,X23_2,X23_3,X24_2
0,413572,0,0.905,-1.188,-0.83,-0.825,-0.537,-0.626,-0.082,-0.211,...,0,0,0,0,0,0,0,0,0,0
1,411680,0,1.239,0.575,1.055,-0.039,-0.537,0.604,-0.228,1.068,...,0,0,0,0,0,0,0,1,0,0
2,408678,0,1.49,1.961,2.402,2.449,0.943,1.13,0.065,-0.14,...,0,0,0,0,0,0,0,0,0,0
3,404876,0,-0.766,-0.936,-0.83,-0.563,-0.537,-0.626,-0.008,-0.211,...,0,0,0,0,0,0,0,0,0,0
4,401738,0,-1.017,-0.685,-0.157,-0.432,0.203,-0.626,-0.155,1.068,...,0,0,0,0,0,0,0,1,0,0


In [18]:
# ------------------------------------------------------------
# Remove X9 and id (not used for prediction)
# ------------------------------------------------------------
drop_cols = ["X9", "id"]
credit_data = credit_data.drop(columns=[c for c in drop_cols if c in credit_data.columns])

# Ensure Y is treated as a categorical/binary label
# (In sklearn, y can be int or category; we convert to int if possible.)
credit_data["Y"] = pd.to_numeric(credit_data["Y"], errors="coerce")

print("After dropping columns shape:", credit_data.shape)
print("Y value counts:\n", credit_data["Y"].value_counts(dropna=False))


After dropping columns shape: (5000, 61)
Y value counts:
 Y
0    4700
1     300
Name: count, dtype: int64


In [19]:
# ------------------------------------------------------------
# Split data 90/10 into training/testing sets
# ------------------------------------------------------------
from sklearn.model_selection import train_test_split

X = credit_data.drop(columns=["Y"])
y = credit_data["Y"]

# Stratify keeps the class proportion similar in train/test (recommended)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.10,
    random_state=42,
    shuffle=True,
    stratify=y
)

print("Training set shape:", X_train.shape)   # (n_train, p)
print("Testing set shape: ", X_test.shape)    # (n_test, p)
print("Number of variables (predictors):", X_train.shape[1])
print("Number of training observations:", X_train.shape[0])


Training set shape: (4500, 60)
Testing set shape:  (500, 60)
Number of variables (predictors): 60
Number of training observations: 4500


In [5]:
# ------------------------------------------------------------
# Cost function for benchmarking performance (class-based)
# - observed: true labels (0/1)
# - predicted: predicted classes (0/1)
# Penalize false negatives (actual 1 predicted 0) more heavily
# ------------------------------------------------------------
def credit_cost(observed, predicted, weight1=10, weight0=1):
    """
    Replicates the R cost function:
      c1: observed==1 & predicted==0 (false negatives) -> weight1
      c0: observed==0 & predicted==1 (false positives) -> weight0
    Returns the mean cost.
    """
    observed = np.asarray(observed).astype(int)
    predicted = np.asarray(predicted).astype(int)

    c1 = (observed == 1) & (predicted == 0)
    c0 = (observed == 0) & (predicted == 1)

    return np.mean(weight1 * c1 + weight0 * c0)

# Example usage (dummy prediction: all zeros)
dummy_pred = np.zeros_like(y_test, dtype=int)
print(f"Example credit_cost (all-0 prediction): {credit_cost(y_test, dummy_pred):.4f}")


Example credit_cost (all-0 prediction): 0.6000


In [6]:
# ------------------------------------------------------------
# Cost function for benchmarking performance (class-based)
# - observed: true labels (0/1)
# - predicted: predicted classes (0/1)
# Penalize false negatives (actual 1 predicted 0) more heavily
# ------------------------------------------------------------
def credit_cost(observed, predicted, weight1=10, weight0=1):
    """
    Replicates the R cost function:
      c1: observed==1 & predicted==0 (false negatives) -> weight1
      c0: observed==0 & predicted==1 (false positives) -> weight0
    Returns the mean cost.
    """
    observed = np.asarray(observed).astype(int)
    predicted = np.asarray(predicted).astype(int)

    c1 = (observed == 1) & (predicted == 0)
    c0 = (observed == 0) & (predicted == 1)

    return np.mean(weight1 * c1 + weight0 * c0)

# Example usage (dummy prediction: all zeros)
dummy_pred = np.zeros_like(y_test, dtype=int)
print(f"Example credit_cost (all-0 prediction): {credit_cost(y_test, dummy_pred):.4f}")


Example credit_cost (all-0 prediction): 0.6000


## 2 Discriminant Analysis
Linear Discriminant Analysis (LDA) is illustrated here using both in-sample and out-of-sample performance measures. We also demonstrate how to apply an arbitrary probability cutoff to convert predicted class probabilities into class labels.

In [7]:
# ============================================================
# 2 Discriminant Analysis (LDA) — Python / Google Colab
# - In-sample + out-of-sample performance
# - Uses an arbitrary cutoff on P(Y=1)
# ============================================================
import numpy as np
import pandas as pd

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, accuracy_score


In [8]:
# ------------------------------------------------------------
# Assumes you already ran the previous cells that created:
# X_train, X_test, y_train, y_test, and credit_cost(...)
# If not, run the "Load Data" section first.
# ------------------------------------------------------------

# Ensure labels are 0/1 integers
y_train_int = pd.Series(y_train).astype(int).values
y_test_int  = pd.Series(y_test).astype(int).values

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train_int)

print("LDA fitted.")



LDA fitted.


In [9]:
# ============================================================
# 2.1 In-sample
# ============================================================
pcut_lda_in = 0.15  # matches R pcut.lda <- .15

# Posterior probabilities P(Y=1)
prob_lda_in = lda.predict_proba(X_train)[:, 1]
pred_lda_in = (prob_lda_in >= pcut_lda_in).astype(int)

# Confusion matrix (like R table)
cm_in = confusion_matrix(y_train_int, pred_lda_in, labels=[0, 1])
cm_in_df = pd.DataFrame(cm_in, index=["Obs 0", "Obs 1"], columns=["Pred 0", "Pred 1"])
cm_in_df


Unnamed: 0,Pred 0,Pred 1
Obs 0,3892,338
Obs 1,157,113


In [10]:
# In-sample misclassification rate (mean(Y != pred))
mis_in = np.mean(y_train_int != pred_lda_in)

print(f"In-sample cutoff (pcut): {pcut_lda_in:.2f}")
print(f"In-sample misclassification rate: {mis_in:.4f}")
print(f"In-sample accuracy: {accuracy_score(y_train_int, pred_lda_in):.4f}")


In-sample cutoff (pcut): 0.15
In-sample misclassification rate: 0.1100
In-sample accuracy: 0.8900


In [11]:
# ============================================================
# 2.2 Out-of-sample
# ============================================================
pcut_lda_out = 0.12  # matches R cut.lda <- .12

prob_lda_out = lda.predict_proba(X_test)[:, 1]
pred_lda_out = (prob_lda_out >= pcut_lda_out).astype(int)

cm_out = confusion_matrix(y_test_int, pred_lda_out, labels=[0, 1])
cm_out_df = pd.DataFrame(cm_out, index=["Obs 0", "Obs 1"], columns=["Pred 0", "Pred 1"])
cm_out_df


Unnamed: 0,Pred 0,Pred 1
Obs 0,419,51
Obs 1,17,13


In [12]:
# Out-of-sample misclassification rate + cost function
mis_out = np.mean(y_test_int != pred_lda_out)
cost_out = credit_cost(y_test_int, pred_lda_out)

print(f"Out-of-sample cutoff (pcut): {pcut_lda_out:.2f}")
print(f"Out-of-sample misclassification rate: {mis_out:.4f}")
print(f"Out-of-sample accuracy: {accuracy_score(y_test_int, pred_lda_out):.4f}")
print(f"Out-of-sample credit cost: {cost_out:.4f}")


Out-of-sample cutoff (pcut): 0.12
Out-of-sample misclassification rate: 0.1360
Out-of-sample accuracy: 0.8640
Out-of-sample credit cost: 0.4420


In [13]:
%%shell
jupyter nbconvert --to html ///content/9_B_Discriminant_analysis.ipynb

[NbConvertApp] Converting notebook ///content/9_B_Discriminant_analysis.ipynb to html
[NbConvertApp] Writing 326110 bytes to /content/9_B_Discriminant_analysis.html


