# Logistic Regression for Binary Classification


Maintainer: Zhaohu(Jonathan) Fan. Contact him at (psujohnny@gmail.com)

Note: This lab note is still WIP, let us know if you encounter bugs or issues.

# Table of Contents
1. [Objective](#1-objective)  
2. [Credit Card Default Data](#2-credit-card-default-data)  
3. [Logistic Regression](#3-logistic-regression)  
   3.1 [Train a logistic regression model with all variables](#31-train-a-logistic-regression-model-with-all-variables)  
   3.2 [Binary Classification](#32-binary-classification)  
   3.3 [Asymmetric cost](#33-asymmetric-cost)  
4. [Summary](#4-summary)  
   4.1 [Things to remember](#41-things-to-remember)  


#### *Colab Notebook [Open in Colab](https://colab.research.google.com/drive/1ZTxuRkIR1qwE4yG8VQtIpo6SnueVRFF2?usp=sharing)*
#### *Useful information about [Logistic Regression for Binary Classification](https://yanyudm.github.io/Data-Mining-R/lecture/4.C_LogisticReg_Classification.html)*




# 1 Objective

The objective of this case is to help you understand logistic regression (binary classification) and several important ideas, such as cross-validation the ROC curve, and the cutoff probability.

# 2 Credit Card Default Data

We will use a subset of the Credit Card Default Data (sample size $n = 12{,}000$) for this lab and illustration. Details of the full dataset ($n = 30{,}000$) can be found at: http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Think about what kinds of factors could cause people to fail to pay their credit card balance.

We first load the credit scoring data. It is easy to load comma-separated values (CSV).


In [1]:
# Google Colab Python equivalent of the provided R code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.stats import chi2_contingency

#  Load data
url = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit_default.csv"
credit_data = pd.read_csv(url)




Look at what information do we have.



In [2]:
# colnames(credit_data)
# Two column names per line
cols = list(credit_data.columns)
for i in range(0, len(cols), 2):
    if i + 1 < len(cols):
        print(f"{cols[i]}\t{cols[i+1]}")
    else:
        print(cols[i])

LIMIT_BAL	SEX
EDUCATION	MARRIAGE
AGE	PAY_0
PAY_2	PAY_3
PAY_4	PAY_5
PAY_6	BILL_AMT1
BILL_AMT2	BILL_AMT3
BILL_AMT4	BILL_AMT5
BILL_AMT6	PAY_AMT1
PAY_AMT2	PAY_AMT3
PAY_AMT4	PAY_AMT5
PAY_AMT6	default.payment.next.month


Let’s look at how many people were actually default in this sample.



In [3]:

#  mean(credit_data$default.payment.next.month)
mean_default = credit_data["default.payment.next.month"].mean()
print(f"Mean default.payment.next.month: {mean_default:.4f}")



Mean default.payment.next.month: 0.2193


The name of response variable is too long! I want to make it shorter by renaming. Recall the rename() function.



In [4]:
# rename default.payment.next.month -> default
credit_data = credit_data.rename(columns={"default.payment.next.month": "default"})

How about the variable type and summary statistics?



In [5]:
#  str(credit_data) and summary(credit_data)
print("\n--- credit_data.info() ---")
credit_data.info()

print("\n--- credit_data.describe(include='all') ---")

# Make numeric summaries show 2 decimal places
desc = credit_data.describe(include="all").T
num_cols = desc.select_dtypes(include=["number"]).columns
desc[num_cols] = desc[num_cols].round(2)

display(desc)




--- credit_data.info() ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   LIMIT_BAL  12000 non-null  int64
 1   SEX        12000 non-null  int64
 2   EDUCATION  12000 non-null  int64
 3   MARRIAGE   12000 non-null  int64
 4   AGE        12000 non-null  int64
 5   PAY_0      12000 non-null  int64
 6   PAY_2      12000 non-null  int64
 7   PAY_3      12000 non-null  int64
 8   PAY_4      12000 non-null  int64
 9   PAY_5      12000 non-null  int64
 10  PAY_6      12000 non-null  int64
 11  BILL_AMT1  12000 non-null  int64
 12  BILL_AMT2  12000 non-null  int64
 13  BILL_AMT3  12000 non-null  int64
 14  BILL_AMT4  12000 non-null  int64
 15  BILL_AMT5  12000 non-null  int64
 16  BILL_AMT6  12000 non-null  int64
 17  PAY_AMT1   12000 non-null  int64
 18  PAY_AMT2   12000 non-null  int64
 19  PAY_AMT3   12000 non-null  int64
 20  PAY_AMT4   12000 non-n

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LIMIT_BAL,12000.0,167501.33,130334.21,10000.0,50000.0,140000.0,240000.0,1000000.0
SEX,12000.0,1.6,0.49,1.0,1.0,2.0,2.0,2.0
EDUCATION,12000.0,1.84,0.75,1.0,1.0,2.0,2.0,4.0
MARRIAGE,12000.0,1.55,0.52,1.0,1.0,2.0,2.0,3.0
AGE,12000.0,35.5,9.23,21.0,28.0,34.0,41.0,79.0
PAY_0,12000.0,-0.02,1.13,-2.0,-1.0,0.0,0.0,8.0
PAY_2,12000.0,-0.13,1.19,-2.0,-1.0,0.0,0.0,7.0
PAY_3,12000.0,-0.17,1.19,-2.0,-1.0,0.0,0.0,7.0
PAY_4,12000.0,-0.23,1.15,-2.0,-1.0,0.0,0.0,7.0
PAY_5,12000.0,-0.27,1.12,-2.0,-1.0,0.0,0.0,7.0


We see all variables are int, but we know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor.



In [6]:
# Convert to factors (categorical)
for col in ["SEX", "EDUCATION", "MARRIAGE"]:
    credit_data[col] = credit_data[col].astype("category")



We omit other EDA, but you shouldn’t whenever you are doing data analysis.



# 3 Logistic Regression
Randomly split the data to training (80%) and testing (20%) datasets:

In [7]:
#  Train/test split (80/20) similar to R's sample()
np.random.seed(123)  # for reproducibility
n = len(credit_data)
train_idx = np.random.choice(credit_data.index, size=int(0.80 * n), replace=False)

credit_train = credit_data.loc[train_idx].copy()
credit_test  = credit_data.drop(train_idx).copy()

## 3.1 Train a logistic regression model with all variables


In [8]:


# Fit logistic regression: glm(default ~ ., family=binomial, data=credit_train)
# Build a formula that treats selected columns as categorical (like factors in R)
y_col = "default"
x_cols = [c for c in credit_train.columns if c != y_col]
cat_cols = {"SEX", "EDUCATION", "MARRIAGE"}

rhs_terms = [f"C({c})" if c in cat_cols else c for c in x_cols]
formula = f"{y_col} ~ " + " + ".join(rhs_terms)

credit_glm0 = smf.glm(formula=formula, data=credit_train, family=sm.families.Binomial()).fit()
print(credit_glm0.summary())



                 Generalized Linear Model Regression Results                  
Dep. Variable:                default   No. Observations:                 9600
Model:                            GLM   Df Residuals:                     9573
Model Family:                Binomial   Df Model:                           26
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -4444.0
Date:                Sun, 04 Jan 2026   Deviance:                       8888.0
Time:                        04:39:08   Pearson chi2:                 1.12e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.1204
Covariance Type:            nonrobust                                         
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -1.0468      0.15

# 3.2 Binary Classification
As we talked in the lecture, people may be more interested in the classification results. But we have to define a cut-off probability first.

These tables illustrate the impact of choosing different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and choosing a small cut-off probability will result in many cases being predicted as 1.

In [9]:
# Predicted probabilities on training data (type="response" in R)
pred_glm0_train = credit_glm0.predict(credit_train)  # probabilities


In [10]:
# Confusion tables at different cutoffs (like R table(..., (pi>cutoff)*1))
def confusion_table(y_true, prob, threshold):
    y_pred = (np.asarray(prob) > threshold).astype(int)
    return pd.crosstab(
        pd.Series(np.asarray(y_true).astype(int), name="Truth"),
        pd.Series(y_pred, name="Predicted"),
        dropna=False
    )

for thr in [0.9, 0.5, 0.2, 0.0001]:
    print(f"\n--- Confusion table (cutoff={thr}) ---")
    display(confusion_table(credit_train["default"], pred_glm0_train, thr))




--- Confusion table (cutoff=0.9) ---


Predicted,0,1
Truth,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7477,10
1,2099,14



--- Confusion table (cutoff=0.5) ---


Predicted,0,1
Truth,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7293,194
1,1591,522



--- Confusion table (cutoff=0.2) ---


Predicted,0,1
Truth,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4462,3025
1,625,1488



--- Confusion table (cutoff=0.0001) ---


Predicted,0,1
Truth,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2,7485
1,0,2113


Therefore, determine the optimal cut-off probability is crucial. The simplest way to determine the cut-off is to use the proportion of “1” in the original data. We will intriduce a more appropriate way to determine the optimal p-cut.



## 3.3 Asymmetric cost
In the case of giving loan to someone, the cost function can indicate the trade off between the risk of giving loan to someone who cannot pay (predict 0, truth 1), and risk of rejecting someone who qualifies (predict 1, truth 0). Given different business situation, one may need to have asymmetric costs for false positive and false negative. Meanwhile, when you want a binary classification decision rule, you need to choose different cut-off probability. Choosing a large cut-off probability will result in few cases being predicted as 1, and choosing a small cut-off probability will result in many cases being predicted as 1.

The symmetric cost function with 1:1 cost ratio, equivalently pcut=1/2; and asymmetric cost function with 5:1 cost ratio, equivalently pcut=1/6:

In [11]:
#  Cost functions
# Symmetric cost with 1:1 cost ratio (pcut=1/2)
def cost1(r, pi, pcut=0.5):
    r = np.asarray(r).astype(int)
    pi = np.asarray(pi, dtype=float)
    err = ((r == 0) & (pi > pcut)) | ((r == 1) & (pi < pcut))
    return err.mean()

# Asymmetric cost with 5:1 cost ratio (pcut=1/6)
def cost2(r, pi, pcut=1/6):
    r = np.asarray(r).astype(int)
    pi = np.asarray(pi, dtype=float)
    weight1 = 5  # cost of FN
    weight0 = 1  # cost of FP
    c1 = (r == 1) & (pi < pcut)  # FN indicator
    c0 = (r == 0) & (pi > pcut)  # FP indicator
    return np.mean(weight1 * c1 + weight0 * c0)

# Compute costs on training set (same as your R calls)
r_train = credit_train["default"].astype(int).values
pi_train = np.asarray(pred_glm0_train)

print(f"\nSymmetric cost (pcut=0.5): {cost1(r_train, pi_train, pcut=0.5):.4f}")
print(f"Asymmetric cost (pcut=1/6): {cost2(r_train, pi_train, pcut=1/6):.4f}")


Symmetric cost (pcut=0.5): 0.1859
Asymmetric cost (pcut=1/6): 0.6833


Here “pcut = 1/(1+weight1/weight0)” can be specified within the cost2 function so that cost is a function(r, pi) of two arguments only that can be fed to cv.glm() later for cross validation.

In general, you will pre-specify a cost ratio (e.g. 5:1) from the domain knowledge and use the equivalent cut-off probability (1/(5+1)). Then you will use that cost value to compare different models under the SAME cost function (asymmetric cost2).

# 4 Summary
## 4.1 Things to remember

*   Know how to use glm() to build logistic regression;
*   Know how to do binary classification, and calculation of MR, FPR, FNR, and (asymmetric) cost;




In [12]:
%%shell
jupyter nbconvert --to html ///content/4_C_Logistic_regression_and_prediction.ipynb

[NbConvertApp] Converting notebook ///content/4_C_Logistic_regression_and_prediction.ipynb to html
[NbConvertApp] Writing 358740 bytes to /content/4_C_Logistic_regression_and_prediction.html


