# **Disparities in Mortgage Lending: A Statistical Analysis of Race and Gender Bias in New York State**
#### **Victor Zhang, Ellie Kogan, Erin Mutchek, Zheka Chyzhykova and Michael Lee,** CMPU 250, Vassar College

In [61]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from statsmodels.tools import add_constant
import fairlearn
from fairlearn.metrics import demographic_parity_ratio, MetricFrame, selection_rate, false_positive_rate, false_negative_rate
from sklearn import metrics
from sklearn.metrics import accuracy_score
from tabulate import tabulate


## 1. Introduction

Traditionally, mortgage approval was handled by human underwriters. Today, algorithms are used throughout the process—from pre-approval to risk scoring to final loan decisions. These tools are intended to reduce bias, but growing evidence suggests they may reproduce the very inequalities they aim to fix. A 2021 Markup investigation found that Black applicants were 80% more likely to be denied than white peers with similar finances.

In this notebook, we investigate potential algorithmic bias in mortgage approvals in New York. We ask:

1. How do credit-scoring algorithms affect approval rates by race and gender?  
2. Do disparities exist in interest rates assigned?  
3. Can we build a model that is both accurate and fair?

We analyze a dataset of mortgage applications, conduct exploratory data analysis, and build regression models to test these questions. We hypothesize that racial and gender bias persists in algorithmic loan decisions.


## 2. Data

We begin by loading our clean dataset. Relevant columns have already been selected and we have filtered the data to include only those who's loan was quantifiably accepted or rejected. 

### 2.1. Load the Data

In [62]:
df = pd.read_csv('../data/clean-data/cleaned_nys_data.csv')
df_original = df.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233144 entries, 0 to 233143
Data columns (total 20 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   action_taken                       233144 non-null  int64  
 1   derived_race                       233144 non-null  object 
 2   derived_ethnicity                  233144 non-null  object 
 3   applicant_sex                      233144 non-null  int64  
 4   applicant_age                      233144 non-null  object 
 5   income                             226538 non-null  float64
 6   debt_to_income_ratio               100684 non-null  float64
 7   applicant_credit_score_type        233144 non-null  int64  
 8   loan_amount                        233144 non-null  float64
 9   loan_to_value_ratio                224447 non-null  float64
 10  interest_rate                      176143 non-null  float64
 11  rate_spread                        1519

  df = pd.read_csv('../data/clean-data/cleaned_nys_data.csv')


### 2.2 Additional Prep

Here, we add in initial variables and clean initial variables that will help us analyze loan outcomes by race. 

In [63]:
# Selecting Black and White applications and creating variable denoting whether white
df_slice = df.loc[(df['derived_race'] == "Black or African American") | (df['derived_race'] == "White")]
df_race = df_slice.copy()
df_race['binary_race'] = 0
df_race.loc[df_race['derived_race'] == "White", 'binary_race'] = 1

# Transform derived ethnicity to binary column for Hispanic or not
df_race['binary_ethnicity'] = 0
df_race.loc[df_race['derived_ethnicity'] == "Not Hispanic or Latino", 'binary_ethnicity'] = 1

In [64]:
# Cleaning rate_spread
df_race = df_race[df_race['rate_spread'] != 'Exempt']
df_race['rate_spread'] = pd.to_numeric(df_race['rate_spread'])

In [65]:
# Drops N/A
df_race = df_race.dropna()

## 3. Linear Regression Models
To investigate whether algorithmic decisions in mortgage lending reflect racial or gender bias, we developed three linear regression models, each targeting a key loan outcome: loan amount, interest rate, and rate spread. These models control for a range of financial and property-related variables to isolate the effect of applicant demographics. Together, they offer insight into whether disparities persist even when controlling for relevant qualifications.

Loan amount reflects how much lenders are willing to lend, which impacts access to credit. Interest rate affects the total cost of the loan over time and may reveal pricing disparities. Rate spread shows how the offered rate compares to the market, helping to assess fairness in a standardized way. 


### 3.1. Model 1: Loan Amount (MLR)

This model uses the selection of predictor outcomes that we thought best represented applicant financial status and demographics. We are using these predictors to predict whether applicant ethnicity or sex has an affect on the amount of money that applicants recieve when applying for a loan. We tried to include financial status and the types of properties that people were applying for to control for these when looking at race and sex. 

In [66]:
# Selecting predictor variables of interest and converting them to acceptable format
X = df_race[['tract_minority_population_percent', "binary_race", "debt_to_income_ratio", 'income', "property_value", 'applicant_sex',
             'applicant_age', 'applicant_credit_score_type', 'loan_to_value_ratio', 'loan_type', 'loan_purpose', 'lien_status', 
             'occupancy_type', 'aus-1']]
X = pd.get_dummies(X, columns=['binary_race', 'applicant_sex', 'applicant_age', 'aus-1', 'occupancy_type', 'applicant_credit_score_type', 'loan_type', 'loan_purpose', 'lien_status'], drop_first=True)
X = X.astype(float)

# Selecting outcome variable and constant
y = df_race['loan_amount']
X = sm.add_constant(X)

# Running model and printing summary
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            loan_amount   R-squared:                       0.790
Model:                            OLS   Adj. R-squared:                  0.790
Method:                 Least Squares   F-statistic:                     5959.
Date:                Fri, 09 May 2025   Prob (F-statistic):               0.00
Time:                        16:53:39   Log-Likelihood:            -8.9893e+05
No. Observations:               66614   AIC:                         1.798e+06
Df Residuals:                   66571   BIC:                         1.798e+06
Df Model:                          42                                         
Covariance Type:            nonrobust                                         
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const 

### 3.2. Model 2: Determing if race and gender have an affect on the interest rates given on the loan

This model uses the same predictors as our other model except we are predicting the overall interest rates affecting how much the borrowers end up paying on the loans. We selected these predictors as a representative of lender's demogrpahic and financial status to indicate whether demographic information has an affect on the ultimate interest rates given. 

In [67]:
# Selecting predictor variables of interest 
X = df_race[['tract_minority_population_percent', "binary_race", "debt_to_income_ratio", 'income', "property_value", 'applicant_sex',
             'applicant_age', 'applicant_credit_score_type', 'loan_to_value_ratio', 'loan_type', 'loan_purpose', 'lien_status', 
             'occupancy_type', 'aus-1']]

# Converting categorical varailbes to correct format
X = pd.get_dummies(X, columns=['binary_race', 'applicant_sex', 'applicant_age', 'aus-1', 'occupancy_type', 'applicant_credit_score_type', 'loan_type', 'loan_purpose', 'lien_status'], drop_first=True)
X = X.astype(float)

# Preparing outcome varaibles and constant
y = df_race['interest_rate']
X = sm.add_constant(X)

# Running model and printing out summary
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:          interest_rate   R-squared:                       0.171
Model:                            OLS   Adj. R-squared:                  0.170
Method:                 Least Squares   F-statistic:                     326.7
Date:                Fri, 09 May 2025   Prob (F-statistic):               0.00
Time:                        16:53:40   Log-Likelihood:            -1.1545e+05
No. Observations:               66614   AIC:                         2.310e+05
Df Residuals:                   66571   BIC:                         2.314e+05
Df Model:                          42                                         
Covariance Type:            nonrobust                                         
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const 

### 3.2. Model 3: Determining if race and gender have an effect on the interest spread given

Interest spread shows how much more (or less) a borrower is paying compared to the best-qualified rates available at that time. As a metric, this can be more fruitful than interest rates because it captures the relative value of the loan controlling for the market. For this model, we used the same predictors to determine the types of loans that people were given to see if race had an affect on the rates. 


In [68]:
# Selecting predictor variables of interest 
X = df_race[['tract_minority_population_percent', "binary_race", "debt_to_income_ratio", 'income', "property_value", 'applicant_sex',
             'applicant_age', 'applicant_credit_score_type', 'loan_to_value_ratio', 'loan_type', 'loan_purpose', 'lien_status', 
             'occupancy_type', 'aus-1']]

# Encoding categorical values and converting to correct type 
X = pd.get_dummies(X, columns=['binary_race', 'applicant_sex', 'applicant_age', 'aus-1', 'occupancy_type', 'applicant_credit_score_type', 'loan_type', 'loan_purpose', 'lien_status'], drop_first=True)
X = X.astype(float)

# Selecting outcome variable
y = df_race['rate_spread']
# Adding constant
X = sm.add_constant(X)

# Running model and print summary
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:            rate_spread   R-squared:                       0.087
Model:                            OLS   Adj. R-squared:                  0.087
Method:                 Least Squares   F-statistic:                     151.8
Date:                Fri, 09 May 2025   Prob (F-statistic):               0.00
Time:                        16:53:40   Log-Likelihood:            -1.1676e+05
No. Observations:               66614   AIC:                         2.336e+05
Df Residuals:                   66571   BIC:                         2.340e+05
Df Model:                          42                                         
Covariance Type:            nonrobust                                         
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const 

## 4. Model Recreations
Our second objective was to analyze the fairness of loan approval decisions. We trained three models to predict application outcomes (accepted vs. denied). We chose Logistic Regression, Decision Tree, and K-Nearest Neighbors (KNN) because of their functionality with categorical outcomes. 

### 4.1. Data Preparation
Here, we continue to adjust our data so that we can properly analyze our predictive models. We then split the data to support the training and testing of the models.

In [69]:
# select columns that were explicitly approved or denied
filtered_df = df.loc[(df["action_taken"] == 1) | (df["action_taken"] == 2) | (df["action_taken"] == 7) | (df["action_taken"] == 3)]

# remove na values
filtered_df = filtered_df.dropna(subset=["debt_to_income_ratio","income","loan_to_value_ratio","loan_amount","property_value"])

# create binary accepted column
filtered_df["binary_accepted"] = True
filtered_df.loc[filtered_df["action_taken"] == 7, 'binary_accepted'] = False
filtered_df.loc[filtered_df["action_taken"] == 3, 'binary_accepted'] = False
counts = filtered_df['binary_accepted'].value_counts()

# create binary race column - 1 for White
df_slice = filtered_df.loc[(filtered_df['derived_race'] == "Black or African American") | (filtered_df['derived_race'] == "White")]
filtered_df = df_slice.copy()
filtered_df['binary_race'] = 0
filtered_df.loc[filtered_df['derived_race'] == "White", 'binary_race'] = 1

# split data

X = filtered_df[["debt_to_income_ratio","income","loan_to_value_ratio","loan_amount","property_value"]]
y = filtered_df["binary_accepted"]

train, test = train_test_split(filtered_df, test_size=0.2)
X_train, X_test = train[["debt_to_income_ratio","income","loan_to_value_ratio","loan_amount","property_value"]], \
test[["debt_to_income_ratio","income","loan_to_value_ratio","loan_amount","property_value"]] 
y_train, y_test = train["binary_accepted"], test["binary_accepted"] 

In [70]:
filtered_df['binary_accepted'].value_counts(), filtered_df['action_taken'].value_counts()

(binary_accepted
 True     73052
 False    11674
 Name: count, dtype: int64,
 action_taken
 1    69923
 3    11612
 2     3129
 7       62
 Name: count, dtype: int64)

### 4.2. Model 1: Logistic Regression

In [71]:
# logistic regression
lr = sk.linear_model.LogisticRegression(max_iter = 100000)

lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)  #predict on test set 
# check model accuracy
acc_lr = sk.metrics.accuracy_score(y_pred,y_test)
print("Model accuracy: ", acc_lr)

Model accuracy:  0.8620323380148708


### 4.3. Model 2: Decision Tree

In [72]:
# decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics # check model accuracy

dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test) #predict on test set 
acc_dt = metrics.accuracy_score(y_pred,y_test)
print("Model accuracy: ", acc_dt)

Model accuracy:  0.7791219166765019


### 4.4. Model 3: K-Nearest Neighbors

In [73]:
# knn
from sklearn.neighbors import KNeighborsClassifier # KNN

knc = KNeighborsClassifier(n_neighbors=3) # for k=3
knc.fit(X_train,y_train)
y_pred = knc.predict(X_test)  #predict on test set 
acc_knn = sk.metrics.accuracy_score(y_pred,y_test)  # check model accuracy
print("Model accuracy: ", acc_knn)

Model accuracy:  0.8321727841378497


### 4.5. Comparing Accuracy

We use accuracy to help us determine how well the models are predicting outcomes in comparison to the results of the real underwriting algorithm.

In [74]:
# plot overall accuracy comparison between models
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree','K-Nearest Neighbours'],
    'Score': [acc_lr, acc_dt, acc_knn]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
0,Logistic Regression,0.862032
2,K-Nearest Neighbours,0.832173
1,Decision Tree,0.779122


In order to test our hypothesis that the HDMA model is biased against Black applicants, we compare the accuracy rates across races for each of the models we trained, which use the same metrics. 

In [75]:
# split data and get predictions between Black and white applicants
x_var = ["debt_to_income_ratio","income","loan_to_value_ratio","loan_amount","property_value"]

test_b = test.loc[test['derived_race'] == "Black or African American"]
test_w = test.loc[test['derived_race'] == "White"]

X_test_b, X_test_w = test_b[x_var], test_w[x_var]
y_test_b, y_test_w = test_b['binary_accepted'], test_w['binary_accepted']

# knn
knn_b = knc.predict(X_test_b)
knn_w = knc.predict(X_test_w)

# lr
lr_b = lr.predict(X_test_b)
lr_w = lr.predict(X_test_w)

# dt
dt_b = dt.predict(X_test_b)
dt_w = dt.predict(X_test_w)

In [76]:
# calculate accuracy
# knn
acc_knn_b = sk.metrics.accuracy_score(knn_b, y_test_b)
acc_knn_w = sk.metrics.accuracy_score(knn_w, y_test_w)
knn_diff = acc_knn_w - acc_knn_b
# lr
acc_lr_b = sk.metrics.accuracy_score(lr_b, y_test_b)
acc_lr_w = sk.metrics.accuracy_score(lr_w, y_test_w)
lr_diff = acc_lr_w - acc_lr_b
# dt
acc_dt_b = sk.metrics.accuracy_score(dt_b, y_test_b)
acc_dt_w = sk.metrics.accuracy_score(dt_w, y_test_w)
dt_diff = acc_dt_w - acc_dt_b
# plot differences in accuracy for race between models
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree','K-Nearest Neighbours'],
    'Score': [lr_diff, dt_diff, knn_diff]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
0,Logistic Regression,0.085136
2,K-Nearest Neighbours,0.074386
1,Decision Tree,0.044311


Here we can see the difference in accuracy score between Black and white applicants. The logistic regression model, which was the most accurate overall, also has the highest disparity in terms of race. If they're not already doing so, it would be beneficial for the HMDA algorithm to consider separate values for race as they assess which loans were paid back, etc.

### 4.6. Another Modeling Approach

Here, we retrain our models to this time include binary race (1 for white, 0 for Black) explicitly as an input. Then, we report on the fairness metrics of each of those models.

In [77]:
X_train, X_test = train[["debt_to_income_ratio","income","loan_to_value_ratio","loan_amount","property_value", "binary_race"]], \
test[["debt_to_income_ratio","income","loan_to_value_ratio","loan_amount","property_value", "binary_race"]] 

dt_race = DecisionTreeClassifier()
dt_race.fit(X_train,y_train)
y_pred_dt = dt_race.predict(X_test)

knn_race = KNeighborsClassifier(n_neighbors=3)
knn_race.fit(X_train,y_train)
y_pred_knn = knn_race.predict(X_test)

lr_race = sk.linear_model.LogisticRegression(max_iter = 100000)
lr_race.fit(X_train,y_train)
y_pred_lr = lr_race.predict(X_test)

In [78]:
# define fairness metrics
metrics = {
    'accuracy' : accuracy_score,
    'selection rate' : selection_rate,
    'FPR' : false_positive_rate,
    'FNR' : false_negative_rate
}

In [79]:
# Bias report on logistic regression model    
mf_lr = MetricFrame(metrics = metrics, y_true=y_test, y_pred=y_pred_lr, sensitive_features=X_test['binary_race'])
mf_lr.by_group

Unnamed: 0_level_0,accuracy,selection rate,FPR,FNR
binary_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.788295,0.996438,0.988067,0.001294
1,0.871838,0.999132,0.997911,0.000689


In [80]:
# Bias report on k-nearest neighbors model
mf_knn = MetricFrame(metrics = metrics, y_true=y_test, y_pred=y_pred_knn, sensitive_features=X_test['binary_race'])
mf_knn.by_group

Unnamed: 0_level_0,accuracy,selection rate,FPR,FNR
binary_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.766921,0.932316,0.887828,0.055627
1,0.841065,0.936586,0.873629,0.054186


In [81]:
# Bias report on decision tree model
mf_dt = MetricFrame(metrics = metrics, y_true=y_test, y_pred=y_pred_dt, sensitive_features=X_test['binary_race'])
mf_dt.by_group

Unnamed: 0_level_0,accuracy,selection rate,FPR,FNR
binary_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.70229,0.760814,0.637232,0.205692
1,0.788399,0.853882,0.756136,0.131792


## 5. Codebook
Variable definitions can be found below. 

In [83]:
# Creating a table of variables and their description

data = [
    ["action_taken", "Final decision made on the loan application (e.g., approved, denied).",
     "(int) 1 (Loan originated), 2 (Application approved but not accepted), 3 (Application denied), 4 (Application withdrawn by applicant), 5 (File closed for incompleteness), 6 (Purchased loan)."],
    ["derived_race", "Race of the applicant as determined by the data system", "(object)"],
    ["derived_ethnicity", "Ethnicity of the applicant as derived from application data", "(object)"],
    ["applicant_sex", "Sex of the primary loan applicant", "(int)"],
    ["applicant_age", "Age of the primary loan applicant", "(object) '<25', '25-34', '35-44', '45-54', '55-64', and '>65'."],
    ["income", "Applicant’s annual income (in thousands of dollars)", "(Float)"],
    ["debt_to_income_ratio", "Ratio of applicant's monthly debt payments to income", "(Float) '<20%', '20%-<30%', '30%-<36%', '36%-<40%', '40%-<45%', '45%-<50%', '50%-60%', and '>60%'."],
    ["applicant_credit_score_type", "Type of credit score used for the applicant", "(int)"],
    ["loan_amount", "Total amount of the loan applied for", "(Float)"],
    ["loan_to_value_ratio", "Ratio of the loan amount to the appraised value of the property", "(Float)"],
    ["interest_rate", "Interest rate charged on the loan", "(Float)"],
    ["rate_spread", "Difference between the loan’s interest rate and the average prime offer rate", "(Object)"],
    ["loan_type", "Type of loan (e.g., conventional, FHA, VA, etc.)", "(int) 1 (Conventional), 2 (FHA), 3 (VA), 4 (RHS or FSA)."],
    ["loan_purpose", "Purpose of the loan (e.g., home purchase, refinancing, etc.)", "(int) 1 (Home purchase), 2 (Home improvement), 31 (Refinancing), 32 (Cash-out refinancing), 4 (Other purpose), 5 (Not applicable)."],
    ["lien_status", "Indicates whether the loan is a first or subordinate lien", "(int) 1 (First lien), 2 (Subordinate lien)."],
    ["property_value", "Appraised value of the property backing the loan", "(Float)"],
    ["occupancy_type", "Indicates whether the property is owner-occupied, rental, etc.", "(int) 1 (Principal residence), 2 (Second residence), 3 (Investment property)."],
    ["tract_minority_population_percent", "Percentage of minority population in the census tract", "(Float)"],
    ["aus-1", "Automated Underwriting System used for the loan decision", "(int) 1 (Desktop Underwriter), 2 (Loan Prospector), 3 (Technology Open to Approved Lenders), 4 (Guaranteed Underwriting System), 5 (Other), 6 (Not applicable)."],
    ["denial_reason-1", "Primary reason for denial if the loan was denied", "(int) 1 (Debt-to-income ratio), 2 (Employment history), 3 (Credit history), 4 (Collateral), 5 (Insufficient cash), 6 (Unverifiable information), 7 (Credit application incomplete), 8 (Mortgage insurance denied), 9 (Other), 10 (Not applicable)."]
]

headers = ["Variable Name", "Description", "Type/Values"]
print(tabulate(data, headers=headers, tablefmt="grid"))

+-----------------------------------+------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable Name                     | Description                                                                  | Type/Values                                                                                                                                                                                                                                        |
| action_taken                      | Final decision made on the loan application (e.g., approved, denied).        | (int) 1 (Loan originated), 2 (Application approved but not accepted), 3 (Application denied), 4 (Application withdrawn by applicant), 5 (File closed for incomp