# The Markup Denied Mortage Case Study

<a href="http://creativecommons.org/licenses/by-nc/4.0/" rel="license"><img style="border-width: 0;" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" alt="Creative Commons License" /></a>
This tutorial is licensed under a <a href="http://creativecommons.org/licenses/by-nc/4.0/" rel="license">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

# Overview

## Acknowledgements

This lab/tutorial is based on the research and technical documentation for *The Markup's* "Denied" investigation:
- [The Secret Bias Hidden in Mortgage-Approval Algorithms](https://themarkup.org/denied/2021/08/25/the-secret-bias-hidden-in-mortgage-approval-algorithms) and [Dozens of Mortgage Lenders Showed Significant Disparities; Here Are the Worst](https://themarkup.org/denied/2021/08/25/dozens-of-mortgage-lenders-showed-significant-disparities-here-are-the-worst)
- Methodology: [How We Investigated Racial Disparities in Federal Mortgage Data](https://themarkup.org/show-your-work/2021/08/25/how-we-investigated-racial-disparities-in-federal-mortgage-data)
- GitHub repository: [the-markup/investigation-redlining](https://github.com/the-markup/investigation-redlining?tab=readme-ov-file)

# Materials

## Scripts

`clean_data.py`: This Python file contains all the functions used to clean the geographic fields, the race and ethnicity columns, and action taken columns, among others. It also finds and flags co-applicants among five different fields.

`categorize_data.py`: This Python file contains all the functions that standardize the columns that are used in the regression, including debt-to-income ratio, combined loan-to-value ratio, among others.

`use_regression.py`: This Python file contains all the functions needed to run the regression and other statistical tests.

[Download from GitHub](https://github.com/the-markup/investigation-redlining/tree/main/utils). Code to download in your notebook is included below.

In [1]:
# code to download the file within your Python IDE
import json, requests, urllib, urllib.request
urllib.request.urlretrieve("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/utils/categorize_data.py", "categorize_data.py")
urllib.request.urlretrieve("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/utils/clean_data.py", "clean_data.py")
urllib.request.urlretrieve("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/utils/use_regression.py", "use_regression.py")

('use_regression.py', <http.client.HTTPMessage at 0x7d66d8272530>)

In [2]:
from categorize_data import *
from clean_data import *
from use_regression import *

## Census Data

`counties`
- We used 2019 American Community Survey data for the property values for each county in the country––table B25077. We downloaded the data from the Census and included the raw dataset.

`metro`
- We used 2019 American Community Survey data for the metro area populations, which we downloaded from the Census website and acquired through the Census API.

`demo`
- We used a Census dataset that lists all counties in the country and the respective metro area that they belong to. That raw dataset is included here. We used this dataset to map counties in HMDA data to their respective metro areas while incorporating the population categories for each metro area.


In [None]:
import pandas as pd
counties = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/data/census_data/county_to_metro_crosswalk/clean/all_counties_210804.csv")
metro = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/data/census_data/metro_area_pop/raw/metro_division_pop2019.csv")
propValue = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/data/census_data/property_values/ACSDT5Y2019.B25077_data_with_overlays_2021-06-23T115616.csv")
demo = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/data/census_data/racial_ethnic_demographics/clean/tract_race_pct2019_210204.csv")

## CFPB Data

Original 2019 HDMA source data from the [CFPB website](https://ffiec.cfpb.gov/data-publication/dynamic-national-loan-level-dataset/2019).
- [Data dictionary](https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/public-lar-schema)
- [Other field-level documentation](https://ffiec.cfpb.gov/documentation/publications/loan-level-datasets/lar-data-fields)

The raw data here is over `6 GB`. We'll work with the output of the reporting team's filtering/reshaping. The broad strokes of their workflow....
- Standardize data
- Standardize applicant and co-applicant race/ethnicity
- Standardize credit models
- Standardize co-applicant info
- Standardize outcomes
- Connect lender info to mortgage info

To unpack the full data processing workflow:
- 1: Data Cleaning
  * [Reporting team's Jupyter Notebook](https://github.com/the-markup/investigation-redlining/blob/main/notebooks/process/1_clean_data.ipynb)
  * [Prof. Walden's version of their notebook](https://colab.research.google.com/drive/1406la8gg4v7u8LBstU9Ec1GQrKVNe1Da?usp=sharing)
- 2: Data Categorizing
  * [Reporting team's Jupyter Notebook](https://github.com/the-markup/investigation-redlining/blob/main/notebooks/process/2_categorize_data.ipynb)
  * [Prof. Walden's version of their notebook](https://colab.research.google.com/drive/1_JYDVnb-ThLFSgwbh_B6qNgMg8fc0ib6?usp=sharing)

### `Dask`

We're going to use `dask` as a memory-friendly alternative to `pandas` for some components of this workflow. Pandas loads an entire data file into memory at once, which will run into issues with large datasets.
- [Dask documentation](https://www.dask.org/)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
import dask.dataframe as dtf, pandas as pd
hmda19_df = dtf.read_csv("/content/drive/MyDrive/2-S24/EoC II/for-students/mar21/hmda.csv", dtype=str) # load data
# hmda19_df = hmda19_df.compute()

# Processing

Step #1: Filter for Conventional Originations and Denials and where income is above 0

In [None]:
# hmda19_df['income'] = pd.to_numeric(hmda19_df['income'])
hmda19_df['income'] = dtf.to_numeric(hmda19_df['income'])

hmda19_df2 = hmda19_df[(hmda19_df['loan_type'] == '1') & (hmda19_df['income'] > 0) &\
                       ((hmda19_df['loan_outcome'] == '1') | (hmda19_df['loan_outcome'] == '3'))].copy()

# print(len(hmda19_df2))

Step #2: Create Dummy Variables for Regression

Select columns for dummy variables

In [None]:
regression_cols = [{'loan_outcome': {'denied': ['3']}},

                   ### Reference: White
                   {'app_race_ethnicity': {'black': ['3'], 'latino': ['6'], 'asian': ['2'], 'native': ['1'],
                                           'pac_islander': ['4'], 'race_na': ['7'], 'asian_cb': ['2', '4']}},

                   ### Reference: Coapplicant
                   {'co_applicant': {'no_coapplicant': ['2'], 'na_coapplicant': ['3']}},

                   ### Reference: Male
                   {'applicant_sex_cat': {'female': ['2'], 'sex_na': ['3', '6']}},

                   ### Reference: Between 34-44 or Between 34-54
                   {'applicant_age_cat': {'less_than25': ['1'], 'between25_34': ['2'],
                                          'between45_54': ['4'], 'between55_64': ['5'], 'between65_74': ['6'],
                                          'greater74': ['7'], 'age_na': ['8'],
                                          'younger_than_34': ['1', '2'], 'older_than_55': ['5', '6', '7'],
                                          'older_than65': ['6', '7']}},

                   ### Reference: Bucket 2 & 3
                   {'prop_value_cat': {'pvr_bucket1': ['1'], 'pvr_bucket4': ['4'], 'pvr_bucket5': ['5'],
                                        'pvr_bucket6': ['6'], 'pvr_bucket_none': ['7']}},


                   ### Reference: 30yr Mortgage
                   {'mortgage_term': {'less30yrs_mortgage': ['2'], 'more30yrs_mortgage': ['3'],
                                      'mortgage_term_na': ['4'], 'not30yr_mortgage': ['2', '3']}},

                   ### Reference: TransUnion
                   {'app_credit_model': {'equifax': ['1'], 'experian': ['2'], 'other_model': ['4', '6'],
                                         'more_than_one': ['5'], 'model_na': ['7']}},

                   {'dti_cat': {'dti_manageable': ['2'], 'dti_unmanageable': ['3'],
                                'dti_struggling': ['4'], 'dti_na': ['5', '6']}},

                   ### Reference: 20 pct downpayment
                   {'downpayment_flag': {'less20pct_downpayment': ['2'],'downpayment_na': ['3', '5']}},

                   ### Reference: Upper LMI
                   {'lmi_def': {'low_lmi': ['1'], 'moderate_lmi': ['2'], 'middle_lmi': ['3'], 'na_lmi': ['5']}},

                   ### Reference: White Cat 1
                   {'diverse_def': {'white_cat2': ['2'], 'white_cat3': ['3'], 'white_cat4': ['4'],
                                      'white_cat_na': ['0', '5']}},

                   ### Reference: Banks
                   {'lender_def': {'credit_union': ['2'], 'independent': ['3'],  'lender_na': ['4', '6']}},

                   ### Reference: Desktop
                   {'main_aus': {'non_desktop': ['2', '3', '4', '5', '6'], 'aus_na': ['7']}},

                   ### Reference: 99th Percentile
                   {'metro_percentile': {'metro_90th': ['9'], 'metro_80th': ['8'],
                                         'metro_70th': ['7'], 'metro_60th': ['6'], 'metro_50th': ['5'],
                                         'metro_40th': ['4'], 'metro_30th': ['3'], 'metro_20th': ['2'],
                                         'metro_10th': ['1'], 'metro_less10th': ['0'], 'micro_area': ['111'],
                                         'metro_none': ['000']}}]

In [None]:
continous_vars = ['income_log', 'loan_log', 'lar_count', 'property_value_ratio', 'prop_zscore']

for continuous_var in continous_vars:
    # hmda19_df2[continuous_var] = pd.to_numeric(hmda19_df2[continuous_var])
    hmda19_df[continuous_var] = dtf.to_numeric(hmda19_df[continuous_var])

In [None]:
"""
for columns in regression_cols:
    ### Function to create dummy variables
    hmda19_df2 = create_dummy_vars(hmda19_df2, columns)
"""

def create_dummy_vars2(df, columns):
    for column in columns:
        dummy_vars = columns[column]
        for dummy_var in dummy_vars:
            var_value = dummy_vars[dummy_var]
            df[dummy_var] = (df[column].isin(var_value)).astype(int)
    return df

In [None]:
for columns in regression_cols:
    ### Function to create dummy variables
    hmda19_df2 = create_dummy_vars2(hmda19_df2, columns)

Independent Variables

In [None]:
variables = ['black', 'latino', 'asian_cb', 'native', 'race_na',
             'no_coapplicant', 'na_coapplicant',
             'female', 'sex_na',
             'less_than25', 'between25_34', 'between45_54', 'between55_64', 'older_than65', 'age_na',
             'income_log', 'loan_log',
             'pvr_bucket1', 'pvr_bucket4', 'pvr_bucket5', 'pvr_bucket6', 'pvr_bucket_none',
             'less30yrs_mortgage', 'more30yrs_mortgage', 'mortgage_term_na',
             'equifax', 'experian', 'other_model', 'more_than_one', 'model_na',
             'dti_manageable', 'dti_unmanageable', 'dti_struggling', 'dti_na',
             'less20pct_downpayment','downpayment_na',
             'moderate_lmi', 'middle_lmi', 'low_lmi', 'na_lmi',
             'credit_union', 'independent',  'lender_na',
             'lar_count',
             'non_desktop', 'aus_na',
             'white_cat2', 'white_cat3', 'white_cat4', 'white_cat_na',
             'metro_90th', 'metro_80th', 'metro_70th', 'metro_60th', 'metro_50th', 'metro_40th',
             'metro_30th', 'metro_20th', 'metro_10th', 'metro_less10th', 'micro_area', 'metro_none']

print(len(variables))

62


Step #3: Run Collinearity Test

In [None]:
hmda_independent_vars = hmda19_df2[variables]
hmda_independent_vars = hmda_independent_vars.compute()
hmda_independent_vars.head()

Unnamed: 0,black,latino,asian_cb,native,race_na,no_coapplicant,na_coapplicant,female,sex_na,less_than25,...,metro_70th,metro_60th,metro_50th,metro_40th,metro_30th,metro_20th,metro_10th,metro_less10th,micro_area,metro_none
0,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
import pandas as pd
import statsmodels.formula.api as smf
from tqdm import tqdm

def calculate_vif2(independent_df):
    vif_list = []
    x_cols = independent_df.columns

    for x_col in tqdm(x_cols):
        x = independent_df[x_col]
        y = independent_df.drop(columns=x_col)

        # One-hot encode categorical columns
        y = pd.get_dummies(y)  # Example: One-hot encoding

        formula = f"{x.name} ~ {' + '.join(y.columns)}"
        rsq = smf.ols(formula, data=independent_df).fit().rsquared
        vif = round(1 / (1 - rsq), 2)

        if vif > 2.5:
            threshold_flag = '1'  # Indicates a concern
        else:
            threshold_flag = '0'  # Indicates no concern

        var_dict = {'independent_var': x_col, 'vif': vif, 'threshold': threshold_flag}
        vif_list.append(var_dict)

    vif_df = pd.DataFrame(vif_list)
    return vif_df

In [None]:
vif_df = calculate_vif2(hmda_independent_vars)

  0%|          | 0/62 [00:00<?, ?it/s]

Varibales that are above the 2.5 threshold

In [None]:
vif_df[(vif_df['threshold'] == '1')].sort_values(by = ['vif'], ascending = False)

Unnamed: 0,independent_var,vif,threshold
21,pvr_bucket_none,11.47,1
33,dti_na,8.72,1
35,downpayment_na,7.93,1
24,mortgage_term_na,7.03,1
16,loan_log,2.72,1
15,income_log,2.68,1


Remove variables with high VIFs
- Keeping income, loan and metro_90th

In [None]:
to_keep = ['income_log', 'loan_log', 'metro_90th']

highvif_vars = vif_df[(vif_df['threshold'] == '1') & ~(vif_df['independent_var'].isin(to_keep))]\
              ['independent_var'].unique().tolist()

variables2 = [var for var in variables if var not in highvif_vars]

NameError: name 'vif_df' is not defined

Step #4: Filter Out High Vif Variables
- Property Value Ratios NA
- Mortgage Term NA
- DTI NA
- Downpayment NA
- NA Lmi
- White Cat NA

In [None]:
hmda19_df3 = hmda19_df2[(hmda19_df2['prop_value_cat'] != '7') & (hmda19_df2['mortgage_term'] != '4') &
 (hmda19_df2['dti_cat'] != '5') & (hmda19_df2['dti_cat'] != '6') &
  (hmda19_df2['downpayment_flag'] != '3') & (hmda19_df2['lmi_def'] != '5') &
   (hmda19_df2['diverse_def'] != '0') & (hmda19_df2['diverse_def'] != '5')].copy()

Also filtering out CLTV above 100

In [None]:
# hmda19_df3['combined_loan_to_value_ratio'] = pd.to_numeric(hmda19_df3['combined_loan_to_value_ratio'])
hmda19_df3['combined_loan_to_value_ratio'] = dtf.to_numeric(hmda19_df3['combined_loan_to_value_ratio'])

hmda19_df4 = hmda19_df3[(hmda19_df3['combined_loan_to_value_ratio'] <= 100)]

# print(len(hmda19_df4))

Replace variables

In [None]:
# high_vif_vars = ['pvr_bucket_none', 'mortgage_term_na', 'dti_na', 'downpayment_na', 'na_lmi', 'white_cat_na']

vars_to_removes = ['pvr_bucket1', 'pvr_bucket4', 'pvr_bucket5', 'pvr_bucket6', 'less20pct_downpayment']

variables3 = [var for var in variables2 if var not in vars_to_removes]
variables3.insert(17, 'property_value_ratio')
variables3.insert(28, 'combined_loan_to_value_ratio')

NameError: name 'variables2' is not defined

Variables to use

In [None]:
pd.Series(variables3)

0                            black
1                           latino
2                         asian_cb
3                           native
4                          race_na
5                   no_coapplicant
6                   na_coapplicant
7                           female
8                           sex_na
9                      less_than25
10                    between25_34
11                    between45_54
12                    between55_64
13                    older_than65
14                          age_na
15                      income_log
16                        loan_log
17            property_value_ratio
18              less30yrs_mortgage
19              more30yrs_mortgage
20                         equifax
21                        experian
22                     other_model
23                   more_than_one
24                        model_na
25                  dti_manageable
26                dti_unmanageable
27                  dti_struggling
28    combined_loan_

# Analysis

## Regression Formula

In [None]:
regression_formula = create_formula(variables3)
regression_formula

'denied ~ black + latino + asian_cb + native + race_na + no_coapplicant + na_coapplicant + female + sex_na + less_than25 + between25_34 + between45_54 + between55_64 + older_than65 + age_na + income_log + loan_log + property_value_ratio + less30yrs_mortgage + more30yrs_mortgage + equifax + experian + other_model + more_than_one + model_na + dti_manageable + dti_unmanageable + dti_struggling + combined_loan_to_value_ratio + moderate_lmi + middle_lmi + low_lmi + na_lmi + credit_union + independent + lender_na + lar_count + non_desktop + aus_na + white_cat2 + white_cat3 + white_cat4 + white_cat_na + metro_90th + metro_80th + metro_70th + metro_60th + metro_50th + metro_40th + metro_30th + metro_20th + metro_10th + metro_less10th + micro_area + metro_none'

In [None]:
hmda19_df4 = hmda19_df4.compute()
print('Number of records: ' + str(len(hmda19_df4)))

Number of records: 0


In [None]:
hmda19_df4

Unnamed: 0_level_0,activity_year,lei,derived_msa_md,state_code,county_code,census_tract,conforming_loan_limit,action_taken,purchaser_type,preapproval,loan_type,loan_purpose,lien_status,reverse_mortgage,open_end_line_of_credit,business_or_commercial_purpose,loan_amount,combined_loan_to_value_ratio,interest_rate,rate_spread,hoepa_status,total_loan_costs,total_points_and_fees,origination_charges,discount_points,lender_credits,loan_term,prepayment_penalty_term,intro_rate_period,negative_amortization,interest_only_payment,balloon_payment,other_nonamortizing_features,property_value,construction_method,occupancy_type,manufactured_home_secured_property_type,manufactured_home_land_property_interest,total_units,multifamily_affordable_units,income,debt_to_income_ratio,applicant_credit_score_type,co_applicant_credit_score_type,applicant_ethnicity_1,co_applicant_ethnicity_1,applicant_ethnicity_observed,co_applicant_ethnicity_observed,applicant_race_1,co_applicant_race_1,applicant_race_observed,co_applicant_race_observed,applicant_sex,co_applicant_sex,applicant_sex_observed,co_applicant_sex_observed,applicant_age,co_applicant_age,applicant_age_above_62,co_applicant_age_above_62,submission_of_application,initially_payable_to_institution,aus_1,aus_2,aus_3,aus_4,aus_5,denial_reason_1,denial_reason_2,denial_reason_3,denial_reason_4,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units,state_fips,county_fips,app_race_ethnicity,coapp_race_ethnicity,coapp_same_race,app_credit_model,co_applicant,loan_outcome,aus_cat,lar_count,assets,lender_def,con_apps,metro_code,metro_type_def,metro_percentile,median_value,median_prop_value,prop_value,total_estimate,white_pct,black_pct,native_pct,latino_pct,asian_pct,pacislander_pct,othercb_pct,asiancb_pct,diverse_def,dti_cat,downpayment_flag,property_value_ratio,prop_zscore,prop_value_cat,applicant_age_cat,income_log,loan_log,applicant_sex_cat,main_aus,mortgage_term,lmi_def,denied,black,latino,asian,native,pac_islander,race_na,asian_cb,no_coapplicant,na_coapplicant,female,sex_na,less_than25,between25_34,between45_54,between55_64,between65_74,greater74,age_na,younger_than_34,older_than_55,older_than65,pvr_bucket1,pvr_bucket4,pvr_bucket5,pvr_bucket6,pvr_bucket_none,less30yrs_mortgage,more30yrs_mortgage,mortgage_term_na,not30yr_mortgage,equifax,experian,other_model,more_than_one,model_na,dti_manageable,dti_unmanageable,dti_struggling,dti_na,less20pct_downpayment,downpayment_na,low_lmi,moderate_lmi,middle_lmi,na_lmi,white_cat2,white_cat3,white_cat4,white_cat_na,credit_union,independent,lender_na,non_desktop,aus_na,metro_90th,metro_80th,metro_70th,metro_60th,metro_50th,metro_40th,metro_30th,metro_20th,metro_10th,metro_less10th,micro_area,metro_none
npartitions=28,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1
,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,int64,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,int64,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [None]:
hmda19_df4 = hmda19_df4.compute()
model = run_regression(data = hmda19_df4, formula = regression_formula).fit()
model.summary()

ValueError: negative dimensions are not allowed

## Other Models

### Collinearity

In [None]:
cols = national_findings_df['variable_name'].unique().tolist()[1:]

hmda_independent_vars2 = hmda19_df4[cols]
vif_df2 = calculate_vif(hmda_independent_vars2)

No new additional varibales that are collinear

In [None]:
vif_df2[(vif_df2['threshold'] == '1')]

### Confusion Matrix

In [None]:
calcuate_confusion_matrix(hmda19_df3, model, cols, ['denied'])

# Findings



## Race & Ethnicity
- Black applicants are almost twice as likely to be denied
- Latinx/Hispanic are almost 1.4 times
- Native Applicants are 1.7 times
- Asian/Pacific Isalnder are 1.5

In [None]:
national_findings_df = convert_results_to_df(model)

races = ['black', 'latino', 'native', 'asian_cb']
national_findings_df[(national_findings_df['variable_name'].isin(races))]

## DTI Categories

In [None]:
dti_vars = ['dti_manageable', 'dti_unmanageable', 'dti_struggling']

national_findings_df[(national_findings_df['variable_name'].isin(dti_vars))]

# Exploratory Visualization

Now that we have the results of the filtering/processing, we can explore visualization options.

## National Results

In [3]:
import pandas as pd
national = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/findings/national_findings/1_national_findings_210823.csv")
national # inspect output

Unnamed: 0,variable_name,pseudo_rsquared,coefficient,standard_error,z_value,p_value,odds_ratio
0,Intercept,0.225557,-8.267021,0.2278565,-36.281695,3.1453739999999997e-288,0.000257
1,black,0.225557,0.601675,0.01223366,49.181903,0.0,1.825173
2,latino,0.225557,0.36882,0.0098623,37.396918,4.385107e-306,1.446027
3,asian_cb,0.225557,0.384095,0.01077373,35.65102,2.2720399999999998e-278,1.468284
4,native,0.225557,0.508317,0.04111249,12.364051,4.090157e-35,1.662491
5,race_na,0.225557,0.344847,0.01225686,28.134986,3.658354e-174,1.411773
6,no_coapplicant,0.225557,0.217777,0.0066485,32.755832,2.5078589999999997e-235,1.24331
7,na_coapplicant,0.225557,-0.077277,0.07922674,-0.975389,0.3293673,0.925634
8,female,0.225557,-0.074411,0.006538793,-11.379968,5.261927e-30,0.92829
9,sex_na,0.225557,0.045272,0.01526447,2.965857,0.003018404,1.046313


## Metro Results

In [4]:
metro = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/findings/metro_findings/1_metro_findings_200823.csv")
metro # inspect output

Unnamed: 0,metro_code,metro_name,metro_pop,metro_apps,metro_type,variable_name,total_count,loan,denied,is_reliable,reliable_note,odds_ratio
0,35084,"Newark, NJ-PA",2164575.0,15854,Metropolitan Division,Black,1011.0,838.0,173.0,True,Statistically significant disparity,1.9
1,35084,"Newark, NJ-PA",2164575.0,15854,Metropolitan Division,Latino,1840.0,1608.0,232.0,True,Doesn't meet level of disparity,1.4
2,35084,"Newark, NJ-PA",2164575.0,15854,Metropolitan Division,Native American,23.0,22.0,1.0,False,Not statistically significant,0.6
3,35084,"Newark, NJ-PA",2164575.0,15854,Metropolitan Division,AAPI,1627.0,1483.0,144.0,True,Statistically significant disparity,1.7
4,35614,"New York-Jersey City-White Plains, NY-NJ",11915488.0,49623,Metropolitan Division,Black,2677.0,2239.0,438.0,True,Statistically significant disparity,1.6
...,...,...,...,...,...,...,...,...,...,...,...,...
3831,17640,"Coco, PR",28109.0,3,Micropolitan Statistical Area,AAPI,0.0,0.0,0.0,False,No results,
3832,27580,"Jayuya, PR",14539.0,1,Micropolitan Statistical Area,Black,0.0,0.0,0.0,False,No results,
3833,27580,"Jayuya, PR",14539.0,1,Micropolitan Statistical Area,Latino,1.0,0.0,1.0,False,No results,
3834,27580,"Jayuya, PR",14539.0,1,Micropolitan Statistical Area,Native American,0.0,0.0,0.0,False,No results,


## Lender Findings

In [5]:
lender = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/findings/lender_findings/1_lender_findings210823.csv")
lender # inspect output

Unnamed: 0,lei,respondent_name,variable_name,total_count,p_value,odds_ratio
0,5493001SXWZ4OFP8Z903,"DHI MORTGAGE COMPANY, LTD.",latino,2154.0,2.423503e-07,2.037769
1,5493001SXWZ4OFP8Z903,"DHI MORTGAGE COMPANY, LTD.",black,1276.0,1.87591e-10,2.613202
2,549300H3IZO24NSOO931,"EAGLE HOME MORTGAGE, LLC",latino,2837.0,1.388851e-14,2.124229
3,549300H3IZO24NSOO931,"EAGLE HOME MORTGAGE, LLC",black,1281.0,3.007772e-11,2.301073
4,549300MGPZBLQDIL7538,FAIRWAY INDEPENDENT MORTGAGE CORPORATION,black,2014.0,1.438527e-10,2.091842
5,549300LYRWPSYPK6S325,FREEDOM MORTGAGE CORPORATION,latino,1043.0,1.885856e-05,2.243939
6,549300DD4R4SYK5RAQ92,"MOVEMENT MORTGAGE, LLC",latino,2228.0,3.507122e-09,2.119191
7,549300DD4R4SYK5RAQ92,"MOVEMENT MORTGAGE, LLC",black,1289.0,1.383497e-06,2.131013
8,5493003GQDUH26DNNH17,NAVY FEDERAL CREDIT UNION,black,1467.0,6.205845e-15,2.056585
9,5493004WMLN60ZJ2ON46,PULTE MORTGAGE LLC,latino,1296.0,9.610523e-05,2.156586


# Outputs

In [None]:
national_findings_df.to_csv("national_findings.csv", index=False")

In [None]:
cols_to_export = cols + ['denied', 'loan_outcome', 'younger_than_34', 'older_than_55', 'not30yr_mortgage',
                         'metro_code', 'lei', 'app_race_ethnicity', 'app_credit_model', 'property_value_ratio']

In [None]:
hmda19_df5 = hmda19_df4[cols_to_export]
hmda19_df5.to_csv('regression_output.csv', index = False)

# Other Analyses

## Metro-Level Analysis

### Step #1: Data Processing

Data is at the county level, grouping by metros

In [None]:
counties.info() # inspect

In [None]:
metros_df2 = pd.DataFrame(counties.groupby(by = ['metro_code', 'metro_name', 'metro_type', 'metro_pop'],
                          dropna = False).size()).reset_index().rename(columns = {0: 'count'}).\
                          drop(columns = {'count'})

metros_df2['metro_pop'] = pd.to_numeric(metros_df2['metro_pop'])

Filtering those NA's that are less than one percent of the column. These values break at the metro level

- 0: Yes
- 1: No

In [None]:
hmda19_df['na_coapplicant'].value_counts(dropna = False, normalize = True) * 100

In [None]:
hmda19_df['age_na'].value_counts(dropna = False, normalize = True) * 100

In [None]:
hmda19_df['lender_na'].value_counts(dropna = False, normalize = True) * 100

In [None]:
hmda19_df2 = hmda19_df[(hmda19_df['na_coapplicant'] != 0) & (hmda19_df['age_na'] != 0) &\
                       (hmda19_df['lender_na'] != 0)]

Filter out metros with no code

In [None]:
hmda19_df3 = hmda19_df2[(hmda19_df2['metro_code'].notnull())]

print(len(hmda19_df3))

Setup independent variables

In [None]:
independent_vars = ['black', 'latino', 'native', 'asian_cb', 'race_na',
                    'female', 'sex_na',
                    'no_coapplicant',
                    'younger_than_34', 'older_than_55',
                    'income_log',
                    'loan_log',
                    'property_value_ratio',
                    'not30yr_mortgage',
                    'equifax', 'experian', 'other_model', 'more_than_one', 'model_na',
                    'dti_manageable', 'dti_unmanageable', 'dti_struggling',
                    'combined_loan_to_value_ratio',
                    'low_lmi', 'moderate_lmi', 'middle_lmi',
                    'credit_union', 'independent',
                    'lar_count',
                    'non_desktop', 'aus_na',
                    'white_cat2', 'white_cat3', 'white_cat4']

continuous_vars = ['income_log', 'loan_log', 'combined_loan_to_value_ratio', 'lar_count', 'prop_zscore']

Get variable count for each metro

In [None]:
metros = hmda19_df3['metro_code'].unique()

print('Number of metros: ' + str(len(metros)))

Count all the independent variables for each metro

In [None]:
### Excluding continous variables from the counting
independent_vars2 = [var for var in independent_vars if var not in continuous_vars]

metro_var_holder = []

for independent_var in independent_vars2:
    index_values = []
    index_values.extend(('metro_code', independent_var))

    metro_var_df = pd.pivot_table(hmda19_df3, index = index_values, columns = ['loan_outcome'],
                                  values = ['denied'], aggfunc = 'count', fill_value = 0).reset_index()

    metro_var_df.columns = metro_var_df.columns.droplevel(0)
    metro_var_df.columns.name = None
    metro_var_df.columns = ['metro_code', 'variable_flag', 'loan', 'denied']
    metro_var_df['variable_name'] = independent_var

    metro_var_holder.append(metro_var_df)

metro_varcount_df = pd.concat(metro_var_holder)
metro_varcount_df['metro_code'].nunique()

Add missing records to the variable count dataframe

In [None]:
metro_varcount_df2 = metro_varcount_df[(metro_varcount_df['variable_flag'] == 0)]
missing_rows_list = []

for metro in metros:
    metro_vars_df = metro_varcount_df2[(metro_varcount_df2['metro_code'] == metro)]
    metro_vars = metro_vars_df['variable_name'].unique()

    ### including the continous variables
    for reference_var in independent_vars:
        if reference_var not in metro_vars:
            missing_row = pd.DataFrame([[metro, 0, 0, 0, reference_var]], columns = ['metro_code',
                                               'variable_flag', 'loan', 'denied', 'variable_name'])
            missing_rows_list.append(missing_row)

missing_rows_df = pd.concat(missing_rows_list)
metro_varcount_df3 = metro_varcount_df2.append(missing_rows_df)

Find variable total count and percentages

In [None]:
metro_varcount_df3['total_count'] = metro_varcount_df3['loan'] + metro_varcount_df3['denied']

metro_varcount_df3['loan_pct'] = metro_varcount_df3['loan'].div(metro_varcount_df3['total_count']).multiply(100)

metro_varcount_df3['denied_pct'] = metro_varcount_df3['denied'].\
                                   div(metro_varcount_df3['total_count']).multiply(100)

### Step #2: Regression

In [None]:
metro_analysis = []
i = 0

for metro in metros:
    print(str(i) + ': Metro: ' + metro)
    metro_df = hmda19_df3[(hmda19_df3['metro_code'] == metro)]
    metro_apps = len(metro_df)

    regression_formula = create_formula(independent_vars)
    model = run_regression(data = metro_df, formula = regression_formula)

    try:
        results = model.fit()
        info = results.mle_retvals['converged']

        results_df = convert_results_to_df(results)
        results_df.insert(0, 'metro_code', metro)
        results_df.insert(1, 'metro_apps', metro_apps)
        results_df.insert(2, 'psuedo_rsquare', results.prsquared)
        results_df['iteration_flag'] = info

    except:
        independent_nan_list = []
        for regression_var in independent_vars:
            results_dict = {'metro_code': metro, 'metro_apps': metro_apps, 'variable_name': regression_var,
                            'standard_error': np.nan,  'z_value': np.nan, 'p_value': np.nan, 'odds_ratio': np.nan,
                            'iteration_flag': np.nan, 'psuedo_rsquare': np.nan}

            non_results_df = pd.DataFrame([results_dict], columns = results_dict.keys())
            independent_nan_list.append(non_results_df)

        results_df = pd.concat(independent_nan_list)

    metro_analysis.append(results_df)
    i += 1

results_df2 = pd.concat(metro_analysis)

Joining results with metro data- Is every variable accounted for:
- If a variables has less than 959, then it's missing in certain metros

In [None]:
var_used_check = pd.DataFrame(results_df2['variable_name'].value_counts(dropna = False)).reset_index()

var_used_check[(var_used_check['variable_name'] < 959)]

Join with metro names and var count dataframes

In [None]:
results_df3 = pd.merge(results_df2, metros_df2, how = 'left', on = ['metro_code'])
results_df4 = pd.merge(results_df3, metro_varcount_df3, how = 'left',  on = ['metro_code', 'variable_name'])

Filter for metros that don't produce any results
- Results Flag 1: No meaningful results, all columns shows up as NaN

In [None]:
results_df4.loc[((results_df4['psuedo_rsquare'].isnull()) & (results_df4['coefficient'].isnull()) & \
                 (results_df4['standard_error'] .isnull()) & (results_df4['z_value'].isnull()) & \
                 (results_df4['p_value'].isnull()) & (results_df4['odds_ratio'].isnull())),
                 'results_flag'] = '1'

metros_no_results = results_df4[(results_df4['results_flag'] == '1')]['metro_code'].nunique()

all_metros = results_df4['metro_code'].nunique()

### Step #3: Validation

Filter for metros that don't produce any results
- Results Flag 1: No meaningful results, all columns shows up as NaN

In [None]:
results_df4.loc[((results_df4['psuedo_rsquare'].isnull()) & (results_df4['coefficient'].isnull()) & \
                 (results_df4['standard_error'] .isnull()) & (results_df4['z_value'].isnull()) & \
                 (results_df4['p_value'].isnull()) & (results_df4['odds_ratio'].isnull())),
                 'results_flag'] = '1'

metros_no_results = results_df4[(results_df4['results_flag'] == '1')]['metro_code'].nunique()

all_metros = results_df4['metro_code'].nunique()

Metro Results Breakdown:
- 709 metros don't produce results
- 250 metros prodocues results

In [None]:
print('Percent of metros with no results: ' + str(((metros_no_results/all_metros) * 100)))

print('Number of metros that DON\'T produce results: ' + str(metros_no_results))

print('Number of metros that produce results: ' + str(results_df4[(results_df4['results_flag'] != '1')] \
                                                      ['metro_code'].nunique()))

In [None]:
metros_size_df = pd.DataFrame(results_df4[(results_df4['results_flag'] == '1')].\
                 groupby(by = ['metro_code', 'metro_apps']).size()).reset_index().rename(columns = {0: 'count'})

metros_size_df['metro_apps'].describe()

Filter for metros that produce the results, but the equation needs work
- Results Flag: 2, no meaningful results because of the equation

In [None]:
results_df4.loc[(results_df4['psuedo_rsquare'] < .1) | (results_df4['iteration_flag'] == False),
                'results_flag'] = '2'

broken_results_metros = results_df4[(results_df4['results_flag'] == '2')]['metro_code'].nunique()

Metro Results Breakdown:
- 121 metros with unreliable results, not enoght variance in variables.

In [None]:
print('Number of metros with broken results: ' + str(broken_results_metros))
print('Percentage of metros with no results: ' + str((broken_results_metros/all_metros) * 100))

Filter for metros where the equation is valid
- Results flag 3: valid results

In [None]:
results_df4.loc[(results_df4['psuedo_rsquare'] >= .1) & (results_df4['iteration_flag'] == True),
                'results_flag'] = '3'

valid_results_metros = results_df4[(results_df4['results_flag'] == '3')]['metro_code'].nunique()

Metro Results Breakdown:
- 128 metros with reliable results.

In [None]:
print('Number of metros with valid results: ' + str(valid_results_metros))
print('Percentage of metros with valid results: ' + str((valid_results_metros/all_metros) * 100))

Overall metros breakdown
- 1: No Results
- 2: Variable Issues
- 3: Results

In [None]:
metro_results_df = pd.DataFrame(results_df4.groupby(by = ['results_flag', 'metro_code']).size()).reset_index().\
                   rename(columns = {0: 'count'})

metro_results_df['results_flag'].value_counts(dropna = False)

Assessing the variables of valid metros

Variables where the p-value and z-value are missing
- 63 records

In [None]:
results_df4.loc[(results_df4['results_flag'] == '3') & (results_df4['z_value'].isnull()) &\
                (results_df4['p_value'].isnull()), 'variable_check'] = '1'

results_df4['variable_check'].value_counts(dropna = False)

Metros with a varibale that has a missing p-value
- 17 metros that are valid but there's missing p-values in them

In [None]:
missing_pvalues_metros_df = results_df4[(results_df4['results_flag'] == '3') & \
                                        (results_df4['variable_check'] == '1')]

print(missing_pvalues_metros_df['metro_code'].nunique())
missing_pvalues_metros_df['metro_name'].value_counts(dropna = False)

Variables that are not statistcially significant
- 2693 records

In [None]:
results_df4.loc[(results_df4['results_flag'] == '3') & (results_df4['p_value'] >= .05),
                'variable_check'] = '2'

results_df4['variable_check'].value_counts(dropna = False)

Filter for variables that are statistcially significant but need more applications
- Variable Check 3: 448 records

In [None]:
results_df4.loc[(results_df4['results_flag'] == '3') & (results_df4['p_value'] < .05) &\
                (results_df4['total_count'] < 75), 'variable_check'] = '3'

results_df4['variable_check'].value_counts(dropna = False)

Filter those records that are statistcially signicant but no disparity
- Variable Check 4: 503 records

In [None]:
results_df4.loc[(results_df4['results_flag'] == '3') & (results_df4['p_value'] < .05) &\
                (results_df4['total_count'] >= 75) & (results_df4['odds_ratio'] < 1.45),
                'variable_check'] = '4'

results_df4['variable_check'].value_counts(dropna = False)

Filter those records that are statistcially signicant with a disparity
- Variable Check 5: 677 records

In [None]:
results_df4.loc[(results_df4['results_flag'] == '3') & (results_df4['p_value'] < .05) &\
                (results_df4['total_count'] >= 75) & (results_df4['odds_ratio'] >= 1.45),
                'variable_check'] = '5'

results_df4['variable_check'].value_counts(dropna = False)

Breakdown of statistcially valid metros for race and ethnicity
- 96 NaN are all Intercept variables

In [None]:
results_df4[(results_df4['results_flag'] == '3')]['variable_check'].value_counts(dropna = False)

Statistically Valid Metros Race and Ethnicity Breakdown:
- 2: Not statistically significant -- 295 records
- 5: Statistically significant disparity -- 166 records
- 3: Not enough applications -- 25 records
- 4: Small disparities –– 23 records
- 1: Missing p-values -- 3 records

In [None]:
races = ['black', 'latino', 'asian_cb', 'native']

results_df5 = results_df4[(results_df4['results_flag'] == '3') & (results_df4['variable_name'].isin(races))]

results_df5['variable_check'].value_counts(dropna = False)

In [None]:
results_df5['metro_code'].nunique()

Breakdown of 128 Metros

In [None]:
metro_results = pd.DataFrame(results_df5.groupby(by = ['metro_code', 'variable_check']).size()).reset_index().\
                rename(columns = {0: 'count'})

Number of metros with no reliable results:

In [None]:
reliable_metros_df = metro_results[(metro_results['variable_check'] == '5') |\
                                   (metro_results['variable_check'] == '4')]

reliable_metros = reliable_metros_df['metro_code'].unique()

Number of metros where all racial and ethnic variables are not statistically significant
- 21 metros

In [None]:
print(metro_results[~(metro_results['metro_code'].isin(reliable_metros)) & \
                     (metro_results['count'] == 4)]['metro_code'].nunique())

metro_results[~(metro_results['metro_code'].isin(reliable_metros)) & \
               (metro_results['count'] == 4)]['variable_check'].unique()

Number of metros where all racial and ethnic variables are not reliable
- 16 metros
- Because not statistically significant, not enough applications, or missing a p-value

In [None]:
print(metro_results[~(metro_results['metro_code'].isin(reliable_metros)) & \
                     (metro_results['count'] < 4)]['metro_code'].nunique())

metro_results[~(metro_results['metro_code'].isin(reliable_metros)) & \
               (metro_results['count'] < 4)]['variable_check'].unique()

Number of metros with valid results
- 91 metros

In [None]:
valid_results_df = metro_results[(metro_results['variable_check'] == '5') |\
                              (metro_results['variable_check'] == '4')]

valid_results_df['metro_code'].nunique()

Metros with at least one disparity:
- 89 metros

In [None]:
valid_results_df[(valid_results_df['variable_check'] == '5')]['metro_code'].nunique()

Metros where the only valid result is a small disparity
- 2 metros

In [None]:
disparity_metro = valid_results_df[(valid_results_df['variable_check'] == '5')]['metro_code'].unique()

valid_results_df[~(valid_results_df['metro_code'].isin(disparity_metro))]['metro_code'].nunique()

Places with small disparities

In [None]:
print(results_df5[(results_df5['variable_check'] == '4')]['metro_name'].nunique())

results_df5[(results_df5['variable_check'] == '4')]['metro_name'].value_counts(dropna = False)

Places where the only reliable results are small disparities

In [None]:
small_disparities = valid_results_df[~(valid_results_df['metro_code'].isin(disparity_metro))]['metro_code']

results_df5[(results_df5['metro_code'].isin(small_disparities)) & (results_df5['variable_check'] == '4')]\
[['metro_name', 'variable_name', 'p_value', 'odds_ratio', 'variable_check']]

Smallest disparities overall

In [None]:
results_df5[(results_df5['variable_check'] == '4')][['metro_name', 'variable_name', 'p_value', 'odds_ratio',
                                                     'variable_check']].sort_values(by = ['odds_ratio']).head(5)

Largest disparities overall

In [None]:
results_df5[(results_df5['variable_check'] == '5')][['metro_name', 'variable_name', 'p_value', 'odds_ratio',
                                                     'variable_check']].sort_values(by = ['odds_ratio'],
                                                                                   ascending = False).head(5)

### Step #4: Results

Breakdown of metros with disparities, Looking at 10 most populous metros

In [None]:
top10_metros = metros_df2.sort_values(by = ['metro_pop'], ascending = False).head(10)
top10_metros

Of the largest metros, Chicago has the worst disparity for Black applicants. Lenders are 2.5 times more likely to deny Black applicants than similarly qualified White applicants.

In [None]:
results_df5[(results_df5['metro_code'].isin(top10_metros['metro_code'])) & \
            (results_df5['variable_check'] == '5')]\
[['metro_code', 'metro_name', 'metro_pop', 'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']].\
sort_values(by = ['metro_pop', 'odds_ratio'], ascending = False)

Minneapolis is the only metro where all four race and ethnicites are more likely to be denied.

In [None]:
results_df5[(results_df5['variable_check'] == '5')]['metro_name'].value_counts(dropna = False).head(5)

#### Key Metros

Chicago Results

In [None]:
results_df5[(results_df5['metro_code'] == '16984')][['metro_code', 'metro_name', 'metro_pop', 'metro_apps',
            'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']]

Minneapolis Results

In [None]:
results_df5[(results_df5['metro_code'] == '33460') & ((results_df5['variable_check'] == '4') |\
            (results_df5['variable_check'] == '5'))][['metro_code', 'metro_name', 'metro_pop', 'metro_apps',
            'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']]

Charolette Results

In [None]:
results_df5[(results_df5['metro_code'] == '16740') & ((results_df5['variable_check'] == '4') |\
            (results_df5['variable_check'] == '5'))][['metro_code', 'metro_name', 'metro_pop',
            'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']]

#### Disparities by race/ethnicity

Largest disparites for Latinos

In [None]:
results_df5[(results_df5['variable_check'] == '5') & (results_df5['variable_name'] == 'latino')]\
[['metro_code', 'metro_name', 'metro_pop', 'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']].\
sort_values(by = ['odds_ratio'], ascending = False).head(5)

Largest disparites for AAPI

In [None]:
results_df5[(results_df5['variable_check'] == '5') & (results_df5['variable_name'] == 'asian_cb')]\
[['metro_code', 'metro_name', 'metro_pop', 'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']].\
sort_values(by = ['odds_ratio'], ascending = False).head(5)

Largest disparites for Native American applicants

In [None]:
results_df5[(results_df5['variable_check'] == '5') & (results_df5['variable_name'] == 'native')]\
[['metro_code', 'metro_name', 'metro_pop', 'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']].\
sort_values(by = ['odds_ratio'], ascending = False).head(5)

Largest disparites for Black applicants

In [None]:
results_df5[(results_df5['variable_check'] == '5') & (results_df5['variable_name'] == 'black')]\
[['metro_code', 'metro_name', 'metro_pop', 'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']].\
sort_values(by = ['odds_ratio'], ascending = False).head(5)

Largest disparites for Native American applicants

In [None]:
results_df5[(results_df5['variable_check'] == '5') & (results_df5['variable_name'] == 'native')]\
[['metro_code', 'metro_name', 'metro_pop', 'psuedo_rsquare', 'variable_name', 'p_value', 'odds_ratio']].\
sort_values(by = ['odds_ratio'], ascending = False).head(5)

Summary of findings by race and ethnicity
- Black applciants are more likely to be denied in 71 metros
- Latinos in 39 metros
- Asian/Pacific Islander in 55
- Native American in 1 metro

In [None]:
metro_race_bd = pd.pivot_table(results_df5, index = ['variable_name'], columns = ['variable_check'],
                               values = ['metro_name'], aggfunc = 'count', fill_value = 0).reset_index()

metro_race_bd.columns = metro_race_bd.columns.droplevel(0)
metro_race_bd.columns.name = None
metro_race_bd.columns = ['Race/Ethnicity', 'Missing', 'Not Sig', 'More Apps', 'Small Disparity', 'Disparity']

metro_race_bd

### Step #5: Outputs

Add definitions for output

In [None]:
lookup_dict = {'results_flag': ['1', '2', '3', '3', '3', '3', '3'],
               'variable_check': [np.nan, np.nan, '1', '2', '3', '4', '5'],
               'reliable_note': ['No results', 'Not enough variance in variables',
                                 'Not statistically significant', 'Not statistically significant',
                                 'Not enough applications', 'Doesn\'t meet level of disparity',
                                 'Statistically significant disparity']}

lookup_df = pd.DataFrame(data=lookup_dict)

lookup_df

In [None]:
results_df6 = pd.merge(results_df4, lookup_df, how = 'left', on = ['results_flag', 'variable_check'])

results_df6.loc[(results_df6['variable_check'] == '4') | (results_df6['variable_check'] == '5'),
                'is_reliable'] = True

results_df6.loc[(results_df6['variable_check'] != '4') & (results_df6['variable_check'] != '5'),
                'is_reliable'] = False

results_df6['odds_ratio_rd'] = results_df6['odds_ratio'].round(1)

Clean results for all metros:

In [None]:
races_replace = {'black': 'Black', 'latino': 'Latino', 'native': 'Native American', 'asian_cb': 'AAPI'}

results_df7 = results_df6[(results_df6['variable_name'].isin(races))]\
[['metro_code', 'metro_name', 'metro_pop', 'metro_apps', 'metro_type', 'variable_name', 'total_count', 'loan',
  'denied', 'is_reliable', 'reliable_note', 'odds_ratio_rd']].rename(columns = {'odds_ratio_rd': 'odds_ratio'})

results_df8 = results_df7.replace(races_replace)

results_df8.sample(5, random_state = 303)

In [None]:
results_df8.to_csv('metro_findings.csv', index = False)

## Lender Analysis

The reporting team also used an `lei` lookup table with additional info on the lenders.
- [Link to this data](https://ffiec.cfpb.gov/data-publication/snapshot-national-loan-level-dataset/2019)

In [None]:
lenders = pd.read_csv("https://raw.githubusercontent.com/the-markup/investigation-redlining/main/data/supplemental_hmda_data/cleaned/lender_definitions_em210513.csv")
lenders # inspect output

### Step #1: Data Processing

In [None]:
lenders_df2 = lenders_df[['lei', 'respondent_name', 'lender_def']].copy()

In [None]:
print(len(hmda19_df))

hmda19_df2 = hmda19_df[(hmda19_df['na_coapplicant'] != 0) & (hmda19_df['age_na'] != 0) &\
                       (hmda19_df['lender_na'] != 0)]

print(len(hmda19_df2))

Large lenders

In [None]:
lenders_apps_df = pd.DataFrame(hmda19_df2['lei'].value_counts(dropna = False)).reset_index().\
                  rename(columns = {'index': 'lei', 'lei': 'total_apps'})

lenders_apps_df2 = lenders_apps_df[(lenders_apps_df['total_apps'] >= 5000)]
print(len(lenders_apps_df2))

lenders_apps_df2.head(3)

Independent variables

In [None]:
independent_vars = ['black', 'latino', 'asian_cb', 'native', 'race_na',
                    'female', 'sex_na',
                    'no_coapplicant',
                    'younger_than_34', 'older_than_55',
                    'income_log',
                    'loan_log',
                    'property_value_ratio',
                    'not30yr_mortgage',
                    'equifax', 'experian', 'other_model', 'more_than_one', 'model_na',
                    'dti_manageable', 'dti_unmanageable', 'dti_struggling',
                    'combined_loan_to_value_ratio',
                    'low_lmi', 'moderate_lmi', 'middle_lmi',
                    'non_desktop', 'aus_na',
                    'white_cat2', 'white_cat3', 'white_cat4']

Count the vaules that show up for each lender
- Remove continuous variables because we are counting loans and denials

In [None]:
continous_vars = ['income_log', 'loan_log', 'combined_loan_to_value_ratio', 'property_value_ratio']
independent_vars2 = [var for var in independent_vars if var not in continous_vars]

In [None]:
lenders = lenders_apps_df2['lei'].unique().tolist()
lenders_list = []
df_holder = []

Count all the independent variables for each lender, by loans and denials

In [None]:
lender_var_holder = []

for independent_var in independent_vars2:
    index_values = []
    index_values.extend(('lei', independent_var))

    lender_var_df = pd.pivot_table(hmda19_df2, index = index_values, columns = ['loan_outcome'],
                                   values = ['denied'], aggfunc = 'count', fill_value = 0).reset_index()

    lender_var_df.columns = lender_var_df.columns.droplevel(0)
    lender_var_df.columns.name = None
    lender_var_df.columns = ['lei', 'variable_flag', 'loan', 'denied']
    lender_var_df['variable_name'] = independent_var
    lender_var_holder.append(lender_var_df)

lender_varcount_df = pd.concat(lender_var_holder)
lender_varcount_df['lei'].nunique()

Finding missing records for each lender
- Focus on positive variables

In [None]:
lender_varcount_df2 = lender_varcount_df[(lender_varcount_df['variable_flag'] == 0)]
missing_rows_list = []

In [None]:
for lender in lender_varcount_df2['lei'].unique():
    lender_vars_df = lender_varcount_df2[(lender_varcount_df2['lei'] == lender)]
    lender_vars = lender_vars_df['variable_name'].unique()

    for var in independent_vars2:
        if var not in lender_vars:
            missing_row = pd.DataFrame([[lender, 0, 0, 0, var]], columns = ['lei', 'variable_flag', 'loan',
                                                                            'denied', 'variable_name'])
            missing_rows_list.append(missing_row)

missing_rows_df = pd.concat(missing_rows_list)
lender_varcount_df3 = lender_varcount_df2.append(missing_rows_df)

- Calculate the denial and loan percentage

In [None]:
lender_varcount_df3['total_count'] = lender_varcount_df3['loan'] + lender_varcount_df3['denied']

lender_varcount_df3['loan_pct'] = lender_varcount_df3['loan'].\
                                  div(lender_varcount_df3['total_count']).multiply(100)

lender_varcount_df3['denied_pct'] = lender_varcount_df3['denied'].\
                                    div(lender_varcount_df3['total_count']).multiply(100)

- Filter for the select lenders

In [None]:
lender_varcount_df4 = lender_varcount_df3[(lender_varcount_df3['lei'].isin(lenders))]
len(lenders) == lender_varcount_df4['lei'].nunique()

Which variables are zero
- Credit models and underwriters are specific to individual lenders
- Many lenders stick to experian, equifax, transunion

In [None]:
lender_varcount_df4[(lender_varcount_df4['total_count'] == 0)]['variable_name'].value_counts(dropna = False)

In [None]:
model_vars = ['equifax', 'experian', 'other_model', 'more_than_one', 'model_na']

missing_credit = lender_varcount_df4[(lender_varcount_df4['variable_name'].isin(model_vars)) &\
                                      (lender_varcount_df4['total_count'] == 0)]['lei'].nunique()

print('Number of lenders with at least one credit model missing: ' + str(missing_credit))

In [None]:
aus = ['non_desktop', 'aus_na']

missing_aus = lender_varcount_df4[(lender_varcount_df4['variable_name'].isin(aus)) &\
                                  (lender_varcount_df4['total_count'] == 0)]['lei'].nunique()

print('Number of lenders with at least one underwriter missing: ' + str(missing_aus))

Select dummy varibales that are greater than zero
- Leaving out variables where credit model and aus don't exits in the lender's data

In [None]:
lender_varcount_df5 = lender_varcount_df4[(lender_varcount_df4['total_count'] > 0)]

### Step #2: Analysis

Regression for individual lenders

In [None]:
lender_holder = []

for lender in lenders:
    lender_df = hmda19_df2[(hmda19_df2['lei'] == lender)]
    total_apps = len(lender_df)

    lender_independent_vars = lender_varcount_df5[(lender_varcount_df5['lei'] == lender)]\
                              ['variable_name'].unique().tolist()
    lender_independent_vars2 = lender_independent_vars + continous_vars

    regression_formula = create_formula(lender_independent_vars2)
    model = run_regression(data = lender_df, formula = regression_formula)

    try:
        results = model.fit()
        info = results.mle_retvals['converged']

        results_df = convert_results_to_df(results)
        results_df.insert(0, 'lei', lender)
        results_df.insert(1, 'psuedo_rsquare', results.prsquared)
        results_df['iteration_flag'] = info
        results_df['total_apps'] = total_apps

    except:
        independent_nan_list = []

        for regression_var in lender_independent_vars:
            results_dict = {'lei': lender, 'variable_name': regression_var, 'standard_error': np.nan,
                            'z_value': np.nan, 'p_value': np.nan, 'odds_ratio': np.nan,
                            'iteration_flag': np.nan, 'psuedo_rsquare': np.nan, 'total_apps': total_apps}

            results_df = pd.DataFrame([results_dict], columns = results_dict.keys())
            independent_nan_list.append(results_df)
        results_df = pd.concat(independent_nan_list)

    lender_holder.append(results_df)

lender_results_df = pd.concat(lender_holder)

Join Dataframes and filter for significant results

In [None]:
lender_results_df2 = pd.merge(lender_results_df, lender_varcount_df5, how = 'left',
                              on = ['lei', 'variable_name'])

Number of lenders that didn't produce any results

In [None]:
no_results_lenders = lender_results_df2[(lender_results_df2['psuedo_rsquare'].isnull()) & \
                                        (lender_results_df2['p_value'].isnull()) &\
                                        (lender_results_df2['z_value'].isnull())]['lei']

no_results_lenders.nunique()

Lenders where the equation didn't work for them

In [None]:
equation_lenders = lender_results_df2[(lender_results_df2['psuedo_rsquare'] < .1) |\
                                      (lender_results_df2['iteration_flag'] == False)]['lei']

equation_lenders.nunique()

Results with valid results

In [None]:
lender_results_df3 = lender_results_df2[(lender_results_df2['psuedo_rsquare'] >= .1) &\
                                        (lender_results_df2['iteration_flag'] == True)]

print(lender_results_df3['lei'].nunique())

No over lap between lenders with results and lenders where the equation didn't work or with no results at all

In [None]:
lender_results_df3[(lender_results_df3['lei'].isin(equation_lenders)) | \
                   (lender_results_df3['lei'].isin(no_results_lenders))]

#### Colinearity

In [None]:
vif_list = []

for lender in lender_results_df3['lei'].unique():
    lender_vars_df = lender_results_df3[(lender_results_df3['lei'] == lender)]
    independent_vars = lender_vars_df['variable_name'].unique()[1:]

    lender_df = hmda19_df2[(hmda19_df2['lei'] == lender)][independent_vars]

    vif_df = calculate_vif(lender_df)
    vif_df['lei'] = lender

    vif_list.append(vif_df)

lenders_vif_df = pd.concat(vif_list)

In [None]:
lenders_vif_df2 = lenders_vif_df[(lenders_vif_df['independent_var'] != 'income_log') &\
                                 (lenders_vif_df['independent_var'] != 'loan_log')]

collinarity_lenders = lenders_vif_df2[(lenders_vif_df2['threshold'] == '1')]['lei'].unique()

Lenders with collinarity issues
- 12 lenders

In [None]:
len(collinarity_lenders)

Filter out lenders with collinarity issues
- 30 lenders with no results (4 with no results + 26 with poor fit)
- 12 lenders with collinarity issues
- 30 lenders move forward

In [None]:
lender_results_df4 = lender_results_df3[~(lender_results_df3['lei'].isin(collinarity_lenders))]

lender_results_df4['lei'].nunique()

### Step #3: Results

Focus on lenders with racial and ethnic results
- 26 lenders

In [None]:
races = ['black', 'latino', 'asian_cb', 'native']
lender_results_df5 = lender_results_df4[(lender_results_df4['variable_name'].isin(races))]

lender_results_df6 = lender_results_df5[(lender_results_df5['p_value'] < .05)]

print(lender_results_df6['lei'].nunique())

Join with names

In [None]:
lender_results_df7 = pd.merge(lender_results_df6, lenders_df2, how = 'left', on = ['lei'])
lender_results_df7['lei'].nunique()

Filter out where applicants are less than 75

In [None]:
lender_results_df8 = lender_results_df7[(lender_results_df7['total_count'] >= 75)]

lender_results_df8['lei'].nunique()

Disparity range:

In [None]:
print(lender_results_df8['odds_ratio'].max())
print(lender_results_df8['odds_ratio'].min())

#### 25 lenders with statistically significant disparities

In [None]:
lender_results_df8[(lender_results_df8['odds_ratio'] >= 1.45)]['lei'].nunique()

In [None]:
lender_results_df9 = lender_results_df8[(lender_results_df8['total_count'] >= 1000) & \
                                        (lender_results_df8['odds_ratio'] >= 1.95)].\
                      sort_values(by = ['odds_ratio'], ascending = False)

print(lender_results_df9['lei'].nunique())

lender_results_df9[['lei', 'respondent_name', 'variable_name', 'total_count', 'p_value', 'odds_ratio']].\
sort_values(by = ['respondent_name', 'odds_ratio'])

In [None]:
lender_results_df9[['lei', 'respondent_name', 'variable_name', 'total_count', 'p_value', 'odds_ratio']].\
sort_values(by = ['respondent_name', 'odds_ratio']).\
to_csv('lender_findings.csv', index = False)