# Preprocessing
## Machine Learning on Mortgage Loans
### CAPP 30245 - Jacob Leppek, Rob Mitchell, Ryan Webb

***
Our analysis attempts to predict whether a completed mortgage loan application is approved or not using 2017 nationwide Consumer Financial Protection Bureau data accessed [here](https://www.consumerfinance.gov/data-research/hmda/). This data is accessible under the Home Mortgage Disclosure Act. We use 3 primary classification models to predict whether loans are approved: Decision Tree / Random Forest, Logistic Regression, and KNN. We break out these results to examine the importance of race in our models and its predictive power. We use multiple evaluation metrics to compare models: **INSERT METRICS HERE**. 

Overall, we find that **INSERT BEST MODEL HERE** is the most predictive model with an accuracy of **INSERT ACCURACY HERE** using features of **INSERT FEATURES HERE**.

This analysis does not claim any interpretation as to the likelihood that an applicant can successfully pay back the loan, as the HMDA data does not include the relevant variables. The lack of a credit history, for instance, is a variable unreported by the HMDA but essential in the decision to originate (approve) a loan or not. 

These unreported variables do lead some academics to [question the validity](https://www.depauw.edu/learn/stata/Workshop/Munnell.pdf) of prior analyses that aim to demonstrate discrimination and borrowing credibility. To this, we can not directly estimate the casual effect of race or any other applicant characteristic has on the probability of a loan being approved. 

For this analysis, we seek only to determine whether, from the variables provided, whether we can predict if a loan is approved or denied. 

Code explanations for each feature may be accessed [here](https://files.consumerfinance.gov/hmda-historic-data-dictionaries/lar_record_codes.pdf).

In this notebook, we clean and prepare the data to use for our machine learning models. 

### Get the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('hmda_2017_nationwide_all-records_codes.csv')

### Pre-Processing

First we will drop the columns we won't be using.

In [3]:
cols = ['as_of_year',
        'respondent_id',
        'agency_code',
        'owner_occupancy',
        'preapproval',
        'applicant_race_2',
        'applicant_race_3',
        'applicant_race_4',
        'applicant_race_5',
        'co_applicant_race_2',
        'co_applicant_race_3',
        'co_applicant_race_4',
        'co_applicant_race_5',
        'purchaser_type',
        'denial_reason_1',
        'denial_reason_2',
        'denial_reason_3',
        'hoepa_status',
        'lien_status',
        'edit_status',
        'sequence_number',
        'application_date_indicator',
        'msamd',
        'state_code',
        'county_code',
        'census_tract_number',
        'rate_spread']

df = df.drop(columns=cols)

Next, we will drop rows with values that we aren't including in our analysis.

In [4]:
# Exclude rows where action taken is withdrawn or incomplete
df = df[~df['action_taken'].isin([4, 5])]
# Exclude rows where loan type is not conventional
df = df[df['loan_type'] == 1]
# Exclude rows where the purpose of the loan is not to purchase a house
df = df[df['loan_purpose'] == 1]
# Exclude rows where property type is not 1-4 family
df = df[df['property_type'] == 1]
# Exclude rows where information on applicant race is not provided or not applicable
df = df[~df['applicant_race_1'].isin([6, 7])]
# Exclude rows where information on applicant ethnicity is not provided or not applicable
df = df[~df['applicant_ethnicity'].isin([3, 4])]
# Exclude rows where information on applicant sex is not provided or not applicable
df = df[df['applicant_sex'].isin([1, 2])]
# Drop columns that now only have 1 values
cols = ['loan_type',
        'property_type',
        'loan_purpose'
       ]
df = df.drop(columns=cols)

Next, we will define our target variable.

In [5]:
df['target'] = df['action_taken'].apply(lambda x: 1 if x in [1, 6] else 0)
df = df.drop(columns=['action_taken'])

Now, we will create some features based on variables in the dataset.

In [6]:
# Change applicant sex from 1 and 2 to 1 and 0 (1 is male)
df['applicant_sex'] = df['applicant_sex'].apply(lambda x: 1 if x == 1 else 0)
# This feature indicates how much money the loan is for relative to the applicant's income
df['loan_income_ratio'] = df['loan_amount_000s'] / df['applicant_income_000s']

We use the existence of co-applicant ethnicity, race, or sex to create a binary indicator of whether a [co-applicant](https://www.investopedia.com/terms/c/co-applicant.asp) exists. From prior research, it appears that having a co-applicant can have [statistically significant effects](https://journals.sagepub.com/doi/abs/10.1177/08861090022093804) on whether the loan is approved or not. 

In [7]:
df['co_applicant'] = np.where((df['co_applicant_ethnicity'] != 5) | 
                              (df['co_applicant_race_1'] != 8) |
                              (df['co_applicant_sex'] != 5), 1, 0)
df = df.drop(columns=['co_applicant_ethnicity', 'co_applicant_race_1', 'co_applicant_sex'])

In [8]:
# The applicant is a minority if their race or ethnicity indicates so
df['applicant_minority'] = df.apply(lambda row: 1 if row['applicant_race_1'] in [1, 2, 3, 4] or
                                                     row['applicant_ethnicity'] == 1
                                                  else 0,
                                    axis=1)
# We will preserve the race an ethnicity columns for future analysis

Roughly 55,000 records have NaNs for all accompanying U.S. Census data (population, minority population, etc). On examination, we found that none of these loan applications were approved. This comes to 7.9% of the records. We have two options at this point:

1. Impute values to the extent possible after splitting the data into training and test sets. We have state codes for roughly 17,000 of these records. We could use the median value of each state FIPS code to impute the values, and impute the rest as the median of all other values. However, since no variation exists in our target variable for these records, imputting the data would likely heavily affect our estimator.

2. Drop these records. Dropping any rows with NaNs in the Census data reveals exact overlap across the missing values. Dropping these records make sense; given that no variance in the outcome exists for these records, it is likely that imputing these values will cause overfitting in our estimator. With this said, it seems unlikely that these values are missing entirely at random, and dropping these records may introduce bias into our estimators.

We chose to drop these records. 

In [9]:
df = df.dropna()

Export this clean data to a csv for use in other notebooks.

In [10]:
df.to_csv('clean.csv', index=False)