# Lending disparities using Logistic Regression

**The story:** https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/

**Author:** Aaron Glantz and Emmanuel Martinez

**Topics:** Logistic regression, odds ratios

**Datasets**

* **philadelphia-mortgages.csv:** Philadelphia mortgage data for 2015
    - A subset of HMDA LAR data from [FFEIC](https://www.ffiec.gov/hmda/hmdaproducts.htm)
    - Codebook is `2015HMDACodeSheet.pdf`
    - A [guide to HMDA reporting](https://www.ffiec.gov/hmda/guide.htm)
    - I've massaged it slightly to make processing a bit easier
* **nhgis0006_ds233_20175_2017_tract.csv:**
    - Table B03002: Hispanic or Latino Origin by Race
    - 2013-2017 American Community Survey data US Census Bureau, from [NHGIS](https://data2.nhgis.org/main)
    - Codebook is `nhgis0006_ds233_20175_2017_tract_codebook.txt`
* **lending_disparities_whitepaper_180214.pdf:** the whitepaper outlining Reveal's methodology

## What's the goal?

Do banks provide mortgages at disparate rates between white applicants and people of color? We're going to look at the following variables to find out:

* Race/Ethnicity
    - Native American
    - Asian
    - Black
    - Native Hawaiian
    - Hispanic/Latino
    - Race and ethnicity were not reported
* Sex
* Whether there was a co-applicant
* Applicant’s annual income (includes co-applicant income)
* Loan amount
* Ratio between the loan amount and the applicant’s income
* Ratio between the median income of the census tract and the median income of the metro area
* Racial and ethnic breakdown by percentage for each census tract
* Regulating agency of the lending institution

# Setup

Import pandas as usual, but also import numpy. We'll need it for logarithms and exponents.

Some of our datasets have a lot of columns, so you'll also want to use `pd.set_option` to display up to 100 columns or so.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression,LogisticRegression

%matplotlib inline

# What is each row of our data?

If you aren't sure, you might need to look at either the whitepaper or the codebook. You'll need to look at them both eventually, so might as well get started now.

# Read in your data

Read in our Philadelphia mortgage data and take a peek at the first few rows.

* **Tip:** As always, census tract columns like to cause problems if they're read in as numbers. Make sure pandas reads it in as a string.

In [2]:
df_mor = pd.read_csv("data/philadelphia-mortgages.csv", encoding='latin-1', dtype={'census_tract' :'str'})

In [3]:
df_mor.head()

Unnamed: 0,census_tract,county_code,state_code,applicant_sex,income,loan_amount,loan_type,property_type,occupancy,action_type,loan_purpose,agency_code,tract_to_msa_income_percent,applicant_race,co_applicant_sex
0,101.0,101,42,3,26,5,1,1,1,4,2,5,97.09,6,5
1,264.0,101,42,2,26,40,1,1,1,4,2,5,98.27,3,5
2,281.0,101,42,2,22,20,1,1,1,5,2,5,72.28,6,5
3,158.0,101,42,2,57,36,1,1,1,5,3,5,105.87,6,5
4,358.0,101,42,1,80,34,1,1,1,1,3,5,139.62,5,2


In [5]:
df_cen = pd.read_csv("data/nhgis0007_ds215_20155_2015_tract.csv", encoding='latin-1', dtype={'TRACTA' :'str'})

In [6]:
df_cen.head()

Unnamed: 0,GISJOIN,YEAR,REGIONA,DIVISIONA,STATE,STATEA,COUNTY,COUNTYA,COUSUBA,PLACEA,...,ADK5M012,ADK5M013,ADK5M014,ADK5M015,ADK5M016,ADK5M017,ADK5M018,ADK5M019,ADK5M020,ADK5M021
0,G0100010020100,2011-2015,,,Alabama,1,Autauga County,1,,,...,21,21,11,11,11,11,11,11,11,11
1,G0100010020200,2011-2015,,,Alabama,1,Autauga County,1,,,...,25,23,11,11,11,11,7,11,11,11
2,G0100010020300,2011-2015,,,Alabama,1,Autauga County,1,,,...,11,11,11,11,11,11,11,11,11,11
3,G0100010020400,2011-2015,,,Alabama,1,Autauga County,1,,,...,437,33,50,11,11,11,456,29,29,11
4,G0100010020500,2011-2015,,,Alabama,1,Autauga County,1,,,...,71,71,18,18,18,18,18,18,18,18


In [7]:
df_cen.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74001 entries, 0 to 74000
Data columns (total 80 columns):
GISJOIN      74001 non-null object
YEAR         74001 non-null object
REGIONA      0 non-null float64
DIVISIONA    0 non-null float64
STATE        74001 non-null object
STATEA       74001 non-null int64
COUNTY       74001 non-null object
COUNTYA      74001 non-null int64
COUSUBA      0 non-null float64
PLACEA       0 non-null float64
TRACTA       74001 non-null object
BLKGRPA      0 non-null float64
CONCITA      0 non-null float64
AIANHHA      0 non-null float64
RES_ONLYA    0 non-null float64
TRUSTA       0 non-null float64
AITSCEA      0 non-null float64
ANRCA        0 non-null float64
CBSAA        0 non-null float64
CSAA         0 non-null float64
METDIVA      0 non-null float64
NECTAA       0 non-null float64
CNECTAA      0 non-null float64
NECTADIVA    0 non-null float64
UAA          0 non-null float64
CDCURRA      0 non-null float64
SLDUA        0 non-null float64
SLDLA   

### Check your column types

I mentioned it above, but make sure `census_tract` is an object (a string) or merging isn't going to be any fun later on.

# Engineering and cleaning up features

## Income-related columns

> When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. **We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.**

A few of the columns you'll need to calculate yourselves. **Calculate these values and assign them to three new columns.**

* Applicant’s adjusted annual income (includes co-applicant income)
* Adjusted loan amount
* Ratio between the loan amount and the applicant’s income

Instead of using the raw income and loan amount, you'll want the log of both income and loan amount. Call these new columns `log_income` and `log_loan_amount`. The third column will be `loan_income_ratio`.

* **Tip:** `np.log` gives you the logarithm

In [10]:
#Q: What does logic do and how is it adjusted?

In [8]:
df_mor['log_income'] = np.log(df_mor.income)

In [9]:
df_mor['log_loan_amount'] = np.log(df_mor.loan_amount)

In [10]:
df_mor['loan_income_ratio'] = df_mor['log_income'] / df_mor['log_loan_amount']

In [11]:
df_mor.head(10)

Unnamed: 0,census_tract,county_code,state_code,applicant_sex,income,loan_amount,loan_type,property_type,occupancy,action_type,loan_purpose,agency_code,tract_to_msa_income_percent,applicant_race,co_applicant_sex,log_income,log_loan_amount,loan_income_ratio
0,101.0,101,42,3,26,5,1,1,1,4,2,5,97.09,6,5,3.258097,1.609438,2.024369
1,264.0,101,42,2,26,40,1,1,1,4,2,5,98.27,3,5,3.258097,3.688879,0.883221
2,281.0,101,42,2,22,20,1,1,1,5,2,5,72.28,6,5,3.091042,2.995732,1.031815
3,158.0,101,42,2,57,36,1,1,1,5,3,5,105.87,6,5,4.043051,3.583519,1.128235
4,358.0,101,42,1,80,34,1,1,1,1,3,5,139.62,5,2,4.382027,3.526361,1.242649
5,27.02,101,42,3,54,70,1,1,1,5,2,5,166.23,6,5,3.988984,4.248495,0.938917
6,337.02,101,42,2,137,35,1,1,1,1,2,5,120.17,6,3,4.919981,3.555348,1.383825
7,4105.0,45,42,2,105,30,1,1,1,4,2,5,77.54,3,5,4.65396,3.401197,1.36833
8,362.02,101,42,3,72,15,1,1,1,5,2,5,148.89,6,5,4.276666,2.70805,1.579242
9,197.0,101,42,3,31,30,1,1,1,5,2,5,45.85,6,5,3.433987,3.401197,1.009641


### Co-applicants

Right now we have a column about the co-applicant's sex (see the codebook for column details). We don't want the sex, though, we're interested in whether there is a co applicant or not. Use the co-applicant's sex to **create a new column called `co_applicant` that is either 'yes', 'no', or 'unknown'.**

* **Hint:** If the co-applicant's sex was not provided or is not applicable, count it as unknown.
* **Hint:** The easiest way is to use `.replace` on the co-applicant sex column, but store the result in your new column

In [12]:
df_mor['co_applicant'] = df_mor['co_applicant_sex'].astype(str)

In [13]:
df_mor['co_applicant'] = df_mor.co_applicant.str.replace("1","Yes")
df_mor['co_applicant'] = df_mor.co_applicant.str.replace("2","Yes")
df_mor['co_applicant'] = df_mor.co_applicant.str.replace("3","Unknown")
df_mor['co_applicant'] = df_mor.co_applicant.str.replace("4","Unknown")
df_mor['co_applicant'] = df_mor.co_applicant.str.replace("5","No")

# Filter loan applicants

If you read the whitepaper - `lending_disparities_whitepaper_180214.pdf` - many filters are used to get to the target dataset for analysis.

> **Loan type**
>
> While we recognize the substantial presence of applicants of color in the FHA market, we focused on conventional home loans for several reasons.

> **Property type**
>
> Prospective borrowers submit loan applications for various types of structures: one- to four-unit properties, multifamily properties and manufactured homes. For this analysis, we focused on one- to four-unit properties.

> **Occupancy**
>
> We included only borrowers who said they planned to live in the house they were looking to buy. We did this to exclude developers or individuals who were buying property as an investment or to subsequently flip it.

> **Action Type**
>
> We wanted to look at the reasons lending institutions deny people a mortgage. After conversations with former officials at HUD, we decided to include only those applications that resulted in originations (action type 1) or denials (action type 3)

> **Income**
>
> An applicant’s income isn’t always reported in the data. In other cases, the data cuts off any incomes over \\$9.9 million and any loan amounts over \\$99.9 million, meaning there’s a value in the database, but it’s not precise. We focused only on those records where income and loan amount have an accurate estimation. This meant discarding about 1 percent of all conventional home loans in the country for 2016. [Note: I already edited this]
>
> When we plotted the number of applicants, how much money they made and the size of the loan, we found that it skewed to the left, meaning the majority of applicants were clustered on the lower end of the income and loan amount scales. This was especially true for applicants of color. We took the logarithm transformation of income and loan amount to normalize the distribution of those variables and limit the effect of extreme outliers.

> **Lien status**
>
> We included all cases in our analysis regardless of lien status.

> **Race and ethnicity**
>
> At first, we looked at race separate from ethnicity, but that approach introduced too many instances in which​ ​either the ethnicity or race was unknown. So we decided to combine race and ethnicity. Applicants who marked their ethnicity as Hispanic were grouped together as Hispanic/Latino regardless of race. Non-Hispanic applicants, as well as those who didn’t provide an ethnicity, were grouped together by race: non-Hispanic white, non-Hispanic black, etc. **[Note: This has already been taken care of]**

> **Loan purpose**
>
> We decided to look at home purchase, home improvement and refinance loans separately from each other. [Note: please look at **home purchase** loans.]

Use the text above (it's from the whitepaper) and the **2015HMDACodeSheet.pdf** code book to filter the dataset.

* **Tip:** there should be between 5-8 filters, depending on how you write them.

In [14]:
df_mor.columns

Index(['census_tract', 'county_code', 'state_code', 'applicant_sex', 'income',
       'loan_amount', 'loan_type', 'property_type', 'occupancy', 'action_type',
       'loan_purpose', 'agency_code', 'tract_to_msa_income_percent',
       'applicant_race', 'co_applicant_sex', 'log_income', 'log_loan_amount',
       'loan_income_ratio', 'co_applicant'],
      dtype='object')

In [15]:
#Loan type
df_mor = df_mor[df_mor['loan_type'] == 1]

In [16]:
#Property type
df_mor = df_mor[df_mor['property_type'] == 1]

In [17]:
#Occupancy
df_mor = df_mor[df_mor['occupancy'] == 1]

In [18]:
#Action Type
df_mor = df_mor[df_mor['action_type'].isin([1, 3])]

In [19]:
#Income
#we've done this already

In [20]:
#Loan purpose
df_mor = df_mor[df_mor['loan_purpose'] == 1]

In [21]:
df_mor2 = df_mor

When you're done filtering, save your dataframe as a "copy" with `df = df.copy()` (if it's called `df`, of course). This will prevent irritating warnings when you're trying to create new columns.

### Confirm that you have 10,107 loans with 19 columns

In [22]:
df_mor2.shape

(10107, 19)

### Create a "loan denied" column

Right now the `action_type` category reflects whether the loan was granted or not, and either has a value of `1` or `3`.

Create a new column called `loan_denied`, where the value is `0` if the loan was accepted and `1` if the loan was denied. **This will be our target for the machine learning algorithm.**

* **Tip:** You should have 8,878 successful loans and 1,229 denied loans

In [23]:
df_mor2['loan_denied'] = df_mor2['action_type'].astype(str)

In [24]:
df_mor2['loan_denied'] = df_mor2['loan_denied'].str.replace("1", "0")
df_mor2['loan_denied'] = df_mor2['loan_denied'].str.replace("3", "1")

In [25]:
df_mor2.loan_denied.value_counts()

0    8878
1    1229
Name: loan_denied, dtype: int64

# Deal with categorical variables

Let's go ahead and take a look at our categorical variables:

* Applicant sex (male, female, na)
* Applicant race
* Mortgage agency
* Co-applicant (yes, no, unknown)

Before we do anything crazy, let's use the codebook to turn them into strings.

* **Tip:** We already did this with the `co_applicant` column, you only need to do the rest
* **Tip:** Just use `.replace`

In [26]:
df_mor2['applicant_sex'] = df_mor2.applicant_sex.astype(str)
df_mor2['agency_code'] = df_mor2.agency_code.astype(str)
df_mor2['applicant_race'] = df_mor2.applicant_race.astype(str)
df_mor2['co_applicant'] = df_mor2.co_applicant.astype(str)
df_mor2['co_applicant_sex'] = df_mor2.co_applicant_sex.astype(str)

In [27]:
df_mor2['applicant_sex'] = df_mor2['applicant_sex'].str.replace("1", "Male")
df_mor2['applicant_sex'] = df_mor2['applicant_sex'].str.replace("2", "Female")
df_mor2['applicant_sex'] = df_mor2['applicant_sex'].str.replace("3", "Unknown")

In [28]:
df_mor2['co_applicant_sex'] = df_mor2['co_applicant_sex'].str.replace("1", "Male")
df_mor2['co_applicant_sex'] = df_mor2['co_applicant_sex'].str.replace("2", "Female")
df_mor2['co_applicant_sex'] = df_mor2['co_applicant_sex'].str.replace("3", "Unknown")
df_mor2['co_applicant_sex'] = df_mor2['co_applicant_sex'].str.replace("4", "Unknown")
df_mor2['co_applicant_sex'] = df_mor2['co_applicant_sex'].str.replace("5", "na")

In [29]:
df_mor2['applicant_race'] = df_mor2['applicant_race'].str.replace("1", "American Indian or Alaska Native")
df_mor2['applicant_race'] = df_mor2['applicant_race'].str.replace("2", "Asian")
df_mor2['applicant_race'] = df_mor2['applicant_race'].str.replace("3", "Black or African American")
df_mor2['applicant_race'] = df_mor2['applicant_race'].str.replace("4", "Native Hawaiian or Other Pacific Islander")
df_mor2['applicant_race'] = df_mor2['applicant_race'].str.replace("5", "White")
df_mor2['applicant_race'] = df_mor2['applicant_race'].str.replace("6", "Unknown")
df_mor2['applicant_race'] = df_mor2['applicant_race'].str.replace("99", "Hispanic and Latino")

In [30]:
df_mor2['agency_code'] = df_mor2['agency_code'].str.replace("1", "OCC")
df_mor2['agency_code'] = df_mor2['agency_code'].str.replace("2", "FRS")
df_mor2['agency_code'] = df_mor2['agency_code'].str.replace("3", "FDIC")
df_mor2['agency_code'] = df_mor2['agency_code'].str.replace("5", "NCUA")
df_mor2['agency_code'] = df_mor2['agency_code'].str.replace("7", "HUD")
df_mor2['agency_code'] = df_mor2['agency_code'].str.replace("9", "CFPB")

In [31]:
df_mor2['co_applicant'].value_counts()

No         6079
Yes        3749
Unknown     279
Name: co_applicant, dtype: int64

In [32]:
df_mor2[['applicant_sex', 'agency_code', 'applicant_race', 'co_applicant']].head()

Unnamed: 0,applicant_sex,agency_code,applicant_race,co_applicant
42,Female,OCC,White,No
43,Unknown,OCC,Unknown,Unknown
46,Male,OCC,White,No
48,Female,OCC,Asian,No
51,Female,OCC,Asian,No


In [33]:
df_mor2.head()

Unnamed: 0,census_tract,county_code,state_code,applicant_sex,income,loan_amount,loan_type,property_type,occupancy,action_type,loan_purpose,agency_code,tract_to_msa_income_percent,applicant_race,co_applicant_sex,log_income,log_loan_amount,loan_income_ratio,co_applicant,loan_denied
42,4019.0,45,42,Female,59,112,1,1,1,1,1,OCC,133.09,White,na,4.077537,4.718499,0.86416,No,0
43,4099.02,45,42,Unknown,177,375,1,1,1,1,1,OCC,208.56,Unknown,Unknown,5.17615,5.926926,0.873328,Unknown,0
46,4102.0,45,42,Male,150,381,1,1,1,1,1,OCC,215.35,White,na,5.010635,5.942799,0.843144,No,0
48,312.0,101,42,Female,65,136,1,1,1,1,1,OCC,93.11,Asian,na,4.174387,4.912655,0.849721,No,0
51,4036.01,45,42,Female,55,196,1,1,1,1,1,OCC,141.83,Asian,na,4.007333,5.278115,0.759236,No,0


Double-check these columns match these values in the first three rows (and yes, you should have a lot of other columns, too).

|applicant_sex|agency_code|applicant_race|co_applicant|
|---|---|---|---|
|female|OCC|white|no|
|na|OCC|na|unknown|
|male|OCC|white|no|

## Dummy variables

Let's say we're at the end of the homework, and we have a column called `sex`, where `0` is female and `1` is male. After we've done our regression, we can look at the coefficient/odds ratio for `sex` and say something like **"being male gives you a 1.5x odds of being denied a loan."**

We can say this because we're looking at one column, and changing `sex` from `0` to `1` would turn the applicant male and give them a 1.5x chance of being denied (the odds ratio).

**But let's say we're looking at a column called `race` instead.** We could do the same `0`/`1` thing with white/minority, but what about white/black/asian? If we try to give them `0`/`1`/`2` our coefficient/odds ratio interpreation stops working, because we don't have a nice True/False dichotomy any more, it's now a *real number*.

* `0`: White
* `1`: Black
* `2`: Asian

Usually with numbers you can say "...for every increase of 1...", but we can't anymore - changing from White to Black (+1) isn't the same as changing from Black to Asian (+1). And you can't subtract Black from Asian to get White. And no, you also can't average together White and Asian to get Black. Just recognize that these aren't numbers, they're categories!

**How can we turn races off and on like we can turn the `sex` variable off and on?** A good option is to make *a `0`/`1` column for each race*. We can then flip each race off and on. These are called **dummy variables**.

In [34]:
pd.get_dummies(df_mor2.applicant_race, prefix='race').head()

Unnamed: 0,race_American Indian or Alaska Native,race_Asian,race_Black or African American,race_Hispanic and Latino,race_Native Hawaiian or Other Pacific Islander,race_Unknown,race_White
42,0,0,0,0,0,0,1
43,0,0,0,0,0,1,0
46,0,0,0,0,0,0,1
48,0,1,0,0,0,0,0
51,0,1,0,0,0,0,0


Seems to take up a lot of space, but it works a lot better.

* The first person is white, so they have a `1` for white and a `0` for every other race
* The second person is N/A, so they have a `1` for N/A and a `0` for every other race
* The next three are white, asian, and asian, so they have a `1` under the appropriate column.

When you're looking at the regression output, each column has its own coefficient (and odds ratio). Since each race now has a column, **each race will also have its own odds ratio.** Asian would have one, Black would have one, Latino would have one - now we can look at the effect of each race separately. For example, you could then say something like "being Asian (e.g., `race_asian` going from `0` to `1`) gives you a 1.2x greater chance of being denied, and being Black gets you a 2.1x chance of being denied."

And no, you're never going to have more than one `1` in a row at the same time.

After you've created your dummy variables, there's one more step which has a real fun name: **one-hot encoding.**

### One-hot encoding

When we have two sexes - male and female - we can flip between them with one binary digit, `0` and `1`.

If we had three races - White, Asian, Black - using `pd.get_dummies` would make three columns, which makes sense on the surface. But why can we put TWO values in ONE column for sex, and it takes THREE columns for the THREE race values?

The truth is, it doesn't have to!

Instead of having three columns, we're only going to have two: **asian and black**. And if both of them are `0`? The applicant is white! This is called a **reference category**, and it means **the coefficients/odds ratios for asian and black are in reference to a white applicant.** So it isn't "being black gets you a 2.1x chance of being denied," it's *being black gets you a 2.1x chance of being denied compared to a white person*. For example:

|race_asian|race_black|person's race|
|---|---|---|
|1|0|Asian|
|0|1|Black|
|0|0|White|
|1|1|Not possible if your source is a single race column|

To create a one-hot encoded variable with a reference category, you write code like this:

In [35]:
dummies_race = pd.get_dummies(df_mor2.applicant_race, prefix='race').drop('race_White', axis=1)
dummies_race.head()

Unnamed: 0,race_American Indian or Alaska Native,race_Asian,race_Black or African American,race_Hispanic and Latino,race_Native Hawaiian or Other Pacific Islander,race_Unknown
42,0,0,0,0,0,0
43,0,0,0,0,0,1
46,0,0,0,0,0,0
48,0,1,0,0,0,0
51,0,1,0,0,0,0


> We usually use `.drop(columns=...)` to drop columns, but I'm using `axis=1` here because you should be familiar with it

### Make a one-hot encoded `sex` category with `female` as the reference category

You should end up with two columns: `sex_male` and `sex_na`.

In [36]:
dummies_sex = pd.get_dummies(df_mor2.applicant_sex, prefix='sex').drop('sex_Female', axis=1)
dummies_sex.head()

Unnamed: 0,sex_Male,sex_Unknown
42,0,0
43,0,1
46,1,0
48,0,0
51,0,0


In [37]:
dummies_co_applicant = pd.get_dummies(df_mor2.co_applicant, prefix='co').drop('co_No', axis=1)
dummies_co_applicant.head()

Unnamed: 0,co_Unknown,co_Yes
42,0,0
43,1,0
46,0,0
48,0,0
51,0,0


In [38]:
dummies_agency = pd.get_dummies(df_mor2.agency_code, prefix='agency').drop('agency_FDIC', axis=1)
dummies_agency.head()

Unnamed: 0,agency_CFPB,agency_FRS,agency_HUD,agency_NCUA,agency_OCC
42,0,0,0,0,1
43,0,0,0,0,1
46,0,0,0,0,1
48,0,0,0,0,1
51,0,0,0,0,1


## Using one-hot encoded columns

Since these one-hot encoded variables are standalone dataframes, we eventually need to combine them into our original dataframe.

We have four categorical variables - sex, race, co-applicant, and the loan agency - so we need you to **make four one-hot encoded variables**. Name them like this:

* `dummies_sex` - reference category of white
* `dummies_race` - reference category of female
* `dummies_co_applicant` - reference category of no
* `dummies_agency` - reference category of FDIC

Typically your reference category is the most common category, because it makes for the most interesting comparisons.

> **Tip:** if you're cutting and pasting from above, watch out for `.head()`
>
> **Tip:** After you've made them, use `.head(2)` to check the first couple rows of each to make sure they look okay

## Cleaning up our old dataframe

Take a look at your original dataframe real quick.

In [39]:
df_mor2.head(1)

Unnamed: 0,census_tract,county_code,state_code,applicant_sex,income,loan_amount,loan_type,property_type,occupancy,action_type,loan_purpose,agency_code,tract_to_msa_income_percent,applicant_race,co_applicant_sex,log_income,log_loan_amount,loan_income_ratio,co_applicant,loan_denied
42,4019.0,45,42,Female,59,112,1,1,1,1,1,OCC,133.09,White,na,4.077537,4.718499,0.86416,No,0


In [43]:
# numeric = df_mor2[['census_tract', 'county_code', 'state_code', 'applicant_sex', 'co_applicant', 'log_income', 'log_loan_amount', 'loan_income_ratio', 'applicant_race', 'loan_amount', 'agency_code']]

In [40]:
numeric=df_mor2[['census_tract','county_code','state_code','action_type','log_income','log_loan_amount','loan_income_ratio','loan_denied']]

In [41]:
numeric.shape

(10107, 8)

We don't need all of those columns! If we look at the list of columns we'll be using for the regression:

* Race/Ethnicity
* Sex
* Whether there was a co-applicant
* Applicant’s annual income (includes co-applicant income)
* Loan amount
* Ratio between the loan amount and the applicant’s income
* Ratio between the median income of the census tract and the median income of the metro area
* Racial and ethnic breakdown by percentage for each census tract
* Regulating agency of the lending institution

We can keep anything in that list, and remove everything else. For example, we can drop the variables we used to create the dummy variables, as we'll be adding the one-hot encoded versions in for the next step.

For "Racial and ethnic breakdown by percentage for each census tract" we'll need to join with some census data later, so we need to also keep census tract, county code and state code.

**Build a new dataframe with only the columns we're interested in, call it `numeric`.** We're calling it `numeric` because it's mostly numeric columns after the categorical ones have been removed.

> **Tip:** You can either use `.drop(columns=` to remove unwanted columns or `df = df[['col1', 'col2', ... 'col12']]` to only select the ones you're interseted in

In [42]:
numeric.head(1)

Unnamed: 0,census_tract,county_code,state_code,action_type,log_income,log_loan_amount,loan_income_ratio,loan_denied
42,4019.0,45,42,1,4.077537,4.718499,0.86416,0


Confirm that `numeric` has 8 columns.

In [43]:
numeric.shape

(10107, 8)

### Combining our features

We now have 1 dataframe of numeric features (and some merge columns), and 4 one-hot-encoded variables (each with their own dataframe). Combine all five dataframes into one large dataframe called `loan_features`.

In [44]:
loan_features = numeric.merge(dummies_sex, left_index=True, right_index=True)

In [45]:
loan_features = loan_features.merge(dummies_race, left_index=True, right_index=True)

In [46]:
loan_features = loan_features.merge(dummies_co_applicant, left_index=True, right_index=True)

In [47]:
loan_features = loan_features.merge(dummies_agency, left_index=True, right_index=True)

In [48]:
loan_features.head(1)

Unnamed: 0,census_tract,county_code,state_code,action_type,log_income,log_loan_amount,loan_income_ratio,loan_denied,sex_Male,sex_Unknown,...,race_Hispanic and Latino,race_Native Hawaiian or Other Pacific Islander,race_Unknown,co_Unknown,co_Yes,agency_CFPB,agency_FRS,agency_HUD,agency_NCUA,agency_OCC
42,4019.0,45,42,1,4.077537,4.718499,0.86416,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [49]:
numeric.head(1)

Unnamed: 0,census_tract,county_code,state_code,action_type,log_income,log_loan_amount,loan_income_ratio,loan_denied
42,4019.0,45,42,1,4.077537,4.718499,0.86416,0


In [50]:
dummies_sex.head(1)

Unnamed: 0,sex_Male,sex_Unknown
42,0,0


In [51]:
dummies_race.head(1)

Unnamed: 0,race_American Indian or Alaska Native,race_Asian,race_Black or African American,race_Hispanic and Latino,race_Native Hawaiian or Other Pacific Islander,race_Unknown
42,0,0,0,0,0,0


In [52]:
dummies_co_applicant.head(1)

Unnamed: 0,co_Unknown,co_Yes
42,0,0


In [53]:
dummies_agency.head(1)

Unnamed: 0,agency_CFPB,agency_FRS,agency_HUD,agency_NCUA,agency_OCC
42,0,0,0,0,1


Confirm that `loan_features` has 10,107 rows and 23 columns.

In [54]:
loan_features.shape
# it does not and i have no idea why

(10107, 23)

# Census data

Now we just need the final piece to the puzzle, the census data. Read in the census data file, calling the dataframe `census`.

> **Tip:** As always, be sure to read the tract column in as a string. Interestingly, this time we _don't_ need to worry about the state or county codes in the same way.
>
> **Tip:** You're going to encounter a problem that you find every time you read in a file from the US government!

## Rename some columns

If you like to keep your data extra clean, feel free to rename the columns you're interested in. If not, feel free to skip it!

> **Tip:** Make sure you're using the estimates columns, not the margin of error columns

## Computed columns

According to Reveal's regression output, you'll want to create the following columns:

* Percent Black in tract
* Percent Hispanic/Latino in tract (I hope you know how Hispanic/Latino + census data works by now)
* Percent Asian in tract
* Percent Native American in tract
* Percent Native Hawaiian in tract

Notice that we don't include percent white - **because all of the other columns add up to percent white, we ignore it!** It's similar to a reference category.

> If we want to use buzzwords here, the technical reason we're not using percent white is called **collinearity.** We'll talk more about it on Friday.

In [55]:
df_cen.head(1)

Unnamed: 0,GISJOIN,YEAR,REGIONA,DIVISIONA,STATE,STATEA,COUNTY,COUNTYA,COUSUBA,PLACEA,...,ADK5M012,ADK5M013,ADK5M014,ADK5M015,ADK5M016,ADK5M017,ADK5M018,ADK5M019,ADK5M020,ADK5M021
0,G0100010020100,2011-2015,,,Alabama,1,Autauga County,1,,,...,21,21,11,11,11,11,11,11,11,11


In [56]:
# total
#df_cen.ADK5E001

# black 
df_cen['pct_black'] = (df_cen.ADK5E004 / df_cen.ADK5E001) *100

# hispanic / latino
df_cen['pct_hila'] = (df_cen.ADK5E012 / df_cen.ADK5E001) *100

# asian 
df_cen['pct_asian'] = (df_cen.ADK5E006 / df_cen.ADK5E001) *100

# native amarican
df_cen['pct_nama'] = (df_cen.ADK5E005 / df_cen.ADK5E001) *100

# native hawaiian
df_cen['pct_naha'] = (df_cen.ADK5E007 / df_cen.ADK5E001) *100

In [57]:
census_features = df_cen[['STATEA', 'COUNTYA', 'TRACTA', 'pct_black','pct_hila', 'pct_asian', 'pct_nama', 'pct_naha']]

## Only keep what we need to join and process

We're only interested in the percentage columns that we computed. Create a new dataframe called `census_features` that is only those columns along with the one we'll need for joining with the mortgage data.

> * **Tip:** Remember we saved state, county and tract codes when working on the loan data

In [58]:
census_features.head(1)

Unnamed: 0,STATEA,COUNTYA,TRACTA,pct_black,pct_hila,pct_asian,pct_nama,pct_naha
0,1,1,20100,7.700205,0.87269,0.616016,0.308008,0.0


Confirm that your first few rows look something like this:
    
|STATEA|COUNTYA|TRACTA|pct_hispanic|pct_black|pct_amer_indian|pct_asian|pct_pac_islander|
|---|---|---|---|---|---|---|---|
|1|1|020100|0.872690|7.700205|0.308008|0.616016|0.000000|
|1|1|020200|0.788497|53.293135|0.000000|2.319109|0.000000|
|1|1|020300|0.000000|18.564690|0.505391|1.381402|0.269542|
|1|1|020400|10.490617|3.662672|1.560027|0.000000|0.000000|
|1|1|020500|0.743287|24.844374|0.000000|3.827929|0.000000|

Your column headers might be different but your numbers should match.

# Merge datasets

Merge `loan_features` and `census_features` into a new dataframe called `merged`.

Unfortunately something is a little different between our `loan_features` and `census_features` census tract columns. You'll need to fix it before you can merge.

## Cleaning

In [59]:
loan_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10107 entries, 42 to 60664
Data columns (total 23 columns):
census_tract                                      10107 non-null object
county_code                                       10107 non-null int64
state_code                                        10107 non-null int64
action_type                                       10107 non-null int64
log_income                                        10107 non-null float64
log_loan_amount                                   10107 non-null float64
loan_income_ratio                                 10107 non-null float64
loan_denied                                       10107 non-null object
sex_Male                                          10107 non-null uint8
sex_Unknown                                       10107 non-null uint8
race_American Indian or Alaska Native             10107 non-null uint8
race_Asian                                        10107 non-null uint8
race_Black or African American

In [60]:
loan_features['census_tract'] = loan_features.census_tract.str.replace(".","")

In [61]:
loan_features.shape

(10107, 23)

In [62]:
census_features.head(1)

Unnamed: 0,STATEA,COUNTYA,TRACTA,pct_black,pct_hila,pct_asian,pct_nama,pct_naha
0,1,1,20100,7.700205,0.87269,0.616016,0.308008,0.0


## Do the merge

In [70]:
merge = loan_features.merge(census_features, how='left', left_on=['census_tract', 'county_code', 'state_code'], right_on=['TRACTA', 'COUNTYA', 'STATEA'])

In [71]:
merge.shape

(10107, 31)

# Our final dataframe

Drop all of the columns we merged on and save it as `train_df`.

In [94]:
train_df = merge.drop(columns=['action_type','census_tract', 'county_code', 'state_code','TRACTA', 'COUNTYA', 'STATEA'])

Confirm that `train_df` has 10107 rows and 25 columns.

In [95]:
train_df.shape

(10107, 24)

## Final cleanup

Because we can't have missing data before we run a regression, check the size of `train_df`, then drop any missing data and check the size again. **Confirm you don't lose any rows.**

In [101]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10107 entries, 0 to 10106
Data columns (total 24 columns):
log_income                                        10107 non-null float64
log_loan_amount                                   10107 non-null float64
loan_income_ratio                                 10107 non-null float64
loan_denied                                       10107 non-null object
sex_Male                                          10107 non-null uint8
sex_Unknown                                       10107 non-null uint8
race_American Indian or Alaska Native             10107 non-null uint8
race_Asian                                        10107 non-null uint8
race_Black or African American                    10107 non-null uint8
race_Hispanic and Latino                          10107 non-null uint8
race_Native Hawaiian or Other Pacific Islander    10107 non-null uint8
race_Unknown                                      10107 non-null uint8
co_Unknown                      

In [103]:
train_df['loan_denied'] = train_df.loan_denied.astype(int)

# Performing our regression

## Try with statsmodels

First try to run a linear regression with statsmodels, because even though sometimes it complains and breaks, the output just looks *so nice*. Instead of `sm.OLS` we'll use `sm.Logit`.

In [104]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10107 entries, 0 to 10106
Data columns (total 24 columns):
log_income                                        10107 non-null float64
log_loan_amount                                   10107 non-null float64
loan_income_ratio                                 10107 non-null float64
loan_denied                                       10107 non-null int32
sex_Male                                          10107 non-null uint8
sex_Unknown                                       10107 non-null uint8
race_American Indian or Alaska Native             10107 non-null uint8
race_Asian                                        10107 non-null uint8
race_Black or African American                    10107 non-null uint8
race_Hispanic and Latino                          10107 non-null uint8
race_Native Hawaiian or Other Pacific Islander    10107 non-null uint8
race_Unknown                                      10107 non-null uint8
co_Unknown                       

In [81]:
import statsmodels
import statsmodels.api as sm

In [105]:
X = train_df.drop(columns='loan_denied')
y = train_df.loan_denied

model = sm.Logit(y, X)
result = model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.334540
         Iterations 7


0,1,2,3
Dep. Variable:,loan_denied,No. Observations:,10107.0
Model:,Logit,Df Residuals:,10084.0
Method:,MLE,Df Model:,22.0
Date:,"Thu, 25 Jul 2019",Pseudo R-squ.:,0.09608
Time:,16:26:10,Log-Likelihood:,-3381.2
converged:,True,LL-Null:,-3740.6
Covariance Type:,nonrobust,LLR p-value:,8.425e-138

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
log_income,-0.3181,0.086,-3.703,0.000,-0.486,-0.150
log_loan_amount,-0.2678,0.060,-4.481,0.000,-0.385,-0.151
loan_income_ratio,-0.2265,0.163,-1.387,0.165,-0.546,0.093
sex_Male,0.1187,0.070,1.690,0.091,-0.019,0.256
sex_Unknown,-0.1157,0.176,-0.658,0.510,-0.460,0.229
race_American Indian or Alaska Native,1.0722,0.582,1.841,0.066,-0.069,2.213
race_Asian,0.3648,0.104,3.498,0.000,0.160,0.569
race_Black or African American,0.7614,0.114,6.696,0.000,0.538,0.984
race_Hispanic and Latino,0.3300,0.162,2.034,0.042,0.012,0.648


## Try again with sci-kit learn

But I'll be honest, I like sklearn a *lot lot lot* better. Using the coefficient to build a dataframe just seems so *nice*.

> **Tip:** When you build your model, use `LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)` - for if you don't increase `max_iter` (how long/hard it works) it'll complain it can't find an answer.

In [106]:
# Every column EXCEPT whether it's suspicious
X = train_df.drop(columns='loan_denied')
# label is suspicious 0/1
y = train_df.loan_denied

# Build a new classifier
# C=1e9 is a magic secret I don't want to talk about
# If we don't say solver='lbfgs' it complains that it's the new default
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)

# Teach the classifier about the complaints we read
clf.fit(X, y)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=4000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Getting your coefficients and odds ratios

After you run your regression **using sklearn**, you can use code like the below to print out an ordered list of features, coefficients, and odds ratios.

```python
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False)
```

In [108]:
feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients)
}).sort_values(by='odds ratio', ascending=False)

Unnamed: 0,feature,coefficient (log odds ratio),odds ratio
16,agency_NCUA,1.300358,3.670609
9,race_Native Hawaiian or Other Pacific Islander,1.15584,3.17669
13,agency_CFPB,1.123566,3.075802
5,race_American Indian or Alaska Native,0.975514,2.652531
7,race_Black or African American,0.768796,2.157166
10,race_Unknown,0.45523,1.576536
11,co_Unknown,0.416794,1.517089
6,race_Asian,0.371528,1.449949
17,agency_OCC,0.347756,1.415887
8,race_Hispanic and Latino,0.344234,1.410909


### Wait, what's the odds ratio again?

It's how much that variable affects the outcome **if all other variables stay the same.**

# Interpreting and thinking about the analysis

### Question 1

Our results aren't exactly the same as Reveal's, as I pulled a slightly different number of rows from the database and I'm not sure what exact dataset they used for census information. How are we feeling about this reproduction? **You might want check their 2015 results in the whitepaper.**

In [None]:
# seems ok to me

### Question 2

In the opening paragraph to the flagship piece, [Aaron and Emmanuel write](https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/):

> Fifty years after the federal Fair Housing Act banned racial discrimination in lending, African Americans and Latinos continue to be routinely denied conventional mortgage loans at rates far higher than their white counterparts.

If you look at the results, Hawaiians/Pacific Islanders (and maybe Native Americans) have an even higher odds ratio. **Why do they choose to talk about African Americans and Latinos instead?**

In [None]:
# pc there they might have a higher total number of applicants (and potential voters)

### Question 3

Write a sentence expressing the meaning of the **odds ratio** statistic for Black mortgage applicants. Find a line in [the Reveal piece](https://www.revealnews.org/article/for-people-of-color-banks-are-shutting-the-door-to-homeownership/) where they use the odds ratio.

In [None]:
# the odds ratio is related to white applicants hence it would be used to compare values black to white applicants. 
# "the likelyhood that black applicants will be denied a morgage is x times higher than for white applicatans"

### Question 4

Write a similar sentence about men.

In [110]:
# "the likelyhood that male applicants will be denied a morgage is x times higher/lower than for female applicatans"

### Question 5

Why did Aaron and Emmanuel choose to include the loan-to-income ratio statistic? **You might want to read the whitepaper.**

In [None]:
# bc it represents socio economic correlation

### Question 6

Credit score is a common reason why loans are denied. Why are credit scores not included in our analysis? **You might want to read the whitepaper.**

In [None]:
#it was not included bc was not publiclyavailable

### Question 7

This data was just sitting out there for anyone to look at, they didn't even need to FOIA it. Why do you think this issue had not come up before Reveal's analysis?

In [None]:
# im guessing the anniversary of the federal fair housing act promted journalists to take a closer look.. maybe..

### Question 8

As a result of this series, [a lot has happened](https://www.revealnews.org/blog/we-exposed-modern-day-redlining-in-61-cities-find-out-whats-happened-since/), although [recent changes don't look so good](https://www.revealnews.org/blog/cfpb-moves-to-limit-home-loan-data/). If you were reporting this story, what groups of people would you want to talk to in order to make sure you're getting the story right?

In [None]:
# affected morgage applicants
# credit-derivats specialitst (cds)
# policy makers

### Question 9

When they were consulting experts, Aaron and Emmanuel received a lot of conflicting accounts about whether they should include the "N/A" values for race (they ended up including it). If the experts disagreed about something like that, why do you think they went forward with their analysis?

In [None]:
# in gerneal i feel just bc experts have opposing opinions it doesnt mean that an anlysis has to stop. 

### Question 10

What if we were working on this story, and our logistic regression or input dataset were flawed? What would be the repercussions?

In [None]:
# When only the regression model is flawed I think it would be easy to fix. 
# When the input dataset is flawed it depends: can we still use it ore is it completely borken? 