# Redlining by County 

The term redlining comes from the practice of the FHA and the Home Owners' Loan Corporation (HOLC)  is color code neighborhood maps in order to identify how safe it was to insure mortgages in certain neighborhoods. Neighborhoods were graded in four categories: A- “Best” (Green), B- “Still Desirable” (Blue), C- “Defiantly Declining” (Yellow), D- “Hazardous” (Red); this classifications were primarily racially motivated, placing neighborhoods where minorities live in C or D classifications.  
In 1968,  the Fair Housing Act was passed, which makes it unlawful to discriminate in terms or conditions in the basis of race or national origin.
And in 1974, The Equal Credit Opportunity Act (ECOA) enacted unlawful for any creditor to discriminate against any applicant, with respect to any aspect of a credit transaction, based on race, color, religion, national origin, sex, marital status, or age
Using HOLC historical redlining information we decided to plot total area (sq ft) redlined against current county lines. 



In [None]:
# Load libraries
import pandas as pd
import numpy as np
import altair as alt
import geopandas as gpd
import pyspark
import censusdata
from pyspark.sql import SparkSession 
from pyspark.sql.functions  import col, when, lit
from pyspark.sql import functions as f

from vega_datasets import data
alt.data_transformers.disable_max_rows()

from io import StringIO
alt.themes.enable("fivethirtyeight") # visualization theme


### Load dataset HOLC 

Calculate Historical Redlining Score (HRS) by calculating the grade weights. 

undefined. Calculate the percentage of weighted area. Ex. area_A divided by area_rated.

undefined. Multiply by the HOLC grade factor : A= 1, B=2, C=3, D=4

undefined. Final Score. The level of redlining goes from 1-4 with 1 being low redlining and 4 high.

This methodology was obtained from https://ncrc.org/redlining-score/

In [None]:


holc_rated=pd.read_csv('HOLC_2020_census_tracts.csv', dtype={'geoid20': str})
#calculate % of rated area
holc_rated['A']= holc_rated['area_A']/holc_rated['area_rated']
holc_rated['B']= holc_rated['area_B']/holc_rated['area_rated']
holc_rated['C']= holc_rated['area_C']/holc_rated['area_rated']
holc_rated['D']= holc_rated['area_D']/holc_rated['area_rated']

#used NCRC methodology to calculate HRS (Historic redlinning score)
holc_rated['a']= holc_rated['A']*1
holc_rated['b']= holc_rated['B']*2
holc_rated['c']= holc_rated['C']*3
holc_rated['d']= holc_rated['D']*4
holc_rated['HRS']= holc_rated[['a', 'b', 'c', 'd']].sum(axis=1)

holc_rated['fips']= holc_rated['geoid20'].str[:5]  #extract county code also known as fips
holc_rated.rename(columns={'geoid20':'GEOID'}, inplace=True)
holc_rated

### Loading county codes and geolocation.

We would like to explore if redlining was more prevalent in certain areas of the country. For that we will use two more datasets for plotting.

1. https://github.com/btskinner/spatial/blob/master/data/county_centers.csv

undefined. https://github.com/kjhealy/fips-codes/blob/master/state_and_county_fips_master.csv

In [None]:
states=pd.read_csv('state_fips.csv', dtype={'fips': str})
states['fips'] = states['fips'].str.zfill(5)
fips= pd.read_csv('fipsnames-20221011-151647.csv', dtype={'fips': str})
fips= pd.merge(fips[['fips', 'clon00', 'clat00']], states[['fips', 'name']],how='left',on='fips')
fips.head(3)

We will joined the newly created fips file with HOLC to plot counties and percentage of redlining 

In [None]:
holc_fips= holc_rated[['fips', 'A', 'B', 'C', 'D', 'HRS']].groupby('fips').mean()
holc_fips.reset_index(inplace=True)
holc_fips["id"] = holc_fips["fips"].astype(int)
holc_fips= pd.merge(holc_fips, fips,how='left',on='fips')

#Long form for plotting
holc_fipsL= pd.melt(holc_fips, id_vars=['fips', 'id', 'name', 'clon00', 'clat00', 'HRS' ], value_vars=['A', 'B', 'C', 
    'D'], ignore_index=False)

holc_fips.head(3)

We used Altair to plot all counties with redlining, we can see that the majority of redlining counties are in the east coast or near metropolitan areas, all so must of the neighborhoods were classified grade "D'

In [None]:
counties = alt.topo_feature(data.us_10m.url, 'counties')
# US states background
background2 = alt.Chart(counties).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(
    width=800,
    height=500
).project('albersUsa')

# airport positions on background
range_=['#6d904f', '#30a2da', '#e5ae38', '#fc4f30']
points = alt.Chart(holc_fipsL).mark_circle( opacity=0.8,
       stroke='black',
       strokeWidth=1).encode(
    longitude='clon00:Q',
    latitude='clat00:Q',
    size=alt.Size('value:Q', title='% of Area Rated', scale=alt.Scale(range=[0, 500])  ),
    color= alt.Color('variable',scale=alt.Scale( range=range_,),  title='Grade'),
    tooltip=['name:N','HRS:Q']
).properties(
    title='Historical Redlining by 2021 County Lines'
)

mapCounties=background2 + points
mapCounties.configure(background='#FFFFFF')

### Census Data

Next, we will compare the demographics of ungraded and graded areas. For this we will use census data available to python library Censusdata. Library documentation can be found https://pypi.org/project/CensusData/. 

In [None]:
#dowload census data
county_pop = censusdata.download('acs5', 2015, censusdata.censusgeo([('county', '*')]),
                                ['B02001_001E', 'B02001_002E', 'B25081_001E', 'B25081_008E', 
                                'B25002_001E', 'B25002_002E', 'B25002_003E'])
county_pop.rename(columns={'B02001_001E':'population_total', 'B02001_002E':'white_pop',
  'B25081_001E':'total_houses','B25081_008E':'houses_wo_mortgage','B25002_001E': 'occupancy_total', 
  'B25002_002E': 'occupied', 'B25002_003E': 'Vacant'}, inplace=True)

county_pop.reset_index(inplace=True)
county_pop

Since the census data does not contain the five digit fips code for each county, we will perform some transformation 

In [None]:
#extract state and 3 digit county code. And build fips code. 
county_pop['state']= county_pop['index'].astype(str).str.extract(r'(state:\d{2})')
county_pop['county']= county_pop['index'].astype(str).str.extract(r'(county:\d{3})')
county_pop['county']= county_pop['county'].str.replace("county:", "")
county_pop['state']= county_pop['state'].str.replace("state:", "")
county_pop['fips']= county_pop['state']+ county_pop['county']
county_pop.drop(columns=['state', 'county'], inplace=True)
county_pop

We will calculate the vacancy, mortgage and minority percentages

In [None]:
county_pop['vacant_perc']= county_pop['Vacant']/county_pop['total_houses']
county_pop['mortgage_perc']= 1-(county_pop['houses_wo_mortgage']/county_pop['total_houses'])
county_pop['minority_perc']= 1-(county_pop['white_pop']/county_pop['population_total'])
county_pop

### Joining redlining and census data 

We will join the HRS with current census data, to see the demographic composition and HRS grading together.  

Minority percentage steadily increases from ungraded to grade D. We can also see that the % of vacant units doubles between graded A areas and Graded D. mortgage percentage is similar between all areas. 

In [None]:
#merge HOLC, and census data
fips_rated= pd.merge(holc_fips, county_pop[['fips', 'population_total', 'total_houses', 'vacant_perc',
    'mortgage_perc', 'minority_perc']],
    how='right',on='fips')
fips_rated['HRS'] = fips_rated['HRS'].fillna(0.1)
fips_rated['grade'] = pd.cut(fips_rated['HRS'], bins=[0,1, 1.75, 2.49, 3.3, 4], labels=['Ungraded', 'A','B', 
    'C', 'D'])

#aggreagate values by HRS grading 
names = {'population_total':'population_total', 'total_houses':'total_houses','mortgage_perc':'mean_%mortage',
     'minority_perc':'mean-%minority', 'vacant_perc':'mean_%vacant', 'HRS':'mean_HRS'}
fips_ratedagg= fips_rated.groupby('grade').agg({'population_total':'sum', 'total_houses':'sum',
    'mortgage_perc':'mean', 'minority_perc':'mean', 'vacant_perc':'mean', 'HRS':'mean' }).rename(columns=names)
fips_ratedagg

### Loan applications and HRS

Considering that there is significant increase in minority population and vacant units as HRS grade increases. We will explore the loan performance, to see if a relationship exist with HRS. 

In [None]:
#pyspark session to load loan data 


spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('My First Spark application') \
    .getOrCreate()
sc = spark.sparkContext

In [None]:
df_hm = spark.read.option("header",True) \
     .csv("2021_public_lar.csv")
df_hm.show(2,truncate=False)

We will focus on single family homes, not used for businesses, we will also exclude any applications that were closed out due to lack of documentation or withdraw by applicant. 

In [None]:
# filter only loans for home purchases and for personal use, etc refer to data cleaning specs. .
df_hm_cleaned = df_hm.select('*')\
    .filter((df_hm.business_or_commercial_purpose == 2) & (df_hm.loan_purpose ==1) &
            (df_hm.occupancy_type ==1)& (df_hm.action_taken !=4) &
            (df_hm.action_taken !=5) & (df_hm.loan_type ==1)&
            (df_hm.derived_dwelling_category == 'Single Family (1-4 Units):Site-Built' )&
            (df_hm.derived_loan_product_type == "Conventional:First Lien") &
            (df_hm.conforming_loan_limit == "C") &
            (df_hm.lien_status == 1) &
            (df_hm.reverse_mortgage == 2) &
            (df_hm.open_end_line_of_credit == 2) &
            (df_hm.negative_amortization == 2 ) &
            (df_hm.total_units == 1)&
            (df_hm.balloon_payment ==2))

Since Hispanic is encoded under 'derived_ethnicity', We will create a variable called race. 

In [None]:
#filter out only by certain races
races=['White', 'Black or African American', 'Asian', 'Hispanic or Latino']
df_hm_cleaned= df_hm_cleaned.withColumn('race', \
    f.when(f.col('derived_ethnicity')=='Hispanic or Latino', "Hispanic or Latino")\
    .otherwise(df_hm_cleaned.derived_race))

df_hm_cleaned =df_hm_cleaned.select('*').filter(df_hm_cleaned.race.isin(races))

We will calculate the total number of applications, approvals and mean interest rate, creating a consolidated pandas data frame. 

In [None]:
count_group=df_hm_cleaned.groupBy('census_tract',"race").count()

approvals=df_hm_cleaned.filter(col('action_taken').isin([1,2,6,8]))\
    .groupBy('census_tract','race').count().withColumnRenamed("count","approved")\
    .withColumnRenamed('race',"race2").withColumnRenamed("census_tract","census")

interest_rate=df_hm_cleaned.groupBy('census_tract',"race").agg(f.mean('interest_rate'))\
    .withColumnRenamed('census_tract','census_tract2').withColumnRenamed("race","race3")

approv_index=count_group.join(approvals,(count_group.race == approvals.race2)\
    & (count_group.census_tract == approvals.census),"left")\
    .join(interest_rate, (count_group.race == interest_rate.race3)\
    & (count_group.census_tract == interest_rate.census_tract2),"left").toPandas()

We will like to add the percentage of approvals to our new data frame 

In [None]:
approv_index.drop(['race2', 'census', 'race3', 'census_tract2'], axis=1, inplace=True)
approv_index['approved'] = approv_index['approved'].fillna(0) #nan in column approved are denials will be transform to 0
approv_index['approval_perc']= approv_index['approved']/approv_index['count']
approv_index

In [None]:
total_applications= approv_index.groupby('race')['count', ].sum()
total_applications

Our data frame encodes the variable race in one columns. We will perform some transformation in order to calculate the correlation with HRS. 

In [None]:
ai_long= pd.pivot_table(approv_index, values=['approval_perc', 'avg(interest_rate)'], index=['census_tract'],
                            columns=['race'])
ai_long.columns = ai_long.columns.droplevel()


  
# Set the index
columns_ = ['asian_%approv', 'black_%approv', 'hisp_%approv', 'white_%approv', 'asian_interest', 'black_interest',
     'hisp_interest', 'white_interest']
ai_long.columns = columns_
ai_long.reset_index(inplace=True)
ai_long

We will bring over our previously calculated HRS from 'holc_rated' data set 

In [None]:
ai_merged= pd.merge(ai_long, holc_rated[['GEOID', 'HRS']], left_on='census_tract', right_on='GEOID',
    how='inner').drop(columns = ['GEOID'])
ai_merged.head(3)

Calculate the correlation between interest rate and approval by race and HRS

In [None]:
#interest correlation
interest_corr=ai_merged[['HRS','census_tract', 'asian_interest', 'black_interest', 'hisp_interest',
       'white_interest', ]].corr().round(2).stack().reset_index()
#create long format to plot correlation 
interest_corr.rename(columns={0: 'corr_pearson', 'level_0': 'variable', 'level_1': 'variable2'}, inplace=True)

#approval percentage correlation
approval_corr=ai_merged[['HRS','census_tract', 'asian_%approv', 'black_%approv', 'hisp_%approv',
       'white_%approv', ]].corr().round(2).stack().reset_index()
#create long format to plot correlation 
approval_corr.rename(columns={0: 'corr_pearson', 'level_0': 'variable', 'level_1': 'variable2'}, inplace=True)

We will also analyze  if there is other way HRS impacts  the loans profile, so we will calculate the correlation with loan amount, combined loan to value ratio, property value, debt to income ratio, tract minority population percent, discount points, tract to msa income percentage, approved

### Calculate correlation with loan amount. 

We will bring these variables from our cleaned 'df_hm_cleaned' dataset which originates from our loan data.  Filter out subset of variables for analysis and merge with 'holc_rated" data set 

In [None]:
tractDF =df_hm_cleaned.select('census_tract', 'loan_amount', 'combined_loan_to_value_ratio', 'property_value', 
    'tract_minority_population_percent', 'tract_to_msa_income_percentage', 'income', 'interest_rate' ).toPandas()

In [None]:
tractMerged= pd.merge(tractDF, holc_rated[['GEOID', 'HRS']], left_on='census_tract', right_on='GEOID',
    how='left').drop(columns = ['GEOID'])

#fill null values in HRS to create grade category by bining on intervals
tractMerged['HRS'] = tractMerged['HRS'].fillna(0.1)
tractMerged['grade'] = pd.cut(tractMerged['HRS'], bins=[0,1, 1.75, 2.49, 3.3, 4], labels=['Ungraded', 'A','B', 
    'C', 'D'])


In [None]:
#convert variables to numeric values to calculate correlation. 
colToNumeric=['loan_amount', 'combined_loan_to_value_ratio', 'property_value', 'tract_minority_population_percent',
       'tract_to_msa_income_percentage', 'income', 'HRS']
tractMerged[colToNumeric]= tractMerged[colToNumeric].apply(pd.to_numeric, errors='coerce')


In [None]:
graded_corr= tractMerged[['loan_amount', 'combined_loan_to_value_ratio', 'property_value', 'tract_minority_population_percent',
       'tract_to_msa_income_percentage', 'income', 'HRS']].corr().round(2).stack().reset_index()
graded_corr.rename(columns={0: 'corr_pearson', 'level_0': 'variable', 'level_1': 'variable2'}, inplace=True)
graded_corr.head(3)

### Plotting correlation.

We could not find a significant correlations between interest rate and approval percentage by race and HRS. We observe a minimal negative correlation between approval percentages for Black (0.01), Hispanic (-0.01) and White (-0.02) and HRS. 

In [None]:
#Pearson correlation matrix for interesr
basei=alt.Chart(interest_corr).mark_rect().transform_filter(
    alt.datum.variable < alt.datum.variable2
).encode(
    x='variable:O',
    y='variable2:O',
    color= 'corr_pearson:Q'
).properties(
    width=250,
    height=250,
    title=alt.TitleParams(
            text='Interest Rate ')
)




interest= basei+basei.mark_text().transform_calculate(label = '"" + datum.x + datum.y').encode(
    text='corr_pearson:N',
    color=alt.value('black'))

#Pearson correlation matrix for approvals
basea=alt.Chart(approval_corr).mark_rect().transform_filter(
    alt.datum.variable < alt.datum.variable2
).encode(
    x='variable:O',
    y= alt.Y('variable2:O', axis=None ),
    color= 'corr_pearson:Q'
).properties(
    width=250,
    height=250,
    title=alt.TitleParams(
            text='Approval Rate')
)




approval= basea+basea.mark_text().transform_calculate(label = '"" + datum.x + datum.y').encode(
    text='corr_pearson:N',
    color=alt.value('black'))


both=interest|approval
both.configure_title(fontSize=14).configure(background='#FFFFFF').configure_axis(
    grid=False)

In [None]:
#Pearson correlation matrix for other variables
baseo=alt.Chart(graded_corr).mark_rect().transform_filter(
    alt.datum.variable < alt.datum.variable2
).encode(
    x='variable:O',
    y= alt.Y('variable2:O'),
    color= alt.Color('corr_pearson:Q', )
).properties(
    width=250,
    height=250,
    title=alt.TitleParams(
            text='HRS Correlation')
)




others= baseo+baseo.mark_text().transform_calculate(label = '"" + datum.x + datum.y').encode(
    text='corr_pearson:N',
    color=alt.value('black'))
others.configure_title(fontSize=14).configure(background='#FFFFFF').configure_axis(
    grid=False)

So far we seen that the percentage of minorities increases as it does HRS, as well as the number of vacancies. We did not discovered a significant relationship between interest rate by race and HRS. And only very minor negative relationship between HRS and approval rate by rate. 

When we calculate the correlation of HRS with other variables we discover a correlation with minority percentage, but near zero correlations with other variables of interest. Additionally, we identify strong correlations between: income, property value, loan amount and mean income percentage with reflect a normal relationship between individual income and the type of home they will be able to afford. 