# Modeling Unbanked Rates Based On Demographic and Geographic Variables Using Machine Learning
### Xavier Adomatis
### Prof Dunford's PPOL 564
### 12/16/2021

Repository Link: 
Word Count (Body Only): 

# Introduction & Background

This paper seeks to examine demographic and geographic relationships among survey respondents to create a model predicting whether or not someone participates in the financial system. By creating an accurate model, or taking elements learned from the creation of the model, we seek to inform policy on expanding financial conclusion.

The benefits of being in the financial system are limited; just by having a bank account, you gain access to meager interest rates and a variety of expensive financial services. The absence of costs incurred by having a bank account, however, are substantial. If one doesn't have a bank account, they're likely to spend hundreds in check cashing fees, subpar payday loans, or dealing with physical risks like theft and home fires. Increasing participation can increase household budgets substantially, especially for the low-income households disporportionatley affected by being unbanked.

Given the consequences of this issue, researchers at the Federal Deposit Insurance Corporation, like myself, and others at banking agencies have sought to understand what determines being unbanked. FDIC economists Celerier and Matray, for instance, found that increasing bank diversity increases banking participation, especially for minorities (2014). Ebonya Washington at Yale attributed it to income disparities and poor public policy (2006). Michael Barr and Rebecca Blank indicated thay in addition to economic status, there are cultural factors ar play as well (2008). Other economists, like Rhine and Greene at the Federal Reserve, have looked at disporportionate effects on at-risk subgroups, like documented immigrants (2006). The broad conclusion of the literature is that there are broad and often intersectional issues with economic inclusivity that often disporportionatley effect vulnerable populations.

Through this essay, I will describe my process for producing this machine learinging model and discuss the implications it brings. First, I discuss the raw data, and how I manipulated it. Then, I'll highlight important variables. I'll breifly delve into the aspects of failed models, and then report on my most successful model. Throughout, I'll discuss what specific data I used, how I tuned them, and what my predictive results were. Finally, I'll discuss possible policy conclusions related to those models, and the challenges and opportunities noted in this process.

# Inputs

## Raw Data

My primary dataset was a massive file from the FDIC's 2019 Household survey. The original file contains 1,600 columns and 70,000 rows of household level financial data. Most of the variables were irrelevant to this study, as they looked at specific question responses. It also contained substantial demographic information on the respondent.

I also opted to pull state-wide variables to see if there were any underlying geographic effects. I had two state-level variables that made it into my dataset. The first, was the Bureau of Labor Statistics's 2019 unemployment rate, and the second was a bank diversity density based on the FDIC Number of Banking Institutions. Presumably, having a more diverse selection of banks and an economically successful state could both contribute to positive banking outcomes. 

I also had a simple a State/State Name/FIPS conversion table just to have handy, as the FDIC data was organized by FIPS and the State Data was organzied by name and postal code. Additionally, I downloaded state populations from the 2020 census as a scaler.

## Data Manipulation

The FDIC data was relatively messy, with approximatley half of the observations not having completed the survey suppliment, which contains any possible financial variable of interest that we could use as a dependent variable. Approximatley one third of them selectively answered questions, leaving large gaps. This left us with 20,000 observations and a fear of measurement error. However, I was able to keep the ~10,000 incomplete responses for my state level table. Data completeness can be seen in appendix figure A. 

For the FDIC, most of my work was eliminating categorical variables. As displayed in the next section, several categorical variables included groups with major disporportionate effect; single mothers, non-high school graduates, low income individuals, and non-Asian racial minorities were all unbanked at higher rates. In order to get this data to cooperate, I turned these categorical variables into new binary variables so that they could be accuratley used in modeling. These variables cab be found in the appendix.

After cleaning the FDIC data, I added geographic factors for my continuous and state level models. I adjusted the FDIC's number of institutions by state variable as scaled by census population, and  incorporated BLS' unemployment rate. I added these two to my dataset after converting to state codes. I was dissappointed not to have more geographic data, but I discuss this in challenges.

I also created a dataset that took state averages of each of the factors to see if using state as a unit of analysis would predict this result. I did this by simply grouping by state and taking the average.

# Variables of Interests

After manipulating the data and producing a clean dataset, I subsetted my variables. I determined that Unbanked would be my primary dependent variable; while use of Alternative Financial Services, like payday loans, was on the table, whether or not someone had a bank account seemed to be the convergence of all of one's financial issues. Unbanked is a binary variable, where 1 is not having a bank account and 0 is having a bank account. Given the data, the unit of oberrvation will be a household in 2019, and geographic variables will pretain to the state where the household is located.

In the below graphs, I have laid out a handful of variables of interest. These below graphs feature categorical and continuous variables that I needed to cluster before converting into binary variables, so the concentrations are of import. A full list of variables can be seen in the outputs and the appendix for each model.

#### Figure 1: Percent Unbanked by Education Level

![Ed_Unbnk.png](attachment:Ed_Unbnk.png)

I seperated education into three dummy variables, one that indicates not graduating high school, one not attending college.

#### Figure 2: Percent Unbanked by Family Type

![FT_Unbnk.png](attachment:FT_Unbnk.png)

Here, I isolated unmarried female-led households in a binary, as they had the highest likelyhood of being unbanked. The other category was too small to be worth seperating.

#### Figure 3: Percent Unbanked by Income

![Inc_Unbnk.png](attachment:Inc_Unbnk.png)

For income, I created a binary indicator for if the household was in the bottom income teir

#### Figure 4: Percent Unbanked by Race

![Race_Unbnk.png](attachment:Race_Unbnk.png)

For race, I clustered White, Asian, and Pacific Islander together

#### Figure 5: Percent Unbanked by Age

![Age_Unbnk.png](attachment:Age_Unbnk.png)

For age, seeing limited variance, I isolated the under 25 group

After clusting these variables into binary form and cleaning the rest of them, I sought to implement various models to discover meaningful relationships between them.

# Explorations

Prior to settling on a model to best capture my data, I explored a variety of different possibilites both in type of variable and unit of analysis. All models were run through a Machine learning pipline, which took our data and tested it on a variety of statistical models. Most results yielded a decision tree regressor, which takes our data, sorts by combinations, and establishes buckets in predicting the outcome. Decision tree models help split the data, and often can assess intersectionality. Since our dependent variable is only a yes or a no, as are most of the inputs, a decision tree already seems to fit well with the data. This is backed by underlying research, as most authors suggest intersectionality is a major determinant of being unbanked. However, the model also assessed fit with Random Forrests, Bagging, and K Nearest Neighbors, none of which were optimum with any version of my data. It also assessed linear models.

## Binary Modeling

Noticing that a large majority of my variables were already categorical or binary, I decided to try to implement a binary-only model, by reducing major categorical and continuous variables as discussed above.

Upon piping this data through the machine learning model, it decicively yieled that an R^2 of .4832 in a Decision Tree Regressor; a shockingly strong prediction between the truth and predicted model. The outputs of the model are listed in Figures 6-8 below. 

#### Figure 6: Truth Prediction Scatter for Binary Model with bank_prev

![Bi_tpscat.png](attachment:Bi_tpscat.png)

#### Figure 7: Correlation Heatmap for Binary Model with bank_prev

![Bi_hm.png](attachment:Bi_hm.png)

#### Figure 8: Variable Reliance Table for Binary Model with bank_prev

![Bi_vr.png](attachment:Bi_vr.png)

It's clear from the data that bank_prev, whether or not the individual previously held a bank account, is the dominant predictor and drives the substantial accuracy. While this is a useful variable, our reccomendation cannot simply be "give people bank accounts so they keep them." Thus, I rerun the model. Because of its relative success, I feature the second binary model later.

## Binary Continuous Mix Modeling

After opting to exclude bank_prev, I chose to implement a model that mixed the robust number of binary variables with a handful continuous ones; it's possible that seeing gradients for income, age, and education would add robustness. Additionally, I joined this with my state level variables, unemployment rate and bank institutions per capita, to assess whether bank diversity and economic conditions would add to the predictive accuracy. The figures below show the most inclusive mixed model, which was a linear regression with a negative predictive accuracy, indicating that the model was worse than random selection.

#### Figure 9: Truth Prediction Scatter for Mixed Model

![Mix_tpscat.png](attachment:Mix_tpscat.png)

#### Figure 10: Variable Reliance Table for Mixed Model

![Mix_vr.png](attachment:Mix_vr.png)

Throught various specifications, we found this to be insufficient model; through various inclusions and exclusions of data and allowing various amounts of branches, we found the mixed model to yield generally poor R^2s from .02 to .06, indicating the results were only marginally superior to a coin toss. The tables above show the most inclusive version of the model, which had one of the worst predictive accuracies overall. The mixed models either were based on the Decision Tree, like above, or simply output a linear regression. 

For this model, the geographic variables were often not useful. Bank diversity often middled the pack, while unemployment rarely registered at all. This is likely because judging state variables at a household scale obscures them entirely.

## State-Level Modeling

Dissappointed that the geographic terms, which were so powerful in Celerier and Matray's model (2014), we sought to further examine the variables at a state level. Using the by-state averages of all of the variables, continuous and binary, we wished to examine if broader determinations could be made on a different unit of analysis.

Modeling on the state level quickly proved to be a fools errand; reducing the dataset to a 50 observation level decimated any predictive power. The pipeline determined that a bagging regressor was the most appropriate, but does not merit a discussion given the tiny N. The pursuit of this model, however, did expose significant potential for measurement-error based endogeneity, discussed in the conclusion.

# Primary Model: Revised Binary Based Regression

After eliminating state-based and continuous-inclusive, we opted to return to the binary model. After excluding bank_prev, to highlight the other variables, we ran a maximally inclusive binary model again. In earlier tests, we found that Decision Trees were common and often had used the maximum model depth, we decided to allow the tree to vary up to ten times, instead of the five we included. However, after piping with the full set of variables, our output returned a regression reliant model in its most accurate prediction. In this linear model model, essentially each variable is assigned a potential effect on the dependent variable, the probability of being unbanked. If enough of the variables are true, then the model will consider that the probability of being banked is high enough to be classed assume that its true. Below are the outputs from the predicted model.

#### Figure 11: Truth Prediction Scatter for Final Model

![Final_tpscat.png](attachment:Final_tpscat.png)

#### Figure 12: Correlation Heatmap for Final Model

![Final_hm.png](attachment:Final_hm.png)

#### Figure 13: Variable Reliance Table for Final Model

![Final_vr.png](attachment:Final_vr.png)

These figures output an R^2 score of .193, easily the highest of any of the other tests that excluded bank_prev. This shows a medium but positive correlation between the predicitions our model made and the actual accuracy. Though its far from perfect, the elevated predictive accuracy of the model was reassuring that machine learning can identify risks to being unbanked.

In Figure 13, the variable reliance table, we see that, like with most of our models, povery, internet, homeownership, education, and race top the list. 

#### Figure 14: Dependence Plots for Final Model

![PD_Plots.png](attachment:PD_Plots.png)

In Figure 14 above, we can see the effects our top variables have on being unbanked, which are mostly consistent with the existing research. In non technical terms we see that being below the poverty line, not having internet access, not owning a home, not graduating high school, and not being white or asian all increase one's probability of being unbanked.

# Discussion

The results of this model were mediocre, with the models being limited in their accuracy. While we can view what groups are significantly predisposed to being unbanked, the model offers little more than the correlation matrix. However, throughout the process, I identified three critical factors that cannot simply be better observed in a correlation matrix.

First, the project's main success was actually its first hiccup; determining beyond a doubt that those that had a bank account are likely to have a bank account. While this seems obvious, it highlights an incredibly important feature: when people become banked, they stay banked. If the powers that be can get an individual into the banking system, they will stay there. From a fiscal persepctive, this suggests that investing in bank outreach once can have very high yields, and the procedure does not have to be continued.

Second, when I designed the state model, I paired states both by their BLS unemployment rate and by the average of the unemployed respondents. These rates difffered greatly, with the respondents being on average 1.5 points more employed than the state level counterparts. I checked the documentation, and asked FDIC employees (Chu, 2021), and assured that the definitions were the same. 

This suggests that unemployed individuals were largely undersampled in the survey. While there's some expectation of deviation, the consistent underreporting draws concerns from measurement error based endogeneity; if at-risk populations are not represented, their coefficients could be severely biased. This means that, moving forward, analysts need to address sampling bias even within census surveys to accuratley account for this bias.

Third, from the final model, we identified that some factors were of utmost importance when detecting unbanked individuals. Three of the variables expectedly call for larger investments; poverty, homeownership, and education, are all related to banking issues. Improving those outcome may improve participation, but those also call for a broader economic response. But coupled with Race, these variables allow for us to identify the most vulnerable demographics for the purposes of advertising and local intervention. The second most powerful variable, access to internet, also has broad implications that may not yet be realized. Banking access is rarely discussed as a potential positive outcome of broadband expansion, and little research has been implementet on this subject. 

The results yeild three tangible goals for policy: first, policy should prioritize making an initial connection between an unbanked person and a bank, as that is highly likely to lead to success; second, better data quality and sampling procedures are necessary to learn more; and finally, more research is needed to investigate the cconnection between internet access and banking, as this has only trickled into policy discussions so far.

# Challenges and Opportunities

The most dissappointing challenge with this data was having no access to county level reporting. Investigation the economic and banking conditions of geographic subunits might tell us more about how a person's geography and things like the number of bank branches relate to improving access.

Like discussed above, finding an adequate model was another difficulty; state-level data was too small, and variations on the main model often resulted in dismal predictive accuracy or were biased by a single varibale. 

If I had more time, I would have sought to do metropolitan level data; while it will exclude those in rural areas, the survey data does allow us to class by Metropolitan Statistical Area. Limiting research to that could stimulate a more intense discussion on how outcomes relate to geographic variables. 

Furthermore, if given the opportunity, I would pull the 2017, 2015, and incoming 2021 surveys to expand the clean dataset and increase the models' statistical power.

# Appendix 

## Additional Reference Figures

#### Figure A: Data Missingness of Unclean Data

![Missing_Matrix_Unclean.png](attachment:Missing_Matrix_Unclean.png)

## Variable descriptions

Binary Only Model Variables:
- Education Levels: Graduating College, Not Attending College, Not Graduating High School
- Demographic Variables: Under 25 Years Old, Citizen, Born In US, Race, Has Children, Single Mother, Has Internet Access, Is Disable, Lives In City or Suburbs
- Financial Variables: Below Poverty Line, unemployed, Unpredictable Income, No Bank Account, Owns Home, Had Bank Account In Past (exlcluded from second binary model)

Mixed Binary-Continuous & State-Level Model Variables:
- Continuous Demographic Variables: Bank Institutions Scaled By Population (represents diversity of financial services), unemployment rate
- Continuous Geographic Variables: Education, Income, Age
- Binary Demographic Variables: Under 25 Years Old, Citizen, Born In US, Race, Has Children, Single Mother, Has Internet Access, Is Disabled, Lives In City or Suburbs
- Binary Financial Variables: Below Poverty Line, unemployed, Unpredictable Income, No Bank Account, Owns Home

# References

"2020 Population and Housing State Data." Census Bureau, 2021. https://www.census.gov/library/visualizations/interactive/2020-population-and-housing-state-data.html

Barr, Michael and Rebecca Blank. "Access to Financial Services, Savings, and Assets Among the Poor." National Poverty Center, Policy Brief No. 13, 2008. http://www.npc.umich.edu/publications/policy_briefs/brief13/PolicyBrief13.pdf

Celerier, Claire and Adrien Matray. "Unbanked Households: Evidence of Supply-Side Factors" FDIC, 2014. https://www.weforum.org/agenda/2014/09/unbanked-bank-accounts-supply-side-factors/

Celerier, Claire and Adrien Matray. "Why do so many Americans lack bank accounts?" World Economic Forum, 2014. https://www.weforum.org/agenda/2014/09/unbanked-bank-accounts-supply-side-factors/

Chu, Karyen. Various Information on FDIC Data Qualtiy. 2021

Dunford, Eric. PPOL 564 Lectures and Code Samples. Georgetown University, 2021. http://ericdunford.com/ppol564/lectures/week_09/week-09-async-material.html

"How America Banks." FDIC, 2019. https://economicinclusion.gov/surveys/2019household/

Rhine, Sherrie, and William Greene. "The Determinants of Being Unbanked for U.S. Immigrants." Journal of Consumer Affairs
Volume 40, Issue 1, 2006. https://onlinelibrary.wiley.com/doi/10.1111/j.1745-6606.2006.00044.x

"State FIPS Codes." USDA, 2021. https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696

"State Profiles." FDIC, 2021. https://www.fdic.gov/analysis/state-profiles/index.html

Washington, Ebonya. "The Impact of Banking and Fringe Banking Regulation on the Number of Unbanked Americans." Journal of Human Resources, XLI, 2006. http://economics.yale.edu/sites/default/files/files/Faculty/washington/impactbank.pdf

### Packages Used
- pandas
- numpy
- missingo
- plotnine
- matplotlib
- sklearn