In [1]:
import pandas as pd

In [2]:
md = pd.read_csv("/Users/xavier/Desktop/DSPP/DS/Data-Science-1-Final-Project/FDIC_Pull/metadata.csv")

In [3]:
md.head()

Unnamed: 0,VariableName,VarLabel,Notes,Code,Value,in2009,in2011,in2013,in2015,in2017,in2019,StartYear,EndYear
0,gereg,Geographic region,,1.0,Northeast,y,y,y,y,y,y,2009,
1,gereg,Geographic region,,2.0,Midwest,y,y,y,y,y,y,2009,
2,gereg,Geographic region,,3.0,South,y,y,y,y,y,y,2009,
3,gereg,Geographic region,,4.0,West,y,y,y,y,y,y,2009,
4,gestfips,State,,1.0,AL,y,y,y,y,y,y,2009,


In [4]:
md.shape

(2647, 13)

In [5]:
#subset only to variables available in 2019
md = md[md["in2019"] == 'y']
md.shape

(1088, 13)

In [6]:
#eliminate variables with too many options
md = md[md["VariableName"] != "gestfips"]
md = md[md["VariableName"] != "msa13"]
md = md[md["VariableName"] != "msa5yr13"]
md.shape

(433, 13)

Here, we remove geographic data from the metadata simply because it occupies too much space in the document and is preventing us from identifying useful variables. We will instead add back a placeholder

In [7]:
#drop columns
md = md.drop(columns=['in2009','in2011','in2013','in2015','in2017','in2019','StartYear','EndYear'])

In [8]:
md.shape

(433, 5)

In [9]:
#Add in placeholders for data removed
md.loc[len(md.index)] = ["gestfips", 'State', '','','']
md.loc[len(md.index)] = ["msa13", 'MSA', '','','']
md.loc[len(md.index)] = ["msa5yr13", 'MSA 5yr', '','','']

In [10]:
md.head(436)

Unnamed: 0,VariableName,VarLabel,Notes,Code,Value
0,gereg,Geographic region,,1,Northeast
1,gereg,Geographic region,,2,Midwest
2,gereg,Geographic region,,3,South
3,gereg,Geographic region,,4,West
950,msa13chg,MSA boundary change between 2013 and 2015 surveys,,1,Yes
...,...,...,...,...,...
2642,hbnkfeeunbnkv2,Unbanked households: Clarity of bank communica...,Excludes households with missing information o...,98,Unknown: Do not know
2643,hbnkfeeunbnkv2,Unbanked households: Clarity of bank communica...,Unbanked households,-1,NIU
433,gestfips,State,,,
434,msa13,MSA,,,


In [11]:
md_clean = md.reset_index()

In [12]:
md_clean.to_csv("/Users/xavier/Desktop/DSPP/DS/Data-Science-1-Final-Project/Var_Selection/md_clean.csv")

Export the cleaned metadata for manual review to select variables

In [13]:
vars = md[(md['VariableName'].isin(['gestfips','msa13','gtcbsast','hagele15','hbnkprev','hhincome','hhtenure','hhtype','hryear4','hunbnk','huse12AFSC','huse12AFST','huse12AFS','prtage','pdisabl_age25to64','peducgrp','pempstat','pnativ','praceeth3','hincvolv2','hintaccv2','hbnkint','hunbnkrmv4','hhfamtyp','hsupresp','hsupwgtk']))]
vars

Unnamed: 0,VariableName,VarLabel,Notes,Code,Value
953,gtcbsast,Metropolitan status,,1,Metropolitan area - principal city
954,gtcbsast,Metropolitan status,,2,Metropolitan area - balance
955,gtcbsast,Metropolitan status,,3,Not in metropolitan area
956,gtcbsast,Metropolitan status,,4,Not identified
958,hagele15,Number of children aged 15 or younger,,,
...,...,...,...,...,...
2526,hunbnkrmv4,Main reason unbanked,Excludes households with missing information o...,10,Other reason
2527,hunbnkrmv4,Main reason unbanked,Excludes households with missing information o...,98,Did not select a reason
2528,hunbnkrmv4,Main reason unbanked,Unbanked households,-1,NIU
433,gestfips,State,,,


Variables of Interest
- gestfips for state mapping (non-dummy)
- msa13 for metro-area mapping (non-dummy)
- gtcbsast to signifiy whether or not in a city (convert to dummy)
- hagele15 for number of young children (non-dummy)
- hbnkprev for if they previously had a bank account (convert to dummy for those unbanked)
- hhincome for household income (make an additional dummy for whether or not they're below the poverty line)
- hhtenure for if they own a home
- hhtype for a rough description of their family situation (non-dummy)
- hryear4 for survey year (non-dummy)
- hunbnk for if respondent is banked
- huse12AFSC for if they used alternative credit products
- huse12AFST for if they used alternative transaction products
- huse12AFS for if they used either alternative product
- prtage for age (non-dummy); (make dummy based on cluster)
- pdisabl_age25to64 for disability status
- peducgrp for education (non-dummy)
- pempstat for employment
- pnativ for citizenship (non-dummy); (make a dummy to address citizenship)
- praceeth3 for race (make dummy based on cluster)
- hincvolv2 for income volitility
- hintaccv2 for internet access
- hbnkint for whether an unbanked person wants a bank account
- hunbnkrmv4 for reason unbanked
- hhfamtyp for family or nonfamily household
- hsupresp for suppliment response
- hsupwgtk for survey weighting of household

Variables to create
- MSA dummy from gtxbsast
- Previous account dummy from hbnkprev
- Poverty line dummy from hhincome
- dummy based on age cluster from prtage
- dummy for citizen/noncitizen from pnativ
- dummy based on race cluster from praceeth3

