# Aggregating MFT to the county level

In order to merge the MFT data (recorded at the zipcode level) with the IRS/ACS data (recorded at the county level), it is necessary to get convert from zip code to county as the summary level. To do this, we will use the Census' [zipcode to county relationship file](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.2010.html).

In [1]:
import pandas as pd

### Merging relationship file and MFT data

In [2]:
rel_df = pd.read_csv("./data/zcta_county_rel_10.csv", dtype={'ZCTA5': object})
mft_df = pd.read_csv("./data/mft_county_rel.csv", dtype={'ZCTA5': object})

In [3]:
mft_county_rel = pd.merge(rel_df, mft_df, how='left')

Checking that we didn't leave anything behind, by looking for any rows where the zip code was in the MFT dataset but the United Way field is empty. There shouldn't be any rows where this is true.

In [4]:
mft_county_rel[mft_county_rel.ZCTA5.isin(mft_df.ZCTA5) & pd.isnull(mft_county_rel['United Way'])]

Unnamed: 0,ZCTA5,STATE,COUNTY,GEOID,POPPT,HUPT,AREAPT,AREALANDPT,ZPOP,ZHU,...,COPOPPCT,COHUPCT,COAREAPCT,COAREALANDPCT,City,State,County,Org ID,United Way,#e-filed returns


Everything looks good!

In [5]:
mft_county_rel.columns

Index(['ZCTA5', 'STATE', 'COUNTY', 'GEOID', 'POPPT', 'HUPT', 'AREAPT',
       'AREALANDPT', 'ZPOP', 'ZHU', 'ZAREA', 'ZAREALAND', 'COPOP', 'COHU',
       'COAREA', 'COAREALAND', 'ZPOPPCT', 'ZHUPCT', 'ZAREAPCT', 'ZAREALANDPCT',
       'COPOPPCT', 'COHUPCT', 'COAREAPCT', 'COAREALANDPCT', 'City', 'State',
       'County', 'Org ID', 'United Way', '#e-filed returns'],
      dtype='object')

### Aggregating to the county

Because zipcodes and counties are not co-terminous, some zipcodes are in multiple counties. The column `ZPOPPCT` in the zipcode to county relationship file denotes how much of the zipcode's population is in each county. For example, zip code `39573` stretches across 5 counties:

In [6]:
mft_county_rel[mft_county_rel.ZCTA5 == '39573'][['ZCTA5', 'City', 'State', 'County', 'STATE', 'COUNTY', 'ZPOPPCT']]

Unnamed: 0,ZCTA5,City,State,County,STATE,COUNTY,ZPOPPCT
16570,39573,Perkinston,MS,Stone,28,39,3.56
16571,39573,Perkinston,MS,Stone,28,45,21.91
16572,39573,Perkinston,MS,Stone,28,47,1.11
16573,39573,Perkinston,MS,Stone,28,59,1.52
16574,39573,Perkinston,MS,Stone,28,109,1.31
16575,39573,Perkinston,MS,Stone,28,131,70.59


We want one county per zipcode, so we'll take only the row with the highest percentage of people. In the case of zipcode `38573`, that would be the last row (the one with a county FIPS code of `131`).

In [7]:
mft_county_rel = mft_county_rel.sort_values(['ZCTA5', 'COUNTY', 'ZPOPPCT']).drop_duplicates(['ZCTA5', 'COUNTY'], keep='last')

In [8]:
mft_county_rel.describe()

Unnamed: 0,STATE,COUNTY,GEOID,POPPT,HUPT,AREAPT,AREALANDPT,ZPOP,ZHU,ZAREA,...,COAREALAND,ZPOPPCT,ZHUPCT,ZAREAPCT,ZAREALANDPCT,COPOPPCT,COHUPCT,COAREAPCT,COAREALANDPCT,#e-filed returns
count,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,...,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,44409.0,21906.0
mean,30.124142,87.833637,30211.97514,7036.027607,3002.365939,171141800.0,167330600.0,9111.244207,3901.043572,262609200.0,...,3466145000.0,74.255175,73.894835,74.578953,74.578992,7.252251,7.248556,6.705254,6.923092,12.697982
std,15.323056,85.813008,15333.111889,12341.371141,5040.099625,562642400.0,550466200.0,13235.141549,5399.305229,659746700.0,...,12200710000.0,38.601151,38.688594,35.947392,35.956118,14.184957,13.916541,10.568559,10.867822,22.525799
min,1.0,1.0,1001.0,0.0,0.0,2270.0,2270.0,0.0,0.0,5094.0,...,5176813.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,18.0,31.0,18127.0,277.0,139.0,16927360.0,16064130.0,867.0,432.0,39823430.0,...,1214167000.0,42.56,40.49,45.98,46.04,0.61,0.65,0.83,0.89,2.0
50%,29.0,71.0,29221.0,1324.0,635.0,62672030.0,60804150.0,2968.0,1390.0,124964000.0,...,1727139000.0,100.0,100.0,100.0,100.0,2.23,2.33,3.12,3.3,6.0
75%,42.0,119.0,42063.0,7345.0,3222.0,164116700.0,159608400.0,11927.0,5188.0,279448300.0,...,2562955000.0,100.0,100.0,100.0,100.0,6.55,6.72,8.12,8.39,13.0
max,72.0,840.0,72153.0,113916.0,47617.0,35108420000.0,34786100000.0,113916.0,47617.0,35108420000.0,...,376855700000.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,456.0


And, since some counties have lots of zipcodes, we need to get a total per county.

In [9]:
mft_per_county = mft_county_rel.groupby(['STATE', 'COUNTY']).agg({'#e-filed returns': 'sum'}).reset_index()

Making sure all of the counties made it to `mft_county_rel`.

In [10]:
len(mft_county_rel.groupby(['STATE', 'COUNTY']).groups) == len(rel_df.groupby(['STATE', 'COUNTY']).groups)

True

In [11]:
mft_per_county.to_csv("./data/mft_at_county_level.csv", index=False)