In [1]:
import pandas as pd

### Merging ACS and MFT data

In order to merge the MFT data (recorded at the zipcode level) with the IRS/ACS data (recorded at the county level), it is necessary to get convert from zip code to county as the summary level. To do this, we will use the Census' [zipcode to county relationship file](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.2010.html).

In [2]:
mft_df = pd.read_csv("./data/mft_returns_2019.csv", dtype={'Zip': object})
census_rel_df = pd.read_csv("./data/zcta_county_rel_10.csv", dtype={'ZCTA5': 'object'})

Right outer merge will get all rows from the MFT dataset and any rows from the ACS dataset that match.

In [3]:
merged = pd.merge(census_rel_df, mft_df, how="outer", left_on="ZCTA5", right_on="Zip")

Sorting on zip because we'll be filling missing zip codes with their neighbors.

In [4]:
merged = merged.sort_values('Zip')

Forward filling missing zip codes. I.e. if we have two rows like

row#  | Zip     | United Way
:----|:--------|:----
1    | 48103   | United Way of Washtenaw County
2    | NA      | United Way of Washtenaw County

then the NA will become 48103. We're only filling within the same United Way, to make sure the nearest zip code isn't geographically super far away.

In [5]:
merged["filled_zip"] = merged.groupby('United Way') \
                             .apply(lambda group: group['ZCTA5'].fillna(method='ffill').fillna(method='bfill')) \
                             .reset_index(level=0, drop="True")

Creating a list of zip codes to use for spot checking.

In [6]:
sample_zips = merged['filled_zip'].dropna().sample(5)

In [7]:
# sorting by ZIP instead of Zip makes sure we have the ACS row first
# we don't need anything but the zip code, because we'll be joining this to the rel file again
acs_mft_df = merged.sort_values('ZCTA5') \
                    .groupby('filled_zip') \
                    .agg({'ZCTA5': 'first', 
                          'City': 'first',
                          'State': 'first',
                          'County': 'first',
                          'Org ID': 'first',
                          'United Way': 'first',
                          '#e-filed returns': 'sum'}) \
                     .reset_index(drop="True")

Spot-checking that the e-filed return numbers correctly aggregated.

In [8]:
merged[merged.filled_zip.isin(sample_zips)].groupby('filled_zip').agg({'#e-filed returns': 'sum'})

Unnamed: 0_level_0,#e-filed returns
filled_zip,Unnamed: 1_level_1
10570,1.0
12062,4.0
28206,12.0
48662,2.0
49323,6.0


In [9]:
acs_mft_df[acs_mft_df.ZCTA5.isin(sample_zips)]

Unnamed: 0,ZCTA5,City,State,County,Org ID,United Way,#e-filed returns
1613,10570,Pleasantville,NY,Westchester,34645F,"United Way of Westchester and Putnam, Inc.",1.0
1908,12062,East Nassau,NY,Rensselaer,34540F,United Way of the Greater Capital Region,4.0
4861,28206,Charlotte,NC,Mecklenburg,35110U,"United Way of Central Carolinas, Inc.",12.0
8892,48662,Wheeler,MI,Gratiot,24355F,"United Way of Gratiot and Isabella Counties, Inc.",2.0
9107,49323,Dorr,MI,Allegan,24010F,Allegan County United Way,6.0


In [11]:
acs_mft_df.to_csv("./data/mft_county_rel.csv", index=False)