### Validating Areal Interpolation

The purpose of this notebook is to demonstrate that our areal interpolation functions produce extensive and intensive statistics similar to known values.

In [41]:
import sys

# others will need to change the below line to point at broadbandequity directory
# this is necessary so that the jupyternotebook can load our package
sys.path[0] = '/Users/drewkeller/Desktop/CS/broadbandequity'

In [125]:
import matplotlib.pyplot as plt
%matplotlib inline
from data_pipeline.fetch_census_data import acs5_aggregate 
from data_pipeline import spatial_operations as so
import numpy as np
import pandas as pd
from IPython.display import display

We will use ACS 5-year aggregate data from 2019 for our validation. We are comparing to known values from CMAP that also rely on aggregated ACS data:

"CCA values are estimated by aggregating ACS data for census tracts and block groups. Data from tracts and block groups located in multiple CCAs is allocated proportionally based on the block-level distribution of population, households or housing units (as appropriate) from the most recent Decennial Census."

Our approach:
1. Start with tract-level population data from ACS.
2. Calculate tract-level population density using tract shapefiles.
3. Aggregate tract-level population to community areas via areal-weighted sum.
4. Aggregate tract-level density to community areas via areal-weighted mean. Multiply by community-area area to get population.
5. Aggregate tract-level density to community areas via population-weighted mean. Multiply by community-area area to get population. _Note: Realized retroactively that this is invalid and results in upward bias - density inherently should be weighted by area, not population. To validate population-weighted mean we would need to use another known statistic like income._
6. Compare calculated populations to known values (Source: [CMAP: 2021 CDS based on 2019 population data](https://datahub.cmap.illinois.gov/dataset/community-data-snapshots-raw-data)).

In [43]:
# 1. Start with tract-level population data from ACS.
tract_data = acs5_aggregate()[["estimated total population","tract"]]
tract_data['population'] = tract_data['estimated total population']
tract_data = tract_data.drop(columns='estimated total population')
tract_data.head()

Unnamed: 0,tract,population
0,630200,1825
1,580700,5908
2,590600,3419
3,600700,2835
4,611900,1639


In [44]:
# 2. Calculate tract-level population density using tract shapefiles.
tract_data = so.geographize(tract_data,'tract')
tract_data["density"] = tract_data['population']/tract_data.area
tract_data.head()

Unnamed: 0,commarea,commarea_n,countyfp10,geoid10,name10,namelsad10,notes,statefp10,tract,geometry,population,area,density
0,44,44.0,31,17031842400,8424,Census Tract 8424,,17,842400,"POLYGON ((-87.62405 41.73022, -87.62405 41.730...",3082,0.000213,14464270.0
1,59,59.0,31,17031840300,8403,Census Tract 8403,,17,840300,"POLYGON ((-87.68608 41.82296, -87.68607 41.823...",3511,9e-05,38971210.0
2,34,34.0,31,17031841100,8411,Census Tract 8411,,17,841100,"POLYGON ((-87.62935 41.85280, -87.62934 41.852...",7142,0.000124,57621650.0
3,31,31.0,31,17031841200,8412,Census Tract 8412,,17,841200,"POLYGON ((-87.68813 41.85569, -87.68816 41.856...",4586,6.8e-05,67631180.0
4,32,32.0,31,17031839000,8390,Census Tract 8390,,17,839000,"POLYGON ((-87.63312 41.87449, -87.63306 41.874...",9209,5.6e-05,164418300.0


In [45]:
# 3. Aggregate tract-level population to community areas via areal-weighted sum.
community_pop = so.aggregate(tract_data,{'population' : 'areal sum'},'community_area','tract')
community_pop.head()

Unnamed: 0,community_area,population
0,ALBANY PARK,49961.187298
1,ARCHER HEIGHTS,13813.340737
2,ARMOUR SQUARE,13615.114045
3,ASHBURN,43493.698832
4,AUBURN GRESHAM,45990.552782


In [49]:
# 4. Aggregate tract-level density to community areas via areal-weighted mean. Multiply by community-area area to get population.
community_density_areal = so.geographize(so.aggregate(tract_data,{'density' : 'areal mean'},'community_area','tract'),'community_area')
community_density_areal['population'] = community_density_areal['density']*community_density_areal['area']
community_density_areal = community_density_areal[['community_area','population']]
community_density_areal.head()

Unnamed: 0,community_area,population
0,DOUGLAS,18756.276354
1,OAKLAND,4417.408055
2,FULLER PARK,2393.597361
3,GRAND BOULEVARD,22648.818032
4,KENWOOD,14178.445677


In [72]:
# 5. Aggregate tract-level density to community areas via population-weighted mean. Multiply by community-area area to get population.
community_density_pop = so.geographize(so.aggregate(tract_data,{'density' : 'pop mean'},'community_area','tract'),'community_area')
community_density_pop['population'] = community_density_pop['density']*community_density_pop['area']
community_density_pop = community_density_pop[['community_area','population']]
community_density_pop.head()

Unnamed: 0,community_area,population
0,DOUGLAS,22593.697698
1,OAKLAND,5637.476548
2,FULLER PARK,2634.166398
3,GRAND BOULEVARD,25133.248071
4,KENWOOD,17022.442742


In [157]:
# 6. Compare calculated populations to known values 

# first step: load validation data
validation_data = pd.read_csv('../data/CMAP_2019_comm_data.csv')[['GEOG','TOT_POP']]
validation_data['GEOG'] = [str(i).upper() for i in validation_data['GEOG']]
validation_data = validation_data.rename(columns={'GEOG':'community_area','TOT_POP':'known population'})
validation_data['community_area'] = validation_data['community_area'].replace({"O'HARE":"OHARE","THE LOOP": "LOOP"})
validation_data = validation_data.dropna()
validation_data.head()

Unnamed: 0,community_area,known population
0,ALBANY PARK,49805.99998
1,ARCHER HEIGHTS,13700.97018
2,ARMOUR SQUARE,13598.48056
3,ASHBURN,43355.99999
4,AUBURN GRESHAM,45909.00001


In [158]:
# second step: place calculated and known values side-by-side with errors
community_pop = community_pop.rename(columns={'population':'areal-weighted sum'})
community_density_areal = community_density_areal.rename(columns={'population':'areal-weighted mean'})
community_density_pop = community_density_pop.rename(columns={'population':'pop-weighted mean'})
validation_data = validation_data.join(community_pop.set_index('community_area'),on='community_area')
validation_data['areal-weighted sum error'] = validation_data['areal-weighted sum']-validation_data['known population']
validation_data = validation_data.join(community_density_areal.set_index('community_area'),on='community_area')
validation_data['areal-weighted mean error'] = validation_data['areal-weighted mean']-validation_data['known population']
validation_data = validation_data.join(community_density_pop.set_index('community_area'),on='community_area')
validation_data['pop-weighted mean error'] = validation_data['pop-weighted mean']-validation_data['known population']
validation_data.head()

Unnamed: 0,community_area,known population,areal-weighted sum,areal-weighted sum error,areal-weighted mean,areal-weighted mean error,pop-weighted mean,pop-weighted mean error
0,ALBANY PARK,49805.99998,49961.187298,155.187318,49961.187298,155.187318,56917.734415,7111.734435
1,ARCHER HEIGHTS,13700.97018,13813.340737,112.370557,13813.340737,112.370557,23458.726775,9757.756595
2,ARMOUR SQUARE,13598.48056,13615.114045,16.633485,13615.114045,16.633485,15600.298867,2001.818307
3,ASHBURN,43355.99999,43493.698832,137.698842,43509.988193,153.988203,49003.21233,5647.21234
4,AUBURN GRESHAM,45909.00001,45990.552782,81.552772,45990.552782,81.552772,49527.007153,3618.007143


In [159]:
# fourth step: add simple crosswalk

crosswalk = pd.read_csv("../data/chicago_internet.csv")[['name','total_pop']]
crosswalk['name'] = [str(i).upper() for i in crosswalk['name']]
crosswalk = crosswalk.rename(columns={'name':'community_area','total_pop':'crosswalk'})
crosswalk = crosswalk.replace({"O'HARE":"OHARE"})
validation_data = validation_data.join(crosswalk.set_index('community_area'),on='community_area')
validation_data['crosswalk error'] = validation_data['crosswalk']-validation_data['known population']

In [166]:
# third step: stats

print('Areal-weighted sum:')
print(f'Maximum error: {max(abs(validation_data["areal-weighted sum error"]))}')
print(f'Median error: {np.median(validation_data["areal-weighted sum error"])}')
print(f'Mean error: {np.mean(validation_data["areal-weighted sum error"])}')
print(f'RMS error: {np.sqrt(np.average(validation_data["areal-weighted sum error"]**2))}')
outliers5 = sum([1 if abs(i)>0.05 else 0 for i in validation_data['areal-weighted sum error']/validation_data['known population']])
outliers20 = sum([1 if abs(i)>0.2 else 0 for i in validation_data['areal-weighted sum error']/validation_data['known population']])
print(f'Community areas off by more than 5%, 20%: {outliers5},{outliers20}')
print('')

print('Areal-weighted mean:')
print(f'Maximum error: {max(abs(validation_data["areal-weighted mean error"]))}')
print(f'Median error: {np.median(validation_data["areal-weighted mean error"])}')
print(f'Mean error: {np.mean(validation_data["areal-weighted mean error"])}')
print(f'RMS error: {np.sqrt(np.average(validation_data["areal-weighted mean error"]**2))}')
outliers5 = sum([1 if abs(i)>0.05 else 0 for i in validation_data['areal-weighted mean error']/validation_data['known population']])
outliers20 = sum([1 if abs(i)>0.2 else 0 for i in validation_data['areal-weighted mean error']/validation_data['known population']])
print(f'Community areas off by more than 5%, 20%: {outliers5},{outliers20}')
print('')

print('Pop-weighted mean:')
print(f'Maximum error: {max(abs(validation_data["pop-weighted mean error"]))}')
print(f'Median error: {np.median(validation_data["pop-weighted mean error"])}')
print(f'Mean error: {np.mean(validation_data["pop-weighted mean error"])}')
print(f'RMS error: {np.sqrt(np.average(validation_data["pop-weighted mean error"]**2))}')
outliers5 = sum([1 if abs(i)>0.05 else 0 for i in validation_data['pop-weighted mean error']/validation_data['known population']])
outliers20 = sum([1 if abs(i)>0.2 else 0 for i in validation_data['pop-weighted mean error']/validation_data['known population']])
print(f'Community areas off by more than 5%, 20%: {outliers5},{outliers20}')
print('')

print('Simple crosswalk:')
print(f'Maximum error: {max(abs(validation_data["crosswalk error"]))}')
print(f'Median error: {np.median(validation_data["crosswalk error"])}')
print(f'Mean error: {np.mean(validation_data["crosswalk error"])}')
print(f'RMS error: {np.sqrt(np.average(validation_data["crosswalk error"]**2))}')
outliers5 = sum([1 if abs(i)>0.05 else 0 for i in validation_data['crosswalk error']/validation_data['known population']])
outliers20 = sum([1 if abs(i)>0.2 else 0 for i in validation_data['crosswalk error']/validation_data['known population']])
print(f'Community areas off by more than 5%, 20%: {outliers5},{outliers20}')

Areal-weighted sum:
Maximum error: 11163.32187874765
Median error: -14.702800242292142
Mean error: -975.0033615755689
RMS error: 2624.3243001499955
Community areas off by more than 5%, 20%: 14,2

Areal-weighted mean:
Maximum error: 11163.321878746778
Median error: -9.460204545499892
Mean error: -924.7923049156362
RMS error: 2658.132967193856
Community areas off by more than 5%, 20%: 15,3

Pop-weighted mean:
Maximum error: 144074.93685718856
Median error: 5647.21233952263
Mean error: 10293.690862051439
RMS error: 21373.79904503689
Community areas off by more than 5%, 20%: 67,31

Simple crosswalk:
Maximum error: 6046.38276
Median error: 0.0
Mean error: 192.82476575324677
RMS error: 1053.2252118963247
Community areas off by more than 5%, 20%: 4,1


Discussion:

Areal-weighted sum and mean produce very similar results, as expected. (Actually, in many cases, precisely the same results.) Both are within a few hundred residents of our validation data for most neighborhoods, but both have a handful of neighborhoods that they underestimate by up to 10,000 residents.

This is definitely cause for concern, but perhaps for a more complicated reason than the aggregation function not working, as the low median error and clustering of most results near the known value seem to suggest that the aggregation function likely works (unless the overlapping-measurement part is broken and the outliers have more overlapping tracts?).

Population-weighted mean overestimates _most_ neighborhoods by several thousand residents. On reflection, this is actually expected behavior because it's not really valid to take a population-weighted mean of population density; it makes sense that this would bias population results upwards. In other words, these results suggest the population-weighted mean function is working, but to confirm we may need to use a different validation variable.

The simple crosswalk method - one-to-one from tracts to community areas - performs best by a signficant amount. In fact, a majority (48/77) community areas obtain the exact value from this method! 

This does raise some questions. Are those 48/77 community areas ones in which tract boundaries align exactly? If so this result could make sense, although it would then be concerning why the aggregation function also didn't give exact results. The next step may be to inspect one of these community areas in depth to see if we can identify what is going on.

In [163]:
# save to CSV
validation_data.to_csv('validation_data')

In [164]:
# display in full
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(validation_data.round().convert_dtypes(convert_integer=True))

Unnamed: 0,community_area,known population,areal-weighted sum,areal-weighted sum error,areal-weighted mean,areal-weighted mean error,pop-weighted mean,pop-weighted mean error,crosswalk,crosswalk error
0,ALBANY PARK,49806,49961,155,49961,155,56918,7112,49806,0
1,ARCHER HEIGHTS,13701,13813,112,13813,112,23459,9758,13726,25
2,ARMOUR SQUARE,13598,13615,17,13615,17,15600,2002,13538,-60
3,ASHBURN,43356,43494,138,43510,154,49003,5647,43356,0
4,AUBURN GRESHAM,45909,45991,82,45991,82,49527,3618,45909,0
5,AUSTIN,93727,93913,186,93923,196,118394,24667,93727,0
6,AVALON PARK,9671,9596,-75,9596,-75,9644,-26,9713,42
7,AVONDALE,38118,38126,8,38126,8,42107,3989,38118,0
8,BELMONT CRAGIN,78550,78624,74,78624,74,83727,5177,78601,51
9,BEVERLY,19791,19834,43,19834,43,20944,1153,19791,0
