This notebook contains the code used to wrangle, join, and analyze the following datasets:

+ American Community Survey (ACS) 5-year estimates (2015-2019)
    + [Aggregate](https://api.census.gov/data/2019/acs/acs5/variables.html)
    + [Profile](https://api.census.gov/data/2019/acs/acs5/profile/variables.html)

+ Chicago Community Area Census Tract Crosswalk
    + [Chicago Community Areas](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6)
    + [Chicago Census Tracts](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2010/5jrd-6zik)
+ [Chicago COVID-19 Community Vulnerability Index (CCVI)](https://data.cityofchicago.org/Health-Human-Services/Chicago-COVID-19-Community-Vulnerability-Index-CCV/xhc6-88s9)
+ [Hardship Index](https://data.cityofchicago.org/Health-Human-Services/hardship-index/792q-4jtu)



All csv files can be found in the "data" folder of the [working](https://github.com/danielgrzenda/broadbandequity/tree/working) branch of our Broadband Equity Github repo.

###### Importing Libraries

In [278]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

###### Importing Chicago ACS aggregate and profile data

In [295]:
# ACS aggregate

acs_agg = pd.read_csv("data/acs5_aggregate.csv",index_col=0,
                      parse_dates=[0]).drop(['state', 'county'], axis=1)

In [299]:
# ACS profile

acs_pro = pd.read_csv("data/acs5_profile.csv",
                      index_col=0,
                      parse_dates=[0]).drop(
                        ['estimated total households with a computer',
                         'estimated unemployment rate',
                         'state', 'county'], axis=1)

In [321]:
# merging both ACS datasets
# renaming variables

acs_df = acs_agg.merge(acs_pro, on='tract').rename(columns={
            'estimated total population': 'total_pop',
            'estimated total households': 'total_households',
            'estimated total with internet subscription': 'hh_internet',
            'estimated total with no internet access': 'hh_no_internet',
            'estimated total has no computer': 'hh_no_computer',
            'estimated total has a computer': 'hh_computer',
            'estimated total households with broadband internet subscription': 'hh_broadband',
            'estimated total households median household income': 'hh_median_income',
            'percent estimated percent of families and people whose income in the past 12 months is below poverty level': 'hh_poverty(%)',
            'percent estimated unemployment rate': 'hh_unemployment(%)',
            'estimated total population hispanic or latino (of any race)': 'total_hispanic',
            'estimated total population Black or African American alone (non-Hispanic)': 'total_black'})


In [303]:
# 1319 rows x 14 columns
# variables beginning with "hh" indicate they are measured at the household level
# variables beginning with "p" indicate they are recorded as percentages 

acs_df.head(5)

Unnamed: 0,total_pop,hh_internet,hh_no_internet,hh_computer,hh_no_computer,tract,total_households,hh_broadband,hh_median_income,p_poverty,total_hispanic,total_black,p_unemployed
0,1825,392,149,426,149,630200,575,392,37422,25.7,1622,0,36.2
1,5908,1242,231,1411,133,580700,1544,1242,47000,17.4,4742,161,42.3
2,3419,928,140,1068,104,590600,1172,917,46033,7.9,2119,9,30.7
3,2835,917,138,1003,81,600700,1084,917,45294,17.0,850,82,36.3
4,1639,322,245,356,218,611900,574,322,24507,55.0,438,1175,46.0


###### Importing Chicago COVID and Hardship Indices

In [207]:
# Chicago COVID hardship index csv

covid = pd.read_csv("data/covid_index.csv",
                        index_col=0,
                        parse_dates=[0]).reset_index()

In [208]:
# filtering for Community Areas only

covid = covid[covid['Geography Type']=="CA"]

In [209]:
# selecting and renaming relevant columns

covid = covid[['Community Area Name', 
               'Community Area or ZIP Code', 
               'CCVI Score', 
               'CCVI Category']].rename(columns={
                        "Community Area Name": "name",
                        'Community Area or ZIP Code': "comm_num",
                        "CCVI Score": "ccvi_score",
                        'CCVI Category': "ccvi_cat"})

In [210]:
# setting index to community area names to fix spelling

covid = covid.set_index(covid.columns[0]).rename(
    index={"Fuller Park*":"Fuller Park", "Burnside*":"Burnside"})

In [211]:
# reverting index to normal

covid=covid.reset_index()

In [213]:
# Chicago hardship index csv

hardship = pd.read_csv("data/hardship_index.csv",
                        index_col=0,
                        parse_dates=[0]).reset_index()

In [214]:
# removing "Chicago" community area

hardship = hardship[hardship['COMMUNITY AREA NAME']!="CHICAGO"]

In [215]:
# selecting and renaming relevant columns

hardship = hardship[['COMMUNITY AREA NAME', 
               'HARDSHIP INDEX']].rename(columns={
                        "COMMUNITY AREA NAME": "name",
                        'HARDSHIP INDEX': "hardship_score"})

In [229]:
# setting index to community area names to fix spelling

hardship = hardship.set_index(covid.columns[0]).rename(
    index={"Montclaire":"Montclare", "Humboldt park":"Humboldt Park", "Washington Height":"Washington Heights"})

In [230]:
# reverting index to normal

hardship=hardship.reset_index()

In [237]:
# joining datasets

covid_hardship = covid.merge(hardship, on='name')

In [238]:
# 77 rows x 5 columns

covid_hardship.head(5)

Unnamed: 0,name,comm_num,ccvi_score,ccvi_cat,hardship_score
0,Ashburn,70,45.1,MEDIUM,37.0
1,Rogers Park,1,30.9,LOW,39.0
2,Lake View,6,5.2,LOW,5.0
3,Jefferson Park,11,25.6,LOW,25.0
4,Archer Heights,57,53.3,HIGH,67.0


###### Importing Chicago Community Area Census Tract Crosswalk

In [239]:
# importing Census tracts mapped to community area number 
# renaming columns

tracts = pd.read_csv("data/tracts_comm_areas.csv",
                        index_col=0,
                        parse_dates=[0]).rename(columns={
                        "COMMAREA": "comm_num",          
                        "TRACTCE10": "tract"})

In [240]:
# importing community area numbers and names
# renaming columns

comm_area = pd.read_csv("data/comm_areas.csv",
                        index_col=0,
                        parse_dates=[0]).rename(columns={
                        "AREA_NUMBE": "comm_num"})

In [241]:
# merging both dataframes above to map tract wirh community area name

tract_area= comm_area.merge(tracts, on='comm_num')

In [242]:
# selecting columns we need and renaming them

tract_area=tract_area[['comm_num', 'tract']]

In [243]:
# 801 rows x 3 columns
# final dataframe

tract_area.head(5)

Unnamed: 0,comm_num,tract
0,35,842000
1,35,351500
2,35,839500
3,35,839200
4,35,839600


###### Joining ACS, COVID index, Hardship Index, and Community Areas

In [323]:
# merging ACS with chicago community areas, covid index, hardship index

full_df = tract_area.merge(acs_df, on='tract'
                           ).merge(covid_hardship, on='comm_num')

In [324]:
# removing rows with no information

full_df = full_df[full_df['total_pop']!=0]

In [325]:
# 798 rows x 19 columns

full_df.sort_values(by="name").head(5)

Unnamed: 0,comm_num,tract,total_pop,hh_internet,hh_no_internet,hh_computer,hh_no_computer,total_households,hh_broadband,hh_median_income,hh_poverty(%),total_hispanic,total_black,hh_unemployment(%),name,ccvi_score,ccvi_cat,hardship_score
108,14,140302,4189,1032,147,1188,104,1292,1032,75789,6.8,1433,228,33.0,Albany Park,38.2,MEDIUM,53.0
103,14,140702,5882,1444,390,1750,247,1997,1444,57708,14.9,3746,183,23.0,Albany Park,38.2,MEDIUM,53.0
104,14,140301,2839,579,165,723,89,812,579,50667,18.8,1724,57,34.9,Albany Park,38.2,MEDIUM,53.0
105,14,140601,2886,690,154,776,105,881,690,43988,12.1,1183,329,39.5,Albany Park,38.2,MEDIUM,53.0
106,14,140701,3028,898,140,998,75,1073,898,71125,11.0,1450,161,22.6,Albany Park,38.2,MEDIUM,53.0


###### Computer, Internet, Broadband Access

This section will look into the community areas at a household-level to see who has and who does not have basic access to the internet and/or computer. We will also look into the households who have a broadband internet subscription. 

In [317]:
# who has a computer? who has internet access? 
# selecting columns we need

internet_df = full_df[['name', 
                        'total_households',
                        'hh_no_internet', 
                        'hh_internet',
                        'hh_computer', 
                        'hh_no_computer',
                       'hh_broadband']].groupby(by="name").sum()

In [318]:
# calculating percentages of households for each variable

# percentage of households with/out internet access

internet_df['hh_no_internet(%)']=internet_df['hh_no_internet']/internet_df['total_households']*100
internet_df['hh_internet(%)']=internet_df['hh_internet']/internet_df['total_households']*100

# percentage of households with/out no computer

internet_df['hh_no_computer(%)']=internet_df['hh_no_computer']/internet_df['total_households']*100
internet_df['hh_computer(%)']=internet_df['hh_computer']/internet_df['total_households']*100

# percentage of households with/out broadband 

internet_df['hh_broadband(%)']=internet_df['hh_broadband']/internet_df['total_households']*100

In [319]:
# calculating response rates 

internet_df['internet_rr']=(internet_df['hh_internet']+internet_df['hh_no_internet'])/internet_df['total_households']*100
internet_df['computer_rr']=(internet_df['hh_computer']+internet_df['hh_no_computer'])/internet_df['total_households']*100


In [320]:
# resulting dataframe

internet_df.sort_values(["hh_broadband(%)"],
                        ascending=True)

Unnamed: 0_level_0,total_households,hh_no_internet,hh_internet,hh_computer,hh_no_computer,hh_broadband,hh_no_internet(%),hh_internet(%),hh_no_computer(%),hh_computer(%),hh_broadband(%),internet_rr,computer_rr
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Burnside,888,375,477,557,331,464,42.229730,53.716216,37.274775,62.725225,52.252252,95.945946,100.0
Englewood,8983,3108,5083,6728,2255,5058,34.598686,56.584660,25.102972,74.897028,56.306356,91.183346,100.0
West Englewood,9483,3612,5428,6670,2813,5428,38.089212,57.239270,29.663609,70.336391,57.239270,95.328483,100.0
Fuller Park,1128,391,666,783,345,666,34.663121,59.042553,30.585106,69.414894,59.042553,93.705674,100.0
North Lawndale,11075,3321,6631,8417,2658,6603,29.986456,59.873589,24.000000,76.000000,59.620767,89.860045,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Lincoln Square,18347,1324,16505,17464,883,16488,7.216439,89.960211,4.812776,95.187224,89.867553,97.176650,100.0
North Center,14093,1161,12763,13419,674,12739,8.238132,90.562691,4.782516,95.217484,90.392393,98.800823,100.0
Lincoln Park,32395,2026,29514,30706,1689,29489,6.254052,91.106652,5.213768,94.786232,91.029480,97.360704,100.0
Lake View,53480,3065,48836,51552,1928,48705,5.731114,91.316380,3.605086,96.394914,91.071429,97.047494,100.0


Based on 2015-2019 ACS data, the percentages of households with computers, internet access, and internet broadband subscription are seen above. The neighborhoods of Burnside, Englewood, West Englewood, Fuller Park have the lowest percentages of both broadband subscription and internet access. The neighborhods of Near South Side, Lake View, Lincoln Park, and North Center have the highest. 

Broadband subscription and internet access numbers are extremely close, suggesting that the overwhelming majority of households who have access to internet do so via a broadbad subscription. 

###### Economics: Median Household Income, Poverty Rates, Unemployment Rates

In [332]:
# median household income, poverty rates, unemployment rates by community areas
# taking the median of ^ values  

income_df = full_df[['name', 
                      'hh_median_income', 
                      'hh_poverty(%)',
                    'hh_unemployment(%)']].groupby(by = "name").median().sort_values(["hh_unemployment(%)"], 
                                                                                  ascending = False)

income_df

Unnamed: 0_level_0,hh_median_income,hh_poverty(%),hh_unemployment(%)
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fuller Park,23746.5,21.7,57.55
West Garfield Park,24001.5,36.4,48.75
West Englewood,27277.0,29.9,48.00
Englewood,23125.0,36.6,47.70
North Lawndale,29028.0,24.3,46.20
...,...,...,...
Lincoln Square,86125.0,2.9,21.20
Logan Square,79231.0,9.5,19.20
North Center,119904.5,1.8,18.20
West Town,102083.0,4.5,16.60


Median income, poverty rates, and unemployment rates are based on household-level data. The medians of median household income and poverty rates across tracts were used to make the table above. The median household incomes in Chicago community areas range from Riverdale's \\$15,408 all the way to Lincoln Park's \\$127,177. Riverdale also has the highest percentage of their households in poverty at 49.2%. Fuller Park has the highest unemployment rate at 57.6%.

###### Race & Ethnicity

In [338]:
# race and ethnicity by community area

race_df = full_df[['name', 
                        'total_pop',
                        'total_hispanic', 
                        'total_black']].groupby(by = "name").sum().sort_values(["name"], 
                                                                      ascending = True)

In [339]:
# percentage of population hispanic

race_df['total_hispanic(%)']=race_df['total_hispanic']/race_df['total_pop']*100

# percentage of population black non-hispanic

race_df['total_black(%)']=race_df['total_black']/race_df['total_pop']*100

In [340]:
# final race and ethnicity dataframe

race_df.sort_values(["total_black(%)"], ascending = False)

Unnamed: 0_level_0,total_pop,total_hispanic,total_black,total_hispanic(%),total_black(%)
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Avalon Park,9713,9,9381,0.092659,96.581901
Burnside,2006,39,1931,1.944167,96.261216
Washington Heights,26742,305,25698,1.140528,96.096029
Chatham,30967,262,29625,0.846062,95.666355
Greater Grand Crossing,30149,524,28839,1.738034,95.654914
...,...,...,...,...,...
Forest Glen,19384,2933,231,15.131036,1.191704
Jefferson Park,27503,6806,293,24.746391,1.065338
Archer Heights,13726,10626,131,77.415125,0.954393
Norwood Park,43405,5859,347,13.498445,0.799447


Race and ethnicity data are based on total population numbers. Hispanic/Latino ethnicity was based on all races. Black/African-American race was non-Hispanic/Latino. Chicago community areas vary vastly in their race and ethnic compositions. Gage Park, South Lawndalw, West Elsdon, and Hermosa have the highest percentages of Hispanics/Latinos of all races. Calumet Heights, Washington Heights, Avalon Park, and Oakland have the highest percentages of non-Hispanic/Latino African-American/Blacks. 

##### Internet Access & Demographics Combined

The table below shows all of the variables above.

In [349]:
# merging all tables by community area

final_df = internet_df.merge(income_df,
                        on='name').merge(race_df,
                                          on='name').merge(covid_hardship, on="name")

In [350]:
# reordering columns

final_df=final_df[['name', 
 'comm_num',
 'total_pop',
 'total_households', 
 'hh_no_internet', 
 'hh_no_internet(%)',
 'hh_internet',
 'hh_internet(%)',
 'internet_rr',
 'hh_computer',
 'hh_computer(%)',
 'hh_no_computer',
 'hh_no_computer(%)',
 'computer_rr',
 'hh_broadband',
 'hh_broadband(%)',
 'hh_poverty(%)',
 'hh_unemployment(%)',
 'hh_median_income',
 'total_hispanic',
 'total_hispanic(%)', 
 'total_black',  
 'total_black(%)', 
 'ccvi_score', 
 'ccvi_cat', 
 'hardship_score']]

In [351]:
final_df

Unnamed: 0,name,comm_num,total_pop,total_households,hh_no_internet,hh_no_internet(%),hh_internet,hh_internet(%),internet_rr,hh_computer,...,hh_poverty(%),hh_unemployment(%),hh_median_income,total_hispanic,total_hispanic(%),total_black,total_black(%),ccvi_score,ccvi_cat,hardship_score
0,Albany Park,14,49806,16909,2674,15.814064,13488,79.768171,95.582234,15104,...,12.10,33.00,66818.0,22399,44.972493,2461,4.941172,38.2,MEDIUM,53.0
1,Archer Heights,57,13726,3919,772,19.698903,2886,73.641235,93.340138,3207,...,10.10,33.60,48629.0,10626,77.415125,131,0.954393,53.3,HIGH,67.0
2,Armour Square,34,13538,5396,1488,27.575982,3685,68.291327,95.867309,4064,...,25.80,45.90,33333.0,585,4.321170,1135,8.383809,30.9,LOW,82.0
3,Ashburn,70,43356,13124,1840,14.020116,10449,79.617495,93.637610,11847,...,8.75,31.45,69261.0,17918,41.327613,19888,45.871390,45.1,MEDIUM,37.0
4,Auburn Gresham,71,45909,17161,5282,30.779092,10394,60.567566,91.346658,13724,...,23.60,45.60,35568.0,1001,2.180400,43791,95.386526,48.2,HIGH,74.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,West Lawn,65,31886,9272,1752,18.895600,7094,76.509922,95.405522,7803,...,12.55,32.95,52992.5,26775,83.971022,845,2.650066,48.5,HIGH,56.0
73,West Pullman,53,30020,10598,2176,20.532176,8101,76.438951,96.971127,8972,...,15.50,38.20,43143.0,1562,5.203198,27579,91.868754,49.2,HIGH,62.0
74,West Ridge,2,78466,25714,3676,14.295714,20875,81.181458,95.477172,23515,...,14.30,35.45,53153.0,14835,18.906278,9086,11.579538,36.0,MEDIUM,46.0
75,West Town,24,83757,37819,3187,8.426981,33590,88.817790,97.244771,35289,...,4.50,16.60,102083.0,18567,22.167699,5727,6.837637,18.2,LOW,10.0


In [352]:
# export to Excel file in the data folder 

final_df.to_csv("data/chicago_internet.csv") 

All variables except total_pop, total_households, and median_income are in percentages. All internet variables are on the household level. All demographic variables are on the community area population level. 

Based on 2015-2019 ACS data, the percentages of households with computers, internet access, and internet broadband subscription are seen above. The neighborhoods of Burnside, Englewood, West Englewood, Fuller Park have the lowest percentages of both broadband subscription and internet access. The neighborhods of Near South Side, Lake View, Lincoln Park, and North Center have the highest. 

Broadband subscription and internet access numbers are extremely close, suggesting that the overwhelming majority of households who have access to internet do so via a broadbad subscription. 

Median income and poverty rates are based on household-level data. The medians of median household income and poverty rates across tracts were used to make the table above. The median household incomes in Chicago community areas range from Riverdale's \\$15,408 all the way to Lincoln Park's \\$127,177. Riverdale also has the highest percentage of their households in poverty at 49.2%. 

Race and ethnicity data are based on total population numbers. Hispanic/Latino ethnicity was based on all races. Black/African-American race was non-Hispanic/Latino. Chicago community areas vary vastly in their race and ethnic compositions. Gage Park, South Lawndalw, West Elsdon, and Hermosa have the highest percentages of Hispanics/Latinos of all races. Calumet Heights, Washington Heights, Avalon Park, and Oakland have the highest percentages of non-Hispanic/Latino African-American/Blacks. 