In [1]:
import pandas as np
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import fidap
from config import api_key

# instantiate api connection
fidap = fidap.fidap_client(api_key=api_key)

### Social Determinants of Health  

This document aims to create a Minimum Viable Product that documents the constituent components of Social Determinants of Health (SDOH). SDOH is broadly conceptualized as comprised of measurements of:  
1. Crime Levels  
2. Educational Attainment  
3. Retail-Grocery Gap  
4. Environmental Factors  
5. Personal Infrastructure  
6. Climate Change  
7. Family  

What we are trying to achieve here is to approximate what the open-data that are used to derive SDOH might look like.  
  
#### Minimum Viable Product (MVP)  
  
This MVP proposed will likely be able to take into account all of the 6 of the 7 factors listed above. Climate Change and Environmental Factors will be amalgamated into one. For the purpose of this MVP, we will primarily gather data on the spatial scale of zip codes, relating to Chicago IL in 2018. [Chicago's zip codes](https://www.chicago.gov/content/dam/city/sites/covid/reports/2020-04-24/ChicagoCommunityAreaandZipcodeMap.pdf) also have the distinctive feature of starting with 606. In fact all zip-codes starting with 606 relate to Chicago.   
  
A lot of the data can be obtained at the zip-code level. Detail and data at the scale of census blocks might not be easily available all the time. It is therefore much preferable to aggregate at the scale of zip codes. Zip codes also have the benefit of being universally understood, as opposed to rather esoteric FIPs codes of Census blocks or Census tracts. However there are multiple problems with using zip codes, but we can address them in future iterations of this product. For the moment, the combined use of ZCTAs from the Census Bureau and Zip Codes are stable enough to provide an indicative MVP.   
  
The common identifier that can be used to connect all datasets will be zip code.  

#### Chicago's Zip Codes  
  
So the idea here is to get the boundary of each zip code in Chicago, IL, as well as the area of each zip code in square miles.   

In [12]:
chicago_zip_query = fidap.sql("""
SELECT zip_code, zip_code_geom, (ROUND(ST_AREA(zip_code_geom)/2589988.1103,3)) AS zip_area_sqm
FROM bigquery-public-data.geo_us_boundaries.zip_codes
WHERE zip_code LIKE '606%'
""")

#### Family  
  
We define family broadly as basic demographic information such as the breakdown of the population by race, age group, gender, and household information.

In [5]:
age_structure_query = fidap.sql("""
SELECT geo_id AS zip, total_pop,male_pop,female_pop, median_age, male_under_5,male_5_to_9,male_10_to_14, (male_15_to_17+male_18_to_19) AS male_15_to_19, (male_20+male_21+male_22_to_24) AS male_20_to_24 ,male_25_to_29,male_30_to_34,male_35_to_39,male_40_to_44,male_45_to_49,male_50_to_54,male_55_to_59,(male_60_to_61+male_62_to_64) AS male_60_to_64, (male_65_to_66+male_67_to_69+male_70_to_74+male_75_to_79+male_80_to_84+male_85_and_over) AS male_65_and_over,female_under_5,female_5_to_9,female_10_to_14,(female_15_to_17+female_18_to_19) AS female_15_to_19, (female_20 + female_21 + female_22_to_24) AS female_20_to_24,female_25_to_29,female_30_to_34,female_35_to_39,female_40_to_44,female_45_to_49,female_50_to_54,female_55_to_59,(female_60_to_61+female_62_to_64) AS female_60_to_64,(female_65_to_66+female_67_to_69+female_70_to_74+female_75_to_79+female_80_to_84+female_85_and_over) AS female_65_and_over 
FROM bigquery-public-data.census_bureau_acs.zcta5_2018_5yr
WHERE geo_id LIKE '606%';
""")

The `age_structure_query` looks at the demographic structure of each zip code. Each zip code is broken down into its total population, total male and female populations, median age, as well as the population for each 5-year age bracket, with an upper limit of 65 and separated by sex. 

In [14]:
family_structure_query = fidap.sql("""
SELECT geo_id AS zip, households, married_households, (households-married_households) AS unmarried_households,
    nonfamily_households, family_households, (family_households - married_households) AS family_unmarried_households, 
    households_public_asst_or_food_stamps
FROM bigquery-public-data.census_bureau_acs.zcta5_2018_5yr
WHERE geo_id LIKE '606%';
""")

The `family_structure_query` looks at the structure of households in each zip code in terms of the number of households.  Of the total number of households, how many of them are married, and unmarried.  
  
There is also another way to look at the households, which is by family unit. That is, the number of households that are family and non-family. Married households is a subset of family households; in other words, we can also derive the number of unmarried family households by subtracting the number of married households from the number of family households.  
  
The third dimension of looking at household structure is through a socio-economic lens - the number of households that require public assistance and/or on food stamps. This can be used as a proxy for the prevalence of poverty. 

In [45]:
race_query = fidap.sql("""
SELECT geo_id AS zip, total_pop, black_pop, asian_pop, hispanic_pop, amerindian_pop, other_race_pop, white_pop
FROM bigquery-public-data.census_bureau_acs.zcta5_2018_5yr
WHERE geo_id LIKE '606%';
""")

Finally, we can also look at the concept of family through each zip code's racial breakdown, which we showcase here in `race_query`. 

#### Education  
  
Education can be defined in terms of educational attainment of the population.  
  
Counting the number of educational establishments within each zip code is another way to do this, but it does not directly affect the population in its surrounding areas as they might not make use of them. Not everyone who lives around UChicago enjoys the benefit of a UChicago education. But is obviously more true at other levels of education such as K-12 as most children attend schools near their place of residence. Then, the availability of educational opportunities matter.    

In [18]:
educational_attainment_query = fidap.sql("""
SELECT geo_id AS zip, total_pop, pop_25_years_over,
    high_school_diploma, less_one_year_college, some_college_and_associates_degree, 
    associates_degree, bachelors_degree,
    masters_degree, graduate_professional_degree 
FROM bigquery-public-data.census_bureau_acs.zcta5_2018_5yr
WHERE geo_id LIKE '606%';
""")

The `educational_attainment_query` looks at the number of people in each zip code and their highest educational attainment. 

#### Retail Grocery Gap  
  
What we want to measure here is the availability of fresh food. We want to see whether the distribution of supermarkets in each zip code is equitable. To this end, we will first like to obtain a list of supermarkets in Chicago, IL, and then group them by zip code. 

In [19]:
supermarket_query = fidap.sql("""
WITH bounding_area AS (SELECT geometry FROM bigquery-public-data.geo_openstreetmap.planet_features_multipolygons
  WHERE ('name:en', 'Chicago') IN (SELECT(key, value) FROM UNNEST(all_tags))
  AND ('boundary', 'administrative') IN (SELECT(key, value) FROM UNNEST(all_tags))
  AND ('admin_level', '8') IN  (SELECT(key, value) FROM UNNEST(all_tags))
)
SELECT pt.geometry, tags.value AS tags, tags.key AS keys
FROM bigquery-public-data.geo_openstreetmap.planet_features_points AS pt, bounding_area
JOIN UNNEST(all_tags) AS tags
WHERE (tags.key = 'name' OR tags.key = 'addr:postcode')
AND ('shop', 'supermarket') IN (SELECT(key, value) FROM UNNEST(all_tags))
AND ST_WITHIN(pt.geometry, bounding_area.geometry)
""")

In [62]:
# pivoting the table  
supermarket_df = supermarket_query.pivot(index = 'geometry', columns = 'keys', values = 'tags').reset_index()
supermarket_df = supermarket_df.rename(columns = {'addr:postcode' : 'zip_code'})

# counting the number of supermarkets by zip code
supermarket_df.name = supermarket_df.name.fillna('Unknown')
supermarket_zip_df = supermarket_df.groupby('zip_code').agg('count').drop('geometry', axis = 1).reset_index()
supermarket_zip_df.zip_code = supermarket_zip_df.zip_code.astype(int)

# left join the number of supermarkets to number of zip codes
# provide us with the number of zip codes without a supermarket
supermarket_zip_gdf = chicago_zip_query.merge(supermarket_zip_df, 'left', 'zip_code')
supermarket_zip_gdf.name = supermarket_zip_gdf.name.fillna(0)
supermarket_zip_gdf = supermarket_zip_gdf.rename(columns = {'name' : 'count'})

# look at the per-capita availability of supermarkets 
pop_zip_code = race_query.loc[:, ['zip', 'total_pop']]
pop_zip_code = pop_zip_code.assign(
    pop_10ks = pop_zip_code.total_pop/10000
)
supermarket_zip_gdf = supermarket_zip_gdf.merge(pop_zip_code, left_on = 'zip_code', right_on = 'zip')
supermarket_zip_gdf = supermarket_zip_gdf.assign(
    per_capita_supermarket = lambda x: x['count']/x['pop_10ks']
)

With this query, we are able to obtain the zip codes that do not have a supermarket while counting the number of supermarkets per 10000 inhabitants in the zip codes that do have a supermarket. 

#### Crime Levels  
  
To account for the impact of crime, I pulled data from Chicago's crime database corresponding to the year 2018 because our ACS data dates from then.   

In [17]:
crime_query = fidap.sql("""
SELECT case_number, primary_type, description, 
    ST_GEOGPOINT(latitude, longitude) AS geom_location, string(date) AS updated_date, 
FROM bigquery-public-data.chicago_crime.crime
WHERE year = 2018
AND location IS NOT NULL;
""")

#### Personal Infrastructure  
  
Our definition of personal infrastructure refers to the quality of the housing stock, as well as public transit availability.   

In [43]:
housing_stock_query = fidap.sql("""
SELECT acs.geo_id AS zip_code, acs.median_year_structure_built, acs.percent_income_spent_on_rent, (acs.total_pop/acs.housing_units) AS housing_density
FROM bigquery-public-data.census_bureau_acs.zcta5_2018_5yr AS acs
WHERE acs.geo_id LIKE '606%'
""")

Here, we look at the number of quality of buildings in a zip code as approximated by the median age of the structure. 

In [32]:
personal_transportation_query = fidap.sql("""
SELECT geometry, tags.value AS stop_name
FROM bigquery-public-data.geo_openstreetmap.planet_features_points
JOIN UNNEST(all_tags) AS tags
WHERE tags.key = 'name'
AND ('operator', 'Chicago Transit Authority') IN (SELECT(key, value) FROM UNNEST(all_tags))
""")

Alternatively, we can also look at the location of public transit stops.

#### Climate and Environment  
  
In terms of the climate and the environment, we can look at it from the perspective of air quality (PM2.5) obtained from the EPA. 

In [40]:
air_quality_query = fidap.sql("""
SELECT parameter_name, arithmetic_mean AS pm25_value, sample_duration, 
  STRING(date_local) AS obs_date,
  ST_GEOGPOINT(longitude, latitude) AS measuring_stn_geom
from bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary
WHERE state_name = 'Illinois' 
AND city_name = 'Chicago'
AND STRING(date_local) LIKE '2018%';
""")

In [7]:
severe_storms_query = fidap.sql("""
SELECT event_type, event_id, event_begin_time, event_end_time, damage_property, deaths_direct, injuries_direct, deaths_indirect, injuries_indirect, ST_GEOGPOINT(event_longitude, event_latitude) AS event_geom, event_range
FROM bigquery-public-data.noaa_historic_severe_storms.storms_2018
WHERE state_fips_code = '17'
AND cz_fips_code = '31'
""")