### Census 2020  
  
In this notebook, I want to look at the ways in which Fidap's datasets can complement Census 2020 data.  
  
1) College towns/Cities and how they differ from places around  
2) Counties with the highest case counts and how they are like  
3) Diversity. Characteristics of largest metropolitan areas in the US  

In [1]:
import pandas as pd
import datetime
import numpy as np
import altair as alt
import fidap
import config

# instantiate connection to fidap
fidap = fidap.fidap_client(api_key = config.api_key)

### Demographic Profile of Covid-19 Afflicted Counties  
  
The power of Fidap lies in its ability to perform analyses across different datasets. I want to see how that might work out here by querying for the top 20 counties that are severely affected by Covid-19 right now, measured in terms of 7-day moving average count of cases since the start of August.  

In [18]:
county_moving_average = fidap.sql("""
WITH covid_cases AS (
SELECT *, 
ROW_NUMBER() OVER (PARTITION BY county_fips_code ORDER BY date ASC) AS row_num,
(confirmed_cases - (LAG(confirmed_cases, 1) OVER (PARTITION BY county_fips_code ORDER BY date ASC))) AS new_cases,
(deaths - (LAG(deaths, 1) OVER (PARTITION BY county_fips_code ORDER BY date ASC))) AS new_deaths,
FROM bigquery-public-data.covid19_nyt.us_counties
WHERE date >= CAST('2021-08-01' AS DATE) 
AND confirmed_cases >= 0 
AND deaths >= 0), 

census_join AS (
SELECT CAST(c.date AS STRING) AS nyt_date, c.state_name, c.county, c.confirmed_cases, c.deaths, c.new_cases, c.new_deaths, c.row_num, c.county_fips_code,
ROUND(100000*(AVG(c.new_cases) OVER (PARTITION BY c.county_fips_code ORDER BY c.date ASC ROWS 7 PRECEDING))/CAST(census.POP100 AS INT64), 2) AS new_cases_ma7,
ROUND(100000*(AVG(c.new_deaths) OVER (PARTITION BY c.county_fips_code ORDER BY c.date ASC ROWS 7 PRECEDING))/CAST(census.POP100 AS INT64), 2) AS new_deaths_ma7,
ROUND(100*CAST(P0020002 AS INT64)/CAST(POP100 AS INT64),2) AS Hispanic_Latinx,
ROUND(100*CAST(P0020005 AS INT64)/CAST(POP100 AS INT64),2) AS White,
ROUND(100*CAST(P0020006 AS INT64)/CAST(POP100 AS INT64),2) AS Black,
ROUND(100*CAST(P0020007 AS INT64)/CAST(POP100 AS INT64),2) AS Indigenous,
ROUND(100*(CAST(P0020008 AS INT64)+CAST(P0020009 AS INT64))/CAST(POP100 AS INT64),2) AS Asian_PacificIslander
FROM covid_cases AS c
INNER JOIN fidap-301014.us_census_2020.Redistricting_Data_Complete AS census 
ON census.GEOCODE = c.county_fips_code
WHERE CHAR_LENGTH(census.GEOID) = 14)

SELECT nyt_date, state_name, county, county_fips_code, confirmed_cases, deaths, new_cases, new_deaths, 
IF(row_num < 7, NULL, new_cases_ma7) new_cases_ma7,
IF(row_num < 7, NULL, new_deaths_ma7) new_deaths_ma7,
Hispanic_Latinx, White, Black, Indigenous, Asian_PacificIslander, 
100-(Hispanic_Latinx + White + Black + Indigenous + Asian_PacificIslander) AS Others_Mixed
FROM census_join
WHERE new_deaths_ma7 IS NOT NULL
ORDER BY nyt_date DESC, new_cases_ma7 DESC
LIMIT 10 ;
""")

county_moving_average = county_moving_average.assign(
    county_name = lambda x: x.county + ", " + x.state_name
)
county_moving_average_long = pd.melt(county_moving_average, id_vars = ['county_name'], 
                                     value_vars = ['Hispanic_Latinx', 'White', 'Black', 'Indigenous', 'Asian_PacificIslander', 'Others_Mixed'])


In [19]:
national_average_racial_breakdown = fidap.sql("""
SELECT
ROUND(100*SUM(CAST(P0020002 AS INT64))/SUM(CAST(POP100 AS INT64)),2) AS Hispanic_Latinx,
ROUND(100*SUM(CAST(P0020005 AS INT64))/SUM(CAST(POP100 AS INT64)),2) AS White,
ROUND(100*SUM(CAST(P0020006 AS INT64))/SUM(CAST(POP100 AS INT64)),2) AS Black,
ROUND(100*SUM(CAST(P0020007 AS INT64))/SUM(CAST(POP100 AS INT64)),2) AS Indigenous,
ROUND(100*SUM(CAST(P0020008 AS INT64)+CAST(P0020009 AS INT64))/SUM(CAST(POP100 AS INT64)),2) AS Asian_PacificIslander
FROM fidap-301014.us_census_2020.Redistricting_Data_Complete 
WHERE CHAR_LENGTH(GEOID) = 14
AND COUNTY IS NOT NULL;
""")
national_average_racial_breakdown = national_average_racial_breakdown.assign(
    Others_Mixed = lambda x: 100-(x.Hispanic_Latinx + x.White + x.Black + x.Indigenous + x.Asian_PacificIslander),
    county_name = "National Average"
) 


national_average_racial_breakdown_long = pd.melt(national_average_racial_breakdown, 
                                                 id_vars = ['county_name'], 
                                                 value_vars = ['Hispanic_Latinx', 'White', 'Black', 'Indigenous', 'Asian_PacificIslander', 'Others_Mixed'])

county_moving_average_long = county_moving_average_long.append(national_average_racial_breakdown_long, ignore_index = True)

In [20]:
print("10 highest county-level 7 day moving average count of new cases per 100,000 residents")
county_moving_average[['state_name', 'county', 'new_cases_ma7']]

10 highest county-level 7 day moving average count of new cases per 100,000 residents


Unnamed: 0,state_name,county,new_cases_ma7
0,Florida,Miami-Dade,2807.94
1,Hawaii,Honolulu,1710.59
2,Florida,Palm Beach,1294.46
3,Florida,Orange,1291.2
4,Nevada,Clark,1219.91
5,Florida,Hillsborough,1200.25
6,Alabama,Jefferson,1117.69
7,Alabama,Mobile,1056.77
8,Mississippi,Harrison,1038.09
9,South Carolina,Charleston,1021.04


In [33]:
alt.Chart(county_moving_average_long).mark_bar(
    cornerRadiusTopLeft = 3,
    cornerRadiusTopRight = 3
).encode(
    x = alt.X('county_name:O', title = "County", axis = alt.Axis(labelAngle = -70)),
    y = alt.Y('value:Q', title = "Racial Makeup (%)"),
    color = 'variable:N' 
).properties(width = 500, height = 300)

Almost of all of the states with the highest case loads in recent days are in the South. Of the 10 counties, 7 of them have a greater share of racial and ethnic minorities as a proportion of their population than the national average. This is particularly pronounced in Florida and Alabama. 

### Life in Extremis  
  
Without any access to age data, the Census 2020 Redistricting Data only provides racial and ethnic breakdowns, can we identify with some confidence where college towns are? At the other end of the spectrum, can we also identify retirement villages? 

How do we even know if a place is a college town? What is a college town?   

In [56]:
college_town = fidap.sql("""
WITH college_town AS (
SELECT CAST(POP100 AS INT64) AS total, CAST(P0050008 AS INT64) AS student_housing, GEOCODE AS county_fips_code, c.county_name, STUSAB as state,
ROUND(100*SAFE_DIVIDE(CAST(P0050008 AS INT64),CAST(POP100 AS INT64)), 2) AS pct_student_housing,
c.county_geom
FROM fidap-301014.us_census_2020.Redistricting_Data_Complete
INNER JOIN bigquery-public-data.geo_us_boundaries.counties AS c
ON c.county_fips_code = GEOCODE
WHERE CHAR_LENGTH(GEOID) = 14
AND COUNTY IS NOT NULL
ORDER BY pct_student_housing DESC
LIMIT 50
)
SELECT state, county_name, county_fips_code, total, student_housing, pct_student_housing, tags.value AS tags
FROM bigquery-public-data.geo_openstreetmap.planet_features_multipolygons AS pt, college_town
LEFT JOIN UNNEST(all_tags) AS tags
WHERE (tags.key = 'name')
AND ST_WITHIN(pt.geometry, college_town.county_geom)
AND ('amenity', 'university') IN (SELECT(key, value) FROM UNNEST(all_tags));
""")

# will time out if we combine the queries into one
community_college_town = fidap.sql("""
WITH college_town AS (
SELECT CAST(POP100 AS INT64) AS total, CAST(P0050008 AS INT64) AS student_housing, GEOCODE AS county_fips_code, c.county_name, STUSAB as state,
ROUND(100*SAFE_DIVIDE(CAST(P0050008 AS INT64),CAST(POP100 AS INT64)), 2) AS pct_student_housing,
c.county_geom
FROM fidap-301014.us_census_2020.Redistricting_Data_Complete
INNER JOIN bigquery-public-data.geo_us_boundaries.counties AS c
ON c.county_fips_code = GEOCODE
WHERE CHAR_LENGTH(GEOID) = 14
AND COUNTY IS NOT NULL
ORDER BY pct_student_housing DESC
LIMIT 50
)
SELECT state, county_name, county_fips_code, total, student_housing, pct_student_housing, tags.value AS tags
FROM bigquery-public-data.geo_openstreetmap.planet_features_multipolygons AS pt, college_town
LEFT JOIN UNNEST(all_tags) AS tags
WHERE (tags.key = 'name')
AND ST_WITHIN(pt.geometry, college_town.county_geom)
AND ('amenity', 'college') IN (SELECT(key, value) FROM UNNEST(all_tags));
""")

In [80]:
top10_college_towns = college_town.sort_values("pct_student_housing", ascending = False).reset_index(drop=True).iloc[:10,]
top10_college_towns = top10_college_towns.assign(
    full_name = lambda x: x.county_name + ", " + x.state
)

alt.Chart(top10_college_towns).mark_bar().encode(
    x = alt.X('full_name', title = "County", axis = alt.Axis(labelAngle = -70)),
    y = alt.Y('pct_student_housing', title = "Pop. in Student Housing (%)")
).properties(width = 400)

So, we have identified counties which have a high proportion of its population comprising of individuals who live in student housing. Generally, we are looking at counties where this percentage is above 7.5%. 

In [57]:
# this is not perfect but better than nothing 
# counting the number of higher education facilities within such towns
college_town = college_town.append(community_college_town).drop_duplicates()
college_town = college_town.groupby(['state', 'county_name', 'county_fips_code', 'total', 'student_housing', 'pct_student_housing']).agg('count')
college_town = college_town.reset_index()

In [63]:
alt.Chart(college_town).mark_bar().encode(
    x = alt.X("tags:Q", bin = True, title = "No. of Colleges"),
    y = alt.Y("count()", title = "No. of Counties")
)

Secondly, we also want to look at the number of colleges, as in traditional four-year universities and community colleges, present within the county. A good number of them have only one college. And that is fine. Then there are the counties with a rather high number of such colleges, as much as 7! The truth is that some of these "colleges" might actually be a part of the same university or college network. They are counted as distinct entities because they are logged as such on OSM.  
  
However, this approach breaks down a little when we are looking at retirement homes. We can pick out counties with a high proportion of its residents living within assisted living and nursing facilities. However, we cannot count the number of nursing homes because people generally do not digitize and georeference nursing homes, much less put them up on OSM.

In [39]:
nursing_home_town = fidap.sql("""
SELECT CAST(POP100 AS INT64) AS total, CAST(P0050005 AS INT64) AS nursing_homes, GEOCODE AS county_fips_code, c.county_name, STUSAB as state,
ROUND(100*SAFE_DIVIDE(CAST(P0050005 AS INT64),CAST(POP100 AS INT64)), 2) AS pct_nursing_homes
FROM fidap-301014.us_census_2020.Redistricting_Data_Complete
INNER JOIN bigquery-public-data.geo_us_boundaries.counties AS c
ON c.county_fips_code = GEOCODE
WHERE CHAR_LENGTH(GEOID) = 14
AND COUNTY IS NOT NULL
ORDER BY pct_nursing_homes DESC
LIMIT 50;
""")

### Urban Diversity  
  
How diverse are America's cities? Which are some of the most diverse city in America? But the question is, how exactly do we measure diversity? Within a company, HR can compute Diversity, Equity, and Inclusion statistics. But how do we do so for an entire human population?   
  
If we take the textbook answer from biology, then we know that we are just concerned about the number and relative abundance of different groups found within any organisational unit.   
  
Okay. So moving forwards, how can measure this? Taking a leaf out of Biology, we can calculate Simpson's Diversity Index which can be expressed as:  
  
$$
D = 1-\frac{\sum_{}n(n-1)}{N(N-1)}
$$  
  
where n is the total number of organisms of a particular species, and N is the total number of organisms of all species. Obviously, the value of D ranges from 0 to 1, no diversity to infinite diversity respectively.  
  
With that in mind, we are actually able to calculate the diversity score for each of America's biggest urban counties.  

In [121]:
# counting from largest MSAs
urban_diversity = fidap.sql("""
WITH basic_count AS (
SELECT DISTINCT CBSA, COUNTY, NAME, STUSAB AS STATE, CAST(POP100 AS INT64) AS total, 
CAST(P0020002 AS INT64) AS latinx, 
CAST(P0020005 AS INT64) AS white,
CAST(P0020006 AS INT64) AS black,
CAST(P0020007 AS INT64) AS indigenous,
CAST(P0020008 AS INT64) AS asian,
CAST(P0020009 AS INT64) AS pacific_islander,
CAST(P0020010 AS INT64) AS others,
CAST(P0020011 AS INT64) AS mixed
FROM fidap-301014.us_census_2020.Redistricting_Data_Complete
WHERE CBSA IS NOT NULL
AND MEMI = '1'
AND CHAR_LENGTH(GEOCODE) = 5
),

diversity AS (
SELECT *,
1 - SAFE_DIVIDE(((latinx*(latinx - 1)) + (white*(white - 1)) + (black*(black - 1)) + (indigenous*(indigenous - 1)) + (asian*(asian - 1)) + (pacific_islander*(pacific_islander - 1)) + (others*(others - 1)) + (mixed*(mixed - 1))),(total*(total-1))) AS diversity_score
FROM basic_count
ORDER BY total DESC
LIMIT 20)

SELECT CBSA, COUNTY, diversity.NAME, STATE, c.lsad_name, diversity_score,
total, latinx, white, black, indigenous, asian, pacific_islander, others, mixed
FROM diversity
INNER JOIN bigquery-public-data.geo_us_boundaries.cbsa AS c
ON c.geo_id = diversity.CBSA;
""")