# EPA-Justice QC

This notebook runs an independent validation of the results produced in the `fetch_data_and_export.ipynb` notebook. The goals of the QC are to:


- **Validate the API:** The functions used in the `fetch_data_and_export.ipynb` notebook utilize APIs to find census data, but here we will actually use data from some downloaded CSVs to compare against the results in `data_to_export.csv`. This should prove that the data gathered from the API matches what we see in the raw tabular data.

- **Check the math:** We will do an independent aggregation of the original data to check our math, and keep those original data values visible for inspection.

- **Ensure reasonable values:** A number of QC checks are done to assure that values are within expected ranges, and also to assess missing data.


Note: The 2020 US Census data CSVs (DHC tables P12 and P9, and ACS 5 year tables S1810 and S2701) were downloaded from the [data.census.gov](data.census.gov) online application, after filtering for all counties, places, and tracts in Alaska. The CDC PLACES and SDOH data CSVs for counties, places, and tracts in Alaska were downloaded from the CDC [data portal](https://data.cdc.gov/browse?category=500+Cities+%26+Places).

## Data Prep

In [161]:
import pandas as pd
import random
from utilities.luts import *

Load places list and previously generated results.

In [162]:
places = pd.read_csv('tbl/NCRPlaces_Census_04192024.csv')
places.drop(columns=['alt_name', 'region', 'country', 'latitude', 'longitude', 'type'], inplace=True)
results = pd.read_csv('tbl/data_to_export.csv')

Choose 20 random places for QC, and get corresponding results.

In [163]:
qc_places = places.where(places['id'].isin(random.sample(places['id'].to_list(), 20))).dropna(subset='id')
qc_results = results.where(results['id'].isin(qc_places['id'])).dropna(subset='id')

Load Census and CDC data from CSVs. Join them into 2 main files and keep only the columns listed in the lookup tables. There is some table reformatting required here in order to accomplish the merge.

In [164]:
# census data
dhc_p12 = pd.read_csv('qc_data/DECENNIALDHC2020.P12-Data.csv')
dhc_p9 = pd.read_csv('qc_data/DECENNIALDHC2020.P9-Data.csv')
acs5 = pd.read_csv('qc_data/ACSST5Y2020.S1810-Data.csv')

# cdc data
places_county = pd.read_csv('qc_data/PLACES__Local_Data_for_Better_Health__County_Data_2023_release_20240506.csv')
places_place = pd.read_csv('qc_data/PLACES__Local_Data_for_Better_Health__Place_Data_2023_release_20240506.csv')
places_tract = pd.read_csv('qc_data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2023_release_20240506.csv')

sdoh_county = pd.read_csv('qc_data/SDOH_Measures_for_County__ACS_2017-2021_20240506.csv')
sdoh_place = pd.read_csv('qc_data/SDOH_Measures_for_Place__ACS_2017-2021_20240506.csv')
sdoh_tract = pd.read_csv('qc_data/SDOH_Measures_for_Census_Tract__ACS_2017-2021_20240506.csv')

# standardize columns
for df in [sdoh_county, sdoh_place, sdoh_tract]:
    df.rename(columns={'MeasureID':'MeasureId'}, inplace=True)


In [165]:
# get var lists
census_vars = list(var_dict["dhc"]["vars"].keys()) + (list(var_dict["acs5"]["vars"].keys()))
cdc_vars = list(var_dict["cdc"]["PLACES"]["vars"].keys()) + (list(var_dict["cdc"]["SDOH"]["vars"].keys()))

In [166]:
# merge census data, dropping unused cols
census_data = dhc_p12.merge(dhc_p9, how='left', on=['GEO_ID','NAME']).merge(acs5, how='left', on=['GEO_ID','NAME'])
for col in census_data.columns:
    if col not in ['GEO_ID', 'NAME'] and col not in census_vars:
        census_data.drop(columns=col, inplace=True)
# keep multiindex
#census_data.columns = pd.MultiIndex.from_arrays([census_data.columns, census_data.iloc[0].values])
# drop multiindex
census_data = census_data.iloc[1:]

In [167]:
# reformat cdc data, dropping unused cols and keeping only crude prevalence
# pivot and drop unused measures (vars), and merge
cdc_dfs_reformatted = []

cdc_dfs = [places_county, places_place, places_tract, sdoh_county, sdoh_place, sdoh_tract]
cdc_cols = ['LocationID', 'LocationName', 'DataValueTypeID', 'Measure', 'MeasureId', 'Data_Value']

for df in cdc_dfs:
    for col in df.columns:
        if col not in cdc_cols:
            df.drop(columns=col, inplace=True)
    df.drop(df.loc[df['DataValueTypeID'] == 'AgeAdjPrv'].index, inplace=True)
    df.drop(columns='DataValueTypeID', inplace=True)

    df_pivot = df.pivot(index=['LocationID', 'LocationName'], columns=['MeasureId', 'Measure'], values='Data_Value').reset_index()
    df_pivot = df_pivot.droplevel(level=1, axis=1)

    for col in df_pivot.columns:
        if col not in cdc_cols and col not in cdc_vars:
            df_pivot.drop(columns=col, inplace=True)
            
    cdc_dfs_reformatted.append(df_pivot)

cdc_data = pd.concat(cdc_dfs_reformatted[0:3]).merge(pd.concat(cdc_dfs_reformatted[3:]), how='left', on='LocationID')
cdc_data.columns.name = None

Finally, we have some merged data that will be easier to QC.

In [168]:
census_data.head()

Unnamed: 0,GEO_ID,NAME,P12_001N,P12_002N,P12_003N,P12_004N,P12_005N,P12_006N,P12_020N,P12_021N,...,P9_002N,P9_005N,P9_006N,P9_007N,P9_008N,P9_009N,P9_010N,P9_011N,S1810_C03_001E,S1810_C03_001M
1,0500000US02013,"Aleutians East Borough, Alaska",3420,2371,38,33,51,38,39,37,...,674,658,297,797,771,32,11,180,10.7,2.2
2,0500000US02016,"Aleutians West Census Area, Alaska",5232,3432,76,106,87,57,61,45,...,668,1585,257,709,1502,238,10,263,7.9,2.7
3,0500000US02020,"Anchorage Municipality, Alaska",291247,147894,9746,9879,9964,5657,3232,3852,...,26438,158232,13777,22480,27281,9844,1922,31273,11.2,0.6
4,0500000US02050,"Bethel Census Area, Alaska",18666,9747,931,977,881,567,168,209,...,207,1628,84,15580,201,2,19,945,11.4,1.0
5,0500000US02060,"Bristol Bay Borough, Alaska",844,440,23,37,18,16,13,18,...,45,357,6,296,5,3,2,130,15.9,4.1


In [169]:
cdc_data.head()

Unnamed: 0,LocationID,LocationName_x,STROKE,DIABETES,KIDNEY,CHD,CASTHMA,COPD,LocationName_y,REMNRTY,NOHSDP,BROAD,POV150
0,2013,Aleutians East,3.3,13.1,3.0,5.7,7.8,5.4,Aleutians East Borough,87.2,15.3,42.5,22.7
1,2016,Aleutians West,2.5,10.8,2.4,4.5,7.4,4.4,Aleutians West Census Area,77.1,9.0,23.0,11.3
2,2020,Anchorage,2.6,7.9,2.6,4.7,9.4,5.3,Anchorage Municipality,43.9,5.8,7.3,15.1
3,2050,Bethel,4.8,14.8,3.9,7.9,12.8,10.4,Bethel Census Area,90.8,18.0,25.2,43.9
4,2060,Bristol Bay,3.6,10.7,3.4,6.9,10.0,6.9,Bristol Bay Borough,58.8,5.3,23.5,8.0


We need to standardize the GEOIDs in order to merge all the tables together. We will drop the "US" and everything before it for all census-based GEOID columns, and add leading zeros to the CDC location id columns.

In [170]:
qc_places['GEOIDFQ'] = qc_places['GEOIDFQ'].str.split("US").str[1]
census_data['GEO_ID'] = census_data['GEO_ID'].str.split("US").str[1]
cdc_data['LocationID'] = "0" + cdc_data['LocationID'].astype(str)

Now we can merge the tables. Let's view all column names to make sure we have everything we need, then do a last check of the pertinent geography columns to make sure everything looks like it lined up right during the table joining operations.

In [171]:
df = qc_places.merge(qc_results, how='left', on="id").merge(
        census_data, how='left', left_on='GEOIDFQ', right_on='GEO_ID').merge(
            cdc_data, how='left', left_on='GEOIDFQ', right_on='LocationID')

In [172]:
df.columns

Index(['id', 'name_x', 'GEOIDFQ', 'PLACENAME', 'AREATYPE', 'COMMENT', 'name_y',
       'areatype', 'placename', 'GEOID', 'total_population', 'pct_65_plus',
       'pct_under_18', 'pct_hispanic_latino', 'pct_white',
       'pct_african_american', 'pct_amer_indian_ak_native', 'pct_asian',
       'pct_hawaiian_pacislander', 'pct_other', 'pct_multi',
       'pct_w_disability', 'moe_pct_w_disability', 'pct_insured',
       'moe_pct_insured', 'pct_uninsured', 'moe_pct_uninsured', 'pct_asthma',
       'pct_copd', 'pct_hd', 'pct_stroke', 'pct_diabetes', 'pct_kd',
       'pct_minority', 'pct_no_hsdiploma', 'pct_below_150pov', 'pct_no_bband',
       'comment', 'GEO_ID', 'NAME', 'P12_001N', 'P12_002N', 'P12_003N',
       'P12_004N', 'P12_005N', 'P12_006N', 'P12_020N', 'P12_021N', 'P12_022N',
       'P12_023N', 'P12_024N', 'P12_025N', 'P12_026N', 'P12_027N', 'P12_028N',
       'P12_029N', 'P12_030N', 'P12_044N', 'P12_045N', 'P12_046N', 'P12_047N',
       'P12_048N', 'P12_049N', 'P9_001N', 'P9_002N

In [173]:
df[['id', 'name_x', 'GEOIDFQ', 'PLACENAME', 'AREATYPE', 'name_y', 'areatype', 'placename', 'GEOID', 'GEO_ID', 'NAME']]

Unnamed: 0,id,name_x,GEOIDFQ,PLACENAME,AREATYPE,name_y,areatype,placename,GEOID,GEO_ID,NAME
0,AK60,Chatham,2220,Sitka City and Borough,County,Chatham,County,Sitka City and Borough,220,2220,"Sitka City and Borough, Alaska"
1,AK48,Buckland,209600,Buckland city,Incorporated place,Buckland,Incorporated place,Buckland city,9600,209600,"Buckland city, Alaska"
2,AK495,Chase,212350,Chase CDP,Census designated place,Chase,Census designated place,Chase CDP,12350,212350,"Chase CDP, Alaska"
3,AK499,Crown Point,217960,Crown Point CDP,Census designated place,Crown Point,Census designated place,Crown Point CDP,17960,217960,"Crown Point CDP, Alaska"
4,AK102,Eagle,220380,Eagle city,Incorporated place,Eagle,Incorporated place,Eagle city,20380,220380,"Eagle city, Alaska"
5,AK112,Ekwok,221810,Ekwok city,Incorporated place,Ekwok,Incorporated place,Ekwok city,21810,221810,"Ekwok city, Alaska"
6,AK378,Suntrana,232150,Healy CDP,Census designated place,Suntrana,Census designated place,Healy CDP,32150,232150,"Healy CDP, Alaska"
7,AK150,Hollis,232810,Hollis CDP,Census designated place,Hollis,Census designated place,Hollis CDP,32810,232810,"Hollis CDP, Alaska"
8,AK229,Livengood,244580,Livengood CDP,Census designated place,Livengood,Census designated place,Livengood CDP,44580,244580,"Livengood CDP, Alaska"
9,AK240,McCarthy,245790,McCarthy CDP,Census designated place,McCarthy,Census designated place,McCarthy CDP,45790,245790,"McCarthy CDP, Alaska"


and replace column names using our short names from the lookup table. At the same time, create a list of column short names to loop thru during the QC.

In [148]:

var_cols = []

for col in df.columns:
    if col in var_dict["dhc"]["vars"].keys():
        new_col = var_dict["dhc"]["vars"][col]["short_name"]
        var_cols.append(new_col)
        df.rename(columns={col : new_col}, inplace=True)
    elif col in var_dict["acs5"]["vars"].keys():
        new_col = var_dict["acs5"]["vars"][col]["short_name"]
        var_cols.append(new_col)
        df.rename(columns={col : new_col}, inplace=True)
    elif col in var_dict["cdc"]["PLACES"]["vars"].keys():
        new_col = var_dict["cdc"]["PLACES"]["vars"][col]["short_name"]
        var_cols.append(new_col)     
        df.rename(columns={col : new_col}, inplace=True)
    elif col in var_dict["cdc"]["SDOH"]["vars"].keys():
        new_col = var_dict["cdc"]["SDOH"]["vars"][col]["short_name"]
        var_cols.append(new_col)     
        df.rename(columns={col : new_col}, inplace=True)
    else: pass