# EPA-Justice QC

This notebook runs an independent validation of the results produced in the `fetch_data_and_export.ipynb` notebook. The goals of the QC are to:


- **Validate the API:** The functions used in the `fetch_data_and_export.ipynb` notebook utilize APIs to find census data, but here we will actually use data from some downloaded CSVs to compare against the results in `data_to_export.csv`. This should prove that the data gathered from the API matches what we see in the raw tabular data.

- **Check the math:** We will do an independent aggregation of the original data to check our math, and keep those original data values visible for inspection.

- **Ensure reasonable values:** A number of QC checks are done to assure that values are within expected ranges, and also to assess missing data.


Note: The 2020 US Census data CSVs (DHC tables P12 and P9, and ACS 5 year tables S1810 and S2701) were downloaded from the [data.census.gov](data.census.gov) online application, after filtering for all counties, places, and tracts in Alaska. The CDC PLACES and SDOH data CSVs for counties, places, and tracts in Alaska were downloaded from the CDC [data portal](https://data.cdc.gov/browse?category=500+Cities+%26+Places).

## Data Prep

In [203]:
import pandas as pd
import random
from utilities.luts import *

Load places list and previously generated results.

In [204]:
places = pd.read_csv('tbl/NCRPlaces_Census_04192024.csv')
places.drop(columns=['alt_name', 'region', 'country', 'latitude', 'longitude', 'type'], inplace=True)
results = pd.read_csv('tbl/data_to_export.csv')

Choose 20 random places for QC, and get corresponding results.

In [205]:
qc_places = places.where(places['id'].isin(random.sample(places['id'].to_list(), 20))).dropna(subset='id')
qc_results = results.where(results['id'].isin(qc_places['id'])).dropna(subset='id')

Load Census and CDC data from CSVs. Join them into 2 main files and keep only the columns listed in the lookup tables. There is some table reformatting required here in order to accomplish the merge.

In [206]:
# census data
dhc_p12 = pd.read_csv('qc_data/DECENNIALDHC2020.P12-Data.csv')
dhc_p9 = pd.read_csv('qc_data/DECENNIALDHC2020.P9-Data.csv')
acs5 = pd.read_csv('qc_data/ACSST5Y2020.S1810-Data.csv')

# cdc data
places_county = pd.read_csv('qc_data/PLACES__Local_Data_for_Better_Health__County_Data_2023_release_20240506.csv')
places_place = pd.read_csv('qc_data/PLACES__Local_Data_for_Better_Health__Place_Data_2023_release_20240506.csv')
places_tract = pd.read_csv('qc_data/PLACES__Local_Data_for_Better_Health__Census_Tract_Data_2023_release_20240506.csv')

sdoh_county = pd.read_csv('qc_data/SDOH_Measures_for_County__ACS_2017-2021_20240506.csv')
sdoh_place = pd.read_csv('qc_data/SDOH_Measures_for_Place__ACS_2017-2021_20240506.csv')
sdoh_tract = pd.read_csv('qc_data/SDOH_Measures_for_Census_Tract__ACS_2017-2021_20240506.csv')

# standardize columns
for df in [sdoh_county, sdoh_place, sdoh_tract]:
    df.rename(columns={'MeasureID':'MeasureId'}, inplace=True)


In [207]:
# get var lists
census_vars = list(var_dict["dhc"]["vars"].keys()) + (list(var_dict["acs5"]["vars"].keys()))
cdc_vars = list(var_dict["cdc"]["PLACES"]["vars"].keys()) + (list(var_dict["cdc"]["SDOH"]["vars"].keys()))

In [208]:
# merge census data, dropping unused cols
census_data = dhc_p12.merge(dhc_p9, how='left', on=['GEO_ID','NAME']).merge(acs5, how='left', on=['GEO_ID','NAME'])
for col in census_data.columns:
    if col not in ['GEO_ID', 'NAME'] and col not in census_vars:
        census_data.drop(columns=col, inplace=True)
# keep multiindex
#census_data.columns = pd.MultiIndex.from_arrays([census_data.columns, census_data.iloc[0].values])
# drop multiindex
census_data = census_data.iloc[1:]

In [209]:
# reformat cdc data, dropping unused cols and keeping only crude prevalence
# pivot and drop unused measures (vars), and merge
cdc_dfs_reformatted = []

cdc_dfs = [places_county, places_place, places_tract, sdoh_county, sdoh_place, sdoh_tract]
cdc_cols = ['LocationID', 'LocationName', 'DataValueTypeID', 'Measure', 'MeasureId', 'Data_Value']

for df in cdc_dfs:
    for col in df.columns:
        if col not in cdc_cols:
            df.drop(columns=col, inplace=True)
    df.drop(df.loc[df['DataValueTypeID'] == 'AgeAdjPrv'].index, inplace=True)
    df.drop(columns='DataValueTypeID', inplace=True)

    df_pivot = df.pivot(index=['LocationID', 'LocationName'], columns=['MeasureId', 'Measure'], values='Data_Value').reset_index()
    df_pivot = df_pivot.droplevel(level=1, axis=1)

    for col in df_pivot.columns:
        if col not in cdc_cols and col not in cdc_vars:
            df_pivot.drop(columns=col, inplace=True)
            
    cdc_dfs_reformatted.append(df_pivot)

cdc_data = pd.concat(cdc_dfs_reformatted[0:3]).merge(pd.concat(cdc_dfs_reformatted[3:]), how='left', on='LocationID')
cdc_data.columns.name = None

Finally, we have some merged data that will be easier to QC.

In [210]:
census_data.head()

Unnamed: 0,GEO_ID,NAME,P12_001N,P12_002N,P12_003N,P12_004N,P12_005N,P12_006N,P12_020N,P12_021N,...,P9_002N,P9_005N,P9_006N,P9_007N,P9_008N,P9_009N,P9_010N,P9_011N,S1810_C03_001E,S1810_C03_001M
1,0500000US02013,"Aleutians East Borough, Alaska",3420,2371,38,33,51,38,39,37,...,674,658,297,797,771,32,11,180,10.7,2.2
2,0500000US02016,"Aleutians West Census Area, Alaska",5232,3432,76,106,87,57,61,45,...,668,1585,257,709,1502,238,10,263,7.9,2.7
3,0500000US02020,"Anchorage Municipality, Alaska",291247,147894,9746,9879,9964,5657,3232,3852,...,26438,158232,13777,22480,27281,9844,1922,31273,11.2,0.6
4,0500000US02050,"Bethel Census Area, Alaska",18666,9747,931,977,881,567,168,209,...,207,1628,84,15580,201,2,19,945,11.4,1.0
5,0500000US02060,"Bristol Bay Borough, Alaska",844,440,23,37,18,16,13,18,...,45,357,6,296,5,3,2,130,15.9,4.1


In [211]:
cdc_data.head()

Unnamed: 0,LocationID,LocationName_x,STROKE,DIABETES,KIDNEY,CHD,CASTHMA,COPD,LocationName_y,REMNRTY,NOHSDP,BROAD,POV150
0,2013,Aleutians East,3.3,13.1,3.0,5.7,7.8,5.4,Aleutians East Borough,87.2,15.3,42.5,22.7
1,2016,Aleutians West,2.5,10.8,2.4,4.5,7.4,4.4,Aleutians West Census Area,77.1,9.0,23.0,11.3
2,2020,Anchorage,2.6,7.9,2.6,4.7,9.4,5.3,Anchorage Municipality,43.9,5.8,7.3,15.1
3,2050,Bethel,4.8,14.8,3.9,7.9,12.8,10.4,Bethel Census Area,90.8,18.0,25.2,43.9
4,2060,Bristol Bay,3.6,10.7,3.4,6.9,10.0,6.9,Bristol Bay Borough,58.8,5.3,23.5,8.0


We need to standardize the GEOIDs in order to merge all the tables together. We will drop the "US" and everything before it for all census-based GEOID columns, and add leading zeros to the CDC location id columns.

In [212]:
qc_places['GEOIDFQ'] = qc_places['GEOIDFQ'].str.split("US").str[1]
census_data['GEO_ID'] = census_data['GEO_ID'].str.split("US").str[1]
cdc_data['LocationID'] = "0" + cdc_data['LocationID'].astype(str)

Now we can merge the tables and replace column names using our short names from the lookup table. Let's view all column names to make sure we have everything we need, then do a last check of the pertinent geography columns to make sure everything looks like it lined up right during the table joining operations.

In [213]:
df = qc_places.merge(census_data, how='left', left_on='GEOIDFQ', right_on='GEO_ID').merge(cdc_data, how='left', left_on='GEOIDFQ', right_on='LocationID')

In [214]:
for col in df.columns:
    if col in var_dict["dhc"]["vars"].keys():
        new_col = var_dict["dhc"]["vars"][col]["short_name"]
        df.rename(columns={col : new_col}, inplace=True)
    elif col in var_dict["acs5"]["vars"].keys():
        new_col = var_dict["acs5"]["vars"][col]["short_name"]
        df.rename(columns={col : new_col}, inplace=True)
    elif col in var_dict["cdc"]["PLACES"]["vars"].keys():
        new_col = var_dict["cdc"]["PLACES"]["vars"][col]["short_name"]
        df.rename(columns={col : new_col}, inplace=True)
    elif col in var_dict["cdc"]["SDOH"]["vars"].keys():
        new_col = var_dict["cdc"]["SDOH"]["vars"][col]["short_name"]
        df.rename(columns={col : new_col}, inplace=True)

In [215]:
df.columns

Index(['id', 'name', 'GEOIDFQ', 'PLACENAME', 'AREATYPE', 'COMMENT', 'GEO_ID',
       'NAME', 'total_population', 'total_male', 'm_under_5', 'm_5_to_9',
       'm_10_to_14', 'm_15_to_17', 'm_65_to_66', 'm_67_to_69', 'm_70_to_74',
       'm_75_to_79', 'm_80_to_84', 'm_85_plus', 'total_female', 'f_under_5',
       'f_5_to_9', 'f_10_to_14', 'f_15_to_17', 'f_65_to_66', 'f_67_to_69',
       'f_70_to_74', 'f_75_to_79', 'f_80_to_84', 'f_85_plus', 'total_p9',
       'hispanic_latino', 'white', 'african_american', 'amer_indian_ak_native',
       'asian', 'hawaiian_pacislander', 'other', 'multi', 'pct_w_disability',
       'moe_pct_w_disability', 'LocationID', 'LocationName_x', 'pct_stroke',
       'pct_diabetes', 'pct_kd', 'pct_hd', 'pct_asthma', 'pct_copd',
       'LocationName_y', 'pct_minority', 'pct_no_hsdiploma', 'pct_no_bband',
       'pct_below_150pov'],
      dtype='object')

In [219]:
df[['id', 'name', 'GEOIDFQ', 'PLACENAME', 'AREATYPE', 'GEO_ID', 'NAME', 'LocationID', 'LocationName_x', 'LocationName_y']]

Unnamed: 0,id,name,GEOIDFQ,PLACENAME,AREATYPE,GEO_ID,NAME,LocationID,LocationName_x,LocationName_y
0,AK15,Anchorage,2020,Anchorage Municipality,County,2020,"Anchorage Municipality, Alaska",2020.0,Anchorage,Anchorage Municipality
1,BORO17,Bristol Bay Borough,2060,Bristol Bay Borough,County,2060,"Bristol Bay Borough, Alaska",2060.0,Bristol Bay,Bristol Bay Borough
2,BORO7,Ketchikan Gateway Borough,2130,Ketchikan Gateway Borough,County,2130,"Ketchikan Gateway Borough, Alaska",2130.0,Ketchikan Gateway,Ketchikan Gateway Borough
3,CENS6,Nome Census Area,2180,Nome Census Area,County,2180,"Nome Census Area, Alaska",2180.0,Nome,Nome Census Area
4,AK60,Chatham,2220,Sitka City and Borough,County,2220,"Sitka City and Borough, Alaska",2220.0,Sitka,Sitka City and Borough
5,AK70,Chiniak,213860,Chiniak CDP,Census designated place,213860,"Chiniak CDP, Alaska",,,
6,AK71,Chisana,213890,Chisana CDP,Census designated place,213890,"Chisana CDP, Alaska",,,
7,AK88,Copper Center,217300,Copper Center CDP,Census designated place,217300,"Copper Center CDP, Alaska",217300.0,Copper Center,Copper Center
8,AK252,Morzhovoi,224660,False Pass city,Incorporated place,224660,"False Pass city, Alaska",,,
9,AK153,Hoonah,233360,Hoonah city,Incorporated place,233360,"Hoonah city, Alaska",233360.0,Hoonah,Hoonah


In [221]:
cdc_data.name

AttributeError: 'DataFrame' object has no attribute 'name'