# US CENSUS DATA - Access via API



A large list of the census datasets accessible via API can be found here: https://www.census.gov/data/developers/data-sets.html . Drilling down into the individual dataset pages will eventually get you to a level where you can view dataset variables and supported geographies. _Not all datasets have the same geographies!_ So defining the geographic unit of interest first will narrow down the data available for those units.

We will concentrate on zip code geography in this notebook as a demonstration. Valid Alaska zip codes for 2020 have been found by downloading the Census Gazetteer file [here](https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html). 

We will compute population > 65 years and population < 18 years for each geography using the DHC (described below). We will also compute percent population with a disability using the ACS 5-year dataset. Some other datasets are also described below for comparison and to show that there are multiple avenues for comparing zip code-level data with larger spatial units for contrast & context.


#### Decennial Census (2020)

The Demographic and Housing Characteristics File ([DHC](https://www.census.gov/data/tables/2023/dec/2020-census-dhc.html)) dataset includes zip code geography; it contains granular spatial units such as "zip code tabulation area" as well as other meaningful units like "alaska native regional corporation" and "school district", as shown in the geography section of [this table](https://api.census.gov/data/2020/dec/dhc.html).

The Detailed Demographic and Housing Characteristics File A ([DHC-A](https://www.census.gov/data/tables/2023/dec/2020-census-detailed-dhc-a.html#:~:text=2020%20Census%20DHC,detailed%20race%20and%20ethnicity%20groups.)) dataset **does not** contain zip code geography; the most granular spatial units are "place", "tract", and "american indian area/alaska native area/hawaiian home land" as shown in the geography section of [this table](https://api.census.gov/data/2020/dec/ddhca.html).

#### American Community Surveys

By definition, the American Community Survey ([ACS](https://www.census.gov/data/developers/data-sets/acs-1year.html)) 1-year and 3-year datasets only have data for geographies of "places with populations of 65,000 or more" or "20,000 or more", respectively, and therefore would not be appropriate for a zip code-level query. However, the ACS 5-year dataset does have data for granular spatial units such as "zip code tabulation area" as well as other meaningful units like "alaska native regional corporation" and "school district", as shown in [this table](https://api.census.gov/data/2022/acs/acs5/geography.html).

#### Population Estimates and Projections

The [Population Estimates and Projections](https://www.census.gov/data/developers/data-sets/popest-popproj.html) dataset is limited to national, regional, and state-level geographies as shown in the geographies section of [this table](https://api.census.gov/data/2021/pep/population.html), and therefore would not be appropriate for a zip code-level query.

### Setup

In [1]:
import pandas as pd
import requests
import json
from api_functions import * # custom helper functions for census API
from api_luts import * # ak zip codes and list of variables/names for each dataset

### Explore DHC and ACS datasets

By fetching the JSON at the base API for the DHC and ACS 5-year datasets, we can get view the metadata for each. From that metadata we can also get links directly to other JSON files that catalogue the geographies and variables.

In [2]:
# set up variables to hold the base API call for both the DHC and the ACS 5-year
dhc_base = 'https://api.census.gov/data/2020/dec/dhc'
acs5_base = 'https://api.census.gov/data/2020/acs/acs5'

# fetch and print the DHC and ACS metadata JSON
dhc_metadata = fetch_print_json(dhc_base, "metadata")
acs5_metadata = fetch_print_json(acs5_base, "metadata")

{
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
  "@id": "http://api.census.gov/data/2020/dec/dhc.json",
  "@type": "dcat:Catalog",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "dataset": [
    {
      "c_vintage": 2020,
      "c_dataset": [
        "dec",
        "dhc"
      ],
      "c_geographyLink": "http://api.census.gov/data/2020/dec/dhc/geography.json",
      "c_variablesLink": "http://api.census.gov/data/2020/dec/dhc/variables.json",
      "c_tagsLink": "http://api.census.gov/data/2020/dec/dhc/tags.json",
      "c_examplesLink": "http://api.census.gov/data/2020/dec/dhc/examples.json",
      "c_groupsLink": "http://api.census.gov/data/2020/dec/dhc/groups.json",
      "c_sorts_url": "http://api.census.gov/data/2020/dec/dhc/sorts.json",
      "c_documentationLink": "https://www.census.gov/developer/",
      "c_isAggregate": true,
      "c_isCube": tr

In [3]:
# grab the geography URLs and the variables URLs from the JSON, and print them
dhc_geo_url = dhc_metadata["dataset"][0]["c_geographyLink"]
dhc_var_url = dhc_metadata["dataset"][0]["c_variablesLink"]
print(dhc_geo_url)
print(dhc_var_url)

acs5_geo_url = acs5_metadata["dataset"][0]["c_geographyLink"]
acs5_var_url = acs5_metadata["dataset"][0]["c_variablesLink"]
print(acs5_geo_url)
print(acs5_var_url)

http://api.census.gov/data/2020/dec/dhc/geography.json
http://api.census.gov/data/2020/dec/dhc/variables.json
http://api.census.gov/data/2020/acs/acs5/geography.json
http://api.census.gov/data/2020/acs/acs5/variables.json


In [4]:
# fetch and print JSON for the geography and variable URLs
dhc_geo = fetch_print_json(dhc_geo_url, parse="geo")
dhc_var = fetch_print_json(dhc_var_url, parse="var")

acs5_geo = fetch_print_json(acs5_geo_url, parse="geo")
acs5_var = fetch_print_json(acs5_var_url, parse="var")

NAME: us
NAME: region
NAME: division
NAME: state
NAME: county: REQUIRES: ['state']
NAME: county subdivision: REQUIRES: ['state', 'county']
NAME: subminor civil division: REQUIRES: ['state', 'county', 'county subdivision']
NAME: place/remainder (or part): REQUIRES: ['state', 'county', 'county subdivision']
NAME: tract (or part): REQUIRES: ['state', 'county', 'county subdivision', 'place/remainder (or part)']
NAME: urban/rural: REQUIRES: ['state', 'county', 'county subdivision', 'place/remainder (or part)', 'tract (or part)']
NAME: urban/rural: REQUIRES: ['state', 'county', 'county subdivision', 'place/remainder (or part)', 'tract (or part)', 'block group (or part)']
NAME: block group (or part): REQUIRES: ['state', 'county', 'county subdivision', 'place/remainder (or part)', 'tract (or part)']
NAME: block: REQUIRES: ['state', 'county', 'tract']
NAME: tract: REQUIRES: ['state', 'county']
NAME: american indian area/alaska native area/hawaiian home land (or part): REQUIRES: ['state', 'count

The large printouts above are very hard to read, and contain many items we do not want to sift through. But thankfully we already know the  general geographies and variables we are looking for. Let's see what's available in geography by searching both dataset geographies for the "zip code" substring. 

In DHC, we search variable codes starting with "P12_" (P12 =  Sex by Age for Selected Age Categories, and including the "_" in the variable code will exclude racial/demographic categories). We also include "Male" or "Female" explicitly. This will give us the many DHC variables that we can query for data, which will still be long but may be easier to narrow down.

In ACS, we simply search variable labels where the word "disability" is present.

In [5]:
print("DHC Geographies:")
zip_code_geo = []
for name in dhc_geo["fips"]:
    if 'zip code' in name['name']:
        zip_code_geo.append(name['name'])
print(f"{len(zip_code_geo)} geographies containing 'zip code':\n")
for z in zip_code_geo: print(z)

print("\n")
print("DHC Variables in P12 table, excluding racial demographic categories:")
p12_vars = []
for var in dhc_var["variables"].keys():
    if ("P12_" in var) and (("Male" in dhc_var['variables'][var]['label'] or ("Female" in dhc_var['variables'][var]['label']))):
        p12_vars.append(f"{var}: {dhc_var['variables'][var]['label']}: {dhc_var['variables'][var]['concept']}")
print(f"{len(p12_vars)} variables containing 'P12_' plus 'Male' or 'Female':\n")
p12_vars.sort()
for var in p12_vars: print(var)

DHC Geographies:
2 geographies containing 'zip code':

zip code tabulation area
zip code tabulation area (or part)


DHC Variables in P12 table, excluding racial demographic categories:
48 variables containing 'P12_' plus 'Male' or 'Female':

P12_002N:  !!Total:!!Male:: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_003N:  !!Total:!!Male:!!Under 5 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_004N:  !!Total:!!Male:!!5 to 9 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_005N:  !!Total:!!Male:!!10 to 14 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_006N:  !!Total:!!Male:!!15 to 17 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_007N:  !!Total:!!Male:!!18 and 19 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_008N:  !!Total:!!Male:!!20 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_009N:  !!Total:!!Male:!!21 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_010N:  !!Total:!!Male:!!22 to 24 years: SEX BY AGE FOR SELECTED AGE CATEGORIES
P12_011N:  !!Total:!!Male:!!25 to 29 yea

In [6]:
print("ACS 5yr Geographies:")
zip_code_geo = []
for name in acs5_geo["fips"]:
    if 'zip code' in name['name']:
        zip_code_geo.append(name['name'])
print(f"{len(zip_code_geo)} geographies containing 'zip code':\n")
for z in zip_code_geo: print(z)

print("\n")
print("ACS 5yr Variables containing 'disability':")
disability_vars = []
for var in acs5_var["variables"].keys():
    if ("disability" in acs5_var['variables'][var]['label']):
        disability_vars.append(f"{var}: {acs5_var['variables'][var]['label']}")
print(f"{len(disability_vars)} variables containing 'disability':\n")
disability_vars.sort()
for var in disability_vars: print(var)

ACS 5yr Geographies:
1 geographies containing 'zip code':

zip code tabulation area


ACS 5yr Variables containing 'disability':
353 variables containing 'disability':

B10052_002E: Estimate!!Total:!!With any disability:
B10052_003E: Estimate!!Total:!!With any disability:!!Grandparent responsible for own grandchildren under 18 years:
B10052_004E: Estimate!!Total:!!With any disability:!!Grandparent responsible for own grandchildren under 18 years:!!30 to 59 years
B10052_005E: Estimate!!Total:!!With any disability:!!Grandparent responsible for own grandchildren under 18 years:!!60 years and over
B10052_006E: Estimate!!Total:!!With any disability:!!Grandparent not responsible for own grandchildren under 18 years
B10052_007E: Estimate!!Total:!!No disability:
B10052_008E: Estimate!!Total:!!No disability:!!Grandparent responsible for own grandchildren under 18 years:
B10052_009E: Estimate!!Total:!!No disability:!!Grandparent responsible for own grandchildren under 18 years:!!30 to 59 years
B

### Query the DHC and ACS datasets

Picking out the variables we want manually, we end up with the following shopping lists that we will query by zip code. We've gone ahead and used these variable codes to set up dataset variable dictionarys with codes as keys, and sub-dictionaries with a custom long name and short name for the variables. (These are stored in `api_luts.py` and has already been loaded into the notebook.)

**DHC:**  
  
P12_002N:  !!Total:!!Male:: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_026N:  !!Total:!!Female:: SEX BY AGE FOR SELECTED AGE CATEGORIES  
  
P12_003N:  !!Total:!!Male:!!Under 5 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_004N:  !!Total:!!Male:!!5 to 9 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_005N:  !!Total:!!Male:!!10 to 14 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_006N:  !!Total:!!Male:!!15 to 17 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
  
P12_020N:  !!Total:!!Male:!!65 and 66 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_021N:  !!Total:!!Male:!!67 to 69 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_022N:  !!Total:!!Male:!!70 to 74 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_023N:  !!Total:!!Male:!!75 to 79 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_024N:  !!Total:!!Male:!!80 to 84 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_025N:  !!Total:!!Male:!!85 years and over: SEX BY AGE FOR SELECTED AGE CATEGORIES  
  
P12_027N:  !!Total:!!Female:!!Under 5 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_028N:  !!Total:!!Female:!!5 to 9 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_029N:  !!Total:!!Female:!!10 to 14 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_030N:  !!Total:!!Female:!!15 to 17 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
  
P12_044N:  !!Total:!!Female:!!65 and 66 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_045N:  !!Total:!!Female:!!67 to 69 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_046N:  !!Total:!!Female:!!70 to 74 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_047N:  !!Total:!!Female:!!75 to 79 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_048N:  !!Total:!!Female:!!80 to 84 years: SEX BY AGE FOR SELECTED AGE CATEGORIES  
P12_049N:  !!Total:!!Female:!!85 years and over: SEX BY AGE FOR SELECTED AGE CATEGORIES  
  
**ACS 5-year:**  
  
B18140_002E: Estimate!!Total:!!With a disability  



Taking our variable dicts and our list of Alaska zip codes, we now use another helper function build a URL and query the census datasets. The result will be a list of lists, where the first list contains the column names and the remaining lists are basically rows of data. The function will also rename the columns using the short names from the variable dicts, and then output this result as a `pandas.DataFrame` for further processing.

In [7]:
dhc_data = fetch_vars_by_zip_to_df(dhc_base, list(dhc_var_dict.keys()), dhc_var_dict, ak_zip_codes)
acs5_data = fetch_vars_by_zip_to_df(acs5_base, list(acs5_var_dict.keys()), acs5_var_dict, ak_zip_codes)

fetching data from: https://api.census.gov/data/2020/dec/dhc?get=P12_002N,P12_026N,P12_003N,P12_004N,P12_005N,P12_006N,P12_020N,P12_021N,P12_022N,P12_023N,P12_024N,P12_025N,P12_027N,P12_028N,P12_029N,P12_030N,P12_044N,P12_045N,P12_046N,P12_047N,P12_048N,P12_049N&for=zip%20code%20tabulation%20area:99501,99502,99503,99504,99505,99506,99507,99508,99509,99510,99511,99513,99514,99515,99516,99517,99518,99519,99520,99521,99522,99523,99524,99529,99530,99540,99545,99546,99547,99548,99549,99550,99551,99552,99553,99554,99555,99556,99557,99558,99559,99561,99563,99564,99565,99566,99567,99568,99569,99571,99572,99573,99574,99575,99576,99577,99578,99579,99580,99581,99583,99585,99586,99587,99588,99589,99590,99591,99599,99602,99603,99604,99605,99606,99607,99608,99609,99610,99611,99612,99613,99614,99615,99619,99620,99621,99622,99623,99624,99625,99626,99627,99628,99629,99630,99631,99632,99633,99634,99635,99636,99637,99638,99639,99640,99641,99643,99644,99645,99647,99648,99649,99650,99651,99652,99653,99654,

In [8]:
dhc_data.head()

Unnamed: 0,ZCTA,total_male,total_female,m_under_5,m_5_to_9,m_10_to_14,m_15_to_17,m_65_to_66,m_67_to_69,m_70_to_74,...,f_under_5,f_5_to_9,f_10_to_14,f_15_to_17,f_65_to_66,f_67_to_69,f_70_to_74,f_75_to_79,f_80_to_84,f_85_plus
0,99576,1344,1303,91,104,127,55,35,28,32,...,112,125,106,53,29,33,38,27,6,9
1,99573,514,446,34,28,33,8,9,22,37,...,24,23,26,14,16,19,18,11,4,2
2,99574,1430,1261,84,87,78,51,50,47,60,...,84,72,74,50,45,34,51,28,14,19
3,99575,56,34,7,10,2,4,2,0,0,...,2,2,4,1,1,1,1,1,0,0
4,99577,13554,13393,966,1098,1089,590,255,341,440,...,888,1030,983,566,269,353,404,207,104,83


Compute some additional DHC columns

In [14]:
# compute some useful columns of data, and display the computed columns
dhc_data['total_population'] = dhc_data['total_male'] + dhc_data['total_female']
dhc_data['m_under_18'] = dhc_data['m_under_5'] + dhc_data['m_5_to_9'] + dhc_data['m_10_to_14'] + dhc_data['m_15_to_17']
dhc_data['f_under_18'] = dhc_data['f_under_5'] + dhc_data['f_5_to_9'] + dhc_data['f_10_to_14'] + dhc_data['f_15_to_17']
dhc_data['total_under_18'] = dhc_data['m_under_18'] + dhc_data['f_under_18']
dhc_data['m_65_plus'] = dhc_data['m_65_to_66'] + dhc_data['m_67_to_69'] + dhc_data['m_70_to_74'] + dhc_data['m_75_to_79'] + dhc_data['m_80_to_84'] + dhc_data['m_85_plus']
dhc_data['f_65_plus'] = dhc_data['f_65_to_66'] + dhc_data['f_67_to_69'] + dhc_data['f_70_to_74'] + dhc_data['f_75_to_79'] + dhc_data['f_80_to_84'] + dhc_data['f_85_plus']
dhc_data['total_65_plus'] = dhc_data['m_65_plus'] + dhc_data['f_65_plus']

dhc_data[['ZCTA', 'total_population', 'total_under_18', 'total_65_plus']]

Unnamed: 0,ZCTA,total_population,total_under_18,total_65_plus
0,99576,2647,773,282
1,99573,960,190,161
2,99574,2691,580,425
3,99575,90,32,6
4,99577,26947,7210,2825
...,...,...,...,...
240,99709,28613,6339,3972
241,99712,13309,3071,1998
242,99714,1230,284,240
243,99720,192,58,31


Check out the ACS disability data. 

In [15]:
acs5_data

Unnamed: 0,ZCTA,total_disability
0,99693,-666666666
1,99709,31936
2,99507,36276
3,99567,-666666666
4,99923,-666666666
...,...,...
233,99825,-666666666
234,99827,-666666666
235,99929,-666666666
236,99835,25000



We can see here that the ACS uses -666666666 as a no data value.  We also see a discrepancy between our total population computed for zip code 99709 in the DHC data and the ACS disability number: the DHC adds up to 28613, while ACS says 31936 people have a disability! What gives? It might be because the ACS has a different resident definition than the DHC, and that the ACS numbers are all technically estimates and not direct counts. See [here](https://www.census.gov/content/dam/Census/library/publications/2020/acs/acs_general_handbook_2020_ch09.pdf) and [here](https://www.census.gov/programs-surveys/acs/guidance/comparing-acs-data.html) for some clues.



This notebook shows that the data are easily accessed via the API, provided you know what you geography are looking for and have manually combed through the data variables to determine your targets. We also see that the data is not necessarily available for every geography, and that cross-datasets comparisons or calculations will need to be approached with caution and nuance. 