In [1]:
import census
import pandas as pd

The process of running the `get_census` pipeline itself is fairly straight forward. The primary complexity comes in defining your variables and setting up the yaml document that is used by the DataPlan object to know how to assemble the data. A link to previously existing documentation on that can be found [here](https://github.com/NSAPH/data_requests/blob/master/request_projects/apr2020_tidy_census_confounders/readme.md).

The simple workflow of getting the data can be seen below. We do it with state geography to minimize the size of data.

In [16]:
plan = census.DataPlan("census_vars.yml", geometry="state", years=get_census.census_years(2000,2019))
plan.assemble_data()
print(plan.data)

2000 hispanic_pct
2000 poverty
2000 poverty_mcare
2000 population
2000 median_house_value
2000 blk_pct
2000 white_pct
2000 native_pct
2000 asian_pct
2000 no_grad
2000 no_grad_mcare
2000 median_household_income
2000 owner_occupied
2000 median_age
2000 age_pct_0_14
2000 age_pct_15_44
2000 age_pct_45_65
2000 age_pct_65_plus
2009 hispanic_pct
2009 poverty
2009 poverty_mcare
2009 population
2009 median_house_value
2009 blk_pct
2009 white_pct
2009 native_pct
2009 asian_pct
2009 no_grad
2009 no_grad_mcare
2009 median_household_income
2009 owner_occupied
2009 median_age
2009 age_pct_0_14
2009 age_pct_15_44
2009 age_pct_45_65
2009 age_pct_65_plus
2010 hispanic_pct
2010 poverty
2010 poverty_mcare
2010 population
2010 median_house_value
2010 blk_pct
2010 white_pct
2010 native_pct
2010 asian_pct
2010 no_grad
2010 no_grad_mcare
2010 median_household_income
2010 owner_occupied
2010 median_age
2010 age_pct_0_14
2010 age_pct_15_44
2010 age_pct_45_65
2010 age_pct_65_plus
2011 hispanic_pct
2011 poverty


The Census API speeds can vary a lot. Use the code in the following blocks to either save the plan object, or to load a saved plan object and start the walk through after the API query step

In [17]:
import pickle

Code to save:

In [18]:
with open("census_walkthrough_data.pkl", 'wb') as output:
    pickle.dump(plan, output, pickle.HIGHEST_PROTOCOL)

Code to load:

In [13]:
with open("census_walkthrough_data.pkl", 'rb') as in_data:
    plan = pickle.load(in_data)

In honor of dorothy (and to get a better sense of the data) let's take a look at just the data for Kansas

In [19]:
plan.data[plan.data['state'] == "20"]

Unnamed: 0,state,year,hispanic_pct,poverty,poverty_mcare,population,median_house_value,blk_pct,white_pct,native_pct,asian_pct,no_grad,no_grad_mcare,median_household_income,owner_occupied,median_age,age_pct_0_14,age_pct_15_44,age_pct_45_65,age_pct_65_plus
13,20,2000,0.070023,0.098958,0.081171,2688418,83500,0.057356,0.860708,0.009275,0.01741,0.180196,0.262679,40624,0.692465,35.2,0.218828,0.43501,0.213657,0.132505
68,20,2009,0.088059,0.122153,0.080468,2777835,118500,0.056562,0.857649,0.00818,0.021684,0.19559,0.181598,48394,0.69497,35.9,0.208556,0.410696,0.250969,0.129779
122,20,2010,0.105163,0.124319,0.08098,2853118,122600,0.068633,0.838055,0.020076,0.028496,0.193603,0.171623,49424,0.677578,36.0,0.212688,0.396993,0.258493,0.131826
172,20,2011,0.10159,0.126402,0.078475,2830985,125500,0.057599,0.852249,0.008428,0.023855,0.189131,0.162217,50594,0.689804,36.1,0.211361,0.40031,0.25691,0.131419
224,20,2012,0.104741,0.132135,0.074834,2851183,127400,0.057582,0.85412,0.008219,0.024089,0.184947,0.152831,51273,0.68216,36.0,0.211456,0.398145,0.257728,0.132671
276,20,2013,0.10743,0.137482,0.07559,2868107,128400,0.057285,0.853968,0.008353,0.024549,0.182462,0.143891,51332,0.675171,36.0,0.211326,0.397344,0.256659,0.134671
328,20,2014,0.109659,0.137901,0.075432,2882946,129400,0.058003,0.852536,0.00823,0.025171,0.179706,0.137959,51872,0.671371,36.0,0.210385,0.396973,0.255608,0.137033
380,20,2015,0.111725,0.135702,0.074327,2892987,132000,0.05821,0.85191,0.008304,0.02615,0.176474,0.128367,52205,0.666891,36.0,0.209019,0.397416,0.253814,0.13975
432,20,2016,0.11308,0.132506,0.075375,2898292,135300,0.057864,0.851854,0.008236,0.026776,0.175209,0.120945,53571,0.663474,36.2,0.207848,0.395862,0.252921,0.14337
511,20,2017,0.115317,0.128103,0.07639,2903820,139200,0.058017,0.84906,0.008094,0.027804,0.172995,0.115076,55477,0.66442,36.3,0.206511,0.39594,0.250751,0.146797


What's happening under the covers is more interesting. We can output (part of) the plan to start to get a sense:

In [4]:
for k in plan.plan.keys():
    print(k)
    for i in range(3):
        print(plan.plan[k][i])

2000
Name: hispanic_pct
Dataset: census
Num: ['P004002']
Den: ['P001001']
Name: poverty
Dataset: census
Num: ['P087002']
Den: ['P087001']
Name: poverty_mcare
Dataset: census
Num: ['P087008', 'P087009']
Den: ['P087008', 'P087009', 'P087016', 'P087017']
2009
Name: hispanic_pct
Dataset: acs
Num: ['B03003_003E']
Den: ['B03003_001E']
Name: poverty
Dataset: acs
Num: ['B17001_002E']
Den: ['B17001_001E']
Name: poverty_mcare
Dataset: acs
Num: ['B17001_015E', 'B17001_016E', 'B17001_029E', 'B17001_030E']
Den: ['B17001_015E', 'B17001_016E', 'B17001_029E', 'B17001_030E', 'B17001_044E', 'B17001_045E', 'B17001_058E', 'B17001_059E']
2010
Name: hispanic_pct
Dataset: census
Num: ['P004003']
Den: ['P004001']
Name: poverty
Dataset: acs
Num: ['B17001_002E']
Den: ['B17001_001E']
Name: poverty_mcare
Dataset: acs
Num: ['B17001_015E', 'B17001_016E', 'B17001_029E', 'B17001_030E']
Den: ['B17001_015E', 'B17001_016E', 'B17001_029E', 'B17001_030E', 'B17001_044E', 'B17001_045E', 'B17001_058E', 'B17001_059E']
2011
Na

Here we see that each year has a set of variables that define a calculated outcome of census variables, along with the associated source. The source varies by year, and years without available data are skipped. These variables are defined in the `.yml` file used when creating the DataPlan object.

Looking at the data from kansas (and at the output from the data plan) we can see that the period from 2001-2009 has no data. Having this missing data be explicitly represented is important. To resolve this, we use the `DataPlan.create_missingness()` method. Note the shift to having to use `geoid` as the spatial identifier rather than state. The `geoid` variable is created to store all location information in a single variable rather than across multiple (like it comes when received from the census). `geoid` should generally correspond to the FIPS code, although this is only ensured for single variable codes and for county values (where an explicit case is defined).

In [20]:
plan.create_missingness()
plan.data[plan.data['geoid'] == "20"]

Unnamed: 0,year,geoid,state,hispanic_pct,poverty,poverty_mcare,population,median_house_value,blk_pct,white_pct,...,asian_pct,no_grad,no_grad_mcare,median_household_income,owner_occupied,median_age,age_pct_0_14,age_pct_15_44,age_pct_45_65,age_pct_65_plus
13,2000,20,20.0,0.070023,0.098958,0.081171,2688418.0,83500.0,0.057356,0.860708,...,0.01741,0.180196,0.262679,40624.0,0.692465,35.2,0.218828,0.43501,0.213657,0.132505
65,2001,20,,,,,,,,,...,,,,,,,,,,
117,2002,20,,,,,,,,,...,,,,,,,,,,
169,2003,20,,,,,,,,,...,,,,,,,,,,
221,2004,20,,,,,,,,,...,,,,,,,,,,
273,2005,20,,,,,,,,,...,,,,,,,,,,
325,2006,20,,,,,,,,,...,,,,,,,,,,
377,2007,20,,,,,,,,,...,,,,,,,,,,
429,2008,20,,,,,,,,,...,,,,,,,,,,
481,2009,20,20.0,0.088059,0.122153,0.080468,2777835.0,118500.0,0.056562,0.857649,...,0.021684,0.19559,0.181598,48394.0,0.69497,35.9,0.208556,0.410696,0.250969,0.129779


In [21]:
plan.write_data("uninterpolated_state_census.csv")

In [22]:
plan.interpolate()
plan.data[plan.data['geoid'] == "20"]

Interpolating owner_occupied
Interpolating median_household_income
Interpolating age_pct_15_44
Interpolating hispanic_pct
Interpolating native_pct
Interpolating no_grad_mcare
Interpolating age_pct_45_65
Interpolating median_house_value
Interpolating poverty
Interpolating blk_pct
Interpolating no_grad
Interpolating poverty_mcare
Interpolating median_age
Interpolating population
Interpolating asian_pct
Interpolating white_pct
Interpolating age_pct_0_14
Interpolating age_pct_65_plus


Unnamed: 0,year,geoid,state,hispanic_pct,poverty,poverty_mcare,population,median_house_value,blk_pct,white_pct,...,asian_pct,no_grad,no_grad_mcare,median_household_income,owner_occupied,median_age,age_pct_0_14,age_pct_15_44,age_pct_45_65,age_pct_65_plus
13,2000,20,20.0,0.070023,0.098958,0.081171,2688418.0,83500.0,0.057356,0.860708,...,0.01741,0.180196,0.262679,40624.0,0.692465,35.2,0.218828,0.43501,0.213657,0.132505
65,2001,20,,0.070163,0.099138,0.081165,2689111.0,83771.317829,0.05735,0.860685,...,0.017443,0.180315,0.262051,40684.232558,0.692484,35.205426,0.218748,0.434822,0.213946,0.132484
117,2002,20,,0.07057,0.099661,0.081149,2691128.0,84560.606061,0.057332,0.860616,...,0.01754,0.180662,0.260222,40859.454545,0.692541,35.221212,0.218516,0.434273,0.214788,0.132422
169,2003,20,,0.072027,0.101536,0.081093,2698353.0,87388.888889,0.057268,0.860368,...,0.017885,0.181906,0.25367,41487.333333,0.692743,35.277778,0.217686,0.432309,0.217803,0.132202
221,2004,20,,0.076035,0.10669,0.080936,2718224.0,95166.666667,0.057092,0.859689,...,0.018835,0.185327,0.235652,43214.0,0.6933,35.433333,0.215404,0.426905,0.226095,0.131597
273,2005,20,,0.087826,0.116896,0.080772,2774302.0,110775.0,0.059779,0.853515,...,0.022319,0.191245,0.199374,46709.0,0.689996,35.75,0.212157,0.413349,0.243522,0.130973
325,2006,20,,0.09376,0.122875,0.080638,2802929.0,119866.666667,0.060586,0.851118,...,0.023955,0.194928,0.178273,48737.333333,0.689172,35.933333,0.209933,0.406128,0.253477,0.130462
377,2007,20,,0.094879,0.123379,0.080329,2806937.0,120671.428571,0.060159,0.851279,...,0.02394,0.194099,0.175979,49002.571429,0.689263,35.957143,0.210137,0.405297,0.253968,0.130598
429,2008,20,,0.095536,0.123963,0.079963,2809887.0,121120.0,0.059988,0.851469,...,0.02395,0.193489,0.174436,49153.933333,0.688789,35.96,0.210225,0.40482,0.254218,0.130737
481,2009,20,20.0,0.088059,0.122153,0.080468,2777835.0,118500.0,0.056562,0.857649,...,0.021684,0.19559,0.181598,48394.0,0.69497,35.9,0.208556,0.410696,0.250969,0.129779


In [23]:
plan.write_data("interpolated_state_census.csv")