# Building the `geosnap` Census data

This notebook demonstrates internal functions the geosnap team uses to build its curated datasets from the U.S. Census bureau. Most users will not need these, but we include the code here for tranparency and reproducibility

The process is straightforward, designed to do as little data manipulation as possible, and takes place in four steps:

1. download bulk data from the Census FTP Server
2. rename variables to match the detailed tables API, and reformat data into high efficiency parquet files
3. compute additional variables to match the LTDB set and attach geometries
4. upload all data to `quilt`

As a result, the raw (but efficient!) data are available [here](https://open.quiltdata.com/b/spatial-ucr/tree/census/demographic_profile/), and the convenient data with the most commonly used variables (with simple names) are available [here](https://open.quiltdata.com/b/spatial-ucr/tree/census/acs/)

In [1]:
import ogrio
from geosnap.io.util import get_census_gdb_wget, convert_census_gdb
from geosnap.util import process_acs

The `get_census_gdb` function will fetch geodatabases containing ACS demographic profile data (which contains most of the useful variables) and store them locally for processing

## 1. Collecting raw data from census

In [2]:
get_census_gdb?

[0;31mSignature:[0m [0mget_census_gdb[0m[0;34m([0m[0myears[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mgeom_level[0m[0;34m=[0m[0;34m'blockgroup'[0m[0;34m,[0m [0moutput_dir[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Fetch file geodatabases of ACS demographic profile data from the Census bureau server.

Parameters
----------
years : list, optional
    set of years to download (2010 onward), defaults to 2010-2018
geom_level : str, optional
    geographic unit to download (tract or blockgroup), by default "blockgroup"
output_dir : str, optional
    output directory to write files, by default None
[0;31mFile:[0m      ~/Dropbox/projects/geosnap/geosnap/io/util.py
[0;31mType:[0m      function


In [None]:
get_census_gdb(years=['2018'],  geom_level='blockgroup', output_dir='.')

D: 35% -  1.3GiB  /  3.6GiB  eta 1:20:255:02:42

## 2. Converting to efficient data formats

Using `ogrio` we can quickly list all the layers present in the geodatabase

In [6]:
# returns an array of arrays, with the inner = [name, geometry]
# we only need the name
[layer[0] fpr layer in ogrio.list_layers("ACS_2018_5YR_BG.gdb.zip")]

['ACS_2014_5YR_BG',
 'BG_METADATA_2014',
 'X00_COUNTS',
 'X01_AGE_AND_SEX',
 'X02_RACE',
 'X03_HISPANIC_OR_LATINO_ORIGIN',
 'X07_MIGRATION',
 'X08_COMMUTING',
 'X09_CHILDREN_HOUSEHOLD_RELATIONSHIP',
 'X11_HOUSEHOLD_FAMILY_SUBFAMILIES',
 'X12_MARITAL_STATUS_AND_HISTORY',
 'X14_SCHOOL_ENROLLMENT',
 'X15_EDUCATIONAL_ATTAINMENT',
 'X16_LANGUAGE_SPOKEN_AT_HOME',
 'X17_POVERTY',
 'X19_INCOME',
 'X20_EARNINGS',
 'X21_VETERAN_STATUS',
 'X22_FOOD_STAMPS',
 'X23_EMPLOYMENT_STATUS',
 'X24_INDUSTRY_OCCUPATION',
 'X27_HEALTH_INSURANCE',
 'X99_IMPUTATION',
 'X25_HOUSING_CHARACTERISTICS']

File geodatabases can be convient, but they are also painfuly slow to process in python, so the `convert_census_gdb` will convert the layers in a gdb to a parquet file instead

In [7]:
convert_census_gdb?

[0;31mSignature:[0m
[0mconvert_census_gdb[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfile[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlayers[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0myear[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlevel[0m[0;34m=[0m[0;34m'bg'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msave_intermediate[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moutput_dir[0m[0;34m=[0m[0;34m'.'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Convert file geodatabases from Census into (set of) parquet files.

Parameters
----------
file : str
    path to file geodatabase
layers : list
    set of layers to extract from gdb
year : str, optional
    [description], by default None
level : str, optional
    geographic level of data ('bg' for blockgroups or 'tr' for tract), by default "bg"
save_intermediate : bool, optional
    if true, each layer will be stored 

In [8]:
convert_census_gdb(file='ACS_2014_5YR_BG.gdb.zip', save_intermediate=True, layers=['X00_COUNTS'], year='2014', level='tract')

X00_COUNTS


In [9]:
import pandas as pd

In [10]:
pd.read_parquet('acs_2014_X00_COUNTS_tract.parquet')

Unnamed: 0_level_0,B00001_001E,B00002_001E
GEOID,Unnamed: 1_level_1,Unnamed: 2_level_1
15000US020130001001,283.0,89.0
15000US020130001002,410.0,138.0
15000US020130001003,473.0,183.0
15000US020160001001,418.0,486.0
15000US020160002001,561.0,211.0
...,...,...
15000US560459511001,223.0,124.0
15000US560459511002,234.0,63.0
15000US560459513001,95.0,46.0
15000US560459513002,92.0,46.0


## 3. Computing intermediate variables

The resulting (combined) parquet files can be processed with `process_acs` to generate the datasets described in the geosnap codebook. For each year, we merge all the files for a single geography into a single, massive dataset, then compute any variables we need and keeping that subset. 


Note, unless *all* layers are processed from the geodatabase, several variables will be unavailable. You can merge them all together with something like the following cell

In [None]:
dfs = []
for file in [file for file in os.listdir("/Users/knaaptime/Dropbox/projects/geosnap/data/census/2019/") if file.endswith('bg.parquet')]:
    if not file not in ['acs_2019_bg.parquet', 'acs_2019_ACS_2019_5YR_BG_bg.parquet']:
        df = gpd.read_parquet("/Users/knaaptime/Dropbox/projects/geosnap/data/census/2019/"+file)
        dfs.append(df)
df = pd.concat(dfs)


In [None]:
df = process_acs(df.reset_index())

to complete the process, you need to merge the geometries(`acs_{year}_ACS_{year}_5YR_{geom_level}_{geom_level}.parquet`) with the processed variables (`df`)

## 4. Uploading to quilt

Follow the [packaging instructions](https://docs.quiltdata.com/walkthrough/uploading-a-package) from `quilt`

To authenticate to the spatial-ucr s3 server, you need to have the `.aws` config file with auth parameters in your home directory