# IPUMS NHGIS dataset metadata API
This notebook adapts code provided by IPUMS and requests metadata for identifying relevent datasets and shapefiles that will be subsequently requested and downloaded using the IPUMS NHGIS dataset API. 

For clarification, this notebook contains code to access the dataset *metadata* API. A separate notebook contains code to access the dataset API. 

**Data required**

Dataset(s): 2009 American Community Survey (ACS) 5-year estimates 

Geographic units: Block group and county level

Geographic extent: Galveston County, Texas

Fields of interest: Total population; total households; population by race and ethnicity; median household income

Shapefiles: block group and county geographies

**See the IPUMS dataset metadata API example here:**

https://developer.ipums.org/docs/workflows/explore_metadata/nhgis/datasets/

# Preliminary Operations

#### 1.Install and import packages

## 1.Install and import packages

In [1]:
## packages needed for NHGIS download
import requests # For creating maps
import json # For building the NHGIS API request
from pprint import pprint # For checking the NHGIS API request operations

## general packages needed
import os   # For saving output to path
import sys  # For checking version of python for replication
import pandas as pd # For reading, writing and wrangling data
from pandas import json_normalize # For wrangling NHGIS metadata JSON

In [2]:
## Display versions being used - important information for replication

print("Python version     ", sys.version)
print("pandas version:    ", pd.__version__)
print("requests version:  ", requests.__version__)
print("json version:      ", json.__version__)

Python version      3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0]
pandas version:     1.0.5
requests version:   2.23.0
json version:       2.0.9


# NHGIS Dataset metadata API request

This section will request and clean several different levels of metadata from the NHGIS API. 

This review of metadata will aid in identifying the paramaters for the relevent dataset(s) and table(s) of interest. 

Each time a more detailed level of metadata is requested the URL must be updated with the correct value(s).

#### 1. Set API key
#### 2. Get high-level metadata for all datasets
#### 3. Get detailed metadata for a single dataset
#### 4. Get metadata for a table


##1.Set API Key
Set object 'my_key' equal to personal and unique NHGIS API key obtained from https://developer.ipums.org/docs/get-started/.

In [None]:
my_key = "typekeyhere"
my_headers = {"Authorization": my_key}
print("my_key is now set to:", my_key)

##2.Get high-level metadata for all datasets
This API call will return a list of all available datasets. The entry for each dataset includes metadata for the dataset’s ***name*** and description. The unique identifier for each dataset is in the **name** field. This is needed for retrieving details about a single dataset (see next section).

In [4]:
my_headers = {"Authorization": my_key}
url = "https://api.ipums.org/metadata/nhgis/datasets?version=v1"
nhgis_alldatasets = requests.get(url, headers=my_headers)
pprint(nhgis_alldatasets.json())

[{'description': 'Population Data [US, States & Counties]',
  'group': '1790 Census',
  'name': '1790_cPop',
  'sequence': 101},
 {'description': 'Population Data [US, States & Counties]',
  'group': '1800 Census',
  'name': '1800_cPop',
  'sequence': 201},
 {'description': 'Population Data [US, States & Counties]',
  'group': '1810 Census',
  'name': '1810_cPop',
  'sequence': 301},
 {'description': 'Population Data [US, States & Counties]',
  'group': '1820 Census',
  'name': '1820_cPop',
  'sequence': 401},
 {'description': 'Population Data [US, States & Counties]',
  'group': '1830 Census',
  'name': '1830_cPop',
  'sequence': 501},
 {'description': 'Agriculture Data [US, States & Counties]',
  'group': '1840 Census',
  'name': '1840_cAg',
  'sequence': 601},
 {'description': 'Manufacturing Data [US, States & Counties]',
  'group': '1840 Census',
  'name': '1840_cMfg',
  'sequence': 602},
 {'description': 'Population & Other Data [US, States & Counties]',
  'group': '1840 Census',


From the preceding output, the data and paramaters of interest include



```
{'description': '5-Year Data [2005-2009, Block Groups & Larger Areas]',
   'group': '2009 American Community Survey',
   'name': '2005_2009_ACS5a',
   'sequence': 4603},
```



The JSON output may be difficult to sort through. We can convert the JSON output to a pandas dataframe for table formatting and easier review.

In the following codeblock, the JSON is converted to a dataframe. The total number of columns and rows are displayed along with the first 5 observations in the dataframe.

In [5]:
nhgis_alldatasets_json = json.dumps(nhgis_alldatasets.json())
nhgis_alldatasets_df = pd.read_json(nhgis_alldatasets_json)
print('Number of colums in Dataframe : ', len(nhgis_alldatasets_df.columns))
print('Number of rows in Dataframe : ', len(nhgis_alldatasets_df.index))
pd.set_option('display.max_colwidth', None)
nhgis_alldatasets_df.head()

Number of colums in Dataframe :  4
Number of rows in Dataframe :  238


Unnamed: 0,name,group,description,sequence
0,1790_cPop,1790 Census,"Population Data [US, States & Counties]",101
1,1800_cPop,1800 Census,"Population Data [US, States & Counties]",201
2,1810_cPop,1810 Census,"Population Data [US, States & Counties]",301
3,1820_cPop,1820 Census,"Population Data [US, States & Counties]",401
4,1830_cPop,1830 Census,"Population Data [US, States & Counties]",501


Now all observations in the dataframe are displayed.

In [6]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
nhgis_alldatasets_df.head(len(nhgis_alldatasets_df.index))

Unnamed: 0,name,group,description,sequence
0,1790_cPop,1790 Census,"Population Data [US, States & Counties]",101
1,1800_cPop,1800 Census,"Population Data [US, States & Counties]",201
2,1810_cPop,1810 Census,"Population Data [US, States & Counties]",301
3,1820_cPop,1820 Census,"Population Data [US, States & Counties]",401
4,1830_cPop,1830 Census,"Population Data [US, States & Counties]",501
5,1840_cAg,1840 Census,"Agriculture Data [US, States & Counties]",601
6,1840_cMfg,1840 Census,"Manufacturing Data [US, States & Counties]",602
7,1840_cPopX,1840 Census,"Population & Other Data [US, States & Counties]",603
8,1850_cAg,1850 Census,"Agriculture Data [US, States & Counties]",701
9,1850_cPAX,1850 Census,"Population, Agriculture & Other Data [US, States & Counties]",702


## 3.Get detailed metadata for a single dataset
This API call will return the details of a single dataset, ***2005_2009_ACS5a***. 

The **name** for the dataset needs to be manually added to the request URL. 

The details include the name and description of each each table within the dataset, as well as the geographic levels for which this dataset is available.

DATASET ATTRIBUTES
* ***name***: The unique identifier of the dataset.
* ***group***: The group of datasets to which this dataset belongs.
* ***description***: A short description of the dataset.
* ***sequence***: The order in which the dataset will appear in the metadata API and extracts.
* ***has_multiple_data_types***: A boolean indicating if multiple data types exist for this dataset. For example, the American Community Survey datasets have margins of error as well as estimate data types. Use the breakdown_and_data_type_layout parameter on your extract to specify how these data types are structured in your extract.
* ***data_tables***: A list of data tables available for this dataset.
* ***geog_levels***: A list of geographic levels available for this dataset.
  * ***name***: The unique identifier of the geographic level.
  * ***description***: A short description of the geographic level.
  * ***has_geog_extent_selection***: Whether or not extent selection is applied (and required) for this geography level. See geographic_instances for a list of valid extents.
* ***breakdowns***: A list of breakdowns available for this dataset.
* ***years***: (Optional) If a dataset includes data from multiple years, then this is a list of its years.
* ***geographic_instances***: (Optional) If a dataset has any geographic levels that have extent selection, then this a list of the valid extents for this dataset.



In [7]:
### manually add 2005_2009_ACS5a to the URL

url = "https://api.ipums.org/metadata/nhgis/datasets/2005_2009_ACS5a?version=v1"
nhgis_dataset = requests.get(url, headers=my_headers)
pprint(nhgis_dataset.json())

{'breakdowns': [{'breakdown_values': [{'description': 'Total area',
                                       'name': 'bs32.ge00'},
                                      {'description': 'Urban',
                                       'name': 'bs32.ge01'},
                                      {'description': 'Urban--in urbanized '
                                                      'area',
                                       'name': 'bs32.ge04'},
                                      {'description': 'Urban--in urbanized '
                                                      'area of 5,000,000 or '
                                                      'more population',
                                       'name': 'bs32.ge05'},
                                      {'description': 'Urban--in urbanized '
                                                      'area of 2,500,000 to '
                                                      '4,999,999 population',
                         

From the preceding output, the data and paramaters of interest include:

```
'breakdowns': [{'breakdown_values': [{'description': 'Total area', 'name': 'bs32.ge00'}]],
'data_tables': [{'description': 'Total Population',
                  'name': 'B01003',
                  'nhgis_code': 'RK9',
                  'sequence': 14},
			        	{'description': 'Household Type (Including Living Alone)',
                  'name': 'B11001',
                  'nhgis_code': 'RL4',
                  'sequence': 45},
				        {'description': 'Hispanic or Latino Origin by Race',
                  'name': 'B03002',
                  'nhgis_code': 'RLI',
                  'sequence': 23},
			        	{'description': 'Median Household Income in the Past 12 '
                                 'Months (in 2009 Inflation-Adjusted Dollars)',
                  'name': 'B19013',
                  'nhgis_code': 'RNH',
                  'sequence': 94}]				  
'geog_levels': [{'description': 'State--County',
                   'has_geog_extent_selection': False,
                   'name': 'county',
                   'sequence': 25},
 				        {'description': 'State--County--Census Tract--Block Group',
                   'has_geog_extent_selection': True,
                   'name': 'blck_grp',
                   'sequence': 83}]
```
The block group geog_level variable has_geog_extent_selection is true requiring a geographic_instances to be defined when the dataset API request is constructed.


```
'geographic_instances': [{'description': 'Texas', 'name': '480'}]
```




### Unnesting a nested dataset JSON

The JSON output may be difficult to sort through. We can convert the JSON output to a pandas dataframe for table formatting and easier review. The dataset also contains a nested structure which will be easier to review if *unnested* into its constituent parts. 

In the following subsections, we will unnest the components: 
* **breakdowns:breakddown_values**
* **data_tables** 
* **geog_levels**
* **geographic_instances** 

For each component, the JSON is converted to a dataframe. The total number of columns and rows are displayed along with the first 5 observations in the dataframe. Then the full list of observations are displayed

### breakdowns:breakdown_values

**breakdown_values** is nested within **breakdowns** as seen in the prior JSON output for the dataset and in the ouput table below for just **breakdown_values**.

In [8]:
nhgis_breakdowns_json = json.dumps(nhgis_dataset.json()["breakdowns"])
nhgis_breakdowns_df = pd.read_json(nhgis_breakdowns_json)
print('Number of colums in Dataframe : ', len(nhgis_breakdowns_df.columns))
print('Number of rows in Dataframe : ', len(nhgis_breakdowns_df.index))
pd.set_option('display.max_colwidth', None)
nhgis_breakdowns_df.head()

Number of colums in Dataframe :  4
Number of rows in Dataframe :  1


Unnamed: 0,name,type,description,breakdown_values
0,bs32,Spatial,Geographic Subarea (2010 Census and American Community Survey),"[{'name': 'bs32.ge00', 'description': 'Total area'}, {'name': 'bs32.ge01', 'description': 'Urban'}, {'name': 'bs32.ge04', 'description': 'Urban--in urbanized area'}, {'name': 'bs32.ge05', 'description': 'Urban--in urbanized area of 5,000,000 or more population'}, {'name': 'bs32.ge06', 'description': 'Urban--in urbanized area of 2,500,000 to 4,999,999 population'}, {'name': 'bs32.ge07', 'description': 'Urban--in urbanized area of 1,000,000 to 2,499,999 population'}, {'name': 'bs32.ge08', 'description': 'Urban--in urbanized area of 500,000 to 999,999 population'}, {'name': 'bs32.ge09', 'description': 'Urban--in urbanized area of 250,000 to 499,999 population'}, {'name': 'bs32.ge10', 'description': 'Urban--in urbanized area of 100,000 to 249,999 population'}, {'name': 'bs32.ge11', 'description': 'Urban--in urbanized area of 50,000 to 99,999 population'}, {'name': 'bs32.ge28', 'description': 'Urban--in urban cluster'}, {'name': 'bs32.ge29', 'description': 'Urban--in urban cluster of 25,000 to 49,999 population'}, {'name': 'bs32.ge30', 'description': 'Urban--in urban cluster of 10,000 to 24,999 population'}, {'name': 'bs32.ge31', 'description': 'Urban--in urban cluster of 5,000 to 9,999 population'}, {'name': 'bs32.ge32', 'description': 'Urban--in urban cluster of 2,500 to 4,999 population'}, {'name': 'bs32.ge43', 'description': 'Rural'}, {'name': 'bs32.ge44', 'description': 'Rural--place'}, {'name': 'bs32.ge45', 'description': 'Rural--place of 2,500 or more population'}, {'name': 'bs32.ge46', 'description': 'Rural--place of 1,000 to 2,499 population'}, {'name': 'bs32.ge47', 'description': 'Rural--place of less than 1,000 population'}, {'name': 'bs32.ge48', 'description': 'Rural--not in place'}, {'name': 'bs32.ge49', 'description': 'Rural--farm'}, {'name': 'bs32.ge50', 'description': 'Urban portion of extended place'}, {'name': 'bs32.ge51', 'description': 'Rural portion of extended place'}, {'name': 'bs32.ge89', 'description': 'American Indian Reservation and Trust Land--Federal'}, {'name': 'bs32.ge90', 'description': 'American Indian Reservation and Trust Land--State'}, {'name': 'bs32.ge91', 'description': 'Oklahoma Tribal Statistical Area'}, {'name': 'bs32.ge92', 'description': 'Tribal Designated Statistical Area'}, {'name': 'bs32.ge93', 'description': 'Alaska Native Village Statistical Area'}, {'name': 'bs32.ge94', 'description': 'State Designated Tribal Statistical Area'}, {'name': 'bs32.ge95', 'description': 'Hawaiian Home Land'}, {'name': 'bs32.geA0', 'description': 'In metropolitan or micropolitan statistical area'}, {'name': 'bs32.geA1', 'description': 'In metropolitan or micropolitan statistical area--in principal city'}, {'name': 'bs32.geA2', 'description': 'In metropolitan or micropolitan statistical area--not in principal city'}, {'name': 'bs32.geA3', 'description': 'In metropolitan or micropolitan statistical area--urban'}, {'name': 'bs32.geA4', 'description': 'In metropolitan or micropolitan statistical area--urban--in urbanized area'}, {'name': 'bs32.geA5', 'description': 'In metropolitan or micropolitan statistical area--urban--in urban cluster'}, {'name': 'bs32.geA6', 'description': 'In metropolitan or micropolitan statistical area--rural'}, {'name': 'bs32.geA7', 'description': 'In metropolitan or micropolitan statistical area of 5,000,000 or more population'}, {'name': 'bs32.geA8', 'description': 'In metropolitan or micropolitan statistical area of 2,500,000 to 4,999,999 population'}, {'name': 'bs32.geA9', 'description': 'In metropolitan or micropolitan statistical area of 1,000,000 to 2,499,999 population'}, {'name': 'bs32.geAA', 'description': 'In metropolitan or micropolitan statistical area of 500,000 to 999,999 population'}, {'name': 'bs32.geAB', 'description': 'In metropolitan or micropolitan statistical area of 250,000 to 499,999 population'}, {'name': 'bs32.geAC', 'description': 'In metropolitan or micropolitan statistical area of 100,000 to 249,999 population'}, {'name': 'bs32.geAD', 'description': 'In metropolitan or micropolitan statistical area of 50,000 to 99,999 population'}, {'name': 'bs32.geAE', 'description': 'In metropolitan or micropolitan statistical area of 25,000 to 49,999 population'}, {'name': 'bs32.geAF', 'description': 'In metropolitan or micropolitan statistical area of less than 25,000 population'}, {'name': 'bs32.geC0', 'description': 'In metropolitan statistical area'}, {'name': 'bs32.geC1', 'description': 'In metropolitan statistical area--in principal city'}, {'name': 'bs32.geC2', 'description': 'In metropolitan statistical area--not in principal city'}, {'name': 'bs32.geC3', 'description': 'In metropolitan statistical area--urban'}, {'name': 'bs32.geC4', 'description': 'In metropolitan statistical area--urban--in urbanized area'}, {'name': 'bs32.geC5', 'description': 'In metropolitan statistical area--urban--in urban cluster'}, {'name': 'bs32.geC6', 'description': 'In metropolitan statistical area--rural'}, {'name': 'bs32.geC7', 'description': 'In metropolitan statistical area of 5,000,000 or more population'}, {'name': 'bs32.geC8', 'description': 'In metropolitan statistical area of 2,500,000 to 4,999,999 population'}, {'name': 'bs32.geC9', 'description': 'In metropolitan statistical area of 1,000,000 to 2,499,999 population'}, {'name': 'bs32.geCA', 'description': 'In metropolitan statistical area of 500,000 to 999,999 population'}, {'name': 'bs32.geCB', 'description': 'In metropolitan statistical area of 250,000 to 499,999 population'}, {'name': 'bs32.geCC', 'description': 'In metropolitan statistical area of 100,000 to 249,999 population'}, {'name': 'bs32.geCD', 'description': 'In metropolitan statistical area of less than 100,000 population'}, {'name': 'bs32.geCE', 'description': 'In metropolitan statistical area of 5,000,000 or more population--in principal city'}, {'name': 'bs32.geCF', 'description': 'In metropolitan statistical area of 5,000,000 or more population--not in principal city'}, {'name': 'bs32.geCG', 'description': 'In metropolitan statistical area of 2,500,000 to 4,999,999 population--in principal city'}, {'name': 'bs32.geCH', 'description': 'In metropolitan statistical area of 2,500,000 to 4,999,999 population--not in principal city'}, {'name': 'bs32.geCJ', 'description': 'In metropolitan statistical area of 1,000,000 to 2,499,999 population--in principal city'}, {'name': 'bs32.geCK', 'description': 'In metropolitan statistical area of 1,000,000 to 2,499,999 population--not in principal city'}, {'name': 'bs32.geCL', 'description': 'In metropolitan statistical area of 500,000 to 999,999 population--in principal city'}, {'name': 'bs32.geCM', 'description': 'In metropolitan statistical area of 500,000 to 999,999 population--not in principal city'}, {'name': 'bs32.geCN', 'description': 'In metropolitan statistical area of 250,000 to 499,999 population--in principal city'}, {'name': 'bs32.geCP', 'description': 'In metropolitan statistical area of 250,000 to 499,999 population--not in principal city'}, {'name': 'bs32.geCQ', 'description': 'In metropolitan statistical area of 100,000 to 249,999 population--in principal city'}, {'name': 'bs32.geCR', 'description': 'In metropolitan statistical area of 100,000 to 249,999 population--not in principal city'}, {'name': 'bs32.geCS', 'description': 'In metropolitan statistical area of less than 100,000 population--in principal city'}, {'name': 'bs32.geCT', 'description': 'In metropolitan statistical area of less than 100,000 population--not in principal city'}, {'name': 'bs32.geE0', 'description': 'In micropolitan statistical area'}, {'name': 'bs32.geE1', 'description': 'In micropolitan statistical area--in principal city'}, {'name': 'bs32.geE2', 'description': 'In micropolitan statistical area--not in principal city'}, {'name': 'bs32.geE3', 'description': 'In micropolitan statistical area--urban'}, {'name': 'bs32.geE4', 'description': 'In micropolitan statistical area--urban--in urbanized area'}, {'name': 'bs32.geE5', 'description': 'In micropolitan statistical area--urban--in urban cluster'}, {'name': 'bs32.geE6', 'description': 'In micropolitan statistical area--rural'}, {'name': 'bs32.geE7', 'description': 'In micropolitan statistical area of 100,000 or more population'}, {'name': 'bs32.geE8', 'description': 'In micropolitan statistical area of 50,000 to 99,999 population'}, {'name': 'bs32.geE9', 'description': 'In micropolitan statistical area of 25,000 to 49,999 population'}, {'name': 'bs32.geEA', 'description': 'In micropolitan statistical area of less than 25,000 population'}, {'name': 'bs32.geEB', 'description': 'In micropolitan statistical area of 100,000 or more population--in principal city'}, {'name': 'bs32.geEC', 'description': 'In micropolitan statistical area of 100,000 or more population--not in principal city'}, {'name': 'bs32.geED', 'description': 'In micropolitan statistical area of 50,000 to 99,999 population--in principal city'}, {'name': 'bs32.geEE', 'description': 'In micropolitan statistical area of 50,000 to 99,999 population--not in principal city'}, {'name': 'bs32.geEF', 'description': 'In micropolitan statistical area of 25,000 to 49,999 population--in principal city'}, {'name': 'bs32.geEG', 'description': 'In micropolitan statistical area of 25,000 to 49,999 population--not in principal city'}, {'name': 'bs32.geEH', 'description': 'In micropolitan statistical area of less than 25,000 population--in principal city'}, {'name': 'bs32.geEJ', 'description': 'In micropolitan statistical area of less than 25,000 population--not in principal city'}, {'name': 'bs32.geG0', 'description': 'Not in metropolitan or micropolitan statistical area'}, {'name': 'bs32.geG1', 'description': 'Not in metropolitan or micropolitan statistical area--urban'}, {'name': 'bs32.geG2', 'description': 'Not in metropolitan or micropolitan statistical area--urban--in urbanized area'}, {'name': 'bs32.geG3', 'description': 'Not in metropolitan or micropolitan statistical area--urban--in urban cluster'}, {'name': 'bs32.geG4', 'description': 'Not in metropolitan or micropolitan statistical area--rural'}, {'name': 'bs32.geH0', 'description': 'Not in metropolitan statistical area'}, ...]"


**breakdown_values** observations will be extracted from within **breakdowns** 

First 5 breakdown_values observations

In [9]:
nhgis_breakdown_values_json = json_normalize(data = nhgis_dataset.json()['breakdowns'], record_path = 'breakdown_values')
nhgis_breakdown_values_df = pd.DataFrame.from_dict(nhgis_breakdown_values_json)
print('Number of colums in Dataframe : ', len(nhgis_breakdown_values_df.columns))
print('Number of rows in Dataframe : ', len(nhgis_breakdown_values_df.index))
pd.set_option('display.max_colwidth', None)
nhgis_breakdown_values_df.head()

Number of colums in Dataframe :  2
Number of rows in Dataframe :  114


Unnamed: 0,name,description
0,bs32.ge00,Total area
1,bs32.ge01,Urban
2,bs32.ge04,Urban--in urbanized area
3,bs32.ge05,"Urban--in urbanized area of 5,000,000 or more population"
4,bs32.ge06,"Urban--in urbanized area of 2,500,000 to 4,999,999 population"


All breakdown_values observations

In [10]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
nhgis_breakdown_values_df.head(len(nhgis_breakdown_values_df.index))

Unnamed: 0,name,description
0,bs32.ge00,Total area
1,bs32.ge01,Urban
2,bs32.ge04,Urban--in urbanized area
3,bs32.ge05,"Urban--in urbanized area of 5,000,000 or more population"
4,bs32.ge06,"Urban--in urbanized area of 2,500,000 to 4,999,999 population"
5,bs32.ge07,"Urban--in urbanized area of 1,000,000 to 2,499,999 population"
6,bs32.ge08,"Urban--in urbanized area of 500,000 to 999,999 population"
7,bs32.ge09,"Urban--in urbanized area of 250,000 to 499,999 population"
8,bs32.ge10,"Urban--in urbanized area of 100,000 to 249,999 population"
9,bs32.ge11,"Urban--in urbanized area of 50,000 to 99,999 population"


### data_tables

First 5 data_tables observations

In [11]:
nhgis_data_tables_json = json.dumps(nhgis_dataset.json()["data_tables"])
nhgis_data_tables_df = pd.read_json(nhgis_data_tables_json)
print('Number of colums in Dataframe : ', len(nhgis_data_tables_df.columns))
print('Number of rows in Dataframe : ', len(nhgis_data_tables_df.index))
pd.set_option('display.max_colwidth', None)
nhgis_data_tables_df.head()

Number of colums in Dataframe :  4
Number of rows in Dataframe :  341


Unnamed: 0,name,nhgis_code,description,sequence
0,B00001,RKW,Unweighted Sample Count of the Population,1
1,B00002,RKX,Unweighted Sample Housing Units,2
2,B01001,RKY,Sex by Age,3
3,B01002,RKZ,Median Age by Sex,4
4,B01002A,RK0,Median Age by Sex (White Alone),5


All data_tables observations

In [12]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
nhgis_data_tables_df.head(len(nhgis_data_tables_df.index))

Unnamed: 0,name,nhgis_code,description,sequence
0,B00001,RKW,Unweighted Sample Count of the Population,1
1,B00002,RKX,Unweighted Sample Housing Units,2
2,B01001,RKY,Sex by Age,3
3,B01002,RKZ,Median Age by Sex,4
4,B01002A,RK0,Median Age by Sex (White Alone),5
5,B01002B,RK1,Median Age by Sex (Black or African American Alone),6
6,B01002C,RK2,Median Age by Sex (American Indian and Alaska Native),7
7,B01002D,RK3,Median Age by Sex (Asian Alone),8
8,B01002E,RK4,Median Age by Sex (Native Hawaiian and Other Pacific Islander Alone),9
9,B01002F,RK5,Median Age by Sex (Some Other Race Alone),10


### geog_levels

First 5 geog_levels observations

In [13]:
nhgis_geog_levels_json = json.dumps(nhgis_dataset.json()["geog_levels"])
nhgis_geog_levels_df = pd.read_json(nhgis_geog_levels_json)
print('Number of colums in Dataframe : ', len(nhgis_geog_levels_df.columns))
print('Number of rows in Dataframe : ', len(nhgis_geog_levels_df.index))
pd.set_option('display.max_colwidth', None)
nhgis_geog_levels_df.head()

Number of colums in Dataframe :  4
Number of rows in Dataframe :  79


Unnamed: 0,name,description,has_geog_extent_selection,sequence
0,nation,Nation,False,1
1,region,Region,False,2
2,division,Division,False,3
3,state,State,False,4
4,state_260,American Indian Area/Alaska Native Area/Hawaiian Home Land--State,False,5


All geog_levels observations

In [14]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
nhgis_geog_levels_df.head(len(nhgis_geog_levels_df.index))

Unnamed: 0,name,description,has_geog_extent_selection,sequence
0,nation,Nation,False,1
1,region,Region,False,2
2,division,Division,False,3
3,state,State,False,4
4,state_260,American Indian Area/Alaska Native Area/Hawaiian Home Land--State,False,5
5,state_290,American Indian Area/Alaska Native Area/Hawaiian Home Land--Tribal Subdivision/Remainder--State,False,6
6,state_311,Metropolitan Statistical Area/Micropolitan Statistical Area--State,False,9
7,state_315,Metropolitan Statistical Area/Micropolitan Statistical Area--Metropolitan Division--State,False,10
8,state_331,Combined Statistical Area--State,False,11
9,state_333,Combined Statistical Area--Metropolitan Statistical Area/Micropolitan Statistical Area--State,False,12


### geographic_instances

First 5 geographic_instances observations

In [15]:
nhgis_geographic_instances_json = json.dumps(nhgis_dataset.json()["geographic_instances"])
nhgis_geographic_instances_df = pd.read_json(nhgis_geographic_instances_json)
print('Number of colums in Dataframe : ', len(nhgis_geographic_instances_df.columns))
print('Number of rows in Dataframe : ', len(nhgis_geographic_instances_df.index))
pd.set_option('display.max_colwidth', None)
nhgis_geographic_instances_df.head()

Number of colums in Dataframe :  2
Number of rows in Dataframe :  52


Unnamed: 0,name,description
0,10,Alabama
1,20,Alaska
2,40,Arizona
3,50,Arkansas
4,60,California


All geographic_instances observations

In [16]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
nhgis_geographic_instances_df.head(len(nhgis_geographic_instances_df.index))

Unnamed: 0,name,description
0,10,Alabama
1,20,Alaska
2,40,Arizona
3,50,Arkansas
4,60,California
5,80,Colorado
6,90,Connecticut
7,100,Delaware
8,110,District Of Columbia
9,120,Florida


##4.Get metadata for a table
This API call will return the metadata details for a specific table. This includes the NHGIS code, a unique identifier for NHGIS, which appears in the codebook and is prepended to the variable names in the extract. The universe information is also returned, along with codes and descriptions for each variable.

There are four tables we are interested in. These were identified from the data_tables section. These tables are:
* B01003: Total Population
* B11001: Household Type (Including Living Alone)
* B03002: Hispanic or Latino Origin by Race
* B19013: Median Household Income in the Past 12 Months (in 2009 Inflation-Adjusted Dollars)

Lets create a request for each table to verify they contain the data we are interested in -- they do.


DATA TABLE ATTRIBUTES
* ***name***: The unique identifier for the data table within its dataset.
* ***description***: A short description of the data table.
* ***universe***: The statistical population (set of entities) measured by this data table (e.g., persons, families, occupied housing units, etc.).
* ***nhgis_code***: The code for this data table that will appear in extract.
* ***sequence***: The order for which this data table will appear in the metadata API and extracts.
* ***variables***: A list of variables within the table.

### B01003: Total Population

In [17]:
### manually add dataset and data_table name to the URL

url = "https://api.ipums.org/metadata/nhgis/datasets/2005_2009_ACS5a/data_tables/B01003?version=v1"
nhgis_metadata = requests.get(url, headers=my_headers)
pprint(nhgis_metadata.json())

{'description': 'Total Population',
 'name': 'B01003',
 'nhgis_code': 'RK9',
 'sequence': 14,
 'universe': 'Total population',
 'variables': [{'description': 'Total', 'nhgis_code': 'RK9001'}]}


###B11001: Household Type (Including Living Alone)

In [18]:
### manually add dataset and data_table name to the URL

url = "https://api.ipums.org/metadata/nhgis/datasets/2005_2009_ACS5a/data_tables/B11001?version=v1"
nhgis_metadata = requests.get(url, headers=my_headers)
pprint(nhgis_metadata.json())

{'description': 'Household Type (Including Living Alone)',
 'name': 'B11001',
 'nhgis_code': 'RL4',
 'sequence': 45,
 'universe': 'Households',
 'variables': [{'description': 'Total', 'nhgis_code': 'RL4001'},
               {'description': 'Family households', 'nhgis_code': 'RL4002'},
               {'description': 'Family households: Married-couple family',
                'nhgis_code': 'RL4003'},
               {'description': 'Family households: Other family',
                'nhgis_code': 'RL4004'},
               {'description': 'Family households: Other family: Male '
                               'householder, no wife present',
                'nhgis_code': 'RL4005'},
               {'description': 'Family households: Other family: Female '
                               'householder, no husband present',
                'nhgis_code': 'RL4006'},
               {'description': 'Nonfamily households', 'nhgis_code': 'RL4007'},
               {'description': 'Nonfamily households: 

### B03002: Hispanic or Latino Origin by Race

In [19]:
### manually add dataset and data_table name to the URL

url = "https://api.ipums.org/metadata/nhgis/datasets/2005_2009_ACS5a/data_tables/B03002?version=v1"
nhgis_metadata = requests.get(url, headers=my_headers)
pprint(nhgis_metadata.json())

{'description': 'Hispanic or Latino Origin by Race',
 'name': 'B03002',
 'nhgis_code': 'RLI',
 'sequence': 23,
 'universe': 'Total population',
 'variables': [{'description': 'Total', 'nhgis_code': 'RLI001'},
               {'description': 'Not Hispanic or Latino',
                'nhgis_code': 'RLI002'},
               {'description': 'Not Hispanic or Latino: White alone',
                'nhgis_code': 'RLI003'},
               {'description': 'Not Hispanic or Latino: Black or African '
                               'American alone',
                'nhgis_code': 'RLI004'},
               {'description': 'Not Hispanic or Latino: American Indian and '
                               'Alaska Native alone',
                'nhgis_code': 'RLI005'},
               {'description': 'Not Hispanic or Latino: Asian alone',
                'nhgis_code': 'RLI006'},
               {'description': 'Not Hispanic or Latino: Native Hawaiian and '
                               'Other Pacific Islander 

###B19013: Median Household Income in the Past 12 Months (in 2009 Inflation-Adjusted Dollars)

In [20]:
### manually add dataset and data_table name to the URL

url = "https://api.ipums.org/metadata/nhgis/datasets/2005_2009_ACS5a/data_tables/B19013?version=v1"
nhgis_metadata = requests.get(url, headers=my_headers)
pprint(nhgis_metadata.json())

{'description': 'Median Household Income in the Past 12 Months (in 2009 '
                'Inflation-Adjusted Dollars)',
 'name': 'B19013',
 'nhgis_code': 'RNH',
 'sequence': 94,
 'universe': 'Households',
 'variables': [{'description': 'Median household income in the past 12 months '
                               '(in 2009 inflation-adjusted dollars)',
                'nhgis_code': 'RNH001'}]}


# IPUMS NHGIS Shapefile Metadata API request 

An NHGIS shapefile is a geometry file for geographic information systems (GIS) in shapefile format. An in-depth introduction is availble on the [NHGIS website](https://www.nhgis.org/documentation/gis-data).

See the [NHGIS website](https://developer.ipums.org/docs/workflows/explore_metadata/nhgis/shapefiles/) for the source of this code.

#### 1. Set API key (may already be completed)
#### 2. Get a list of all shapefiles

## 1.Set API key
The API key should have been set in a prior section in this notebook. See **1.Set API key** in the **NHGIS dataset metadata API request** section.

## 2.Get a list of all shapefiles
This API call will return a list of all available shapefiles.

SHAPEFILE ATTRIBUTES
* ***name***: The unique identifier of the time series table.
* ***year***: The survey year in which the file’s represented areas were used for tabulations, which may be different than the vintage of the represented areas.
* ***geographic_level***: The geographic level of the shapefile.
extent: The geographic extent which is covered by the shapefile.
* ***basis***: The derivation source of the shapefile.
* ***sequence***: The order the shapefile in which appears in the metadata API.

In [21]:
my_headers = {"Authorization": my_key}
url = "https://api.ipums.org/metadata/nhgis/shapefiles?version=v1"
nhgis_shp_metadata = requests.get(url, headers=my_headers)
pprint(nhgis_shp_metadata.json())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  'extent': 'Oklahoma',
  'geographic_level': 'Block',
  'name': '400_block_2000_tl2010',
  'sequence': 502,
  'year': '2000'},
 {'basis': '2010 TIGER/Line +',
  'extent': 'Oregon',
  'geographic_level': 'Block',
  'name': '410_block_2000_tl2010',
  'sequence': 503,
  'year': '2000'},
 {'basis': '2010 TIGER/Line +',
  'extent': 'Pennsylvania',
  'geographic_level': 'Block',
  'name': '420_block_2000_tl2010',
  'sequence': 504,
  'year': '2000'},
 {'basis': '2010 TIGER/Line +',
  'extent': 'Rhode Island',
  'geographic_level': 'Block',
  'name': '440_block_2000_tl2010',
  'sequence': 505,
  'year': '2000'},
 {'basis': '2010 TIGER/Line +',
  'extent': 'South Carolina',
  'geographic_level': 'Block',
  'name': '450_block_2000_tl2010',
  'sequence': 506,
  'year': '2000'},
 {'basis': '2010 TIGER/Line +',
  'extent': 'South Dakota',
  'geographic_level': 'Block',
  'name': '460_block_2000_tl2010',
  'sequence': 507,
  'year': 

The JSON output may be difficult to sort through. We can convert the JSON output to a pandas dataframe for table formatting and easier review.

In the following codeblock, the JSON is converted to a dataframe. The total number of columns and rows are displayed along with the first 5 observations in the dataframe.

In [22]:
nhgis_shp_metadata_json = json.dumps(nhgis_shp_metadata.json())
nhgis_shp_df = pd.read_json(nhgis_shp_metadata_json)
print('Number of colums in Dataframe : ', len(nhgis_shp_df.columns))
print('Number of rows in Dataframe : ', len(nhgis_shp_df.index))
pd.set_option('display.max_colwidth', None)
nhgis_shp_df.head()

Number of colums in Dataframe :  6
Number of rows in Dataframe :  1327


Unnamed: 0,name,year,geographic_level,extent,basis,sequence
0,us_state_1790_tl2000,1790,State,United States,2000 TIGER/Line +,1
1,us_county_1790_tl2000,1790,County,United States,2000 TIGER/Line +,2
2,us_county_1790_tl2008,1790,County,United States,2008 TIGER/Line +,3
3,us_state_1800_tl2000,1800,State,United States,2000 TIGER/Line +,4
4,us_county_1800_tl2000,1800,County,United States,2000 TIGER/Line +,5


All shapefile observations

In [23]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
nhgis_shp_df.head(len(nhgis_shp_df.index))

Unnamed: 0,name,year,geographic_level,extent,basis,sequence
0,us_state_1790_tl2000,1790,State,United States,2000 TIGER/Line +,1
1,us_county_1790_tl2000,1790,County,United States,2000 TIGER/Line +,2
2,us_county_1790_tl2008,1790,County,United States,2008 TIGER/Line +,3
3,us_state_1800_tl2000,1800,State,United States,2000 TIGER/Line +,4
4,us_county_1800_tl2000,1800,County,United States,2000 TIGER/Line +,5
5,us_county_1800_tl2008,1800,County,United States,2008 TIGER/Line +,6
6,us_state_1810_tl2000,1810,State,United States,2000 TIGER/Line +,7
7,us_county_1810_tl2000,1810,County,United States,2000 TIGER/Line +,8
8,us_county_1810_tl2008,1810,County,United States,2008 TIGER/Line +,9
9,us_state_1820_tl2000,1820,State,United States,2000 TIGER/Line +,10


The shapefile of interest will rely on the *2009 TIGER/Line +* shapefile series. We want Block Group geographies for the State of Texas and County geographies for the United States. The two shapefiles that fit these requirements are named: 
* **480_blck_grp_2000_tl2009**
* **us_county_2009_tl2009**
