# IPUMS NHGIS Data Extraction Using ipumsr

## Introduction

**From the [IPUMS NHGIS Webpage](https://www.nhgis.org):** The National Historical Geographic Information System (NHGIS) provides easy access to summary tables and time series of population, housing, agriculture, and economic data, along with GIS-compatible mapping files, for years from 1790 through the present and for all levels of U.S. census geography, including states, counties, tracts, and blocks.

This notebook will guide you through the process of exploring, selecting, and extracting population data from [IPUMS NHGIS](https://nhgis.org) using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.htm).  By working through this notebook, you will learn how to define an extraction of population data from the IPUMS NHGIS repository and download the data for analysis.

By working through this notebook, you will learn how to define an extraction of population data for specific geographic areas (such as tracts or counties) and download the relevant data and shapefiles for spatial analysis. This workflow is useful for researchers and analysts interested in understanding population changes over time across different regions of the United States.


#### Prerequisites

Before using this notebook, we recommend first completing the **Introduction to IPUMS and the IPUMS API** notebook.


#### Overview
This notebook includes the following sections:

1. Setup
2. NHGIS Time-Series Data Metadata Exploration
3. NHGIS Geography Shapefile Metadata Exploration
4. NHGIS Time-Series Data and Geography Shapefile Extraction Specification and Submission
5. Subset and Merge the Time-Series and Geography Data Extractions

## 1. Setup

### 1a. Package Installation

Before running this script, you will need to install and load the following packages into your R environment:

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A package for data manipulation that provides a consistent set of functions to filter, arrange, summarize, and transform data. *dplyr* makes it easy to work with data frames and perform operations efficiently.  This notebook uses the the following functions from *dplyr*.
* [*filter()*](https://dplyr.tidyverse.org/reference/filter.html) for subsetting a dataframe based on specified conditions
* [*select()*](https://dplyr.tidyverse.org/reference/select.html) for selecting variables in a dataframe by name
* [*rename()*](https://dplyr.tidyverse.org/reference/rename.html) for changing the names of individual variables in a dataframe
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) A package specifically designed to interact with IPUMS datasets, including NHGIS. It allows users to define and submit data extraction requests, download data, and read it directly into R for analysis.  This notebook uses the the following functions from *ipumsr*.

* *set_ipums_api_key()* for setting your IPUMS API key
* *get_metadata_nhgis()* for listing available data sources from IPUMS NHGIS
* *define_extract_nhgis()* for defining an IPUMS NHGIS extract request
* *tst_spec()* for creating a tst_spec object containing a time-series table specification
* *submit_exract()* for submitting an extract request via the IPUMS API and return an *ipums_extract* object
* *wait_for_extract()* wait for an extract to finish processing
* *download_extract()* download an extract's data files
* *read_nhgis()* for reading tabular data from an NHGIS extract
* *read_ipums_sf()* for reading spatial data from an IPUMS extract

[**purrr**](https://cran.r-project.org/web/packages/purrr/index.html) A functional programming toolkit that simplifies the process of working with lists and vectors. It is particularly useful for applying functions to multiple elements or data frames, making it easier to write clean, efficient code.  This notebook uses the the following functions from *purr*.

* [*map()*](https://www.rdocumentation.org/packages/purrr/versions/0.2.5/topics/map) and [*map_dfr()*](https://purrr.tidyverse.org/reference/map_dfr.html) for applying a function to each element in the given input

If you are working in the I-GUIDE environment, the these packages should be already be installed.  If you are working on your local machine or another environment, you may need to install them before continuing.

In [1]:
# install.packages("dplyr", "ipumsr", "purr")

Load the packages into your workspace.

In [6]:
library(dplyr)
library(ipumsr)
library(purrr)

### 1b. API Setup

#### Connect your IPUMS API Key

Run the following code to enter your [IPUMS API key](https://account.ipums.org/api_keys).  Refer to the **Introduction to IPUMS and the IPUMS API** notebook for background on the IPUMS data repository and for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

## 2. NHGIS Time-Series Data Metadata Exploration

The NHGIS provides a variety of time-series tables, each representing collections of population data over different years. This section helps you identify the right datasets for your analysis by exploring the available time-series tables and filtering them based on specific criteria.

#### Steps:
1. Retrieve metadata for available time-series datasets.
2. Filter and display datasets that focus on a specific topic.
3. Identify which years and geographic levels are covered by each dataset.
4. Select a dataset for extration.

### 2a. Retrieve Time-Series Metadata
First we will take a look at the list available NHGIS time-series datasets which includes hundreds of data tables.  Running the code below will provide a snapshot of the first ten datasets in the list.

In [6]:
# get list of time-series dataset metadata
datts_meta <- get_metadata_nhgis("time_series_tables") %>% print(n = 10)

[90m# A tibble: 389 × 7[39m
   name  description        geographic_integration sequence time_series years   
   [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m                     [3m[90m<dbl>[39m[23m [3m[90m<list>[39m[23m      [3m[90m<list>[39m[23m  
[90m 1[39m A00   Total Population   Nominal                    100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 2[39m AV0   Total Population   Nominal                    100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 3[39m B78   Total Population   Nominal                    100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 4[39m CL8   Total Population   Standardized to 2010       100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 5[39m A57   Persons by Urban/… Nominal                    101. [90m<tibble>[39m    [90m<tibble>[39m
[90m 6[39m A59   Persons by Urban/… Nominal                    101. [90m<tibble>[39m    [90m<tibble>[39m
[90m 7[39m CL9   Persons b

Note that each entry in the list includes not only the description and reference code for the dataset but also [tibbles](https://tibble.tidyverse.org) for "time_series", "years", and "geog_levels".  The information in the tibbles is not visualized in this high-level view of the data but you can imagine that for each "\<tibble>" entry there is another table of information containing additional details on the available data.  We will zoom in deeper in the following steps and you will be able to view the data contained within these tibbles.

This wealth of data is overwhelming and it is unlikely anyone would need it all for a single project.  So in the next step, we will programmatically filter the metadata to select only the datasets focused on a specific topic.

### 2b. Filter Metadata Based on Criteria

You will use the following code to retrieve metadata on available NHGIS time-series datasets and filter them to find datasets that focus on a specific topic.

In this example, we will explore only the datasets which focus on total population.  Therefore, we will filter the entire list of datasets to find the subset of datasets whcih include the phrase "total population" in the description.

In [7]:
description_filter <- "total population"

In [8]:
datts_meta_filter <- datts_meta %>% filter(grepl(description_filter, description, ignore.case = T)) %>% select(name, description) %>% as.data.frame() %>% print()

[1] name        description
<0 rows> (or 0-length row.names)


Using "total population" as a filter resulted in four potential datasets.  For additional detailed information NHGIS time-series datasets, refer to the [NHGIS Time Series Tables lookup document](https://assets.nhgis.org/NHGIS_Time_Series_Tables.pdf).

Next we will take a look at the metadata for this selection of datasets datasets to determine which of the datasets includes information on the time range and geographies we are interested in.

### 2c. Identify Available Years and Geographic Levels

This step will display the available years and geographic levels for the filtered datasets. This will help you decide which dataset best suits your analysis.

You can use the *get_metadata_nhgis* command to view metadata for a specific NHGIS time-series table using the table's code.  The following example shows the metadata for table "CL8".

In [128]:
get_metadata_nhgis(time_series_table = "CL8")

name,description,sequence
<chr>,<chr>,<int>
AA,Persons: Total,1

name,description,sequence
<chr>,<chr>,<int>
1990,1990,108
2000,2000,118
2010,2010,131
2020,2020,155

name,description,sequence
<chr>,<chr>,<int>
state,State,4
county,State--County,25
tract,State--County--Census Tract,66
blck_grp,State--County--Census Tract--Block Group,85
cty_sub,State--County--County Subdivision,102
place,State--Place,148
cd111th,"State--Congressional District (2007-2013, 110th-112th Congress)",217
cbsa,Metropolitan Statistical Area/Micropolitan Statistical Area,338
urb_area,Urban Area,372
zcta,5-Digit ZIP Code Tabulation Area,382


The metadata view shows that the CL8 time-series table includes total population information for 1990, 2000, 2010, and 2020 and for a variety of geographic levels.  Here we can see the information included in the "time_series", "years", and "geog_levels" tibbles which were obscured in the high-level view in step 2a.

We could repeat this process for each table from our data filtering proces, but to save us some time, the code below takes the name, description, years, and geograpic levels information for each of the tables in our filtering results and presents the metadata in a simple reference table.

In [9]:
# get metadata for each time-series table
metadata_list <- map(datts_meta_filter$name, ~ get_metadata_nhgis(time_series_table = .x))

# combine into a data frame with the necessary columns
metadata_combined <- map_dfr(metadata_list, function(metadata) {
  data.frame(
    name = metadata$name,
    description = metadata$description,
    # Extract only the "description" column from the nested tibbles in "years" and "geog_levels"
    years = paste(metadata$years$description, collapse = ", "),
    geog_levels = paste(metadata$geog_levels$name, collapse = ", ")
  )
})

# print the final data frame
metadata_combined

name,description,years,geog_levels
<chr>,<chr>,<chr>,<chr>
A00,Total Population,"1790, 1800, 1810, 1820, 1830, 1840, 1850, 1860, 1870, 1880, 1890, 1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020","state, county"
AV0,Total Population,"1970, 1980, 1990, 2000, 2010, 2006-2010, 2007-2011, 2008-2012, 2009-2013, 2010-2014, 2011-2015, 2012-2016, 2013-2017, 2014-2018, 2015-2019, 2020, 2016-2020, 2017-2021, 2018-2022","state, county, tract, cty_sub, place"
B78,Total Population,"1980, 1990, 2000, 2010, 2006-2010, 2007-2011, 2008-2012, 2009-2013, 2010-2014, 2011-2015, 2012-2016, 2013-2017, 2014-2018, 2015-2019, 2020, 2016-2020, 2017-2021, 2018-2022","nation, region, division, state, county, tract, cty_sub, place"
CL8,Total Population,"1990, 2000, 2010, 2020","state, county, tract, blck_grp, cty_sub, place, cd111th, cbsa, urb_area, zcta"


Taking a look at these results, we can easily see the available years and geographies for each of the time-series tables we identified in our filtering process.

Note that the lists of year ranges include both single years (e.g. "2000") corresponding to Decennial Census population counts and year ranges (e.g. "2008-2012") corresponding to five-year average population estimates from the [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs).

### 2d. Select a Dataset

Once we have decided on a specific dataset, we will save the table's code for use in our data extration later on.

In this example, we select the 2010 harmonized dataset (CL8), which aligns data to standardized 2010 geographies.  But you can change this line of code to correspond to whichever dataset you want.  You can also select multiple datasets using a list (e.g. *c("CL8", "A00")*).  However, if you choose to select multiple dataseta, be mindful of the differences in available years and geographies for the datasets in your selection.

In [11]:
selection_datts <- "CL8"

Now that we have selected our dataset, let's review its complete metadata details.  Verfity that you have selected the correct dataset and that our selection meets your data needs.

In [12]:
get_metadata_nhgis(time_series_table = selection_datts)

name,description,sequence
<chr>,<chr>,<int>
AA,Persons: Total,1

name,description,sequence
<chr>,<chr>,<int>
1990,1990,108
2000,2000,118
2010,2010,131
2020,2020,155

name,description,sequence
<chr>,<chr>,<int>
state,State,4
county,State--County,25
tract,State--County--Census Tract,66
blck_grp,State--County--Census Tract--Block Group,85
cty_sub,State--County--County Subdivision,102
place,State--Place,148
cd111th,"State--Congressional District (2007-2013, 110th-112th Congress)",217
cbsa,Metropolitan Statistical Area/Micropolitan Statistical Area,338
urb_area,Urban Area,372
zcta,5-Digit ZIP Code Tabulation Area,382


## 3. NHGIS Geography Shapefile Metadata Exploration

As we saw in our metadata exploration above, the available geography levels vary based on the dataset.  If we want to extract geographic data along with the our datatable, we will need to review the available geographic data files and select an appropriate file for use with our data.

#### Steps:
1. Review dataset metadata.
2. Retrieve metadata for available shapefiless.
3. Filter and display shapefiles based on years and geographic level.
5. Select a shapefile for extration.

### 3a. Review Time-Series Data Metadata

First we should review which geographies are available for our selected datatable and select a geographic level for our extraction.

In [3]:
get_metadata_nhgis(time_series_table = selection_datts)$geog_levels

ERROR: Error in get_metadata_nhgis(time_series_table = selection_datts): could not find function "get_metadata_nhgis"


Select one of the available geographies and save it for use in our data extration later on.

In this example, we select the Census tract ("tract") geographies.  But you can change this line of code to correspond to whichever geography you want.  Similar to the datast selection in step 2d, you can also select multiple geographies using a list (e.g. *c("state", "county")*).

In [15]:
selection_geog <- "tract"

### 3b. Retrieve Geography Metadata

Shapefiles are a type of file format which contain geographic boundaries.  This type of file is essential for spatial analysis.  This section retrieves and filters shapefile metadata to identify shapefiles which correspond to our selected year and geography.  This filtering step ensures you have the correct geographic boundaries for the population data.

In [14]:
shp_meta <- get_metadata_nhgis("shapefiles")

### 3c. Filter Geography Metadata Based on Year and Geography

For this exercise, we will extract a set of shapefiles at your previously-selected geography as well as a specific year.  As we saw in our metadata exploration above, the available geography levels vary based on the dataset.

For this filtering step, you should also filter based on the year.  For this exercise, we are using time-series table CL8 which contains information on total population harmonized to 2010 geographies.  Therefore, we should only select a shapefile which corresponds to 2010 geographies.

In [16]:
selection_year <- "2010"

If you are unfamiliar with Census geographies, it might sound strange to include a year specification in this filtring step.  For large geographies, such as "nation" or "state", the year is relatively unimportant because the boundaries of these regions are not redrawn from year to year.  However, for smaller geographies, especially those related to the U.S. Decennial Census, such as "tract", "block" or "blck_grp" (block group), and as those related to political districts, such as "cd" (congressional district), the boundary of the grography can change over time.  Census tract, block group, and block boundaries are redrawn for each Decennial Census based on population numbers, and Congressional Districts are often redrawn for new congresional elections.  For this reason, it is essential to correspond your shapefile selection to your time-series data extraction.

Run the code below to list the available shapefiles based on your year and geography specifications.

In [17]:
shp_meta %>% filter(year == selection_year & grepl(selection_geog, geographic_level, ignore.case = T)) %>% print(n = Inf)

[90m# A tibble: 4 × 6[39m
  name                            year  geographic_level   extent basis sequence
  [3m[90m<chr>[39m[23m                           [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m    [3m[90m<int>[39m[23m
[90m1[39m us_tract_2010_tl2010            2010  Census Tract       Unite… 2010…      603
[90m2[39m us_tract_cenpop_2010_cenpop2010 2010  Census Tract (Cen… Unite… 2010…      604
[90m3[39m us_tract_2010_tl2020            2010  Census Tract       Unite… 2020…      605
[90m4[39m us_ttract_2010_tl2010           2010  Tribal Census Tra… Unite… 2010…      641


The filtering step provides us with a list of potential shapefiles we can use for our extraction based on the year and geography criteria.

### 3d. Select a Geography Shapfile

For this exercise, we will select the 2010 Census tract dataset based on the 2010 TIGER line files (file "us_tract_2010_tl2010").  And wee will save this selection for use later in our data extraction step.

In [19]:
selection_shp <- "us_tract_2010_tl2010"

## 4. NHGIS Time-Series Dataset and Geography Shapefile Extraction Specification and Submission

Now that you've identified your dataset and shapefile, this section defines and submits an extraction request to the IPUMS NHGIS API. Extracting data from IPUMS NHGIS allows you to download specific datasets and geographical data directly from the IPUMS server. This method makes it easy to automate and reproduce data requests.  The extraction will include both the selected time-series data and the corresponding shapefiles.

#### Steps:
1. Define and Run the Data Extraction
2. Review the Data Extraction

### 4a. Define the Extraction Parameters and Run the Extraction

Here we will put everything together including out time series data table selection (selection_datts), our selected geography (selection_geog), and our selected shapefiles (selection_shp).

In [5]:
extraction <- define_extract_nhgis(description = "I-GUIDE IPUMS Population Change Extraction",
                                   time_series_tables = tst_spec(name = selection_datts,
                                                                 geog_levels = selection_geog),
                                   shapefiles = selection_shp)

ERROR: Error in eval(expr, envir, enclos): object 'selection_datts' not found


Submit the extraction request and wait for it to complete, then download the resulting data.

In [1]:
# submit extraction  
extraction_submitted <- submit_extract(extraction)

# wait for completion
extraction_complete <- wait_for_extract(extraction_submitted)

# check completion
extraction_complete$status

# get extraction filepath
filepath <- download_extract(extraction_submitted, overwrite = T)

ERROR: Error in submit_extract(extraction): could not find function "submit_extract"


### 4b. Review the Extracted Files
If you followed along with this exercise, your data extraction and download should contain the following two files.  If you expanded your extraction to additional datasets and shapefiles, you extraction will contain additional files.

1. A dataset containing total population by Census tract (based on 2010 Census tract boundaries) for all available years in the CL8 time-series dataset (1990, 2000, 2010, and 2020).
2. A shapefile with 2010 Census tract boundaries.

In [50]:
# see files in extract
dat_raw <- read_nhgis(filepath[1])
shp_raw <- read_ipums_sf(filepath[2])

Use of data from NHGIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.

[1mRows: [22m[34m73057[39m [1mColumns: [22m[34m17[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (6): GISJOIN, STATE, STATEA, COUNTY, COUNTYA, TRACTA
[32mdbl[39m (11): GEOGYEAR, CL8AA1990, CL8AA1990L, CL8AA1990U, CL8AA2000, CL8AA2000L...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
head(dat_raw)

ERROR: Error in eval(expr, envir, enclos): object 'dat_raw' not found


## 5. Subset and Merge the Time-Series and Geography Data Extractions

This final section provides a few example data engineering next steps for reference.

1. First, the time-series population data is condensed to include only population counts from 1990, 2000, 2010, and 2020 and the column "GISJOIN" which contains unique codes for each Census tract.
2. Next, the population count columns are renamed.
3. Then the Census tract shapefile is condensed to include only the state FIPS code and "GISJOIN" columns.
4. Finally, the time-series population data is mrged with the Census tract shapefile using the unique "GISJOIN" column as the join key.

### 5a. Subset the Time-Series and Geography Data



In [None]:
# subset the time-series data to only necessary columns
dat <- dat_raw[c("GISJOIN", "CL8AA1990", "CL8AA2000", "CL8AA2010", "CL8AA2020")]

# subset the shapefile to only necessary columns
shp <- shp_raw[c("GISJOIN", "STATEFP10")]

### 5b. Merge the Time-Series and Geography Data 

In [None]:
# merge the time-series population data with the Censuss tract shapefile
dat <- merge(dat, shp, by = "GISJOIN")

The final merged includes total population for 1990, 2000, 2010, and 2020 attached to the geographic boundaries of the 2010 Census tracts.  The code below provides a snapshot of the first ten lines in the final merged dataset.

In [57]:
head(dat)

Unnamed: 0_level_0,GISJOIN,pop1990,pop2000,pop2010,pop2020,STATEFP10,geometry
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<MULTIPOLYGON [m]>
1,G0100010020100,1772.67,1920.02,1912,1775,1,MULTIPOLYGON (((888438 -515...
2,G0100010020200,2031.0,1892.0,2170,2055,1,MULTIPOLYGON (((889844.1 -5...
3,G0100010020300,2952.0,3339.0,3373,3216,1,MULTIPOLYGON (((891383.8 -5...
4,G0100010020400,4401.0,4556.0,4386,4246,1,MULTIPOLYGON (((892527.3 -5...
5,G0100010020500,3120.68,6041.9,10766,11222,1,MULTIPOLYGON (((895451 -522...
6,G0100010020600,3330.0,3272.0,3668,3729,1,MULTIPOLYGON (((889098.5 -5...


## Conclusion
This script provides a step-by-step guide to extracting and preparing U.S. Census population data from the IPUMS NHGIS for spatial analysis. By following this approach, social scientists and researchers can automate data extraction tasks, save time, and ensure reproducibility.

The data and shapefile merging capabilities allow users to explore population trends across various geographic levels and time periods. Feel free to customize the script to fit your specific research needs.

## Next Steps

From here, we recommend exploring the following notebooks:

* **Exploratory Data Spatial Analysis (ESDA) with IPUMS NHGIS**

## Quick Code

Don't forget to update the code with your IPUMS API key!

In [18]:
# install necessary packages
#install.packages("dplyr", "ipumsr", "purr")

# load necessary libraries
library(dplyr)
library(ipumsr)
library(purrr)

# set IPUMS API key
ipums_api_key <- "paste your api key here"
set_ipums_api_key(ipums_api_key, save = T, overwrite = T)

# define extract specifications
selection_datts <- "CL8"                  # time-series data table
selection_geog <- "tract"                 # geographic level
selection_year <- "2010"                  # year(s)
selection_shp <- "us_tract_2010_tl2010"   # shapefile

# set up the data extraction
extraction <- define_extract_nhgis(description = "IPUMS NHGIS Data Extraction",
                                   time_series_tables = tst_spec(name = selection_datts,
                                                                 geog_levels = selection_geog,
                                                                 years = selection_year),
                                   shapefiles = selection_shp)

# submit extraction and extract the data
extraction_submitted <- submit_extract(extraction)                  # submit the extraction  
extraction_complete <- wait_for_extract(extraction_submitted)       # wait for completion
extraction_complete$status                                          # check completion
filepath <- download_extract(extraction_submitted, overwrite = T)   # get extraction filepath

# extract the files
dat_raw <- read_nhgis(filepath[1])
shp_raw <- read_ipums_sf(filepath[2])

# merge the data and geography files
dat <- merge(dat_raw, shp_raw, by = "GISJOIN")

# save as RDS file
saveRDS(dat, "dat_ipums_nhgis.rds", )

Existing .Renviron file copied to /home/jovyan/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



ERROR: [1m[33mError[39m in `ipums_api_request()`:[22m
[33m![39m The provided API key is either missing or invalid.
[34mℹ[39m Please provide your API key to the `api_key` argument or request a key at https://account.ipums.org/api_keys
[34mℹ[39m Use `set_ipums_api_key()` to save your key for future use.
