# IPUMS NHGIS Data Extraction Using ipumsr
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

This notebook builds on the the workflow introduced in the **[Introduction to the IPUMS API for R Users](https://tech.popdata.org/ipumsr/articles/ipums-api.html)** article on the IPUMS website.  As the author of the R Spatial Notebook series, I recognize the IPUMS article as a significant inspiration and source of information for this notebook.

## Introduction
The [IPUMS NHGIS](https://www.nhgis.org) database offers harmonized summary data and geographic boundary files from U.S. censuses and surveys, providing a resource for spatial analysis of demographic, social, and economic trends. It enables users to access aggregated data at various geographic levels, such as states, counties, and census tracts, facilitating the exploration of population dynamics and regional patterns over time. Through harmonization, IPUMS NHGIS ensures that data can be seamlessly compared across years, despite changes in geographic boundaries, variable definitions, and survey methodologies.

**From the [IPUMS NHGIS Webpage](https://www.nhgis.org):** The National Historical Geographic Information System (NHGIS) provides easy access to summary tables and time series of population, housing, agriculture, and economic data, along with GIS-compatible mapping files, for years from 1790 through the present and for all levels of U.S. census geography, including states, counties, tracts, and blocks.

#### Data Included in the IPUMS NHGIS Repository (Available Using the IPUMS API)
* [Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html) data from 1790 to present
* [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html) data from 2009 to present

#### Additional NHGIS Data Available Using the [NHGIS Online Data Finder](https://data2.nhgis.org/main)
* [Census of Agriculture](https://www.nass.usda.gov/AgCensus) data from 1850 to 1959
* [County Business Patterns (CBP)](https://www.census.gov/programs-surveys/cbp.html) data from 1970 to 2002
* Census of Religious Bodies data from 1906, 1916, 1926, and 1936
* Marriage and Divorce data from 1867 to 2010
* Natality and Mortality Data from 1915 to 2007
* 1952 Survey of Churches and Church Membership
* 1920-1936 FDIC Bank Deposit Data
* 1925 Special Census of Detroit
* 1937 Census of Unemployment

### About the Decennial Census
The United States **[Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html)** is a population and housing count conducted by the [U.S. Census Bureau](https://www.census.gov) every ten years. The Census aims to count every person living in the United States and its territories, collecting basic demographic information such as age, sex, race, ethnicity, and household relationships. This data is used primarily to allocate seats in the U.S. House of Representatives, redraw congressional and legislative districts, and distribute federal funding to states and local communities.

The Decennial Census is designed to provide a comprehensive snapshot of the nation's population and housing characteristics at a specific point in time. The data collected is crucial for planning and policy-making, as well as for guiding resource allocation for schools, hospitals, infrastructure projects, and emergency response services. Unlike the more detailed American Community Survey (ACS), which samples a subset of the population annually, the Decennial Census seeks to account for the entire population in one effort, making it a critical tool for understanding population dynamics over time.

### About the American Community Survey (ACS)
The **[American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html)** is an annual survey conducted by the [U.S. Census Bureau](https://www.census.gov) that collects information on a subset of the U.S. population.  The ACS collects data on a variety of topics, including income, poverty, education, marital status, health insurance coverage, disability, occupancy, costs, tenure, and units by type.  It is a more in-depth supplement to the Decennial U.S. Census and in 2005 replaced the long-form version of the Decennial Census survey which was previously conducted every ten years.  Each year the ACS samples over 3.5 million housing units across the United States with a new sample of about 250,000 addresses drawn each month.

ACS data is available as single-year datasets as well as three- and five-year summaries of the data.  While single-year data provide a snapshot of conditions in a specific year, the three- and five-year summaries offer more stable estimates by averaging data over time, making them less susceptible to anomalies and more useful for analyzing smaller geographic areas.

### Notebook Goals
This notebook introduces the process of extracting [IPUMS NHGIS](https://www.nhgis.org) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS NHGIS data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS NHGIS datasets and prepare them for spatial and statistical workflows.

### ✨ Prerequisites ✨
* Complete [Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a)
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)
* Complete [Introduction to sf: Reading, Writing, and Inspecting Vector Data](https://platform.i-guide.io/notebooks/9968babe-22e4-4c3d-98e2-d8b45e9672cd)

### Notebook Overview
1. Setup
2. IPUMS NHGIS Time-Series Data Metadata Exploration
3. IPUMS NHGIS Geography Shapefile Metadata Exploration
4. IPUMS NHGIS Time-Series Data and Geography Shapefile Extraction Specification and Submission
5. Subset and Merge the Time-Series and Geography Data Extractions(

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following functions from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* [*select*](https://rdrr.io/cran/dplyr/man/select.html) · keep or drop columns using their names and types
* [*rename*](https://rdrr.io/cran/dplyr/man/rename.html) · rename columns
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator.  The *pip* operator is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows and is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**geojsonio**](https://cran.r-project.org/web/packages/geojsonio/index.html) · Convert Data from and to *[GeoJSON](https://geojson.org)* or *[TopoJSON](https://github.com/topojson/topojson)*.

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_nhgis*](https://rdrr.io/cran/ipumsr/man/define_extract_nhgis.html) · define an IPUMS NHGIS extract request
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_metadata_nhgis*](https://rdrr.io/cran/ipumsr/man/get_metadata_nhgis.html) · list available data sources from IPUMS NHGIS
* [*read_ipums_sf*](https://rdrr.io/cran/ipumsr/man/read_ipums_sf.html) · read spatial data from an IPUMS extract
* [*read_nhgis*](https://rdrr.io/cran/ipumsr/man/read_nhgis.html) · read tabular data from an NHGIS extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* *tst_spec* · create a *tst_spec* object containing a time series table specification
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**purrr**](https://cran.r-project.org/web/packages/purrr/index.html) A complete and consistent functional programming toolkit for R. This notebook uses the the following functions from *purrr*.

* [*map()*](https://rdrr.io/cran/purrr/man/map.html) and [*map_dfr()*](https://rdrr.io/cran/purrr/man/map_dfr.html) · apply a function to each element of a vector

[**sf**](https://cran.r-project.org/web/packages/sf/index.html) Support for simple features, a standardized way to encode spatial vector data. Binds to 'GDAL' for reading and writing data, to 'GEOS' for geometrical operations, and to 'PROJ' for projection conversions and datum transformations. Uses by default the 's2' package for spherical geometry operations on ellipsoidal (long/lat) coordinates.  This notebook uses the following functions from *sf*.

* [*st_write*](https://rdrr.io/cran/sf/man/st_write.html) · Write simple features object to file or database

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages("dplyr", "geojsonio", "ipumsr", "purr", "sf")

Load the packages into your workspace.

In [None]:
library(dplyr)
library(geojsonio)
library(ipumsr)
library(purrr)
library(sf)

### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to [Chapter 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a) for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

## 2. NHGIS Time-Series Data Metadata Exploration

Before submitting an IPUMS data extraction request, it’s essential to ensure the parameters of the extraction definition are set up correctly.  The extraction definition specifies the sample, variables, geographic levels, and other options.

If this is your first time using the IPUMS API in R, or if you are setting up a new data extract for a new project, it is a good idea to start by exploring the available data which can be done using the *ipumsr* package.

Because we are extracting data from the IPUMS NHGIS repository, we will carry out the process of setting up our extraction in two phases.  First we will set up the extraction paramaters for the tabular component of our data and second we will set up the extraction parameters for the spatial component of our data.

**★ Pro Tip:** The NHGIS extraction setup process in R is significantly different from the extraction process used for other IPUMS data repositories.  If this is your first time setting up an NHGIS data extraction in R, please be sure to carefully follow all steps in this notebook, even if you have previously used R to extract data from other IPUMS repositories.

### 2a. Review the List of Time-Series Datasets

First, let's take a look at the entire database of time-series datasets available from the [IPUMS NHGIS data repository](https://www.nhgis.org).  The NHGIS data available for direct extraction using the IPUMS API include the [Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html) and [American Communinty Survey (ACS)](https://www.census.gov/programs-surveys/acs).

**★ Pro Tip:** The [NHGIS Online Data Finder](https://data2.nhgis.org/main) provides access to other NHGIS data sources not available via API including the [Census of Agriculture](https://www.nass.usda.gov/AgCensus), [County Business Patterns (CBP)](https://www.census.gov/programs-surveys/cbp.html), and other historic data samples.

For this step, we will use the [*get_metadata_nhgis*](https://rdrr.io/cran/ipumsr/man/get_metadata_nhgis.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package.  This function will return a database of all population datasets witin the IPUMS NHGIS data repository which are available to be downloaded using the IPUMS API.  This code stores the metadata from all available samples in the IPUMS USA repository to the object *metadata_datts* and prints the first 10 lines of the metadata database so we can preview the results.

In [None]:
# get list of time-series dataset metadata
metadata_datts <- get_metadata_nhgis("time_series_tables") %>% print(n = 10)

From this preview, we can see that the IPUMS USA metada table has a **name**, corresponding to a sample identification code, a **description**, providing a short description or label for each sample, **geographic_integration** providing information on the type of spatial data which can be linked to it, **sequence**, and three colums (**time_series**, **years**, and **geo_levels**) which are each represented as [*tibbles*](https://tibble.tidyverse.org).  We will need to select a sample and make note of its sample identification code (**name**) which we will use when defining our data extraction.

**★ Pro Tip:** If you are working in Jupyter Notebooks your view of the columns in the data table may be truncated.  Take a look at the bottom of the table view to see the number of rows and variables (columns) which are not included in the preview.

A [*tibble*](https://tibble.tidyverse.org) can be thought of as a version of a data.frame that includes additional functionality and metadata visibility.  It is also more compatible with the [*tidyverse*](https://www.tidyverse.org) packages, including the [*dplyr*](https://cran.r-project.org/web/packages/dplyr/index.html) package we use in this notebook.  The information in the *tibble* colums is not visualized in this high-level view of the data but you can imagine that for each *\<tibble>* entry there is another table of information containing additional details on the available data.  We will zoom in deeper in the following steps and you will be able to view the data contained within these tibbles.

From this view we can also see that there are 389 samples available in from IPUMS NHGIS.  If we consider that each *\<tibble>* entry represents an additional table with multiple options for *time_series*, *years*, and *geog_levels* for each datasets, the wealth of data becomes overwhelming.

At first glance, it might be difficult to understand what data is contained in each sample, especially if you are not used to working with US Census and ACS data.  Refer to the [Overview of NHGIS Datasets](https://www.nhgis.org/overview-nhgis-datasets) page on the IPUMS NHGIS website for a list of all IPUMS NHGIS samples and more detailed information on each sample and the [List of NHGIS Time Series Tables](https://assets.nhgis.org/NHGIS_Time_Series_Tables.pdf) for a list time-series sample identification codes and additional information.

### 2b. Filter Metadata by Criteria

If you already know which sample you want to use you could explore the samples list until you found the appropriate sample identification code (**name**).  Alternately, you could use the [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) function from the [**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) package in conjunction with the[*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) function from the [**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) package to filter the list of samples down to the subset which may be relevant for your project.

For this exercise, we will filter the list of sample metadata *metadata_datts* to only samples which have the phrase *total population* in their descriptions.  The code below uses the [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) function in conjunction with the [*select*](https://rdrr.io/cran/dplyr/man/select.html), *as.data.frame*, and *print* functions to filter the metadata table to only the datasets which meet our criteria, only print the *name* and *description* for the resulting datasets, and print the results as a *data.frame* rather than the *tibble* format which the *metadata_datts* object is stored in.  The *as.data.frame* and *print* steps is purely to create a clean and easy-to-read result of the filter process and shows us only the most basic information we need to be able proceed to our next step.  The complete table of results from the filtering step, including all metadata, is stored in the *metadat_datts_filter* object.

In [None]:
metadata_datts_filter <- metadata_datts %>%
    filter(grepl("total population", description, ignore.case = T)) %>%
    select(name, description) %>%
    as.data.frame() %>%
    print()

The filtering process has returned four relevant samples which have total population information.  Next we will dig deeper into the metadata for this selection of datasets datasets to determine which of the datasets includes information on the time range and geographies we are interested in.

### 2c. Identify Available Years and Geographic Levels

Next will display the available years and geographic levels for the potential datasets.  This step reveals the information which is stored as nested *tibbles* for each sample and not visible in the high-level metadata table.  We will use the [*get_metadata_nhgis*](https://rdrr.io/cran/ipumsr/man/get_metadata_nhgis.html) command from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) to view metadata details for a specific NHGIS time-series table by passing the idenfitication code (*name*) to the function.

The following example shows the metadata for the *CL8* time-series table.

In [None]:
get_metadata_nhgis(time_series_table = "CL8")

The metadata view shows that the *CL8* time-series table has the following characteristics:

* the table includes data on **total population**
* it is **standardized to 2010 geographies**
* the data is available for the **years 1990, 2000, 2010, and 2020**
* and available at **multiple geography levels** including state, county, tract, block group, county subdivision, place, congressional district, CBSA, urban area, and ZIP code.

This view shows us all the information included in the **time_series**, **years**, and **geog_levels** *tibbles* which were obscured in the high-level view in step 2a.

We could repeat this process for each of the four time-series tables which were returned in our data filtering proces, but to save us some time, the code below takes the **name**, **description**, **years**, and **geograpic levels** information for each of the tables in our filtering results and presents the metadata in a simple reference table.

In [None]:
# get metadata for each time-series table
metadata_list <- map(metadata_datts_filter$name, ~ get_metadata_nhgis(time_series_table = .x))

# combine into a data frame with the necessary columns
metadata_combined <- map_dfr(metadata_list, function(metadata) {
  data.frame(
    name = metadata$name,
    description = metadata$description,
    # extract only the "description" column from the nested tibbles in "years" and "geog_levels"
    years = paste(metadata$years$description, collapse = ", "),
    geog_levels = paste(metadata$geog_levels$name, collapse = ", ")
  )
})

# print the final data frame
metadata_combined

Looking at these results, we can easily see the available years and geographies for each of the time-series tables we identified in our earlier filtering process.

**★ Pro Tip:** Note that the lists of year ranges include both single data (e.g. "2000") from the [Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html) and year ranges (e.g. "2008-2012") corresponding to  five-year population estimates from the [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs).  For example, the *B78* table includes total population counts for 1980, 1990, 2000, 2010, and 2020 from the Decennial Census and five-year population estimates from the ACS for 2007-2011, 2008-2012, 2009-2013, 2010-2014, 2011-2015, 2012-2016, 2013-2017, 2014-2018, 2015-2019, 2020, 2016-2020, 2017-2021, 2018-2022, and 2019-2023.  Refer to the [List of NHGIS Time Series Tables](https://assets.nhgis.org/NHGIS_Time_Series_Tables.pdf) on the IPUMS NHGIS website for additional information on data sources for each of the time-series tables in the NHGIS repository.

**★ Pro Tip:** For this exercise we focused on time-series datasets with informaiton on total population.  However, the IPUMS NHGIS repository includes data on many more topics and many other sources besides the Decennial Census and ACS.  Refer to the [Overview of NHGIS Datasets](https://www.nhgis.org/overview-nhgis-datasets) and the [List of NHGIS Time Series Tables](https://assets.nhgis.org/NHGIS_Time_Series_Tables.pdf) on the IPUMS NHGIS website for additional information on all available datasets.

### 2d. Select a Time-Series Dataset

Once we have decided on a specific dataset, we will save the table's code for use in our data extration later on.

For this example, we will select **table CL8** which includes total population information from the Decennial Census for 1990, 2000, 2010, and 2020 harmonized to 2010 geographies.  You could simply make a mental note of the identification code *CL8* but to make things easier for us we will save the table's code to the *selection_datts* object and use this for our data extraction later on.

**★ Pro Tip:** You can select multiple datasets by passing a list of identificaiton codes in the data extraction step (e.g. *c("CL8", "A00")*).  However, if you choose to select multiple datasets, be mindful of the differences in available years and geographies for the datasets in your selection.

In [None]:
selection_datts <- "CL8"

Now that we have selected our dataset, let's review the table's complete metadata details again.  It's a good idea to verfity that we have selected the correct dataset and that our selection meets the needs of our project.

In [None]:
get_metadata_nhgis(time_series_table = selection_datts)

## 3. NHGIS Geography Shapefile Metadata Exploration

Now that we have explored the metadata for the NHGIS time-series tables and made a selection, the next step is to explore the metadata for the NHGIS spatial files and make our selection for the geographic component of our extration.  Refer to the [list of GIS Files](https://www.nhgis.org/gis-files) on the IPUMS NHGIS website for information on the spatial data available from the NHGIS repository.

GIS files in the NHGIS repository are stored as [shapefiles](https://en.wikipedia.org/wiki/Shapefile).  A shapefile is a commonly-used type of file format for geographic vector data.  Vector data is one of the main types of geographic data and can represent point locations, lines, or polygon (geographic boundary) data.  The GIS files in the NHGIS repository include geographic boundary data relevant to the Census and ACS including boundaries for states, counties, Census tracts, ZIP code areas, and many more.

As we saw in our metadata exploration above, available geographies vary based on the dataset.  If we want to extract geographic data along with the our time-series data table, we will need to make sure we select an appropriate file for use with our data.

### 3a. Review Time-Series Data Metadata

First let's review which geographies are relevant for our the time-series table we selected and choose a gepgraphic level for our data extraction from this list.  As we saw in the time-series exploration steps, the available geographic levels vary based on the dataset.

In [None]:
get_metadata_nhgis(time_series_table = selection_datts)$geog_levels

For this example, let's choose the *county* geographic level for our extraction.  Let's save the code to the *selection_geog* object for use later on.

**★ Pro Tip:** Similar to the dataset selection step in 2d, you can select multiple datasets by passing a list of geography codes (e.g. *c("state", "county")*) in the extraction step.

In [None]:
selection_geog <- "county"

### 3b. Retrieve and Filter Shapefile Metadata by Criteria

Next we will retrieve and filter the shapefile metadata and identify which shapefile we want to extract along with our time-series data.

In this filtering step, we will also filter the available GIS files based on the year which will help us simplify the list of shapefiles to choose from.  Since we are using time-series table *CL8* for this exercise, which contains information on total population harmonized to 2010 geographies, we should will select shapefile which corresponds to 2010 geographies.

**★ Pro Tip:** If you are unfamiliar with Census geographies, it might sound strange to include a year specification in your filtring step.  For large geographies, such as "nation" or "state", the year is relatively unimportant because the boundaries of these regions do not change over time.  However, for smaller geographies, especially those related to the U.S. Decennial Census, such as *tract*, *block* or *blck_grp* (block group), and as those related to political districts, such as *cd* (congressional district), the geographic boundaries can change over time.  Census tract, block group, and block boundaries are redrawn for each Decennial Census based on population numbers, and Congressional Districts are often redrawn for new congresional elections.  For this reason, it is essential to carefully choose your shapefiles to correspond with your time-series data extraction.

The code below uses the [*get_metadata_nhgis*](https://rdrr.io/cran/ipumsr/man/get_metadata_nhgis.html) from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package to retrive metadata on the NHGIS shapefiles and then filters the metadata to only the files for *2010* and at our previously-selected geographic level (county).

In [None]:
metadata_shp <- get_metadata_nhgis("shapefiles") %>%
    filter(year == 2010 & grepl(selection_geog, geographic_level, ignore.case = T)) %>%
    print(n = Inf)

### 3d. Select a Geography Shapfile

The filtering process has returned five relevant shapefiles which are all from 2010.  From this list we can see that two of the shapefiles are at the *county* geographic level, two are at the *county subdivision* geographic level, and one is at the *count (centers of popultion)* geographic level.  All five were returned because they all matched our filtering step which searched for files with the word "county" in the *geographic_level* metadata slot.

Because we are specifically interested in *county* data, we will choose from the two *county* files:

* *us_county_2010_tl2010* (2010 Tiger/Line files)
* *us_county_2010_tl2020* ([2010 boundaries based on 2020 Tiger/Line files](https://www.nhgis.org/gis-files#2010-from-2020:~:text=coastal%20water%20areas.-,2010%20Boundaries%20Based%20on%202020%20TIGER/Line%20Files,-NHGIS%20also%20provides))

**★ Pro Tip:** [TIGER/Line shapefiles](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html) are a collection of geographic datasets that contain information about the United States. They are derived from the [U.S. Census Bureau's](https://www.census.gov) Master Address File/Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) database.

For this exercise we will use the **2010 Tiger/Line files** which is referred to using identification code **us_county_2010_tl2010**.  Again we will store the identification code so we can easily retrive it later on.

In [None]:
selection_shp <- "us_county_2010_tl2010"

## 4. NHGIS Time-Series Dataset and Geography Shapefile Extraction Specification and Submission

Once we have reviewed the available data and decided on the time-series tables and GIS files we want, the next step is to set up a data extraction using the [*define_extract_nhgis*](https://rdrr.io/cran/ipumsr/man/define_extract_nhgis.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package.  This function requires the following minimum parameters:

* *description* · text description of the extract
* *time_series_tables* · a  *tst_spec* object which specifies the parameters of a time-series table and requires the following minimum parameters:
  * *name* · vector of time-series tables to include in the extract; tables should be specified using the sample identification codes
  * *geog_levels* · vector of geographic levels to be included in the extract; geographic levels should be specified using the geographic level identification codes
* *shapefiles* · vector of GIS shapefiles to include in the extract

### 4a. Define the Data Extract

We already know what we will pass to the function for the *name* ("CL8") and *geog_levels* ("county") of the *tst_spec* object passed to the *time_series_tables* parameter and for the *shapefiles* ("us_tract_2010_tl2010") parameter.  We are ready to submit our data extract request.  In this step we will add a text description of the request which can be anything and is included to help us differentiate between requests.  For this extract we will use the simple description "IPUMS NHGIS Data Extraction".

Here we pass all the extraction definition information to the [*define_extract_nhgis*](https://rdrr.io/cran/ipumsr/man/define_extract_nhgis.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package and store the resulting extraction definition in the object *extract_definition*.

In [None]:
extract_definition <- define_extract_nhgis(description = "I-GUIDE NHGIS Data Extraction",
                                           time_series_tables = tst_spec(name = selection_datts,
                                                                         geog_levels = selection_geog),
                                           shapefiles = selection_shp)

Let's review the extraction definition information to make sure we have set it up the way we intended.

In [None]:
# review the extraction definition
extract_definition

Everything looks good so we will submit the extraction request, wait for it to complete, and download the resulting data.

### 4b. Submit the Extract Request

Now that the extraction definition is set up, we can submit it to the IPUMS API using the [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html).

For this exercise, after submitting the request we will also use the [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package to monitor the status of the request.  This is not a necessary step but it is helpful, especially when submitting large requests.

Finally, once the extract is complete, we can download it using the [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package and save it in the object *filepath*.

In [None]:
# submit extraction  
extraction_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extraction_submitted)

# check completion
extraction_complete$status

# get extraction filepath
filepath <- download_extract(extraction_submitted, overwrite = T)

### 4c. Review the Extract

Once we have downloaded the extract, we are ready to review it and transform it to a format we can easily use.  The NHGIS data extract download will contain the following two files.

1. A data file (file extension .cdv) containing the time-series tabular data.
2. A zipped GIS shapefile (file extension .zip) containing the geographic boundary data.

Read the data file and shapefile into formats which we can work with in R.  The final *dat* object will contain the data from our extraction in a [*tibble*](https://tibble.tidyverse.org) format and the *shp* object will contain our geographic boundaries in [*simple features* (*sf*)](https://r-spatial.github.io/sf/articles/sf1.html) format, a commonly-used format for spatial data objects in R.

In [None]:
# see files in extract
dat <- read_nhgis(filepath[1])
shp <- read_ipums_sf(filepath[2])

We now have a useable version of our time-series table stored in *dat*.  Let's take a look at the number of observations and variables in the data.

In [None]:
dim(dat)

The data we downloaded includes information on 16 variables for 3143 counties.

Let's take a look at the first few lines of the data.

In [None]:
head(dat)

We know that our data extraction includes total population information and these data are represented by the following variables:

**Population Variables**
* 1990 Total Population Estimate (CL8AA1990)
* 1990 Total Population Lower Bound (CL8AA1990L)
* 1990 Total Population Upper Bound (CL8AA1990U)
* 2000 Total Population Estimate (CL8AA2000)
* 2000 Total Population Lower Bound (CL8AA2000L)
* 2000 Total Population Upper Bound (CL8AA2000U)
* 2010 Total Population (CL8AA2010)
* 2020 Total Population Estimate (CL8AA2020)
* 2020 Total Population Lower Bound (CL8AA2020L)
* 2020 Total Population Upper Bound (CL8AA2020U)

In addition to the 1990, 2000, 2010, and 2020 total population values, the data also include upper and lower confidence limits for the 1990, 2000, and 2020 total population estimates.  If we remember from our data extraction specification steps, we chose a time-series table that was standardized to the 2010 geographies.  Because the 1990, 2000, and 2020 populations had to be estimated for slightly different geographies than the geographies for which the original data was collected, there is some potential error in the estimates.  The upper and lower confidence limits provide information on the possible error in the population estimates.

In addition to the population variables, the downloaded data also includes multiple geographic variables.

**Geographic Variables**
* GIS Join Match Code (GISJOIN)
* Geography Year (GEOGYEAR)
* State Name (STATE)
* State FIPS Code (STATEA)
* County Name (COUNTY)
* County FIPS Code (COUNTYA)

Next let's take a look at the number of observations and variables in the geographic data.

In [None]:
dim(shp)

The geographic data includes information on 21 variables for 3221 counties.

Let's take a look at columns included in the geographic data file.

In [None]:
colnames(shp)

The variable we are most interested in from the geographic data is the GIS Join Match Code (GISJOIN).  This variable corresponds to the GISJOIN variable in our time-series tabular data and we will use the two corresponding variables from the two datasets to link them.

### 4d. Merge the Time-Series and Geography Data

Now that we have both the time-series data table on total population for 1990, 2000, 2010, and 2020 by county and the geographic data file with county boundaries, we can merge the two to create a spatially referenced data object which includes the total population counts by county.  We will be able to use this file to create map visualization of the population data or conduct other spatial analysis workflows.

In the previous section we identified the GIS Join Match Code (GISJOIN) variable present in both our tabular and geographic data.  We will use this common variable as the join key and merge the two files and create a new version of our which includes both popuation and geographic information.

In [None]:
# merge the time-series population data with the county geographic data
dat_shp <- merge(dat, shp, by = "GISJOIN")

The final merged includes total population for 1990, 2000, 2010, and 2020 attached to 2010 county geographic boundaries.  The code below provides a snapshot of the first ten lines in the final merged dataset showing all variables from both tables including all population variables and the *geometry* column containing geographic information.

In [None]:
head(dat_shp)

### 4e. Save the Data

Finally, let's save the data we extracted from IPUMS USA.  We will save the data as a **shapefile** (*.shp*).  The shapefile format will retain the geographic metadata necessary to use our file for mapping and spatial analysis.  This type of file can be reopened in R using the [*st_read*](https://rdrr.io/cran/sf/man/st_read.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package or read natively into traditional GIS software platforms such as QGIS and ArcGIS.

**★ Pro Tip:** Saving the geographic data as a shapefile will produce four files.  All four files together consititute the informaiton necessary to properly draw and spatially reference the data in the shapefile and should be kept together.

* a *.shp* file containing feature geometry
* a *.dbf* file containing feature attribute information
* a *.shx* file containing the feature index
* a *.prj* file containing projection information

Because our data includes geographic data we can't save it to a tabular format such as a comma-separated values (.csv) file or the R Data Serilization (RDS) format as these formats do not support geographic metadata storage

Since our data is very large let's first subset it to make it a little easier to work with.  Before saving, we will subset the data to include only individuals located in the state of California (FIPS code 6).

In [None]:
# subset the data to only the state of California
dat_subset <- dat_shp[dat_shp$STATEA == "06",]

# view the dimensions of the new data table
dim(dat_subset)

Subsetting the data to only California reduces the dimensions of the data to only 58 counties, making it much easier to work with and store.

Before saving, let's also remove a few of the attributes which have very large values including land area (ALAND10), water area (AWATER10), and shape area (Shape_area).  Due to their large values these attibutes will cause warnings when we save our data to shapefile format so we will save ourself the headache of seeing all the warnings by removing the attributes prior to saving.

In [None]:
dat_subset <- dat_subset[, !names(dat_subset) %in% c("ALAND10", "AWATER10", "Shape_area")]

We are ready to save our data.  We wil use the [*st_write*](https://rdrr.io/cran/sf/man/st_write.html) function from the [**sf**](https://cran.r-project.org/web/packages/sf/index.html) package to write our data as a shapefile.

**★ Pro Tip:** Setting the *delete_dsn* to *TRUE* for the [*st_write*](https://rdrr.io/cran/sf/man/st_write.html) function will allow the save to overwrite the file if it already exists

In [None]:
st_write(dat_subset, "ipums_nhgis_example.shp", driver = "ESRI Shapefile", delete_dsn = T)

At the end of this exercise we have a freshly downloaded dataset from the IPUMS NHGIS repository saved in our workspace.

---

## Next Steps

* Continue to [**Chapter 3.01.1 IPUMS NHGIS Data Extraction using ipumsr: Supplemental Exercise 1**](https://platform.i-guide.io/notebooks/a74fff96-4db5-430f-b346-958b0c5f7b38)
* Continue to [**Chapter 3.01.2 IPUMS NHGIS Data Extraction using ipumsr: Supplemental Exercise 2**](https://platform.i-guide.io/notebooks/bc79eda6-8353-42ea-8cb7-5db70aa6febf)
* Continue to [**Chapter 3.01.3 IPUMS NHGIS Data Extraction using ipumsr: Supplemental Exercise 3**](https://platform.i-guide.io/notebooks/55dd96e5-fdf6-408f-a050-7fcd006d0575)
* Move on to Chapter 5: Data Cleaning, Preparation, and Exploratory Data Analysis (EDA)
  * [**Chapter 5.02 Spatial Data Exploration and Preprocessing with IPUMS NHGIS**]()
* Return to the [**R Spatial Notebooks Project Chapter List**](https://vavramusser.github.io/r-spatial/#:~:text=Chapter%201%3A%20Data%20Sources%20and%20APIs) to view a list of all available notebooks organized in the R Spatial Notebooks chapter structure.
* Visit the [**R Spatial Notebooks Project Homepage**](https://vavramusser.github.io/r-spatial) to learn more about the project, view the list of all notebooks, and explore additional resources.
* Join the project [**Mailing List**](https://mailchi.mp/ab01e8fc8397/r-spatial-email-signup) to hear about future notebook releases and other updates.
* If you have an idea for a new notebook please submit your idea via the [**Suggestion Box**](https://us19.list-manage.com/survey?u=746bf8d366d6fbc99c699e714&id=54590a28ea&attribution=false).

---

## ★ Thank You ★

Thank you so much for engaging with this notebook and supporting the project!  The R Spatial Notebooks Project is a labor of love so if you enjoy or benefit from these notebooks, please consider [**Donating to the Project**](https://buymeacoffee.com/vavramusser).  Your support allows me to continue producing notebooks and supporting the R Spatial Notebooks community.

---

## Quick Code
A clean and simple version of the code included in this notebook (excluding the metadata exploration steps).  **Don't forget to update the code with your IPUMS API key!**