# IPUMS [NHGIS](https://www.nhgis.org) Data Extraction Using [ipumsr](https://cran.r-project.org/web/packages/ipumsr/index.html) - Supplemental Exercise 3
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

## Introduction
This notebook provides an additional example of the IPUMS NHGIS data extraction process using the IPUMS API via the ipumsr R package.  This exercise is a supplement to the workflow introducted in Chapter 3.4 IPUMS NHGIS Data Extraction Using ipumsr.

### Notebook Goals
This notebook replicates the IPUMS NHGIS data extraction process and extracts a NHGIS polygon dataset on block group population and accompanying shapefile.  The resulting data file is used in subsequent notebooks in the R Spatial Notebooks series.  The notebook provides an example of extracting spatial data limited by a user-defined extent with attached attribute data from the IPUMS NHGIS repository.

### ★ Prerequisites ★
* Complete [Chapter 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a)
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)
* Complete [Chapter 2.4 IPUMS NHGIS Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc)

### Notebook Overview
1. Setup
2. Extraction Workflow: Shapefiles Restricted to a Geographic Extent + Tabular Data

---

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following functions from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_nhgis*](https://rdrr.io/cran/ipumsr/man/define_extract_nhgis.html) · define an IPUMS NHGIS extract request
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_metadata_nhgis*](https://rdrr.io/cran/ipumsr/man/get_metadata_nhgis.html) · list available data sources from IPUMS NHGIS
* [*read_ipums_sf*](https://rdrr.io/cran/ipumsr/man/read_ipums_sf.html) · read spatial data from an IPUMS extract
* [*read_nhgis*](https://rdrr.io/cran/ipumsr/man/read_nhgis.html) · read tabular data from an NHGIS extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* *tst_spec* · create a *tst_spec* object containing a time series table specification
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**purrr**](https://cran.r-project.org/web/packages/purrr/index.html) A complete and consistent functional programming toolkit for R. This notebook uses the the following functions from *purrr*.

* [*map()*](https://rdrr.io/cran/purrr/man/map.html) and [*map_dfr()*](https://rdrr.io/cran/purrr/man/map_dfr.html) · apply a function to each element of a vector

[**sf**](https://cran.r-project.org/web/packages/sf/index.html) Support for simple features, a standardized way to encode spatial vector data. Binds to 'GDAL' for reading and writing data, to 'GEOS' for geometrical operations, and to 'PROJ' for projection conversions and datum transformations. Uses by default the 's2' package for spherical geometry operations on ellipsoidal (long/lat) coordinates.  This notebook uses the following functions from *sf*.

* [*st_write*](https://rdrr.io/cran/sf/man/st_write.html) · Write simple features object to file or database

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages("dplyr", "ipumsr", "purr", "sf")

Load the packages into your workspace.

In [None]:
library(dplyr)
library(ipumsr)
library(purrr)
library(sf)

### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to [Chapter 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a) for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

## 2. NHGIS Time-Series + Polygons with Geographic Extent - Method 1

### 2a. View and Filter the List of Geography Shapefiles

Forthis exercise, we will again work with the total population time-series table *CL8* that we worked with in Chapter 2.4 and Chapte 2.4a.  Therefore, we will jump directly into e will taking a look at the list of geography shapefiles that fit our critera.  We are looking for block group boundary shapefiles but want to restrict the extent of our query to onl the state of Minnesota.  Therefore, let's filter the shapefile metadata to only incluede shapefiles which include the word "block group" on the description of their *geographic_level* and have *Minnesota* as their extent.  We will also focus on only shapefiles using the 2010 Tiger-Line files so we will also filter based on the *year = 2010* criteria.

In [None]:
metadata_shp <- get_metadata_nhgis("shapefiles") %>%
    filter(year == 2010, extent == "Minnesota", grepl("block group", geographic_level, ignore.case = T)) %>%
    print(n = Inf)

This filter resulted in a list of two potential shapefiles.  Let's select the 2010 shapefile based on 2010 Tiger-Line shapefiles for states (*270_blck_grp_2010_tl2010*).

### 2b. Shapefile Extraction Specification and Submission

Now that we have selected our shapefile (*270_blck_grp_2010_tl2010*) we are ready to define and submit our extraction request to the IPUMS API.  We already know we want to use time-series table *CL8* at the block group (*blk_grp*) geographic level.

In [None]:
extract_definition <- define_extract_nhgis(description = "I-GUIDE NHGIS County Population  Extraction",
                                           time_series_tables = tst_spec(name = "CL8",
                                                                         geog_levels = "blck_grp"),
                                           shapefiles = "270_blck_grp_2010_tl2010")

Submitting the extraction definition object *extract_definition* to the API.

In [None]:
extraction_submitted <- submit_extract(extract_definition)
extraction_complete <- wait_for_extract(extraction_submitted)
extraction_complete$status
filepath <- download_extract(extraction_submitted, overwrite = T)

The result of the extraction request will be two files 1) a time-series table containing the populationd data and 2) the places (points) geography shapefile.  The next step is to read these two files into R.

In [None]:
dat <- read_nhgis(filepath[1])
shp <- read_ipums_sf(filepath[2])

Let's take a look at the dimesions of the time-series data (*dat*) and the shapefile (*shp*).

In [None]:
dim(dat)
dim(shp)

The time-series table includes 18 variables for 217,740 block groups in the entire United States and the shapefile includes 16 attributes for the 4,108 block groups in the state of Minnesota.

Let's take a look at the first few lines of the time-series table on population and the block groups shapefile.

In [None]:
head(dat)

In [None]:
head(shp)

## 2. NHGIS Time-Series + Polygons with Geographic Extent - Method 2

Alternately, we could have set up our querty using the block group shapefile for the entire United States (*us_blck_grp_2010_tl2010*) and then specified the geographic extent in the extraction definition step.  Note that in the extraction definition set up below we have specified *Minnesota* as the geographic extent.

In [None]:
extract_definition <- define_extract_nhgis(description = "I-GUIDE NHGIS County Population  Extraction",
                                           time_series_tables = tst_spec(name = "CL8",
                                                                         geog_levels = "blck_grp"),
                                           shapefiles = "us_blck_grp_2010_tl2010",
                                           extent = "Minnesota")

Submitting the extraction definition object *extract_definition* to the API.

In [None]:
extraction_submitted <- submit_extract(extract_definition)
extraction_complete <- wait_for_extract(extraction_submitted)
extraction_complete$status
filepath <- download_extract(extraction_submitted, overwrite = T)

Again, the result of the extraction request will be two files 1) a time-series table containing the populationd data and 2) the places (points) geography shapefile.  The next step is to read these two files into R.

In [None]:
dat <- read_nhgis(filepath[1])
shp <- read_ipums_sf(filepath[2])

If we take a look at the dimesions of the time-series data (*dat*) and the shapefile (*shp*) and at the first few lines of each file, we can see that this method of setting up the data extraction resulted in the same files as the first method.

In [None]:
dim(dat)
dim(shp)

In [None]:
head(dat)

In [None]:
head(shp)

Both files have the common column *GISJOIN* which should be familiar since we used it to merge the time-series and shapefile files in Chapter 2.4.  This time, we will not merge the files here but will instead save them seperately as a comma-seperated values (.csv) file for the time-series data and as a shapefile (.shp) for the geographic data.

Before we do that however, we should remove a few columns that require a lot of memory.  This will make it easier to save and work work this data.

In [None]:
# merge the time-series population data with the county geographic data
dat_shp <- merge(dat, shp, by = "GISJOIN")

In [None]:
dat_shp <- dat_shp[, !names(dat_shp) %in% c("ALAND10", "AWATER10", "Shape_area")]

We are ready to save the files to our workspace.

In [None]:
st_write(dat_shp, "ipums_nhgis_blockgroups.shp", driver = "ESRI Shapefile", delete_dsn = T)

At the end of this notebook we have saved a copy of the time-series table with population data from the 1990, 2000, 2010, and 2020 Decennial Censuses for block groups in the United States to the file *ipums_nhgis_blockgroups.csv* and a copy of the complementary geographic data file for block groups in the state of Minnesota to the shapefile *ipums_nhgis_blockgroups.shp*.

---
## Recommended Next Steps

* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * [2.4b IPUMS NHGIS Data Extraction Using ipumsr - Supplemental Exercise 2]()
  * [2.4c IPUMS NHGIS Data Extraction Using ipumsr - Supplemental Exercise 3]()

## Stay Connected
Thank you for engaging with this notebook and supporting the project!  Please visit the [**R Spatial Notebooks Project Homepage**](https://vavramusser.github.io/r-spatial) to learn more about the project and explore additional notebooks.  Don't forget to join the project [**Mailing List**](https://mailchi.mp/ab01e8fc8397/r-spatial-email-signup) to hear about future notebook releases and other updates.  If you have an idea for a new notebook I want to hear about it!  Please submit your idea via the [**Suggestion Box**](https://us19.list-manage.com/survey?u=746bf8d366d6fbc99c699e714&id=54590a28ea&attribution=false).