# IPUMS [NHGIS](https://www.nhgis.org) Data Extraction Using [ipumsr](https://cran.r-project.org/web/packages/ipumsr/index.html) - Supplemental Exercise 1
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

## Introduction
This notebook provides an additional example of the IPUMS NHGIS data extraction process using the IPUMS API via the ipumsr R package.  This exercise is a supplement to the workflow introducted in Chapter 3.4 IPUMS NHGIS Data Extraction Using ipumsr.

### Notebook Goals
This notebook replicates the IPUMS NHGIS data extraction process and extracts a NHGIS point dataset on population of populated places and point locations shapefile.  The resulting data file is used in subsequent notebooks in the R Spatial Notebooks series.  The notebook provides an example of extracting point-based spatial data with attached attribute data from the IPUMS NHGIS repository.

### ★ Prerequisites ★
* Complete [Chapter 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a)
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)
* Complete [Chapter 2.4 IPUMS NHGIS Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc)

### Notebook Overview
1. Setup
2. Extraction Workflow: Point-Based Shapefiles + Tabular Data

---

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following functions from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_nhgis*](https://rdrr.io/cran/ipumsr/man/define_extract_nhgis.html) · define an IPUMS NHGIS extract request
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_metadata_nhgis*](https://rdrr.io/cran/ipumsr/man/get_metadata_nhgis.html) · list available data sources from IPUMS NHGIS
* [*read_ipums_sf*](https://rdrr.io/cran/ipumsr/man/read_ipums_sf.html) · read spatial data from an IPUMS extract
* [*read_nhgis*](https://rdrr.io/cran/ipumsr/man/read_nhgis.html) · read tabular data from an NHGIS extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* *tst_spec* · create a *tst_spec* object containing a time series table specification
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**purrr**](https://cran.r-project.org/web/packages/purrr/index.html) A complete and consistent functional programming toolkit for R. This notebook uses the the following functions from *purrr*.

* [*map()*](https://rdrr.io/cran/purrr/man/map.html) and [*map_dfr()*](https://rdrr.io/cran/purrr/man/map_dfr.html) · apply a function to each element of a vector

[**sf**](https://cran.r-project.org/web/packages/sf/index.html) Support for simple features, a standardized way to encode spatial vector data. Binds to 'GDAL' for reading and writing data, to 'GEOS' for geometrical operations, and to 'PROJ' for projection conversions and datum transformations. Uses by default the 's2' package for spherical geometry operations on ellipsoidal (long/lat) coordinates.  This notebook uses the following functions from *sf*.

* [*st_write*](https://rdrr.io/cran/sf/man/st_write.html) · Write simple features object to file or database

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages("dplyr", "ipumsr", "purr", "sf")

Load the packages into your workspace.

In [None]:
library(dplyr)
library(ipumsr)
library(purrr)
library(sf)

### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to [Chapter 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a) for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

## 2. NHGIS Time-Series + Points

### 2a. View and Filter the List of Time-Series Datasets

First we will take a look at the list of available time-series datasets in the NHGIS repository that fit our critera.  We are looking for datasets on *total population* at the *place* geography.

In [None]:
metadata_datts_filter <- get_metadata_nhgis("time_series_tables") %>%
    filter(grepl("place", geog_levels, ignore.case = T), grepl("total population", description, ignore.case = T)) %>%
    select(name, description) %>%
    as.data.frame() %>%
    print()

The results returned three possible datasets, *AV0*, *B78*, and *CL8*.

### 2b. Identify Available Years and Geographic Levels

Next let's take a look at the available years and geographic levels for each of these three datasets.

In [None]:
# get metadata for each time-series table
metadata_list <- map(metadata_datts_filter$name, ~ get_metadata_nhgis(time_series_table = .x))

# combine into a data frame with the necessary columns
metadata_combined <- map_dfr(metadata_list, function(metadata) {
  data.frame(
    name = metadata$name,
    description = metadata$description,
    # extract only the "description" column from the nested tibbles in "years" and "geog_levels"
    years = paste(metadata$years$description, collapse = ", "),
    geog_levels = paste(metadata$geog_levels$name, collapse = ", ")
  )
})

# print the final data frame
metadata_combined

Let's choose the *CL8* time-series table and move onto the next step, selecting a shapefile to go with our time-series data.

### 2c. View and Filter the List of Geography Shapefiles

Since we want our data to be at the *place (points)* geography, let's filter the shapfile metadata to only include shapfiles which include the word *points* in the description of their *geographic_level*.

In [None]:
metadata_shp <- get_metadata_nhgis("shapefiles") %>%
    filter(grepl("points", geographic_level, ignore.case = T)) %>%
    print(n = Inf)

Let's select the 2010 shapefile (*us_place_point_2010_tlgnis*).

### 2d. Time-Series + Shapefile Extraction Specification and Submission

Now that we have selected our time-series dataset (*CL8*), the geographic level for our time-series dataset (*place*), and our shapefile (*us_place_point_2010_tlgnis*) we are ready to define and submit our exraction request to the IPUMS API.

In [None]:
extract_definition <- define_extract_nhgis(description = "I-GUIDE NHGIS Places Points Extraction",
                                           time_series = tst_spec(name = "CL8",
                                                                  geog_levels = "place"),
                                           shapefiles = "us_place_point_2010_tlgnis")

Submitting the extraction definition object *extract_definition* to the API.

In [None]:
extraction_submitted <- submit_extract(extract_definition)
extraction_complete <- wait_for_extract(extraction_submitted)
extraction_complete$status
filepath <- download_extract(extraction_submitted, overwrite = T)

The result of the extraction request will be two files 1) a time-series table containing the populationd data and 2) the places (points) geography shapefile.  The next step is to read these two files into R.

In [None]:
dat <- read_nhgis(filepath[1])
shp <- read_ipums_sf(filepath[2])

Let's take a look at the dimesions of the time-series data (*dat*) and the shapefile (*shp*).

In [None]:
dim(dat)
dim(shp)

The time-series table includes 16 variables for 29,261 places and the shapefile includes 8 attributes for 29,514 places.  The number of places represented by the time-series table is slightly smaller than the number of places represented in the shapefile.  It could be that some of the places represented in the shapefile do not have population counts available in the data table.

Let's take a look at the first few lines of the time-series table on population and the places (points) shapefile.

In [None]:
head(dat)

In [None]:
head(shp)

Before we join the *dat* and *shp* files, let's remove a few of data columns which are less useful for our analyses.

We will only keep the following attributes:

**From the *dat* Tabular Data**
* GIS Join Key (*GISJOIN*)
* Place (*PLACE*)
* 1990 Total Population (*CL8AA1990*)
* 2000 Total Population (*CL8AA2000*)
* 2010 Total Population (*CL8AA2010*)
* 2020 Total Population (*CL8AA2020*)

**From the *shp* Geography Data**
* GIS Join Key (*GISJOIN*)
* Name (*NAME*)

In [None]:
dat_cols <- c("GISJOIN", "PLACE", "CL8AA1990", "CL8AA2000", "CL8AA2010", "CL8AA2020")
dat <- dat[dat_cols]

shp_cols <- c("GISJOIN", "NAME")
shp <- shp[shp_cols]

We kept the *GISJOIN* join key column in both files and we will use this common key to join the two datasets using the *merge* function.

In [None]:
# merge the time-series population data with the county geographic data
dat_shp <- merge(dat, shp, by = "GISJOIN")

Finally we will save the data in shapefile format.

In [None]:
st_write(dat_shp, "ipums_nhgis_places.shp", driver = "ESRI Shapefile", delete_dsn = T)

At the end of this notebook we have saved a copy of the time-series table with population data from the 1990, 2000, 2010, and 2020 Decennial Censuses for populated places in the United States to the file *ipums_nhgis_places.csv* and a copy of the complementary geographic data file for populated places to the shapefile *ipums_nhgis_places.shp*.

---
## Recommended Next Steps

* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * [2.4b IPUMS NHGIS Data Extraction Using ipumsr - Supplemental Exercise 2]()
  * [2.4c IPUMS NHGIS Data Extraction Using ipumsr - Supplemental Exercise 3]()

## Stay Connected
Thank you for engaging with this notebook and supporting the project!  Please visit the [**R Spatial Notebooks Project Homepage**](https://vavramusser.github.io/r-spatial) to learn more about the project and explore additional notebooks.  Don't forget to join the project [**Mailing List**](https://mailchi.mp/ab01e8fc8397/r-spatial-email-signup) to hear about future notebook releases and other updates.  If you have an idea for a new notebook I want to hear about it!  Please submit your idea via the [**Suggestion Box**](https://us19.list-manage.com/survey?u=746bf8d366d6fbc99c699e714&id=54590a28ea&attribution=false).