# IPUMS NHGIS Data Extraction Using ipumsr

## Introduction
The [IPUMS NHGIS](https://www.nhgis.org) database offers harmonized summary data and geographic boundary files from U.S. censuses and surveys, providing a resource for spatial analysis of demographic, social, and economic trends. It enables users to access aggregated data at various geographic levels, such as states, counties, and census tracts, facilitating the exploration of population dynamics and regional patterns over time. Through harmonization, IPUMS NHGIS ensures that data can be seamlessly compared across years, despite changes in geographic boundaries, variable definitions, and survey methodologies.

**From the [IPUMS NHGIS Webpage](https://www.nhgis.org):** The National Historical Geographic Information System (NHGIS) provides easy access to summary tables and time series of population, housing, agriculture, and economic data, along with GIS-compatible mapping files, for years from 1790 through the present and for all levels of U.S. census geography, including states, counties, tracts, and blocks.

This notebook introduces the process of extracting [IPUMS NHGIS](https://www.nhgis.org) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS NHGIS data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS NHGIS datasets and prepare them for spatial and statistical workflows.

### ★ Prerequisites ★
* Complete Chapter 1.1: Introduction to IPUMS and the IPUMS API
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS NHGIS Time-Series Data Metadata Exploration
3. IPUMS NHGIS Geography Shapefile Metadata Exploration
4. IPUMS NHGIS Time-Series Data and Geography Shapefile Extraction Specification and Submission
5. Subset and Merge the Time-Series and Geography Data Extractions

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following functions from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* [*select*](https://rdrr.io/cran/dplyr/man/select.html) · keep or drop columns using their names and types
* [*rename*](https://rdrr.io/cran/dplyr/man/rename.html) · rename columns
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_nhgis*](https://rdrr.io/cran/ipumsr/man/define_extract_nhgis.html) · define an IPUMS NHGIS extract request
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_metadata_nhgis*](https://rdrr.io/cran/ipumsr/man/get_metadata_nhgis.html) · list available data sources from IPUMS NHGIS
* [*read_ipums_sf*](https://rdrr.io/cran/ipumsr/man/read_ipums_sf.html) · read spatial data from an IPUMS extract
* [*read_nhgis*](https://rdrr.io/cran/ipumsr/man/read_nhgis.html) · read tabular data from an NHGIS extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* *tst_spec* · create a *tst_spec* object containing a time series table specification
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**purrr**](https://cran.r-project.org/web/packages/purrr/index.html) A complete and consistent functional programming toolkit for R. This notebook uses the the following functions from *purrr*.

* [*map()*](https://rdrr.io/cran/purrr/man/map.html) and [*map_dfr()*](https://rdrr.io/cran/purrr/man/map_dfr.html) · apply a function to each element of a vector

[**sf**](https://cran.r-project.org/web/packages/sf/index.html) Support for simple features, a standardized way to encode spatial vector data. Binds to 'GDAL' for reading and writing data, to 'GEOS' for geometrical operations, and to 'PROJ' for projection conversions and datum transformations. Uses by default the 's2' package for spherical geometry operations on ellipsoidal (long/lat) coordinates.

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages("dplyr", "ipumsr", "purr", "sf")

Load the packages into your workspace.

In [32]:
library(dplyr)
library(ipumsr)
library(purrr)
library(sf)

ERROR: Error in library(sf): there is no package called 'sf'


### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to *Chapter 1.1: Introduction to IPUMS and the IPUMS API* for instructions on setting up your IPUMS account and API key.

In [7]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

Please enter your IPUMS API key:  59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0


Existing .Renviron file copied to C:\Users\vavra\Documents/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



## 2. NHGIS Time-Series Data Metadata Exploration

The NHGIS provides a variety of time-series tables, each representing collections of population data over different years. This section helps you identify the right datasets for your analysis by exploring the available time-series tables and filtering them based on specific criteria.

#### Steps:
1. Retrieve metadata for available time-series datasets.
2. Filter and display datasets that focus on a specific topic.
3. Identify which years and geographic levels are covered by each dataset.
4. Select a dataset for extration.

### 2a. Retrieve Time-Series Metadata
First we will take a look at the list available NHGIS time-series datasets which includes hundreds of data tables.  Running the code below will provide a snapshot of the first ten datasets in the list.

In [8]:
# get list of time-series dataset metadata
datts_meta <- get_metadata_nhgis("time_series_tables") %>% print(n = 10)

[90m# A tibble: 389 × 7[39m
   name  description        geographic_integration sequence time_series years   
   [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m                     [3m[90m<dbl>[39m[23m [3m[90m<list>[39m[23m      [3m[90m<list>[39m[23m  
[90m 1[39m A00   Total Population   Nominal                    100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 2[39m AV0   Total Population   Nominal                    100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 3[39m B78   Total Population   Nominal                    100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 4[39m CL8   Total Population   Standardized to 2010       100. [90m<tibble>[39m    [90m<tibble>[39m
[90m 5[39m A57   Persons by Urban/… Nominal                    101. [90m<tibble>[39m    [90m<tibble>[39m
[90m 6[39m A59   Persons by Urban/… Nominal                    101. [90m<tibble>[39m    [90m<tibble>[39m
[90m 7[39m CL9   Persons b

Note that each entry in the list includes not only the description and reference code for the dataset but also [tibbles](https://tibble.tidyverse.org) for "time_series", "years", and "geog_levels".  The information in the tibbles is not visualized in this high-level view of the data but you can imagine that for each "\<tibble>" entry there is another table of information containing additional details on the available data.  We will zoom in deeper in the following steps and you will be able to view the data contained within these tibbles.

This wealth of data is overwhelming and it is unlikely anyone would need it all for a single project.  So in the next step, we will programmatically filter the metadata to select only the datasets focused on a specific topic.

### 2b. Filter Metadata Based on Criteria

You will use the following code to retrieve metadata on available NHGIS time-series datasets and filter them to find datasets that focus on a specific topic.

In this example, we will explore only the datasets which focus on total population.  Therefore, we will filter the entire list of datasets to find the subset of datasets whcih include the phrase "total population" in the description.

In [9]:
description_filter <- "total population"

In [10]:
datts_meta_filter <- datts_meta %>% filter(grepl(description_filter, description, ignore.case = T)) %>% select(name, description) %>% as.data.frame() %>% print()

  name      description
1  A00 Total Population
2  AV0 Total Population
3  B78 Total Population
4  CL8 Total Population


Using "total population" as a filter resulted in four potential datasets.  For additional detailed information NHGIS time-series datasets, refer to the [NHGIS Time Series Tables lookup document](https://assets.nhgis.org/NHGIS_Time_Series_Tables.pdf).

Next we will take a look at the metadata for this selection of datasets datasets to determine which of the datasets includes information on the time range and geographies we are interested in.

### 2c. Identify Available Years and Geographic Levels

This step will display the available years and geographic levels for the filtered datasets. This will help you decide which dataset best suits your analysis.

You can use the *get_metadata_nhgis* command to view metadata for a specific NHGIS time-series table using the table's code.  The following example shows the metadata for table "CL8".

In [11]:
get_metadata_nhgis(time_series_table = "CL8")

name,description,sequence
<chr>,<chr>,<int>
AA,Persons: Total,1

name,description,sequence
<chr>,<chr>,<int>
1990,1990,108
2000,2000,118
2010,2010,131
2020,2020,155

name,description,sequence
<chr>,<chr>,<int>
state,State,4
county,State--County,25
tract,State--County--Census Tract,66
blck_grp,State--County--Census Tract--Block Group,85
cty_sub,State--County--County Subdivision,102
place,State--Place,148
cd111th,"State--Congressional District (2007-2013, 110th-112th Congress)",217
cbsa,Metropolitan Statistical Area/Micropolitan Statistical Area,338
urb_area,Urban Area,372
zcta,5-Digit ZIP Code Tabulation Area,382


The metadata view shows that the CL8 time-series table includes total population information for 1990, 2000, 2010, and 2020 and for a variety of geographic levels.  Here we can see the information included in the "time_series", "years", and "geog_levels" tibbles which were obscured in the high-level view in step 2a.

We could repeat this process for each table from our data filtering proces, but to save us some time, the code below takes the name, description, years, and geograpic levels information for each of the tables in our filtering results and presents the metadata in a simple reference table.

In [12]:
# get metadata for each time-series table
metadata_list <- map(datts_meta_filter$name, ~ get_metadata_nhgis(time_series_table = .x))

# combine into a data frame with the necessary columns
metadata_combined <- map_dfr(metadata_list, function(metadata) {
  data.frame(
    name = metadata$name,
    description = metadata$description,
    # Extract only the "description" column from the nested tibbles in "years" and "geog_levels"
    years = paste(metadata$years$description, collapse = ", "),
    geog_levels = paste(metadata$geog_levels$name, collapse = ", ")
  )
})

# print the final data frame
metadata_combined

name,description,years,geog_levels
<chr>,<chr>,<chr>,<chr>
A00,Total Population,"1790, 1800, 1810, 1820, 1830, 1840, 1850, 1860, 1870, 1880, 1890, 1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020","state, county"
AV0,Total Population,"1970, 1980, 1990, 2000, 2010, 2006-2010, 2007-2011, 2008-2012, 2009-2013, 2010-2014, 2011-2015, 2012-2016, 2013-2017, 2014-2018, 2015-2019, 2020, 2016-2020, 2017-2021, 2018-2022, 2019-2023","state, county, tract, cty_sub, place"
B78,Total Population,"1980, 1990, 2000, 2010, 2006-2010, 2007-2011, 2008-2012, 2009-2013, 2010-2014, 2011-2015, 2012-2016, 2013-2017, 2014-2018, 2015-2019, 2020, 2016-2020, 2017-2021, 2018-2022, 2019-2023","nation, region, division, state, county, tract, cty_sub, place"
CL8,Total Population,"1990, 2000, 2010, 2020","state, county, tract, blck_grp, cty_sub, place, cd111th, cbsa, urb_area, zcta"


Taking a look at these results, we can easily see the available years and geographies for each of the time-series tables we identified in our filtering process.

Note that the lists of year ranges include both single years (e.g. "2000") corresponding to Decennial Census population counts and year ranges (e.g. "2008-2012") corresponding to five-year average population estimates from the [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs).

### 2d. Select a Dataset

Once we have decided on a specific dataset, we will save the table's code for use in our data extration later on.

In this example, we select the 2010 harmonized dataset (CL8), which aligns data to standardized 2010 geographies.  But you can change this line of code to correspond to whichever dataset you want.  You can also select multiple datasets using a list (e.g. *c("CL8", "A00")*).  However, if you choose to select multiple dataseta, be mindful of the differences in available years and geographies for the datasets in your selection.

In [13]:
selection_datts <- "CL8"

Now that we have selected our dataset, let's review its complete metadata details.  Verfity that you have selected the correct dataset and that our selection meets your data needs.

In [14]:
get_metadata_nhgis(time_series_table = selection_datts)

name,description,sequence
<chr>,<chr>,<int>
AA,Persons: Total,1

name,description,sequence
<chr>,<chr>,<int>
1990,1990,108
2000,2000,118
2010,2010,131
2020,2020,155

name,description,sequence
<chr>,<chr>,<int>
state,State,4
county,State--County,25
tract,State--County--Census Tract,66
blck_grp,State--County--Census Tract--Block Group,85
cty_sub,State--County--County Subdivision,102
place,State--Place,148
cd111th,"State--Congressional District (2007-2013, 110th-112th Congress)",217
cbsa,Metropolitan Statistical Area/Micropolitan Statistical Area,338
urb_area,Urban Area,372
zcta,5-Digit ZIP Code Tabulation Area,382


## 3. NHGIS Geography Shapefile Metadata Exploration

As we saw in our metadata exploration above, the available geography levels vary based on the dataset.  If we want to extract geographic data along with the our datatable, we will need to review the available geographic data files and select an appropriate file for use with our data.

#### Steps:
1. Review dataset metadata.
2. Retrieve metadata for available shapefiless.
3. Filter and display shapefiles based on years and geographic level.
5. Select a shapefile for extration.

### 3a. Review Time-Series Data Metadata

First we should review which geographies are available for our selected datatable and select a geographic level for our extraction.

In [15]:
get_metadata_nhgis(time_series_table = selection_datts)$geog_levels

name,description,sequence
<chr>,<chr>,<int>
state,State,4
county,State--County,25
tract,State--County--Census Tract,66
blck_grp,State--County--Census Tract--Block Group,85
cty_sub,State--County--County Subdivision,102
place,State--Place,148
cd111th,"State--Congressional District (2007-2013, 110th-112th Congress)",217
cbsa,Metropolitan Statistical Area/Micropolitan Statistical Area,338
urb_area,Urban Area,372
zcta,5-Digit ZIP Code Tabulation Area,382


Select one of the available geographies and save it for use in our data extration later on.

In this example, we select the Census tract ("tract") geographies.  But you can change this line of code to correspond to whichever geography you want.  Similar to the datast selection in step 2d, you can also select multiple geographies using a list (e.g. *c("state", "county")*).

In [16]:
selection_geog <- "tract"

### 3b. Retrieve Geography Metadata

Shapefiles are a type of file format which contain geographic boundaries.  This type of file is essential for spatial analysis.  This section retrieves and filters shapefile metadata to identify shapefiles which correspond to our selected year and geography.  This filtering step ensures you have the correct geographic boundaries for the population data.

In [17]:
shp_meta <- get_metadata_nhgis("shapefiles")

### 3c. Filter Geography Metadata Based on Year and Geography

For this exercise, we will extract a set of shapefiles at your previously-selected geography as well as a specific year.  As we saw in our metadata exploration above, the available geography levels vary based on the dataset.

For this filtering step, you should also filter based on the year.  For this exercise, we are using time-series table CL8 which contains information on total population harmonized to 2010 geographies.  Therefore, we should only select a shapefile which corresponds to 2010 geographies.

In [18]:
selection_year <- "2010"

If you are unfamiliar with Census geographies, it might sound strange to include a year specification in this filtring step.  For large geographies, such as "nation" or "state", the year is relatively unimportant because the boundaries of these regions are not redrawn from year to year.  However, for smaller geographies, especially those related to the U.S. Decennial Census, such as "tract", "block" or "blck_grp" (block group), and as those related to political districts, such as "cd" (congressional district), the boundary of the grography can change over time.  Census tract, block group, and block boundaries are redrawn for each Decennial Census based on population numbers, and Congressional Districts are often redrawn for new congresional elections.  For this reason, it is essential to correspond your shapefile selection to your time-series data extraction.

Run the code below to list the available shapefiles based on your year and geography specifications.

In [19]:
shp_meta %>% filter(year == selection_year & grepl(selection_geog, geographic_level, ignore.case = T)) %>% print(n = Inf)

[90m# A tibble: 4 × 6[39m
  name                            year  geographic_level   extent basis sequence
  [3m[90m<chr>[39m[23m                           [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m  [3m[90m<chr>[39m[23m    [3m[90m<int>[39m[23m
[90m1[39m us_tract_2010_tl2010            2010  Census Tract       Unite… 2010…      603
[90m2[39m us_tract_cenpop_2010_cenpop2010 2010  Census Tract (Cen… Unite… 2010…      604
[90m3[39m us_tract_2010_tl2020            2010  Census Tract       Unite… 2020…      605
[90m4[39m us_ttract_2010_tl2010           2010  Tribal Census Tra… Unite… 2010…      641


The filtering step provides us with a list of potential shapefiles we can use for our extraction based on the year and geography criteria.

### 3d. Select a Geography Shapfile

For this exercise, we will select the 2010 Census tract dataset based on the 2010 TIGER line files (file "us_tract_2010_tl2010").  And wee will save this selection for use later in our data extraction step.

In [20]:
selection_shp <- "us_tract_2010_tl2010"

## 4. NHGIS Time-Series Dataset and Geography Shapefile Extraction Specification and Submission

Now that you've identified your dataset and shapefile, this section defines and submits an extraction request to the IPUMS NHGIS API. Extracting data from IPUMS NHGIS allows you to download specific datasets and geographical data directly from the IPUMS server. This method makes it easy to automate and reproduce data requests.  The extraction will include both the selected time-series data and the corresponding shapefiles.

#### Steps:
1. Define and Run the Data Extraction
2. Review the Data Extraction

### 4a. Define the Extraction Parameters and Run the Extraction

Here we will put everything together including out time series data table selection (selection_datts), our selected geography (selection_geog), and our selected shapefiles (selection_shp).

In [21]:
extraction <- define_extract_nhgis(description = "I-GUIDE IPUMS Population Change Extraction",
                                   time_series_tables = tst_spec(name = selection_datts,
                                                                 geog_levels = selection_geog),
                                   shapefiles = selection_shp)

Submit the extraction request and wait for it to complete, then download the resulting data.

In [22]:
# submit extraction  
extraction_submitted <- submit_extract(extraction)

# wait for completion
extraction_complete <- wait_for_extract(extraction_submitted)

# check completion
extraction_complete$status

# get extraction filepath
filepath <- download_extract(extraction_submitted, overwrite = T)

Successfully submitted IPUMS NHGIS extract number 90

Checking extract status...

Waiting 10 seconds...

Checking extract status...

Waiting 20 seconds...

Checking extract status...

Waiting 30 seconds...

Checking extract status...

IPUMS NHGIS extract 90 is ready to download.






Data file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/nhgis0090_csv.zip
Shapefile saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/nhgis0090_shape.zip



### 4b. Review the Extracted Files
If you followed along with this exercise, your data extraction and download should contain the following two files.  If you expanded your extraction to additional datasets and shapefiles, you extraction will contain additional files.

1. A dataset containing total population by Census tract (based on 2010 Census tract boundaries) for all available years in the CL8 time-series dataset (1990, 2000, 2010, and 2020).
2. A shapefile with 2010 Census tract boundaries.

In [37]:
# see files in extract
dat <- read_nhgis(filepath[1])
shp <- read_ipums_sf(filepath[2])

Use of data from NHGIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.

[1mRows: [22m[34m73057[39m [1mColumns: [22m[34m17[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (6): GISJOIN, STATE, STATEA, COUNTY, COUNTYA, TRACTA
[32mdbl[39m (11): GEOGYEAR, CL8AA1990, CL8AA1990L, CL8AA1990U, CL8AA2000, CL8AA2000L...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [38]:
head(dat)

GISJOIN,GEOGYEAR,STATE,STATEA,COUNTY,COUNTYA,TRACTA,CL8AA1990,CL8AA1990L,CL8AA1990U,CL8AA2000,CL8AA2000L,CL8AA2000U,CL8AA2010,CL8AA2020,CL8AA2020L,CL8AA2020U
<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
G0100010020100,2010,Alabama,1,Autauga County,1,20100,1772.67,1750,1773,1920.02,1900,1921,1912,1775,1715,1775
G0100010020200,2010,Alabama,1,Autauga County,1,20200,2031.0,2031,2031,1892.0,1892,1892,2170,2055,2055,2115
G0100010020300,2010,Alabama,1,Autauga County,1,20300,2952.0,2952,2952,3339.0,3339,3339,3373,3216,3216,3216
G0100010020400,2010,Alabama,1,Autauga County,1,20400,4401.0,4401,4401,4556.0,4556,4556,4386,4246,4246,4246
G0100010020500,2010,Alabama,1,Autauga County,1,20500,3120.68,3119,3433,6041.9,6040,6364,10766,11222,11222,11222
G0100010020600,2010,Alabama,1,Autauga County,1,20600,3330.0,3330,3330,3272.0,3272,3272,3668,3729,3729,3729


In [39]:
colnames(dat)

**Variables**
* GIS Join Match Code (GISJOIN)
* Geography Year (GEOGYEAR)
* State Name (STATE)
* State Code (STATEA)
* County Name (COUNTY)
* County Code (COUNTYA)
* Census Tract Code (TRACTA)
* 1990 Total Persons (CL8AA1990)
* 1990 Total Persons (Lower Bound) (CL8AA1990L)
* 1990 Total Persons (Upper Bound) (CL8AA1990U)
* 2000 Total Persons (CL8AA2000)
* 2000 Total Persons (Lower Bound) (CL8AA2000L)
* 2000 Total Persons (Upper Bound) (CL8AA2000U)
* 2010 Total Persons (CL8AA2010)
* 2020 Total Persons (CL8AA2020)
* 2020 Total Persons (Lower Bound) (CL8AA2020L)
* 2020 Total Persons (Upper Bound) (CL8AA2020U)

## 5. Subset and Merge the Time-Series and Geography Data Extractions

This final section provides a few example data engineering next steps for reference.

1. First, the time-series population data is condensed to include only population counts from 1990, 2000, 2010, and 2020 and the column "GISJOIN" which contains unique codes for each Census tract.
2. Next, the population count columns are renamed.
3. Then the Census tract shapefile is condensed to include only the state FIPS code and "GISJOIN" columns.
4. Finally, the time-series population data is mrged with the Census tract shapefile using the unique "GISJOIN" column as the join key.

### 5a. Merge the Time-Series and Geography Data 

In [40]:
# merge the time-series population data with the Censuss tract shapefile
dat_shp <- merge(dat, shp, by = "GISJOIN")

The final merged includes total population for 1990, 2000, 2010, and 2020 attached to the geographic boundaries of the 2010 Census tracts.  The code below provides a snapshot of the first ten lines in the final merged dataset.

In [41]:
head(dat_shp)

Unnamed: 0_level_0,GISJOIN,GEOGYEAR,STATE,STATEA,COUNTY,COUNTYA,TRACTA,CL8AA1990,CL8AA1990L,CL8AA1990U,⋯,NAMELSAD10,MTFCC10,FUNCSTAT10,ALAND10,AWATER10,INTPTLAT10,INTPTLON10,Shape_area,Shape_len,geometry
Unnamed: 0_level_1,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,⋯,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<MULTIPOLYGON [m]>
1,G0100010020100,2010,Alabama,1,Autauga County,1,20100,1772.67,1750,1773,⋯,Census Tract 201,G5020,S,9809944,36312,32.4771112,-86.4903033,9846259,16193.875,MULTIPOLYGON (((888438 -515...
2,G0100010020200,2010,Alabama,1,Autauga County,1,20200,2031.0,2031,2031,⋯,Census Tract 202,G5020,S,3340505,5846,32.475758,-86.4724678,3346347,9844.309,MULTIPOLYGON (((889844.1 -5...
3,G0100010020300,2010,Alabama,1,Autauga County,1,20300,2952.0,2952,2952,⋯,Census Tract 203,G5020,S,5349274,9054,32.4740243,-86.4597033,5358330,10519.641,MULTIPOLYGON (((891383.8 -5...
4,G0100010020400,2010,Alabama,1,Autauga County,1,20400,4401.0,4401,4401,⋯,Census Tract 204,G5020,S,6382705,16244,32.4710782,-86.4446805,6398946,12500.859,MULTIPOLYGON (((892527.3 -5...
5,G0100010020500,2010,Alabama,1,Autauga County,1,20500,3120.68,3119,3433,⋯,Census Tract 205,G5020,S,11397725,48412,32.4589157,-86.4218165,11446139,17113.378,MULTIPOLYGON (((895451 -522...
6,G0100010020600,2010,Alabama,1,Autauga County,1,20600,3330.0,3330,3330,⋯,Census Tract 206,G5020,S,8020366,60048,32.447347,-86.4768023,8080417,14306.062,MULTIPOLYGON (((889098.5 -5...


### 3d. Save the Data

Next let's save a couple versions of our IPUMS ACS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [43]:
saveRDS(dat_shp, "ipums_nhgis_example.rds")
write.csv(dat_shp, "ipums_nhgis_example.csv")

## Conclusion
This script provides a step-by-step guide to extracting and preparing U.S. Census population data from the IPUMS NHGIS for spatial analysis. By following this approach, social scientists and researchers can automate data extraction tasks, save time, and ensure reproducibility.

The data and shapefile merging capabilities allow users to explore population trends across various geographic levels and time periods. Feel free to customize the script to fit your specific research needs.

## Recommended Next Steps
* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * 2.1: IPUMS USA Data Extraction Using ipumsr
  * 2.2: IPUMS CPS Data Extraction Using ipumsr
  * 2.3: IPUMS International Microdata Extraction Using ipumsr
  * 2.5: IPUMS Time Use Data Extraction Using ipumsr
  * 2.6: IPUMS Health Surveys Data Extraction Using ipumsr
  * 2.7: Reading IPUMS Global Health Data Extracts Using ipumsr
  * 2.8: Reading IPUMS Higher Education Data Extracts Using ipumsr
* **Move on to Chapter 3: Data Cleaning and Preparation**
  * 3.2: Spatial Data Preparation and Transformation with IPUMS NHGIS

## Quick Code

Don't forget to update the code with your IPUMS API key!