# IPUMS USA Data Extraction Using ipumsr
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

This notebook builds on the the workflow introduced in the **[Introduction to the IPUMS API for R Users](https://tech.popdata.org/ipumsr/articles/ipums-api.html)** article on the IPUMS website.  As the author of the R Spatial Notebook series, I recognize the IPUMS article as a significant inspiration and source of information for this notebook.

## Introduction
The [IPUMS USA](https://usa.ipums.org/usa) database offers harmonized microdata from the [U.S. Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html), [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html), and [Puerto Rico Community Survey (PRCS)](https://www.census.gov/programs-surveys/acs/about/puerto-rico-community-survey.html). It provides detailed, individual-level records on population demographics, economic activity, housing conditions, and social characteristics, enabling the analysis of trends in American society across time and space. Through harmonization, IPUMS USA allows data to be seamlessly compared across census years, despite changes in survey design, geographic boundaries, and variable definitions.

**From the [IPUMS USA Website](https://usa.ipums.org/usa):** IPUMS USA collects, preserves and harmonizes United States Census microdata and provides easy access to this data with enhanced documentation. Data includes Decennial Censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present.

#### Data Included in the IPUMS USA Repository
* [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html)
  * Annual data from 2001 to present
  * 3-year estimates from 2007 to 2013
  * 5-year estimates 2009 to present
  * Historic population sample data from 1850 to 2000
* [Puerto Rico Community Survey (PRCS)](https://www.census.gov/programs-surveys/acs/about/puerto-rico-community-survey.html)
  * Annual data from 2005 to present
  * 3-year estimates from 2007 to 2013
  * 5-year estimates 2009 to present
  * Historic population sample data from 1910 to 2000
* [Historic Full Count data from 1790 to 1950](https://usa.ipums.org/usa/full_count.shtml)

### About the American and Puerto Rico Community Surveys (ACS and PRCS)
The [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html) and [Puerto Rico Community Survey (PRCS)](https://www.census.gov/programs-surveys/acs/about/puerto-rico-community-survey.html) are are annual surveys conducted by the [U.S. Census Bureau](https://www.census.gov) that collect information on a subset of the U.S. population.  The ACS and PRCS collect data on a variety of topics, including income, poverty, education, marital status, health insurance coverage, disability, occupancy, costs, tenure, and units by type.  It is a more in-depth supplement to the Decennial U.S. Census and in 2005 replaced the long-form version of the Decennial Census survey which was previously conducted every ten years.  Each year the ACS samples over 3.5 million housing units across the United States with a new sample of about 250,000 addresses drawn each month.

ACS and PRCS are available as single-year datasets as well as three- and five-year summaries of the data.  While single-year data provide a snapshot of conditions in a specific year, the three- and five-year summaries offer more stable estimates by averaging data over time, making them less susceptible to anomalies and more useful for analyzing smaller geographic areas.

### Historic Census Microdata
In addition to these contemporary datasets, IPUMS USA makes [historic Full Count U.S. Census microdata]((https://usa.ipums.org/usa/full_count.shtml)) freely available for research purposes, covering the period 1790 to 1950. This dataset includes over 800 million individual-level records from 1850 to 1940 and 7.5 million household-level records from 1790 to 1840. The microdata represent a collaborative effort between IPUMS and the genealogical organizations [Ancestry.com](ancestry.com) and [FamilySearch](https://www.familysearch.org/en/united-states), leveraging extensive digitized historical census records for scientific purposes.

### Notebook Goals
This notebook introduces the process of extracting [IPUMS USA](https://usa.ipums.org/usa) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS USA data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS USA datasets and prepare them for spatial and statistical workflows.

### ✨ Prerequisites ✨
* Complete [Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a)
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS USA Metadata Exploration
3. IPUMS USA Data Extraction Specification and Submission

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following function from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator.  The *pip* operator is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows and is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) · define an extract request for an IPUMS microdata collection
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) · list available samples for IPUMS microdata collections
* [*read_ipums_ddi*](https://rdrr.io/cran/ipumsr/man/read_ipums_ddi.html) · read metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* [*read_ipums_micro*](https://rdrr.io/cran/ipumsr/man/read_ipums_micro.html) · read data from an IPUMS microdata extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) Simple, Consistent Wrappers for Common String Operations.  This notebook uses the following function from *stringr*.

* [*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) · detect the presence or absence of a match

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("dplyr", "ipumsr", "stringr"))

Load the packages into your workspace.

In [1]:
library(dplyr)
library(ipumsr)
library(stringr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to [Chapter 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a) for instructions on setting up your IPUMS account and API key.

In [2]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

Please enter your IPUMS API key:  59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0


Existing .Renviron file copied to C:\Users\vavra\Documents/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



## 2. IPUMS USA Metadata Exploration

Before submitting an IPUMS data extraction request, it’s essential to ensure the parameters of the extraction definition are set up correctly.  The extraction definition specifies the sample, variables, and other options.

If this is your first time using the IPUMS API in R, or if you are setting up a new data extract for a new project, it is a good idea to start by exploring the available data which can be done using the *ipumsr* package.

### 2a. Review the List of Samples

First, let's take a look at the entire list of datasets available from the [IPUMS USA data repository](https://usa.ipums.org/usa).  The USA data available for direct extraction using the IPUMS API include the [American Communinty Survey (ACS)](https://www.census.gov/programs-surveys/acs) and [Puerto Rico Community Survey (PRCS)](https://www.census.gov/programs-surveys/acs/about/puerto-rico-community-survey.html).

For this step, we will use the [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package.  This function will return a list of all datasets from the specified IPUMS data repository which are available to be downloaded using the IPUMS API.  Since we are focusing on IPUMS USA, we will specify that we want to view all available samples from the IPUMS USA repository by passing *"usa"* to the function.  This code stores the metadata from all available samples in the IPUMS USA repository to the object *metadata_usa*.

**★ Pro Tip:** You can use the [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) function to retrieve metadata for any of the available IPUMS repositories by changing the database reference code.

In [3]:
# retrive and view the list of samples from the IPUMS USA database
metadata_usa <- get_sample_info("usa")

Let's take a look at the dimensions of the *metadata_usa* object.  This will give us an idea of how many samples are available from the IPUMS USA repository.

In [4]:
# view the dimensions of the list of samples
dim(metadata_usa)

The results tell us that there are 148 samples available from IPUMS USA.

Let's take a look at the first few elements in the table of samples.

In [5]:
# view the first few lines of the list of samples
head(metadata_usa)

name,description
<chr>,<chr>
us1850a,1850 1%
us1850c,1850 100% sample (Revised November 2023)
us1860a,1860 1%
us1860b,1860 1% sample with black oversample
us1860c,1860 100% sample (Revised November 2023)
us1870a,1870 1%


In [6]:
# filter the list of samples by survey and year
metadata_usa %>% filter(str_detect(description, "ACS"),      # filter descrption by survey
                        str_detect(description, "2010"))     # filter description by year

name,description
<chr>,<chr>
us2010a,2010 ACS
us2010c,"2008-2010, ACS 3-year"
us2010e,"2006-2010, ACS 5-year"
us2012c,"2010-2012, ACS 3-year"
us2014c,"2010-2014, ACS 5-year"


The filtering process has returned five relevant samples including the 2010 ACS, the 2008-2010 and 2010-2012 ACS 3-year summaries, and the 2006-2010 and 2010-2014 ACS 5-year summaries.  For this exercise we will use the **2010 ACS Sample** which is referred to using identification code (*name*) **us2010a**.

**★ Pro Tip:** In the IPUMS specification, ACS and PRCS multi-year summaries are referred to by the **final year** in the corresponding time range.  For example, the 2008-2010 ACS 3-year summary is referred to using the code *2010e*.

## 3. IPUMS USA Data Extraction Specification and Submission

Once we have reviewed the available samples and decided on the dataset, the next step is to set up a data extraction using the [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package.  This function requires the following minimum parameters:

* *collection* · the IPUMS data collection for the extract (for this exercise we are downloading from IPUMS USA so we use the code "usa")
* *description* · text description of the extract
* *samples* · vector of samples to include in the extract; samples should be specified using the sample identification codes
* *variables* · vector of variables to include in the extract

### 3a. Define the Variable List

We already know what we will pass to the function for the *collection* ("usa") and *samples* ("us2010a") parameters.  Next we will need to determine which variables we want.

If you are already familiar with IPUMS USA data extractions using their web-based data extract platforms, you might already know which variables are available for our selected sample.  If not, the best place to start is by exploring the web-based [**IPUMS USA Data Extract Platform**](https://usa.ipums.org/usa-action/variables/live_search) to see what variables are available and identify the appropriate variable codes.  Before searching for variables, be sure to click the **Select Samples** button in the top-left corner of the search platform and select the samples you are planning to use.  Since we are using the 2010 ACS sample for this example, you should select the 2010 ACS sample within the search platform.  What variables are available, and the codes used for the variables, may differ based on your selected sample, so it is important to be specific.

For this example we will use the following set of variables from the 2010 ACS.

**Variable Selection**
* [State FIPS Code (STATEFIP)](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code)
* [Public Use Microdata Area (PUMA)](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html#:~:text=Public%20Use%20Microdata%20Areas%20(PUMAs)%20are%20non%2Doverlapping%2C,%2C%20Puerto%20Rico%2C%20and%20Guam.)
* Sex (SEX)
* Age (AGE)
* Race (RACE)
* Educational Attainment (EDUC)
* Total Personal Income (INCTOT)

By default, the data extraction will also include both our selected variables and a set of IPUMS preselected variables.  The preselected variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

### 3b. Define the Data Extract

Now that we know the collection ("usa"), sample ("us2010a"), and list of variables (c("STATEFIP", "PUMA", "SEX", "AGE", "RACE", "EDUC", "INCTOT")) we are ready to submit our data extract request.  In this step we will add a text description of the request which can be anything and is included to help us differentiate between requests.  For this extract we will use the simple description "IPUMS USA Data Extraction".

Here we pass all the extraction definition information to the [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package and store the resulting extraction definition in the object *extract_definition*.

**★ Pro Tip:** You can specify multiple samples in the same data extract by specifying all sample identification codes as a list.  Be sure that the variables you specify are available for all of the samples!

In [7]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "usa",
                                           description = "IPUMS USA Data Extraction",
                                           samples = c("us2010a"),
                                           variables = c("STATEFIP", "PUMA", "SEX", "AGE", "RACE", "EDUC", "INCTOT"))

Let's review the extraction definition information to make sure we have set it up the way we intended.

In [8]:
# review the extraction definition
extract_definition

Everything looks good so we will submit the extraction request, wait for it to complete, and download the resulting data.

### 3c. Submit the Extract Request

Now that the extraction definition is set up, we can submit it to the IPUMS API using the [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html).

For this exercise, after submitting the request we will also use the [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package to monitor the status of the request.  This is not a necessary step but it is helpful, especially when submitting large requests.

Finally, once the extract is complete, we can download it using the [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package and save it in the object *filepath*.

In [9]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

Successfully submitted IPUMS USA extract number 30

Checking extract status...

Waiting 10 seconds...

Checking extract status...

Waiting 20 seconds...

Checking extract status...

IPUMS USA extract 30 is ready to download.






DDI codebook file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/usa_00030.xml
Data file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/usa_00030.dat.gz



### 3d. Review the Extract

Once we have downloaded the extract, we are ready to review it and transform it to a format we can easily use.  The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for the data.
2. A zipped data (.dat) file (file extension .dat.gz) containing the data.

Read the ddi and data files into a format which we can work with in R.  The final *dat* object will contain the data from our extraction in a table format which is easy to use in R.

In [10]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.



We now have a useable version of our dataset stored in *dat*.  Let's take a look at the number of observations and variables in the data.

In [11]:
dim(dat)

The 2010 ACS data we downloaded includes information on 19 variables for about 3.1 million individuals.  This makes sense since we know the ACS surveys about 3.5 million individuals each year.

Let's take a look at the first few lines of the data.

In [12]:
head(dat)

YEAR,SAMPLE,SERIAL,CBSERIAL,HHWT,CLUSTER,STATEFIP,PUMA,STRATA,GQ,PERNUM,PERWT,SEX,AGE,RACE,RACED,EDUC,EDUCD,INCTOT
<int>,<int+lbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int+lbl>,<dbl+lbl>,<dbl>,<int+lbl>,<dbl>,<dbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<dbl+lbl>
2010,201001,1,69,96,2010000000000.0,1,400,40001,1,1,96,2,75,1,100,6,63,7500
2010,201001,2,80,97,2010000000000.0,1,2200,220001,1,1,97,1,25,1,100,11,114,17000
2010,201001,2,80,97,2010000000000.0,1,2200,220001,1,2,128,2,26,1,100,7,71,13000
2010,201001,2,80,97,2010000000000.0,1,2200,220001,1,3,182,1,3,1,100,0,2,9999999
2010,201001,3,140,90,2010000000000.0,1,100,10001,1,1,90,2,87,1,100,7,71,29400
2010,201001,4,224,82,2010000000000.0,1,1300,130001,1,1,82,1,33,1,100,10,101,28000


Notice that this data is in [*tibble*](https://tibble.tidyverse.org) format rather than the more common *data.frame* format you might be used to as an R user.  A tibble can be thought of as a version of a data.frame that includes additional functionality and metadata visibility.  It is also more compatible with the [*tidyverse*](https://www.tidyverse.org) packages, including the [*dplyr*](https://cran.r-project.org/web/packages/dplyr/index.html) package we use in this notebook.

As mentioned above, IPUMS includes a set of preselected variables in data extractions, along with the variables selected by the user.  We only selected 6 variables for the extraction but the resulting download includes 18 variables.  Let's take a look at the list of column names.

In [13]:
colnames(dat)

This list includes the 7 variables we originally selected as well as detailed supplemental variables for 2 of our selected variables (RACED for the variales RACE and EDUCD for the variable EDUC).  We also see that there are 10 additional IPUMS preselected variables which mainly include metadata such as identification codes, weights, and other metainformation.

**Variable Selection**
* [State FIPS Code (STATEFIP)](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code)
* [Public Use Microdata Area (PUMA)](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/pumas.html#:~:text=Public%20Use%20Microdata%20Areas%20(PUMAs)%20are%20non%2Doverlapping%2C,%2C%20Puerto%20Rico%2C%20and%20Guam.)
* Sex (SEX)
* Age (AGE)
* Race (RACE)
* Educational Attainment (EDUC)
* Total Personal Income (INCTOT)

**Detailed Supplements for Selected Variables**
* Race (detailed) (RACED)
* Education (detailed) (EDUCD)

**IPUMS Preselected Variables**
* Census Year (YEAR)
* IPUMS Sample Identifier (SAMPLE)
* Household Serial Number (SERIAL)
* Original Census Bureau Household Serial Number (CBSERIAL)
* Household Weight (HHWT)
* Household Cluster for Vaccine Estimation (CLUSTER)
* Household Strata for Variance Estimation (STRATA)
* Group Quarters Status (GQ)
* Person Number in Sample Unit (PERNUM)
* Person Weight (PERWT)

### 3e. Save the Data

Finally, let's save the data we extracted from IPUMS USA.  We will save the data in the following two formats:

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

Since our data is very large (the .csv verion of the saved data will be about 300 MB) let's first subset it to make it a little easier to work with.  Before saving, we will subset the data to include only individuals located in the state of Michigan (FIPS code 26).

In [14]:
# subset the data to only the state of Michigan
dat_subset <- dat[dat$STATEFIP == 26,]

# view the dimensions of the new data table
dim(dat_subset)

Subsetting the data to only Indiana reduces the dimensions of the data to only 98,973 individuals, making it much easier to work with and store.

In [15]:
saveRDS(dat_subset, "ipums_usa_example.rds")
write.csv(dat_subset, "ipums_usa_example.csv")

At the end of this exercise we have a freshly downloaded dataset from the IPUMS USA repository saved in our workspace.

---

## Next Steps

* Continue to [**Chapter 3.02 IPUMS NHGIS Data Extraction using ipumsr**](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc)
* Move on to Chapter 5: Data Cleaning, Preparation, and Exploratory Data Analysis (EDA)
  * [**Chapter 5.01 Data Cleaning and Preparation with IPUMS USA**]()
  * [**Chapter 5.03 Exploratory Data Analysis with IPUMS USA**]()
* Return to the [**R Spatial Notebooks Project Chapter List**](https://vavramusser.github.io/r-spatial/#:~:text=Chapter%201%3A%20Data%20Sources%20and%20APIs) to view a list of all available notebooks organized in the R Spatial Notebooks chapter structure.
* Visit the [**R Spatial Notebooks Project Homepage**](https://vavramusser.github.io/r-spatial) to learn more about the project, view the list of all notebooks, and explore additional resources.
* Join the project [**Mailing List**](https://mailchi.mp/ab01e8fc8397/r-spatial-email-signup) to hear about future notebook releases and other updates.
* If you have an idea for a new notebook please submit your idea via the [**Suggestion Box**](https://us19.list-manage.com/survey?u=746bf8d366d6fbc99c699e714&id=54590a28ea&attribution=false).

---

## ★ Thank You ★

Thank you so much for engaging with this notebook and supporting the project!  The R Spatial Notebooks Project is a labor of love so if you enjoy or benefit from these notebooks, please consider [**Donating to the Project**](https://buymeacoffee.com/vavramusser).  Your support allows me to continue producing notebooks and supporting the R Spatial Notebooks community.

---

## Quick Code
A clean and simple version of the code included in this notebook (excluding the metadata exploration steps).  **Don't forget to update the code with your IPUMS API key!**