# IPUMS USA Data Extraction Using ipumsr

## Introduction
The [IPUMS USA](https://usa.ipums.org/usa) database offers harmonized microdata from the [U.S. Decennial Census](https://www.census.gov/programs-surveys/decennial-census.html) and the [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs/about.html). It provides detailed, individual-level records on population demographics, economic activity, housing conditions, and social characteristics, enabling the analysis of trends in American society across time and space. Through harmonization, IPUMS USA allows data to be seamlessly compared across census years, despite changes in survey design, geographic boundaries, and variable definitions.

**From the [IPUMS USA Website](https://usa.ipums.org/usa):** IPUMS USA collects, preserves and harmonizes United States Census microdata and provides easy access to this data with enhanced documentation. Data includes Decennial Censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present.

This notebook introduces the process of extracting [IPUMS USA](https://usa.ipums.org/usa) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS USA data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS USA datasets and prepare them for spatial and statistical workflows.

### ★ Prerequisites ★
* Complete Chapter 1.1: Introduction to IPUMS and the IPUMS API
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS USA Metadata Exploration
3. IPUMS USA Data Extraction Specification and Submission

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following function from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) · define an extract request for an IPUMS microdata collection
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) · list available samples for IPUMS microdata collections
* [*read_ipums_ddi*](https://rdrr.io/cran/ipumsr/man/read_ipums_ddi.html) · read metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* [*read_ipums_micro*](https://rdrr.io/cran/ipumsr/man/read_ipums_micro.html) · read data from an IPUMS microdata extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) Simple, Consistent Wrappers for Common String Operations.  This notebook uses the following function from *stringr*.

* [*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) · detect the presence or absence of a match

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("dplyr", "ipumsr", "stringr"))

Load the packages into your workspace.

In [None]:
library(dplyr)
library(ipumsr)
library(stringr)

### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to *Chapter 1.1: Introduction to IPUMS and the IPUMS API* for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

## 2. IPUMS USA Metadata Exploration

### 2a. Review the List of Samples

In [None]:
# retrive and view the list of samples from the IPUMS USA database
metadata_usa <- get_sample_info("usa")

# view the dimensions of the list of samples
dim(metadata_usa)

In [None]:
# view the first few lines of the list of samples
head(metadata_usa)

Refer to the [Descriptions of IPUMS USA Samples](https://usa.ipums.org/usa/sampdesc.shtml) page on the IPUMS USA website .

In [None]:
# filter the list of samples by survey and year
metadata_usa %>% filter(str_detect(description, "ACS"),      # filter descrption by survey
                        str_detect(description, "2010"))     # filter description by year

## 3. IPUMS USA Data Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

### 3a. Define the Data Extract

For this example we will use the 2010 ACS sample.

**Variable Selection**
* Public Use Microdata Area (PUMA)
* Sex (SEX)
* Age (AGE)
* Race (RACE)
* Educational Attainment (EDUC)
* Total Personal Income (INCTOT)

By default, the data extraction will also include a number of IPUMS preselected variables.  These variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS USA so we use the code "usa".
* **description** Text description of the extract.
* **samples** Vector of samples to include in the extract request.  In our case we are downloading the 2010 ACS data (us2010a).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

In [None]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "usa",
                                           description = "IPUMS USA Data Extraction",
                                           samples = c("us2010a"),
                                           variables = c("PUMA", "SEX", "AGE", "RACE", "EDUC", "INCTOT"))

In [None]:
# review the extraction definition
extract_definition

### 3b. Submit the Extract Request

In [None]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

### 3c. Review the Extract

The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

Read the ddi and data files into a format which we can work with in R.

In [None]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

In [None]:
dim(dat)

In [None]:
head(dat)

In [None]:
colnames(dat)

**Variable Selection**
* Public Use Microdata Area (PUMA)
* Sex (SEX)
* Age (AGE)
* Race (RACE)
* Educational Attainment (EDUC)
* Total Personal Income (INCTOT)

**Detailed Supplements for Selected Variables**
* Race (detailed) (RACED)
* Education (detailed) (EDUCD)

**IPUMS Preselected Variables**
* Census Year (YEAR)
* IPUMS Sample Identifier (SAMPLE)
* Household Serial Number (SERIAL)
* Original Census Bureau Household Serial Number (CBSERIAL)
* Household Weight (HHWT)
* Household Cluster for Vaccine Estimation (CLUSTER)
* Household Strata for Variance Estimation (STRATA)
* Group Quarters Status (GQ)
* Person Number in Sample Unit (PERNUM)
* Person Weight (PERWT)

### 3d. Save the Data

Next let's save a couple versions of our IPUMS ACS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [None]:
saveRDS(dat, "ipums_usa_example.rds")
write.csv(dat, "ipums_usa_example.csv")

## 2. IPUMS USA Metadata Exploration

First, let's take a look at the entire list of datasets available from the [IPUMS USA data repository](https://usa.ipums.org/usa).  The USA data available for direct extraction using the IPUMS API include the [American Communinty Survey (ACS)](https://www.census.gov/programs-surveys/acs) and [Puerto Rico Community Survey (PRCS)](https://www.census.gov/programs-surveys/acs/about/puerto-rico-community-survey.html).

The ACS and PRCS are are annual surveys conducted by the U.S. Census Bureau that collect information on a subset of the U.S. population.  The ACS collects data on a variety of topics, including income, poverty, education, marital status, health insurance coverage, disability, occupancy, costs, tenure, and units by type.  It is a more in-depth supplement to the Decennial U.S. Census and in 2005 replaced the long-form version of the Decennial Census survey which was previously conducted every ten years.  Each year the ACS samples over 3.5 million housing units across the United States with a new sample of about 250,000 addresses drawn each month.

ACS and PRCS are available as single-year datasets as well as three- and five-year summaries of the data.  The three- and five-year summary data are often used in lieu of the single-year data as they are less susceptible to anomalies.

The IPUMS Data also includes population counts and samples dating back to 1850 but, for this exercise, we will focus on the ACS sample data which has been the standard USA population sample survey since 2005.  Refer to the [Descriptions of IPUMS Samples](https://usa.ipums.org/usa/sampdesc.shtml) page on the IPUMS USA website for a list of all IPUMS USA data sets and their descriptions.

The *get_sample_info* function form the ipumsr package returns a list of all datasets from the specified IPUMS data repository which are available to be downloaded using the ipumsr API.  We will reuqest the list of datasets from the USA (usa) repository and print the full list.

For this exercise, we will be work with the 2022 five-year ACS data (us2022c).  In the IPUMS specification, ACS and PRCS multi-year summaries are referred to by the final year in the corresponding time range.  So the 2022 five-year ACS is a summary of the 2018, 2019, 2020, 2021, and 2022 ACS surveys.

The data extraction process does allow us to download multiple datasets, but for this exercise we will only download the us2022c ACS data.

As mentioned, the ACS includes a wide range of variables on many topics.  Here we will focus on the following selection of demographic variables.

And include a couple geography reference variables.

Review the extraction definition to make sure we have set it up the way we intended.

Everything looks good so we will submit the extraction request, wait for it to complete, and download the resulting data.

We now have a useable version of our dataset stored in *dat*.  Let's take a look at the number of observations and variables in the data.

The 2022 5-year ACS data includes information on 20 variables for 15,721,123 individuals.  This makes sense since we know the ACS surveys about 3.5 million individuals and our dataset corresponds to five years of ACS data.

Let's take a look at the first few lines of the data file.

Notice that this data is in ["tibble"](https://tibble.tidyverse.org) format rather than the more common "data.frame" format you might be used to as an R user.  A tibble can be thought of as a version of a data.frame that includes additional functionality and metadata visibility.  It is also more compatible with the tidyverse packages, including the dplyr package we use in this notebook.

We also appear to have a lot more columns than the set we requested from IPUMS.  The view above truncates the dataset to a subset of the columns for easier viewing.  Let's take a quick look at the list of column names so we can see all the variables included in this dataset.

The IPUMS R API included both the variables we asked for and some additional variables.

We have the demographic variables we requested:

Along with more descriptive supplementary versions of some of our demographic variables:

And the geographic variables we requested:

IPUMS has also included a set of varibles which we did not specifically request but which are always included in the ACS data downloads:

At the end of this exercise we have a freshly downloaded dataset from the IPUMS USA repository saved in our workspace.

## Recommended Next Steps
* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * 2.1: IPUMS USA Data Extraction Using ipumsr
  * 2.2: IPUMS CPS Data Extraction Using ipumsr
  * 2.3: IPUMS International Microdata Extraction Using ipumsr
  * 2.4: IPUMS NHGIS Data Extraction Using ipumsr
  * 2.5: IPUMS Time Use Data Extraction Using ipumsr
  * 2.6: IPUMS Health Surveys Data Extraction Using ipumsr
  * 2.7: Reading IPUMS Global Health Data Extracts Using ipumsr
  * 2.8: Reading IPUMS Higher Education Data Extracts Using ipumsr
* **Move on to Chapter 3: Data Cleaning and Preparation**
  * 3.1: Data Preparation and Transformation with IPUMS ACS

## Quick Code
Don't forget to update the code with your IPUMS API key!