# IPUMS Current Population Survey (CPS) Data Extraction Using ipumsr
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

## Introduction
The [IPUMS CPS](https://cps.ipums.org/cps) database offers harmonized microdata from the [Current Population Survey (CPS)](https://www.census.gov/programs-surveys/cps.html), a key source of information on labor force participation, employment, income, and related demographic characteristics in the United States. It provides detailed, individual-level records that enable the analysis of trends in the labor market and socioeconomic conditions over time. Through harmonization, IPUMS CPS allows data to be seamlessly compared across survey years, despite changes in questionnaire design, geographic classifications, and variable definitions.

**From the [IPUMS CPS Website](https://cps.ipums.org/cps):** IPUMS CPS harmonizes microdata from the monthly U.S. labor force survey, the Current Population Survey (CPS), covering the period 1962 to the present. Data include demographic information, rich employment data, program participation and supplemental data on topics such as fertility, tobacco use, volunteer activities, voter registration, computer and internet use, food security, and more.

#### Data Included in the IPUMS CPS Repository
* Monthly [Current Population Survey (CPS)](https://www.census.gov/programs-surveys/cps.html) data from 1976 to present
* Annual [Social and Economic Supplement (ASEC)](https://www.census.gov/programs-surveys/saipe/guidance/model-input-data/cpsasec.html) data from 1962 to present

### About the CPS and ASEC
The [**Current Population Survey (CPS)**](https://www.census.gov/programs-surveys/cps.html) and [**Annual Social and Economic Supplement (ASEC)**](https://www.census.gov/programs-surveys/saipe/guidance/model-input-data/cpsasec.html) are vital surveys conducted by the U.S. [Census Bureau](https://www.census.gov) and the [Bureau of Labor Statistics](https://www.bls.gov). The CPS is a monthly survey that collects labor force data, including employment, unemployment, and workforce participation, as well as demographic information. It serves as the primary source of data for calculating the U.S. unemployment rate and analyzing labor market trends.

The ASEC is a supplement to the CPS, conducted annually in March. It expands on the core CPS topics by collecting detailed information on income, poverty, health insurance coverage, program participation, and household composition. The ASEC is a key source for evaluating the social and economic well-being of the U.S. population and is widely used for policymaking, academic research, and public understanding of economic conditions. Together, these surveys provide comprehensive insights into the labor market and socioeconomic conditions across the United States.

### Notebook Goals
This notebook introduces the process of extracting [IPUMS CPS](https://cps.ipums.org/cps) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS CPS data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS CPS datasets and prepare them for spatial and statistical workflows.

### ★ Prerequisites ★
* Complete [Section 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a)
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS CPS Metadata Exploration
3. IPUMS CPS Data Extraction Specification and Submission

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following function from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator.  The *pip* operator is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows and is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) · define an extract request for an IPUMS microdata collection
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) · list available samples for IPUMS microdata collections
* [*read_ipums_ddi*](https://rdrr.io/cran/ipumsr/man/read_ipums_ddi.html) · read metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* [*read_ipums_micro*](https://rdrr.io/cran/ipumsr/man/read_ipums_micro.html) · read data from an IPUMS microdata extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) Simple, Consistent Wrappers for Common String Operations.  This notebook uses the following function from *stringr*.

* [*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) · detect the presence or absence of a match

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("dplyr", "ipumsr", "stringr"))

Load the packages into your workspace.

In [None]:
library(dplyr)
library(ipumsr)
library(stringr)

### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to [Chapter 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a) for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

## 2. IPUMS CPS Metadata Exploration

Before submitting an IPUMS data extraction request, it’s essential to ensure the parameters of the extraction definition are set up correctly.  The extraction definition specifies the sample, variables, and other options.

If this is your first time using the IPUMS API in R, or if you are setting up a new data extract for a new project, it is a good idea to start by exploring the available data which can be done using the *ipumsr* package.

### 2a. Review the List of Samples

First, let's take a look at the entire list of datasets available from the [IPUMS Current Population Survey (CPS) data repository](https://cps.ipums.org/cps).  The CPS data available for direct extraction using the IPUMS API include the monthly [Current Population Survey (CPS)](https://www.census.gov/programs-surveys/cps.html) from 1976 to present and the annual [CPS Annual Social and Economic Supplement (ASEC)](https://www.census.gov/programs-surveys/saipe/guidance/model-input-data/cpsasec.html) from 1962 to present.

For this step, we will use the [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package.  This function will return a list of all datasets from the specified IPUMS data repository which are available to be downloaded using the IPUMS API.  Since we are focusing on IPUMS CPS, we will specify that we want to view all available samples from the IPUMS CPS repository by passing *"cps"* to the function.  This code stores the metadata from all available samples in the IPUMS CPS repository to the object *metadata_usa*.

**★ Pro Tip:** You can use the [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) function to retrieve metadata for any of the available IPUMS repositories by changing the database reference code.

In [None]:
# retrive and view the list of samples from the IPUMS CPS database
metadata_cps <- get_sample_info("cps")

Let's take a look at the dimensions of the *metadata_cps* object.  This will give us an idea of how many samples are available from the IPUMS CPS repository.

In [None]:
# view the dimensions of the list of samples
dim(metadata_cps)

The results tell us that there are 650 samples available from IPUMS CPS.

Let's take a look at the first few elements in the table of samples.

In [None]:
# view the first few lines of the list of samples
head(metadata_cps)

From this preview, we can see that the IPUMS CPS metada table has 1) a **name*, corresponding to a sample identification code, and 2) a **description**, providing a short description or label for each sample.  We will need to select a sample and make note of its sample identification code (**name**) which we will use when defining our data extraction.

At first glance, it might be difficult to understand what data is contained in each sample, especially if you are not used to working with CPS data.  Refer to the [Descriptions of IPUMS CPS Samples](https://cps.ipums.org/cps/samples.shtml) page on the IPUMS CPS website for a list of all IPUMS USA samples and more detailed information on each sample.  IPUMS also provides the [list of sample identification codes](https://cps.ipums.org/cps-action/samples/sample_ids) on their website.

If you already know which sample you want to use you could explore the samples list until you found the appropriate sample identification code (**name**).  Alternately, you could use the [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) function from the [**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) package in conjunction with the[*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) function from the [**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) package to filter the list of samples down to the subset which may be relevant for your project.

For this exercise, we will filter the list of sample metadata *metadata_cps* to only samples which have *2020* in their descriptions.

In [None]:
# filter the list of samples by year
metadata_cps %>% filter(str_detect(description, "2020"))      # filter descrption by year

The filtering process has returned 13 relevant samples including the 12 monthly CPS samples and the annual ASEC sample.  For this exercise we will use the **2020 ASEC Sample** which is referred to using identification code (*name*) **cps2020_03s**.

## 3. IPUMS CPS Data Extraction Specification and Submission

Once we have reviewed the available samples and decided on the dataset, the next step is to set up a data extraction using the [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package.  This function requires the following minimum parameters:

* *collection* · the IPUMS data collection for the extract (for this exercise we are downloading from IPUMS CPS so we use the code "usa")
* *description* · text description of the extract
* *samples* · vector of samples to include in the extract; samples should be specified using the sample identification codes
* *variables* · vector of variables to include in the extract

### 3a. Define the Variable List

We already know what we will pass to the function for the *collection* ("cps") and *samples* ("cps2020_03s") parameters.  Next we will need to determine which variables we want.

If you are already familiar with IPUMS CPS data extractions using their web-based data extract platforms, you might already know which variables are available for our selected sample.  If not, the best place to start is by exploring the web-based [**IPUMS CPS Data Extract Platform**](https://cps.ipums.org/cps-action/variables/group) to see what variables are available and identify the appropriate variable codes.  Before searching for variables, be sure to click the **Select Samples** button in the top-left corner of the search platform and select the samples you are planning to use.  Since we are using the 2020 ASEC sample for this example, you should select the 2020 ASEC sample within the search platform.  What variables are available, and the codes used for the variables, may differ based on your selected sample, so it is important to be specific.

For this example we will use the following set of variables from the 2020 ASEC

**Variable Selection**
* [County FIPS Code (COUNTY)](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code)
* Age (AGE)
* Sex (SEX)
* Quit Job or Retired for Health Reasons (QUITSICK)
* Covered by Military Health Insurance in the Last Year (HICHAMP)
* Veteran's Most Recent Period of Service (VETLAST)

By default, the data extraction will also include both our selected variables and a set of IPUMS preselected variables.  The preselected variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

### 3b. Define the Data Extract

Now that we know the collection ("cps"), sample ("cps2020_03s"), and list of variables (c("COUNTY", "SEX", "AGE", "RACE", "EDUC", "INCTOT")) we are ready to submit our data extract request.  In this step we will add a text description of the request which can be anything and is included to help us differentiate between requests.  For this extract we will use the simple description "IPUMS CPS Data Extraction".

Here we pass all the extraction definition information to the [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package and store the resulting extraction definition in the object *extract_definition*.

**★ Pro Tip:** You can specify multiple samples in the same data extract by specifying all sample identification codes as a list.  Be sure that the variables you specify are available for all of the samples!

In [None]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "cps",
                                           description = "IPUMS CPS Data Extraction",
                                           samples = c("cps2020_03s"),
                                           variables = c("COUNTY", "AGE", "SEX", "QUITSICK", "HICHAMP", "VETLAST"))

Let's review the extraction definition information to make sure we have set it up the way we intended.

In [None]:
# review the extraction definition
extract_definition

Everything looks good so we will submit the extraction request, wait for it to complete, and download the resulting data.

### 3c. Submit the Extract Request

Now that the extraction definition is set up, we can submit it to the IPUMS API using the [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html).

For this exercise, after submitting the request we will also use the [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package to monitor the status of the request.  This is not a necessary step but it is helpful, especially when submitting large requests.

Finally, once the extract is complete, we can download it using the [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) function from the [**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) package and save it in the object *filepath*.

In [None]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

### 3d. Review the Extract

Once we have downloaded the extract, we are ready to review it and transform it to a format we can easily use.  The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for the data.
2. A zipped data (.dat) file (file extension .dat.gz) containing the data.

Read the ddi and data files into a format which we can work with in R.  The final *dat* object will contain the data from our extraction in a table format which is easy to use in R.

In [None]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

We now have a useable version of our dataset stored in *dat*.  Let's take a look at the number of observations and variables in the data.

In [None]:
dim(dat)

The 2020 ASEC data we downloaded includes information on 16 variables for 157,959 individuals.

Let's take a look at the first few lines of the data.

In [None]:
head(dat)

Notice that this data is in [*tibble*](https://tibble.tidyverse.org) format rather than the more common *data.frame* format you might be used to as an R user.  A tibble can be thought of as a version of a data.frame that includes additional functionality and metadata visibility.  It is also more compatible with the [*tidyverse*](https://www.tidyverse.org) packages, including the [*dplyr*](https://cran.r-project.org/web/packages/dplyr/index.html) package we use in this notebook.

As mentioned above, IPUMS includes a set of preselected variables in data extractions, along with the variables selected by the user.  We only selected 6 variables for the extraction but the resulting download includes 18 variables.  Let's take a look at the list of column names.

In [None]:
colnames(dat)

This list includes the 6 variables we originally selected and 10 additional IPUMS preselected variables which mainly include metadata such as identification codes, weights, and other metainformation.

**Variable Selection**
* [County FIPS Code (COUNTY)](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code)
* Age (AGE)
* Sex (SEX)
* Quit Job or Retired for Health Reasons (QUITSICK)
* Covered by Military Health Insurance in the Last Year (HICHAMP)
* Veteran's Most Recent Period of Service (VETLAST)

**IPUMS Preselected Variables**
* Survey Year (YEAR)
* Household Serial Number (SERIAL)
* Month (MONTH)
* CPSID, Household Record (CPSID)
* Flag for ASEC (ASECFLAG)
* Annual Social and Economic Supplement Household Weight (ASECWTH)
* Person Number in Sample Unit (PERNUM)
* CPSID, Person Record (CPSIDP)
* Validated Longitudinal Identifer (CPSIDV)
* Annual Social and Economic Supplement Weight (ASECWT)

### 3e. Save the Data

Finally, let's save the data we extracted from IPUMS CPS.  We will save the data in the following two formats:

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [None]:
saveRDS(dat, "ipums_cps_example.rds")
write.csv(dat, "ipums_cps_example.csv")

At the end of this exercise we have a freshly downloaded dataset from the IPUMS USA repository saved in our workspace.

## Recommended Next Steps
* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * [2.1 IPUMS USA Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/ab5cad39-6d00-43d2-bc51-17fd4e6b98f2)
  * [2.3 IPUMS International Microdata Extraction Using ipumsr](https://platform.i-guide.io/notebooks/71bcc1a6-8d43-405d-a8c3-adceaf5b785d)
  * [2.4 IPUMS NHGIS Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/be08e56e-1c08-458e-a230-263c64d386bc)
  * [2.5 IPUMS Time Use Data Extraction Using ipumsr](https://platform.i-guide.io/datasets/db169417-ceb7-4a98-965c-096873edbf6a)
  * [2.6 IPUMS Health Surveys Data Extraction Using ipumsr](https://platform.i-guide.io/notebooks/4b366bd1-b23f-4f47-9c7f-2a060a9135a5)
* **Move on to Chapter 3: Data Cleaning and Preparation**
  * [3.1 Data Preparation and Transformation with IPUMS USA](https://platform.i-guide.io/notebooks/b4b29b13-d832-495d-8db7-1545a30549f1)

## Quick Code
A clean and simple version of the code included in this notebook (excluding the metadata exploration steps).  **Don't forget to update the code with your IPUMS API key!**