# IPUMS Current Population Survey (CPS) Data Extraction Using ipumsr

## Introduction
The [IPUMS CPS](https://cps.ipums.org/cps) database offers harmonized microdata from the [Current Population Survey (CPS)](https://www.census.gov/programs-surveys/cps.html), a key source of information on labor force participation, employment, income, and related demographic characteristics in the United States. It provides detailed, individual-level records that enable the analysis of trends in the labor market and socioeconomic conditions over time. Through harmonization, IPUMS CPS allows data to be seamlessly compared across survey years, despite changes in questionnaire design, geographic classifications, and variable definitions.

**From the [IPUMS CPS Website](https://cps.ipums.org/cps):** IPUMS CPS harmonizes microdata from the monthly U.S. labor force survey, the Current Population Survey (CPS), covering the period 1962 to the present. Data include demographic information, rich employment data, program participation and supplemental data on topics such as fertility, tobacco use, volunteer activities, voter registration, computer and internet use, food security, and more.

This notebook introduces the process of extracting [IPUMS CPS](https://cps.ipums.org/cps) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS CPS data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS CPS datasets and prepare them for spatial and statistical workflows.

### ★ Prerequisites ★
* Complete Chapter 1.1: Introduction to IPUMS and the IPUMS API
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS CPS Metadata Exploration
3. IPUMS CPS Data Extraction Specification and Submission

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following function from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) · define an extract request for an IPUMS microdata collection
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) · list available samples for IPUMS microdata collections
* [*read_ipums_ddi*](https://rdrr.io/cran/ipumsr/man/read_ipums_ddi.html) · read metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* [*read_ipums_micro*](https://rdrr.io/cran/ipumsr/man/read_ipums_micro.html) · read data from an IPUMS microdata extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) Simple, Consistent Wrappers for Common String Operations.  This notebook uses the following function from *stringr*.

* [*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) · detect the presence or absence of a match

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [55]:
# install.packages(c("dplyr", "ipumsr", "stringr"))

Load the packages into your workspace.

In [17]:
library(dplyr)
library(ipumsr)
library(stringr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to *Chapter 1.1: Introduction to IPUMS and the IPUMS API* for instructions on setting up your IPUMS account and API key.

In [11]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

Please enter your IPUMS API key:  59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0


Existing .Renviron file copied to C:\Users\vavra\Documents/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



## 2. IPUMS CPS Metadata Exploration

### 2a. Review the List of Samples

In [12]:
# retrive and view the list of samples from the IPUMS CPS database
metadata_cps <- get_sample_info("cps")

# view the dimensions of the list of samples
dim(metadata_cps)

In [14]:
# view the first few lines of the list of samples
head(metadata_cps)

name,description
<chr>,<chr>
cps1962_03s,"IPUMS-CPS, ASEC 1962"
cps1963_03s,"IPUMS-CPS, ASEC 1963"
cps1964_03s,"IPUMS-CPS, ASEC 1964"
cps1965_03s,"IPUMS-CPS, ASEC 1965"
cps1966_03s,"IPUMS-CPS, ASEC 1966"
cps1967_03s,"IPUMS-CPS, ASEC 1967"


Refer to the [Descriptions of IPUMS CPS Samples](https://cps.ipums.org/cps/samples.shtml) page on the IPUMS CPS website.

In [18]:
# filter the list of samples by year
metadata_cps %>% filter(str_detect(description, "2020"))      # filter descrption by year

name,description
<chr>,<chr>
cps2020_01s,"IPUMS-CPS, January 2020"
cps2020_02s,"IPUMS-CPS, February 2020"
cps2020_03b,"IPUMS-CPS, March 2020"
cps2020_04b,"IPUMS-CPS, April 2020"
cps2020_05b,"IPUMS-CPS, May 2020"
cps2020_06s,"IPUMS-CPS, June 2020"
cps2020_07b,"IPUMS-CPS, July 2020"
cps2020_08s,"IPUMS-CPS, August 2020"
cps2020_03s,"IPUMS-CPS, ASEC 2020"
cps2020_09b,"IPUMS-CPS, September 2020"


## 3. IPUMS CPS Data Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

### 3a. Define the Data Extract

For this example we will use the January 2020 CPS sample.

**Variable Selection**
* County (FIPS Code) (COUNTY)
* Age (AGE)
* Sex (SEX)
* Quit Job or Retired for Health Reasons (QUITSICK)
* Covered by Military Health Insurance in the Last Year (HICHAMP)
* Veteran's Most Recent Period of Service (VETLAST)

By default, the data extraction will also include a number of IPUMS preselected variables.  These variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS CPS so we use the code "cps".
* **description** Description of the extract.
* **samples** Vector of samples to include in the extract request.  In our case we are downloading the January 2020 CPS data (cps2020_01s).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

In [46]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "cps",
                                           description = "IPUMS CPS Data Extraction",
                                           samples = c("cps2020_03s"),
                                           variables = c("COUNTY", "AGE", "SEX", "QUITSICK", "HICHAMP", "VETLAST"))

In [47]:
# review the extraction definition
extract_definition

### 3b. Submit the Extract Request

In [48]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

Successfully submitted IPUMS CPS extract number 1

Checking extract status...

Waiting 10 seconds...

Checking extract status...

IPUMS CPS extract 1 is ready to download.






DDI codebook file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/cps_00001.xml
Data file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/cps_00001.dat.gz



### 3c. Review the Extract

The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

Read the ddi and data files into a format which we can work with in R.

In [49]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.



In [50]:
dim(dat)

In [51]:
head(dat)

YEAR,SERIAL,MONTH,CPSID,ASECFLAG,ASECWTH,COUNTY,PERNUM,CPSIDP,CPSIDV,ASECWT,AGE,SEX,QUITSICK,HICHAMP,VETLAST
<dbl>,<dbl>,<int+lbl>,<dbl>,<int+lbl>,<dbl>,<dbl+lbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>
2020,1,3,20190300000000.0,1,1560.3756,0,1,20190300000000.0,201903000000000.0,1560.3756,63,2,1,1,0
2020,1,3,20190300000000.0,1,1560.3756,0,2,20190300000000.0,201903000000000.0,1560.3756,67,1,1,1,0
2020,2,3,20181200000000.0,1,986.5948,0,1,20181200000000.0,201812000000000.0,986.5948,64,1,1,2,11
2020,2,3,20181200000000.0,1,986.5948,0,2,20181200000000.0,201812000000000.0,986.5948,71,2,1,2,0
2020,3,3,20190200000000.0,1,1519.0704,0,1,20190200000000.0,201902000000000.0,1519.0704,54,2,1,1,0
2020,4,3,20190300000000.0,1,1423.5779,0,1,20190300000000.0,201903000000000.0,1423.5779,74,2,1,1,0


In [52]:
colnames(dat)

**Variable Selection**
* County (FIPS Code) (COUNTY)
* Age (AGE)
* Sex (SEX)
* Quit Job or Retired for Health Reasons (QUITSICK)
* Covered by Military Health Insurance in the Last Year (HICHAMP)
* Veteran's Most Recent Period of Service (VETLAST)

**IPUMS Preselected Variables**
* Survey Year (YEAR)
* Household Serial Number (SERIAL)
* Month (MONTH)
* CPSID, Household Record (CPSID)
* Flag for ASEC (ASECFLAG)
* Annual Social and Economic Supplement Household Weight (ASECWTH)
* Person Number in Sample Unit (PERNUM)
* CPSID, Person Record (CPSIDP)
* Validated Longitudinal Identifer (CPSIDV)
* Annual Social and Economic Supplement Weight (ASECWT)

### 3d. Save the Data

Next let's save a couple versions of our IPUMS ACS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [54]:
saveRDS(dat, "ipums_cps_example.rds")
write.csv(dat, "ipums_cps_example.csv")

## Recommended Next Steps
* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * 2.1: IPUMS USA Data Extraction Using ipumsr
  * 2.3: IPUMS International Microdata Extraction Using ipumsr
  * 2.4: IPUMS NHGIS Data Extraction Using ipumsr
  * 2.5: IPUMS Time Use Data Extraction Using ipumsr
  * 2.6: IPUMS Health Surveys Data Extraction Using ipumsr
  * 2.7: Reading IPUMS Global Health Data Extracts Using ipumsr
  * 2.8: Reading IPUMS Higher Education Data Extracts Using ipumsr

## Quick Code
Don't forget to update the code with your IPUMS API key!