# IPUMS USA Data Extraction Using ipumsr

## Introduction
This notebook is designed to guide you through exploring, selecting, and extracting population data from [IPUMS USA](https://cps.ipums.org/usa) using the [R ipumsr package](https://cran.r-project.org/web/packages/ipumsr/index.htm).

The IPUMS USA data repository includes data from the [American Communinty Survey (ACS)](https://www.census.gov/programs-surveys/acs) and [Puerto Rico Community Survey (PRCS)](https://www.census.gov/programs-surveys/acs/about/puerto-rico-community-survey.html).

By working through this notebook, you will learn how to define an extraction of population data from the IPUMS USA repository and download the relevant data for analysis.

#### Overview
This notebook includes the following sections:

1. Setup
2. USA Data Metadata Exploration
3. USA Data Extraction Specification and Submission

## 1. Setup

Before running this script, you will need to install and load the *ipumsr* package into your R environment:

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) A package specifically designed to interact with IPUMS datasets, including NHGIS. It allows users to define and submit data extraction requests, download data, and read it directly into R for analysis.  This notebook uses the the following functions from *ipumsr*.

* *set_ipums_api_key()* for setting your IPUMS API key
* *get_sample_info()* for retrieving sample identification codes and descriptions for IPUMS microdata collections
* *define_extract_micro()* for defining the parameters of an IPUMS microdata extract request to be submitted via the IPUMS API
* *submit_exract()* for submitting an extract request via the IPUMS API and return an *ipums_extract* object
* *wait_for_extract()* wait for an extract to finish processing
* *download_extract()* download an extract's data files
* *read_ipums_ddi()* for reading metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* *read_ipums_micro()* for reading data from an IPUMS microdata extract

If you are working in the I-GUIDE environment, the *ipumsr* package should be already be installed.  However you will still need to load the package into your workspace using *library* base R function.

In [1]:
library(ipumsr)

Run the following code to enter your [IPUMS API key](https://account.ipums.org/api_keys).

In [4]:
my_ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(my_ipumps_api_key, save = T, overwrite = T)

Please enter your IPUMSS API key:  59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0


Existing .Renviron file copied to /home/jovyan/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



## 2. Metadata Exploration

Next, let's take a look at the entire set of available datasets within the USA data repository.  The USA data available for direct extraction using the IPUMS API primairly focus on the [American Communinty Survey (ACS)](https://www.census.gov/programs-surveys/acs) and [Puerto Rico Community Survey (PRCS)](https://www.census.gov/programs-surveys/acs/about/puerto-rico-community-survey.html).

The ACS and PRCS are are annual surveys conducted by the U.S. Census Bureau that collect information on a subset of the U.S. population.  The ACS collects data on a variety of topics, including income, poverty, education, marital status, health insurance coverage, disability, occupancy, costs, tenure, and units by type.  It is a more in-depth supplement to the Decennial U.S. Census and in 2005 replaced the long-form version of the Decennial Census survey which was previously conducted every ten years.  Each year the ACS samples over 3.5 million housing units across the United States with a new sample of about 250,000 addresses drawn each month.

ACS and PRCS are available as single-year datasets as well as three- and five-year summaries of the data.  The three- and five-year summary data are often used in lieu of the single-year data as they are less susceptible to anomalies.

The IPUMS Data also includes population counts and samples dating back to 1850 but, for this exercise, we will focus on the ACS sample data which has been the standard USA population sample survey since 2005.  Refer to the [Descriptions of IPUMS Samples](https://usa.ipums.org/usa/sampdesc.shtml) page on the IPUMS USA website for a list of all IPUMS USA data sets and their descriptions.

The *get_sample_info* function form the ipumsr package returns a list of all datasets from the specified IPUMS data repository which are available to be downloaded using the ipumsr API.  We will reuqest the list of datasets from the USA (usa) repository and print the full list.

In [9]:
metadata <- get_sample_info("usa") %>% print(n = Inf)

[90m# A tibble: 146 × 2[39m
    name    description                               
    [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m                                     
[90m  1[39m us1850a 1850 1%                                   
[90m  2[39m us1850c 1850 100% sample (Revised November 2023)  
[90m  3[39m us1860a 1860 1%                                   
[90m  4[39m us1860b 1860 1% sample with black oversample      
[90m  5[39m us1860c 1860 100% sample (Revised November 2023)  
[90m  6[39m us1870a 1870 1%                                   
[90m  7[39m us1870b 1870 1% sample with black oversample      
[90m  8[39m us1870c 1870 100% sample (Revised November 2023)  
[90m  9[39m us1880a 1880 1%                                   
[90m 10[39m us1880d 1880 10%                                  
[90m 11[39m us1880e 1880 100% database (Revised November 2023)
[90m 12[39m us1900k 1900 1%                                   
[90m 13[39m us1900j 1900 5%             

For this exercise, we will be work with the 2022 five-year ACS data (us2022c).  In the IPUMS specification, ACS and PRCS multi-year summaries are referred to by the final year in the corresponding time range.  So the 2022 five-year ACS is a summary of the 2018, 2019, 2020, 2021, and 2022 ACS surveys.

The data extraction process does allow us to download multiple datasets, but for this exercise we will only download the us2022c ACS data.

As mentioned, the ACS includes a wide range of variables on many topics.  Here we will focus on the following selection of demographic variables.

1. sex (SEX)
2. age (AGE)
3. race (RACE)
4. educational attainment (EDUC)
5. total income (INCTOT)

And include a couple geography reference variables.

6. state FIPS code (STATEFIP)
7. county FIPS code (COUNTYFIP)

## 3. Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS USA so we use the code "usa".  The other collections include CPS (cps), International (ipumsi), Time Use (atus, ahtus, or mtus), and Health Surveys (nhgis or meps).
* **description** Description of the extract.
* **samples** Vector of samples to include in the extracft request.  In our case we are downloading the ACS 2022 5-year summary data (us2022c).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

For additional information on *define_extract_micro* and other ipumsr functions, refer to [the CRAN ipumsr reference manual](https://cran.r-project.org/web/packages/ipumsr/ipumsr.pdf).

In [11]:
extract_definition <- define_extract_micro(
  collection = "usa",
  description = "Example ACS extract",
  samples = c("us2022c"),
  variables = c("STATEFIP", "COUNTYFIP", "SEX", "AGE", "RACE", "EDUC", "INCTOT")
)

Review the extraction definition to make sure we have set it up the way we intended.

In [12]:
extract_definition

Everything looks good so we will submit the extraction request, wait for it to complete, and download the resulting data.

In [13]:
# submit extraction  
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get extraction filepath
filepath <- download_extract(extract_submitted, overwrite = T)

Successfully submitted IPUMS USA extract number 17

Checking extract status...

Waiting 10 seconds...

Checking extract status...

Waiting 20 seconds...

Checking extract status...

Waiting 30 seconds...

Checking extract status...

Waiting 40 seconds...

Checking extract status...

Waiting 50 seconds...

Checking extract status...

IPUMS USA extract 17 is ready to download.





DDI codebook file saved to /home/jovyan/pipelines/1 - Data Acquisition and Extraction /usa_00017.xml
Data file saved to /home/jovyan/pipelines/1 - Data Acquisition and Extraction /usa_00017.dat.gz



Your data extraction download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

We need to read the ddi and data files into a format which we can work with in R.

In [14]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.



We now have a useable version of our dataset stored in *dat*.  Let's take a look at the number of observations and variables in the data.

In [15]:
dim(dat)

The 2022 5-year ACS data includes information on 20 variables for 15,721,123 individuals.  This makes sense since we know the ACS surveys about 3.5 million individuals and our dataset corresponds to five years of ACS data.

Let's take a look at the first few lines of the data file.

In [16]:
head(dat)

YEAR,MULTYEAR,SAMPLE,SERIAL,CBSERIAL,HHWT,CLUSTER,STATEFIP,COUNTYFIP,STRATA,GQ,PERNUM,PERWT,SEX,AGE,RACE,RACED,EDUC,EDUCD,INCTOT
<int>,<dbl>,<int+lbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int+lbl>,<dbl+lbl>,<dbl>,<int+lbl>,<dbl>,<dbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<dbl+lbl>
2022,2018,202203,1,2018010000000.0,17,2022000000000.0,1,0,160001,4,1,17,2,19,1,100,6,65,-1754
2022,2018,202203,2,2018010000000.0,17,2022000000000.0,1,81,190001,4,1,17,2,18,2,200,6,65,1870
2022,2018,202203,3,2018010000000.0,24,2022000000000.0,1,0,200001,3,1,24,1,53,1,100,6,64,11691
2022,2018,202203,4,2018010000000.0,8,2022000000000.0,1,0,240001,3,1,8,1,28,1,100,7,71,0
2022,2018,202203,5,2018010000000.0,3,2022000000000.0,1,97,270101,3,1,3,2,25,1,100,3,30,0
2022,2018,202203,6,2018010000000.0,4,2022000000000.0,1,0,240001,3,1,4,2,30,1,100,6,63,0


Notice that this data is in ["tibble"](https://tibble.tidyverse.org) format rather than the more common "data.frame" format you might be used to as an R user.  A tibble can be thought of as a version of a data.frame that includes additional functionality and metadata visibility.  It is also more compatible with the tidyverse packages, including the dplyr package we use in this notebook.

We also appear to have a lot more columns than the set we requested from IPUMS.  The view above truncates the dataset to a subset of the columns for easier viewing.  Let's take a quick look at the list of column names so we can see all the variables included in this dataset.

In [17]:
colnames(dat)

The IPUMS R API included both the variables we asked for and some additional variables.

We have the demographic variables we requested:

1. sex (SEX)
2. age (AGE)
3. race (RACE)
4. educational attainment (EDUC)
5. total income (INCTOT)

Along with more descriptive supplementary versions of some of our demographic variables:

6. detailed race (RACED)
7. detailed educational attainment (EDUCD)

And the geographic variables we requested:

8. state FIPS code (STATEFIP)
9. county FIPS code (COUNTYFIP)

IPUMS has also included a set of varibles which we did not specifically request but which are always included in the ACS data downloads:

10. five-year summary reference year (YEAR) (i.e. 2022 for this data)
11. survey year (MULTYEAR) (i.e. 2018, 2019, 2020, 2021, or 2022 for this data)
12. sample identifier (SAMPLE)
13. unique household identifier (SERIAL)
14. Census Bureau unique household identifier (CBSERIAL)
15. household survey weight (HHWT)
16. primmary sampling unit or cluster (CLUSTER)
17. stratification code (STRATA)
18. group quarters code (GQ)
19. person nunber within the household (PERNUM)
20. person weight (PERWT)

Next let's save a couple versions of our IPUMS ACS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [18]:
saveRDS(dat, "IPUMS_ACS_5y_2022.rds")
write.csv(dat, "IPUMS_ACS_5y_2022.csv")

At the end of this exercise we have a freshly downloaded dataset from the IPUMS USA repository saved in our workspace.

## Next Steps

From here, we recommend exploring the following notebooks:

* **Data Cleaning with IPUMS USA**
* **IPUMS NHGIS Data Extraction**