# IPUMS International Data Extraction Using ipumsr

## Introduction
The [IPUMS International](https://international.ipums.org/international) database offers harmonized microdata from population censuses and surveys across multiple countries. It provides detailed, individual-level records on demographics, education, employment, housing, and household characteristics, enabling the analysis of global trends in population and socioeconomic conditions across time and space. Through harmonization, IPUMS International ensures data can be seamlessly compared across countries and census years, overcoming challenges posed by differences in survey design, geographic boundaries, and variable definitions."

**From the [IPUMS International Website](https://international.ipums.org/international):** IPUMS International is dedicated to collecting and distributing census microdata from around the world. The project goals are to collect and preserve data and documentation, harmonize data, and disseminate the harmonized data free of charge.

This notebook introduces the process of extracting [IPUMS International](https://international.ipums.org/international) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS International data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS International datasets and prepare them for spatial and statistical workflows.

### ★ Prerequisites ★
* Complete Chapter 1.1: Introduction to IPUMS and the IPUMS API
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS International Metadata Exploration
3. IPUMS International Data Extraction Specification and Submission

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following function from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) · define an extract request for an IPUMS microdata collection
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) · list available samples for IPUMS microdata collections
* [*read_ipums_ddi*](https://rdrr.io/cran/ipumsr/man/read_ipums_ddi.html) · read metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* [*read_ipums_micro*](https://rdrr.io/cran/ipumsr/man/read_ipums_micro.html) · read data from an IPUMS microdata extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) Simple, Consistent Wrappers for Common String Operations.  This notebook uses the following function from *stringr*.

* [*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) · detect the presence or absence of a match

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("dplyr", "ipumsr", "stringr"))

Load the packages into your workspace.

In [2]:
library(dplyr)
library(ipumsr)
library(stringr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




#### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to *Chapter 1.1: Introduction to IPUMS and the IPUMS API* for instructions on setting up your IPUMS account and API key.

In [3]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

Please enter your IPUMS API key:  59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0


Existing .Renviron file copied to C:\Users\vavra\Documents/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



## 2. IPUMS International Metadata Exploration

### 2a. Review the List of Samples

In [6]:
# retrive and view the list of samples from the IPUMS International database
metadata_int <- get_sample_info("ipumsi")

# view the dimensions of the list of samples
dim(metadata_int)

In [7]:
# view the first few lines of the list of samples
head(metadata_int)

name,description
<chr>,<chr>
ar1970a,Argentina 1970
ar1980a,Argentina 1980
ar1991a,Argentina 1991
ar2001a,Argentina 2001
at1971a,Austria 1971
at1981a,Austria 1981


Refer to the [Descriptions of IPUMS International Samples](https://international.ipums.org/international-action/sample_details) page on the IPUMS International website.

In [9]:
# filter the list of samples by country
metadata_int %>% filter(str_detect(description, "South Africa"))

name,description
<chr>,<chr>
za1996a,South Africa 1996
za2001a,South Africa 2001
za2007a,South Africa 2007
za2011a,South Africa 2011
za2016a,South Africa 2016


## 3. IPUMS USA Data Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

### 3a. Define the Data Extract

For this example we will use the 2016 South Africa sample.

**Variable Selection**
* South Africa, Local Municipality 2016 (Level 3 GIS) (GEO3_ZA2016)
* Electricity (ELECTRIC)
* Water Supply (WATSUP)
* Sewage (SEWAGE)
* Cooking Fuel (FUELCOOK)
* Toilet (TOILET)

By default, the data extraction will also include a number of IPUMS preselected variables.  These variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS USA so we use the code "usa".
* **description** Text description of the extract.
* **samples** Vector of samples to include in the extract request.  In our case we are downloading the 2010 ACS data (us2010a).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

In [13]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "ipumsi",
                                           description = "IPUMS International Data Extraction",
                                           samples = c("za2016a"),
                                           variables = c("GEO3_ZA2016", "ELECTRIC", "WATSUP", "SEWAGE", "FUELCOOK", "TOILET"))

In [14]:
# review the extraction definition
extract_definition

### 3b. Submit the Extract Request

In [15]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

Successfully submitted IPUMS International extract number 1

Checking extract status...

Waiting 10 seconds...

Checking extract status...

Waiting 20 seconds...

Checking extract status...

Waiting 30 seconds...

Checking extract status...

Waiting 40 seconds...

Checking extract status...

IPUMS International extract 1 is ready to download.






DDI codebook file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/ipumsi_00001.xml
Data file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/ipumsi_00001.dat.gz



### 3c. Review the Extract

The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

Read the ddi and data files into a format which we can work with in R.

In [16]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

Use of data from IPUMS International is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.



In [17]:
dim(dat)

In [18]:
head(dat)

COUNTRY,YEAR,SAMPLE,SERIAL,HHWT,GEO3_ZA2016,ELECTRIC,WATSUP,SEWAGE,FUELCOOK,TOILET
<int+lbl>,<int>,<int+lbl>,<dbl>,<dbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>
710,2016,710201601,1000,19.71,6037013,1,16,11,20,21
710,2016,710201601,1000,19.71,6037013,1,16,11,20,21
710,2016,710201601,1000,19.71,6037013,1,16,11,20,21
710,2016,710201601,1000,19.71,6037013,1,16,11,20,21
710,2016,710201601,1000,19.71,6037013,1,16,11,20,21
710,2016,710201601,1000,19.71,6037013,1,16,11,20,21


In [19]:
colnames(dat)

**Variable Selection**
* South Africa, Local Municipality 2016 (Level 3 GIS) (GEO3_ZA2016)
* Electricity (ELECTRIC)
* Water Supply (WATSUP)
* Sewage (SEWAGE)
* Cooking Fuel (FUELCOOK)
* Toilet (TOILET)

**IPUMS Preselected Variables**
* Country (COUNTRY)
* Survey Year (YEAR)
* IPUMS Sample Identifier (SAMPLE)
* Household Serial Number (SERIAL)
* Household Weight (HHWT)

### 3d. Save the Data

Next let's save a couple versions of our IPUMS ACS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [None]:
saveRDS(dat, "ipums_international_example.rds")
write.csv(dat, "ipums_international_example.csv")

## Recommended Next Steps
* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * 2.1: IPUMS USA Data Extraction Using ipumsr
  * 2.2: IPUMS CPS Data Extraction Using ipumsr
  * 2.4: IPUMS NHGIS Data Extraction Using ipumsr
  * 2.5: IPUMS Time Use Data Extraction Using ipumsr
  * 2.6: IPUMS Health Surveys Data Extraction Using ipumsr
  * 2.7: Reading IPUMS Global Health Data Extracts Using ipumsr
  * 2.8: Reading IPUMS Higher Education Data Extracts Using ipumsr

## Quick Code
Don't forget to update the code with your IPUMS API key!