# IPUMS Health Surveys Data Extraction Using ipumsr
### by [Kate Vavra-Musser](https://vavramusser.github.io) for the [R Spatial Notebook Series](https://vavramusser.github.io/r-spatial)

This notebook builds on the the workflow introduced in the **[Introduction to the IPUMS API for R Users](https://tech.popdata.org/ipumsr/articles/ipums-api.html)** article on the IPUMS website.  As the author of the R Spatial Notebook series, I recognize the IPUMS article as a significant inspiration and source of information for this notebook.

## Introduction

The [IPUMS Health Surveys](https://healthsurveys.ipums.org) database offers harmonized microdata from the [National Health Interview Survey (NHIS)](https://www.cdc.gov/nchs/nhis/index.html) and the [Medical Expenditure Panel Survey (MEPS)](https://meps.ahrq.gov/mepsweb), providing detailed information on population health, healthcare access, and medical expenditures. It enables researchers to analyze trends in health outcomes, insurance coverage, and healthcare utilization across time and demographic groups. Through harmonization, IPUMS Health Surveys ensures data can be seamlessly compared across survey years, addressing changes in survey design, variable definitions, and geographic classifications.

**From the [IPUMS Health Surveys Website](https://healthsurveys.ipums.org):** IPUMS Health Surveys provide free individual-level survey data for research purposes from two leading sources of self-reported health and health care access information: the National Health Interview Survey (NHIS) and the Medical Expenditure Panel Survey (MEPS).

The IPUMS Health Surveys database organized into the following two subsections, based on data source:

* [National Health Interview Survey (NHIS)](https://nhis.ipums.org/nhis): The National Health Interview Survey (NHIS) provides harmonized annual microdata from the 1960s to the present.
* [Medical Expenditure Panel Survey (MEPS)](https://meps.ipums.org/meps): The Medical Expenditure Panel Survey (MEPS) provides harmonized microdata from the longitudinal survey of U.S. health care expenditures and utilization, covering the period 1996 to the present.

This notebook introduces the process of extracting [IPUMS Health Surveys](https://healthsurveys.ipums.org) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS Health Surveys data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS Health Surveys datasets and prepare them for spatial and statistical workflows.

### ★ Prerequisites ★
* Complete [Section 1.1 Introduction to IPUMS and the IPUMS API](https://platform.i-guide.io/notebooks/82d3b176-e4e6-4307-8186-318a3fe6c81a)
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS National Health Interview Series (NHIS) Metadata Exploration
3. IPUMS National Health Interview Series (NHIS) Data Extraction Specification and Submission
4. IPUMS Medical Expenditures Panel Survey (MEPS) Metadata Exploration
5. IPUMS Medical Expenditures Panel Survey (MEPS) Data Extraction Specification and Submission

## 1. Setup
## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

#### Required Packages

[**dplyr**](https://cran.r-project.org/web/packages/dplyr/index.html) A Grammar of Data Manipulation. This notebook uses the the following function from *dplyr*.

* [*filter*](https://rdrr.io/cran/dplyr/man/filter.html) · keep rows that match a condition
* This notebook also uses [*%>%*](https://magrittr.tidyverse.org/reference/pipe.html), referred to as the *pipe* operator, which is used to pass the output from one function directly into the next function for the purpose of creating streamlined workflows.  The *pipe* operator is a commonly used component of the [*tidyverse*](https://www.tidyverse.org).

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) An R Interface for Downloading, Reading, and Handling IPUMS Data.  This notebook uses the the following functions from *ipumsr*.

* [*define_extract_micro*](https://rdrr.io/github/mnpopcenter/ripums/man/define_extract_micro.html) · define an extract request for an IPUMS microdata collection
* [*download_extract*](https://rdrr.io/cran/ipumsr/man/download_extract.html) · download a completed IPUMS data extract
* [*get_sample_info*](https://rdrr.io/cran/ipumsr/man/get_sample_info.html) · list available samples for IPUMS microdata collections
* [*read_ipums_ddi*](https://rdrr.io/cran/ipumsr/man/read_ipums_ddi.html) · read metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* [*read_ipums_micro*](https://rdrr.io/cran/ipumsr/man/read_ipums_micro.html) · read data from an IPUMS microdata extract
* [*set_ipums_api_key*](https://rdrr.io/cran/ipumsr/man/set_ipums_api_key.html) · set your IPUMS API key
* [*submit_extract*](https://rdrr.io/cran/ipumsr/man/submit_extract.html) · submit an extract request via the IPUMS API
* [*wait_for_extract*](https://rdrr.io/cran/ipumsr/man/wait_for_extract.html) · wait for an extract to finish processing

[**stringr**](https://cran.r-project.org/web/packages/stringr/index.html) Simple, Consistent Wrappers for Common String Operations.  This notebook uses the following function from *stringr*.

* [*str_detect*](https://stringr.tidyverse.org/reference/str_detect.html) · detect the presence or absence of a match

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("dplyr", "ipumsr", "stringr"))

Load the packages into your workspace.

In [None]:
library(dplyr)
library(ipumsr)
library(stringr)

#### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to *Chapter 1.1: Introduction to IPUMS and the IPUMS API* for instructions on setting up your IPUMS account and API key.

In [None]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

nhis
meps

## 2. IPUMS National Health Interview Series (NHIS) Metadata Exploration

From the [**IPUMS National Health Interview Series (NHIS) Webpage**](https://nhis.ipums.org/nhis): The National Health Interview Survey is a survey collecting information on the health, health care access, and health behaviors of the civilian, non-institutionalized U.S. population, with digital data files available from 1963 to present. IPUMS Health Surveys harmonizes these data and allows users to create custom NHIS data extracts for analysis.

### 2a. Review the List of Samples

In [None]:
# retrive and view the list of samples from the IPUMS USA database
metadata_nhis <- get_sample_info("nhis")

# view the dimensions of the list of samples
dim(metadata_nhis)

In [None]:
# view the first few lines of the list of samples
head(metadata_nhis)

Refer to the [Descriptions of IPUMS NHIS Samples](https://nhis.ipums.org/nhis/surveys.shtml) page on the IPUMS NHIS website.

In [None]:
# filter the list of samples by year
metadata_nhis %>% filter(str_detect(description, "2000"))

## 3. IPUMS NHIS Data Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

### 3a. Define the Data Extract

For this example we will use the 2020 NHIS sample.

**Variable Selection**
* Regin of Residence (REGION)
* Health Status (HEALTH)
* Health Insurance Coverage Status (HINOTCOVE)
* Ever Told Had Hypertension on 2+ Visits (HYP2TIME)
* Ever Smoked 100 Cigarettes in Life (SMOKEV)
* Frequency of Vigorous Activity 10+ Minutes: Times per Week (VIG10FWK)

By default, the data extraction will also include a number of IPUMS preselected variables.  These variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS USA so we use the code "usa".
* **description** Text description of the extract.
* **samples** Vector of samples to include in the extract request.  In our case we are downloading the 2010 ACS data (us2010a).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

In [None]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "nhis",
                                           description = "IPUMS NHIS Data Extraction",
                                           samples = c("ih2000"),
                                           variables = c("REGION", "HEALTH", "HINOTCOVE", "HYP2TIME", "SMOKEV", "VIG10FWK"))

In [None]:
# review the extraction definition
extract_definition

### 3b. Submit the Extract Request

In [None]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

### 3c. Review the Extract

The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

Read the ddi and data files into a format which we can work with in R.

In [None]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

In [None]:
dim(dat)

In [None]:
head(dat)

In [None]:
colnames(dat)

**Variable Selection**
* Regin of Residence (REGION)
* Health Status (HEALTH)
* Health Insurance Coverage Status (HINOTCOVE)
* Ever Told Had Hypertension on 2+ Visits (HYP2TIME)
* Ever Smoked 100 Cigarettes in Life (SMOKEV)
* Frequency of Vigorous Activity 10+ Minutes: Times per Week (VIG10FWK)

**IPUMS Preselected Variables**
* Survey Year (YEAR)
* Sequential Serial Number, Household Record (SERIAL)
* Stratum for Variance Estimation (STRATA)
* Primary Sampling Unit (PSU) for Variance Estimation (PSU)
* NHIS Unique Identifier, Household (NHISHID)
* Household Weight, Final Annual (HHWEIGHT)
* Person Number within Family/Household (from reformatting) (PERNUM)
* NHIS Unique Identifier, Person (NHISPID)
* Household Number (from NHIS) (HHX)
* Family Number (from NHIS) (FMX)
* Person Number of Respondent (from NHIS) (PX)
* Final Basic Annual Weight (PERWEIGHT)
* Sample Person Weight (SAMPWEIGHT)
* Final Annual Family Weight (FWEIGHT)
* Sample Adult Flag (ASTATFLG)
* Sample Child Flag (CSTATFLG)

### 3d. Save the Data

Next let's save a couple versions of our IPUMS NHIS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [None]:
saveRDS(dat, "ipums_nhis_example.rds")
write.csv(dat, "ipums_nhis_example.csv")

## 4. IPUMS Medical Expenditures Panel Survey (MEPS) Metadata Exploration

From the [**IPUMS Medical Expenditures Panel Survey (MEPS) Webpage**](https://nhis.ipums.org/nhis): MEPS provides nationally representative, longitudinal data from 1996 to the present on health status, medical conditions, healthcare utilization, and healthcare expenditures for the U.S. civilian, non-institutionalized population. IPUMS MEPS harmonizes these data and allows users to create customized data extracts for analysis.

### 4a. Review the List of Samples

In [None]:
# retrive and view the list of samples from the IPUMS MEPS database
metadata_meps <- get_sample_info("meps")

# view the dimensions of the list of samples
dim(metadata_meps)

In [None]:
# view the first few lines of the list of samples
head(metadata_meps)

In [None]:
# filter the list of samples by year
metadata_meps %>% filter(str_detect(description, "2020"))

## 5. IPUMS Medical Expenditures Panel Survey (MEPS) Data Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

### 5a. Define the Data Extract

For this example we will use the 2020 MEPS sample.

**Variable Selection**
* Health Status (HEALTH)
* Health Insuraance Coverage Type (Hierarchy) (COVERTYPE)
* Annual Total of Direct Health Care Payments (EXPTOT)
* Annual Total Number of Visits Made to Office-Based Medical Providers (OBTOTVIS)
* Respondent Has Been Told They Have Diabetes (DCSDIABDX)

By default, the data extraction will also include a number of IPUMS preselected variables.  These variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS USA so we use the code "usa".
* **description** Text description of the extract.
* **samples** Vector of samples to include in the extract request.  In our case we are downloading the 2010 ACS data (us2010a).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

In [None]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "meps",
                                           description = "IPUMS MEPS Data Extraction",
                                           samples = c("mp2020"),
                                           variables = c("HEALTH", "COVERTYPE", "EXPTOT", "OBTOTVIS", "DCSDIABDX"))

In [None]:
# review the extraction definition
extract_definition

### 5b. Submit the Extract Request

In [None]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

### 5c. Review the Extract

The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

Read the ddi and data files into a format which we can work with in R.

In [None]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

In [None]:
dim(dat)

In [None]:
head(dat)

In [None]:
colnames(dat)

**Variable Selection**
* Health Status (HEALTH)
* Health Insuraance Coverage Type (Hierarchy) (COVERTYPE)
* Annual Total of Direct Health Care Payments (EXPTOT)
* Annual Total Number of Visits Made to Office-Based Medical Providers (OBTOTVIS)
* Respondent Has Been Told They Have Diabetes (DCSDIABDX)

**IPUMS Preselected Variables**
* Survey Year (YEAR)
* Person Number within Family/Household (from reformatting) (PERNUM)
* Dwelling Unit ID (DUID)
* Person Number (PID)
* MEPS Unique Identifier (IPUMS Generated) (MEPSID)
* Panel (PANEL)
* Annual Primary Sampling Unit (PSU) for Variance Estimation (PSUANN)
* Annual Stratum for Variance Estimation (STRATANN)
* Pooled Primary Sampling Unit (PSU) for Variance Estimation (PSUPLD)
* Pooled Variance Stratum (STRATAPLD)
* Year Entered MEPS (PANELYR)
* Relative Year 1 or 2 in Panel (RELYR)
* Final Basic Annual Weight (PERWEIGHT)
* Self-Administered Questionnaire Weight (SAQWEIGHT)
* Diabetes Care Weight (DIABWEIGHT)

### 5d. Save the Data

Next let's save a couple versions of our IPUMS MEPS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [None]:
saveRDS(dat, "ipums_meps_example.rds")
write.csv(dat, "ipums_mepss_example.csv")

## Recommended Next Steps
* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * 2.1: IPUMS USA Data Extraction Using ipumsr
  * 2.2: IPUMS CPS Data Extraction Using ipumsr
  * 2.3: IPUMS International Microdata Extraction Using ipumsr
  * 2.4: IPUMS NHGIS Data Extraction Using ipumsr
  * 2.5: IPUMS Time Use Data Extraction Using ipumsr
  * 2.7: Reading IPUMS Global Health Data Extracts Using ipumsr
  * 2.8: Reading IPUMS Higher Education Data Extracts Using ipumsr