# IPUMS Health Surveys Data Extraction Using ipumsr

## Introduction

The [IPUMS Health Surveys](https://healthsurveys.ipums.org) database offers harmonized microdata from the [National Health Interview Survey (NHIS)](https://www.cdc.gov/nchs/nhis/index.html) and the [Medical Expenditure Panel Survey (MEPS)](https://meps.ahrq.gov/mepsweb), providing detailed information on population health, healthcare access, and medical expenditures. It enables researchers to analyze trends in health outcomes, insurance coverage, and healthcare utilization across time and demographic groups. Through harmonization, IPUMS Health Surveys ensures data can be seamlessly compared across survey years, addressing changes in survey design, variable definitions, and geographic classifications.

**From the [IPUMS Health Surveys Website](https://healthsurveys.ipums.org):** IPUMS Health Surveys provide free individual-level survey data for research purposes from two leading sources of self-reported health and health care access information: the National Health Interview Survey (NHIS) and the Medical Expenditure Panel Survey (MEPS).

The IPUMS Health Surveys database organized into the following two subsections, based on data source:

* [National Health Interview Survey (NHIS)](https://nhis.ipums.org/nhis): The National Health Interview Survey (NHIS) provides harmonized annual microdata from the 1960s to the present.
* [Medical Expenditure Panel Survey (MEPS)](https://meps.ipums.org/meps): The Medical Expenditure Panel Survey (MEPS) provides harmonized microdata from the longitudinal survey of U.S. health care expenditures and utilization, covering the period 1996 to the present.

This notebook introduces the process of extracting [IPUMS Health Surveys](https://healthsurveys.ipums.org) data using the [IPUMS API](https://developer.ipums.org/docs/v2/apiprogram) via the [ipumsr R package](https://cran.r-project.org/web/packages/ipumsr/index.html). Users will learn how to define, submit, and download an IPUMS Health Surveys data extract, specifying desired variables, time periods, and geographic units for analysis. By the end of this notebook, users will have the skills to efficiently acquire customized IPUMS Health Surveys datasets and prepare them for spatial and statistical workflows.

### ★ Prerequisites ★
* Complete Chapter 1.1: Introduction to IPUMS and the IPUMS API
* Set Up Your [IPUMS Account and API Key](https://account.ipums.org/api_keys)

### Notebook Overview
1. Setup
2. IPUMS National Health Interview Series (NHIS) Metadata Exploration
3. IPUMS National Health Interview Series (NHIS) Data Extraction Specification and Submission
4. IPUMS Medical Expenditures Panel Survey (MEPS) Metadata Exploration
5. IPUMS Medical Expenditures Panel Survey (MEPS) Data Extraction Specification and Submission

## 1. Setup
This section will guide you through the process of installing essential packages and setting your IPUMS API key.

##### Required Packages

[**ipumsr**](https://cran.r-project.org/web/packages/ipumsr/index.html) A package for interacting with IPUMS datasets and the IPUMS API. It allows users to define and submit data extraction requests, download data, and read it directly into R for analysis.  This notebook uses the the following functions from *ipumsr*.

* *set_ipums_api_key()* for setting your IPUMS API key
* *get_sample_info()* for retrieving sample identification codes and descriptions for IPUMS microdata collections
* *get_metadata_nhgis()* for listing available data sources from IPUMS NHGIS
* *define_extract_micro()* for defining the parameters of an IPUMS microdata extract request to be submitted via the IPUMS API
* *define_extract_nhgis()* for defining an IPUMS NHGIS extract request
* *tst_spec()* for creating a tst_spec object containing a time-series table specification
* *submit_exract()* for submitting an extract request via the IPUMS API and return an *ipums_extract* object
* *wait_for_extract()* wait for an extract to finish processing
* *download_extract()* download an extract's data files
* *read_ipums_ddi()* for reading metadata about an IPUMS microdata extract from a DDI codebook (.xml) file
* *read_ipums_micro()* for reading data from an IPUMS microdata extract
* *read_nhgis()* for reading tabular data from an NHGIS extract
* *read_ipums_sf()* for reading spatial data from an IPUMS extract

### 1a. Install and Load Required Packages
If you have not already installed the required packages, uncomment and run the code below:

In [None]:
# install.packages(c("dplyr", "ipumsr", "stringr"))

Load the packages into your workspace.

In [8]:
library(dplyr)
library(ipumsr)
library(stringr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




#### 1b. Set Your IPUMS API Key

Store your [IPUMS API key](https://account.ipums.org/api_keys) in your environment using the following code.

Refer to *Chapter 1.1: Introduction to IPUMS and the IPUMS API* for instructions on setting up your IPUMS account and API key.

In [2]:
ipumps_api_key = readline("Please enter your IPUMS API key: ")
set_ipums_api_key(ipumps_api_key, save = T, overwrite = T)

Please enter your IPUMS API key:  59cba10d8a5da536fc06b59dd85f877c475a4c7d96dd08a9ce04d9d0


Existing .Renviron file copied to C:\Users\vavra\Documents/.Renviron_backup for backup purposes.

The environment variable IPUMS_API_KEY has been set and saved for future sessions.



nhis
meps

## 2. IPUMS National Health Interview Series (NHIS) Metadata Exploration

From the [**IPUMS National Health Interview Series (NHIS) Webpage**](https://nhis.ipums.org/nhis): The National Health Interview Survey is a survey collecting information on the health, health care access, and health behaviors of the civilian, non-institutionalized U.S. population, with digital data files available from 1963 to present. IPUMS Health Surveys harmonizes these data and allows users to create custom NHIS data extracts for analysis.

### 2a. Review the List of Samples

In [4]:
# retrive and view the list of samples from the IPUMS USA database
metadata_nhis <- get_sample_info("nhis")

# view the dimensions of the list of samples
dim(metadata_nhis)

In [5]:
# view the first few lines of the list of samples
head(metadata_nhis)

name,description
<chr>,<chr>
ih1968,1968 NHIS
ih1969,1969 NHIS
ih1970,1970 NHIS
ih1971,1971 NHIS
ih1972,1972 NHIS
ih1973,1973 NHIS


Refer to the [Descriptions of IPUMS NHIS Samples](https://nhis.ipums.org/nhis/surveys.shtml) page on the IPUMS NHIS website.

In [9]:
# filter the list of samples by year
metadata_nhis %>% filter(str_detect(description, "2000"))

name,description
<chr>,<chr>
ih2000,2000 NHIS


## 3. IPUMS NHIS Data Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

### 3a. Define the Data Extract

For this example we will use the 2020 NHIS sample.

**Variable Selection**
* Regin of Residence (REGION)
* Health Status (HEALTH)
* Health Insurance Coverage Status (HINOTCOVE)
* Ever Told Had Hypertension on 2+ Visits (HYP2TIME)
* Ever Smoked 100 Cigarettes in Life (SMOKEV)
* Frequency of Vigorous Activity 10+ Minutes: Times per Week (VIG10FWK)

By default, the data extraction will also include a number of IPUMS preselected variables.  These variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS USA so we use the code "usa".
* **description** Text description of the extract.
* **samples** Vector of samples to include in the extract request.  In our case we are downloading the 2010 ACS data (us2010a).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

In [10]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "nhis",
                                           description = "IPUMS NHIS Data Extraction",
                                           samples = c("ih2000"),
                                           variables = c("REGION", "HEALTH", "HINOTCOVE", "HYP2TIME", "SMOKEV", "VIG10FWK"))

In [11]:
# review the extraction definition
extract_definition

### 3b. Submit the Extract Request

In [13]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

Successfully submitted IPUMS NHIS extract number 1

Checking extract status...

Waiting 10 seconds...

Checking extract status...

Waiting 20 seconds...

Checking extract status...

IPUMS NHIS extract 1 is ready to download.






DDI codebook file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/nhis_00001.xml
Data file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/nhis_00001.dat.gz



### 3c. Review the Extract

The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

Read the ddi and data files into a format which we can work with in R.

In [20]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

Use of data from IPUMS NHIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.



In [21]:
dim(dat)

In [22]:
head(dat)

YEAR,SERIAL,STRATA,PSU,NHISHID,HHWEIGHT,REGION,PERNUM,NHISPID,HHX,⋯,PERWEIGHT,SAMPWEIGHT,FWEIGHT,ASTATFLG,CSTATFLG,HEALTH,HINOTCOVE,HYP2TIME,SMOKEV,VIG10FWK
<dbl>,<dbl>,<dbl+lbl>,<dbl+lbl>,<chr>,<dbl>,<int+lbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<int+lbl>,<dbl+lbl>
2000,1,5061,1,2000000001,2944,1,1,20000000010101,1,⋯,3568,0,3160,3,0,1,1,0,0,0
2000,1,5061,1,2000000001,2944,1,2,20000000010102,1,⋯,3372,12409,3160,1,0,1,1,0,2,3
2000,1,5061,1,2000000001,2944,1,3,20000000010103,1,⋯,3343,0,3160,0,3,1,1,0,0,0
2000,1,5061,1,2000000001,2944,1,4,20000000010104,1,⋯,3160,0,3160,0,3,1,1,0,0,0
2000,1,5061,1,2000000001,2944,1,5,20000000010105,1,⋯,3267,15986,3160,0,1,1,1,0,0,0
2000,1,5061,1,2000000001,2944,1,6,20000000010106,1,⋯,3569,0,3160,0,3,1,1,0,0,0


In [23]:
colnames(dat)

**Variable Selection**
* Regin of Residence (REGION)
* Health Status (HEALTH)
* Health Insurance Coverage Status (HINOTCOVE)
* Ever Told Had Hypertension on 2+ Visits (HYP2TIME)
* Ever Smoked 100 Cigarettes in Life (SMOKEV)
* Frequency of Vigorous Activity 10+ Minutes: Times per Week (VIG10FWK)

**IPUMS Preselected Variables**
* Survey Year (YEAR)
* Sequential Serial Number, Household Record (SERIAL)
* Stratum for Variance Estimation (STRATA)
* Primary Sampling Unit (PSU) for Variance Estimation (PSU)
* NHIS Unique Identifier, Household (NHISHID)
* Household Weight, Final Annual (HHWEIGHT)
* Person Number within Family/Household (from reformatting) (PERNUM)
* NHIS Unique Identifier, Person (NHISPID)
* Household Number (from NHIS) (HHX)
* Family Number (from NHIS) (FMX)
* Person Number of Respondent (from NHIS) (PX)
* Final Basic Annual Weight (PERWEIGHT)
* Sample Person Weight (SAMPWEIGHT)
* Final Annual Family Weight (FWEIGHT)
* Sample Adult Flag (ASTATFLG)
* Sample Child Flag (CSTATFLG)

### 3d. Save the Data

Next let's save a couple versions of our IPUMS NHIS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [24]:
saveRDS(dat, "ipums_nhis_example.rds")
write.csv(dat, "ipums_nhis_example.csv")

## 4. IPUMS Medical Expenditures Panel Survey (MEPS) Metadata Exploration

From the [**IPUMS Medical Expenditures Panel Survey (MEPS) Webpage**](https://nhis.ipums.org/nhis): MEPS provides nationally representative, longitudinal data from 1996 to the present on health status, medical conditions, healthcare utilization, and healthcare expenditures for the U.S. civilian, non-institutionalized population. IPUMS MEPS harmonizes these data and allows users to create customized data extracts for analysis.

### 4a. Review the List of Samples

In [26]:
# retrive and view the list of samples from the IPUMS MEPS database
metadata_meps <- get_sample_info("meps")

# view the dimensions of the list of samples
dim(metadata_meps)

In [27]:
# view the first few lines of the list of samples
head(metadata_meps)

name,description
<chr>,<chr>
mp1996,1996 MEPS
mp1997,1997 MEPS
mp1998,1998 MEPS
mp1999,1999 MEPS
mp2000,2000 MEPS
mp2001,2001 MEPS


In [29]:
# filter the list of samples by year
metadata_meps %>% filter(str_detect(description, "2020"))

name,description
<chr>,<chr>
mp2020,2020 MEPS


## 5. IPUMS Medical Expenditures Panel Survey (MEPS) Data Extraction Specification and Submission

Once we know the dataset and variable selection we want, we can define our data extraction using the *define_extract_micro* function from the *ipumsr* package.  This function requires the following parameters:

### 5a. Define the Data Extract

For this example we will use the 2020 MEPS sample.

**Variable Selection**
* Health Status (HEALTH)
* Health Insuraance Coverage Type (Hierarchy) (COVERTYPE)
* Annual Total of Direct Health Care Payments (EXPTOT)
* Annual Total Number of Visits Made to Office-Based Medical Providers (OBTOTVIS)
* Respondent Has Been Told They Have Diabetes (DCSDIABDX)

By default, the data extraction will also include a number of IPUMS preselected variables.  These variables include metainformation such as identification codes and survey weights.  We will explore and list the preselected variables after completing the data extraction.

* **collection** Code for the IPUMS collection represented by this extract request.  In our case we are downloading from IPUMS USA so we use the code "usa".
* **description** Text description of the extract.
* **samples** Vector of samples to include in the extract request.  In our case we are downloading the 2010 ACS data (us2010a).
* **variables** Vector of variable names or a list of detailed variable specifications to include in the extract request.

In [30]:
# set up the data extraction definition
extract_definition <- define_extract_micro(collection = "meps",
                                           description = "IPUMS MEPS Data Extraction",
                                           samples = c("mp2020"),
                                           variables = c("HEALTH", "COVERTYPE", "EXPTOT", "OBTOTVIS", "DCSDIABDX"))

In [31]:
# review the extraction definition
extract_definition

### 5b. Submit the Extract Request

In [33]:
# submit extraction request
extract_submitted <- submit_extract(extract_definition)

# wait for completion
extraction_complete <- wait_for_extract(extract_submitted)

# check completion status
extraction_complete$status

# get the extract filepath
filepath <- download_extract(extract_submitted, overwrite = T)

Successfully submitted IPUMS MEPS extract number 1

Checking extract status...

Waiting 10 seconds...

Checking extract status...

IPUMS MEPS extract 1 is ready to download.






DDI codebook file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/meps_00001.xml
Data file saved to C:/Users/vavra/Dropbox/R Spatial/r-spatial/meps_00001.dat.gz



### 5c. Review the Extract

The data extract download will contain the following two files.

1. A [DDI (Data Documentation Initiative)](https://ddialliance.org) codebook file (file extension .xml) containing metadata and descriptive information for you data.
2. A zipped data (.dat) file (file extension .dat.gz) containing your data.

Read the ddi and data files into a format which we can work with in R.

In [34]:
ddi <- read_ipums_ddi(filepath)
dat <- read_ipums_micro(ddi)

Use of data from IPUMS MEPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.



In [35]:
dim(dat)

In [36]:
head(dat)

YEAR,PERNUM,DUID,PID,MEPSID,PANEL,PSUANN,STRATANN,PSUPLD,STRATAPLD,PANELYR,RELYR,PERWEIGHT,SAQWEIGHT,DIABWEIGHT,HEALTH,COVERTYPE,EXPTOT,OBTOTVIS,DCSDIABDX
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int+lbl>,<dbl+lbl>,<dbl>,<dbl+lbl>,<dbl+lbl>,<int>,<int+lbl>,<dbl>,<dbl>,<dbl>,<int+lbl>,<int+lbl>,<dbl>,<dbl>,<int+lbl>
2020,1,2320005,101,2320005101,23,1,20202079,1,2079,2018,3,8418.417,0.0,0.0,0,2,459,4,0
2020,2,2320005,102,2320005102,23,1,20202079,1,2079,2018,3,5199.932,0.0,0.0,0,2,564,0,0
2020,1,2320006,101,2320006101,23,1,20202028,1,2028,2018,3,2139.84,0.0,0.0,0,4,140,1,0
2020,2,2320006,102,2320006102,23,1,20202028,1,2028,2018,3,2216.009,4082.83,0.0,1,4,4673,0,0
2020,3,2320006,103,2320006103,23,1,20202028,1,2028,2018,3,4157.286,0.0,0.0,0,2,410,0,0
2020,1,2320012,102,2320012102,23,2,20202069,2,2069,2018,3,1960.941,2308.142,2950.122,2,2,2726,8,2


In [37]:
colnames(dat)

**Variable Selection**
* Health Status (HEALTH)
* Health Insuraance Coverage Type (Hierarchy) (COVERTYPE)
* Annual Total of Direct Health Care Payments (EXPTOT)
* Annual Total Number of Visits Made to Office-Based Medical Providers (OBTOTVIS)
* Respondent Has Been Told They Have Diabetes (DCSDIABDX)

**IPUMS Preselected Variables**
* Survey Year (YEAR)
* Person Number within Family/Household (from reformatting) (PERNUM)
* Dwelling Unit ID (DUID)
* Person Number (PID)
* MEPS Unique Identifier (IPUMS Generated) (MEPSID)
* Panel (PANEL)
* Annual Primary Sampling Unit (PSU) for Variance Estimation (PSUANN)
* Annual Stratum for Variance Estimation (STRATANN)
* Pooled Primary Sampling Unit (PSU) for Variance Estimation (PSUPLD)
* Pooled Variance Stratum (STRATAPLD)
* Year Entered MEPS (PANELYR)
* Relative Year 1 or 2 in Panel (RELYR)
* Final Basic Annual Weight (PERWEIGHT)
* Self-Administered Questionnaire Weight (SAQWEIGHT)
* Diabetes Care Weight (DIABWEIGHT)

### 5d. Save the Data

Next let's save a couple versions of our IPUMS MEPS data file.

* A *.rds* version of the data.  The **R Data Serialization (RDS)** format will retain metadata for the next time we want to import the file back into R.  One downside to the .rds format is it is only useable within R.
* A *.csv* version of the data.  The [**Comma-Separated Values (CSV)**](https://en.wikipedia.org/wiki/Comma-separated_values) format is versitile and can be easily accessed in other programs.  However, the CSV file format does not include metadata such as labels for variable levels.

In [39]:
saveRDS(dat, "ipums_meps_example.rds")
write.csv(dat, "ipums_mepss_example.csv")

## Recommended Next Steps
* **Continue with Chapter 2: IPUMS Data Acquisition and Extraction**
  * 2.1: IPUMS USA Data Extraction Using ipumsr
  * 2.2: IPUMS CPS Data Extraction Using ipumsr
  * 2.3: IPUMS International Microdata Extraction Using ipumsr
  * 2.4: IPUMS NHGIS Data Extraction Using ipumsr
  * 2.5: IPUMS Time Use Data Extraction Using ipumsr
  * 2.7: Reading IPUMS Global Health Data Extracts Using ipumsr
  * 2.8: Reading IPUMS Higher Education Data Extracts Using ipumsr