## Creating an IPUMS CPS extract using the IPUMS Microdata Extract API and ipumspy

This notebook contains sample code for creating, submitting, downloading, and reading an IPUMS CPS extract via the IPUMS Microdata Extract API using the `ipumspy` Python library.

To be able to run the code in this notebook, you will need 1) an [IPUMS CPS user account](https://cps.ipums.org/cps-action/menu), 2) an [IPUMS API key](https://account.ipums.org/api_keys), and 3) to have version 0.2.1 of `ipumspy` installed.

First, we will import all the necessary libraries.

In [1]:
import os

import pandas as pd

from ipumspy import IpumsApiClient, CpsExtract, readers

Next we will pass our API key to the `IpumsApiClient`. In this example, I have stored my API key as a conda environment variable. This is considered best practice, but you may also replace the `my_api_key` variable in the second line of the next code block with your API key string in quotes.

In [2]:
my_api_key = os.getenv("IPUMS_MICRODATA_API_KEY")
ipums = IpumsApiClient(my_api_key)

Next we will define an extract. For the purposes of this demonstration, we will only choose one sample and a small number of variables. An extract is defined by an IPUMS collection id, a list of sample ids, and a list of variable names. You may also specify an extract description. This is not required, but highly recommended! In the code chunk below using the `CpsExtract` class indicates that we wish to make an IPUMS CPS extract. Note that IPUMS does not currently offer a metadata API, so if you do not know the sample IDs and variable names that you want to include, you can find these by browsing the list of IPUMS CPS sample IDs and the list of available IPUMS CPS variables.

In [3]:
extract = CpsExtract(["cps2022_05b"],
                     ["AGE", "SEX", "RACE"],
                     description="My first API extract!")

The next step is to submit our `extract` object to the IPUMS Microdata Extract API using the `ipums` API client instance.

In [4]:
submitted_extract = ipums.submit_extract(extract)

Now that the extract has been submitted to the IPUMS extract system, we chan check its status and see that it has been received and is in line to be processed.

In [5]:
ipums.extract_status(submitted_extract)

'queued'

To save ourselves a few lines of code, we can use the `wait_for_extract()` method to let us know when the extract has been completed and is ready for download. Once the extract is completed, we can use the `download_extract()` method to download the data file and its accompanying DDI codebook in to our current working directory. The codeblock below also demonstrates use of `collection` and `extract_id` attributes of our IPUMS CPS extract. Each extract a user submitts is assigned its own unique ID number by the IPUMS extract system. This number can be used to easily read in your downloaded files, or to re-download this extract at a later time if needed.

In [6]:
ipums.wait_for_extract(submitted_extract)
print(f"{submitted_extract.collection} number {submitted_extract.extract_id} is complete!")
ipums.download_extract(submitted_extract)

cps number 4 is complete!


Now that the necessary files are downloaded, we can use the `readers` module to parse the DDI codebook and read the extract data file into a Pandas DataFrame. Note that the first line uses the `submitted_extract` object attributes to grab the correct file without requiring any intermediate steps. Note that the naming convention for IPUMS extract files is `[lowercase collection name]_[extract ID number, left-padded to 5 digits]`.

In [7]:
extract_file_name = f"{submitted_extract.collection}_{str(submitted_extract.extract_id).zfill(5)}"

ddi = readers.read_ipums_ddi(f"{extract_file_name}.xml")

df = readers.read_microdata(ddi, f"{extract_file_name}.dat.gz")

df.head()

See the `ipums_conditions` attribute of this codebook for terms of use.
See the `ipums_citation` attribute of this codebook for the appropriate citation.


Unnamed: 0,YEAR,SERIAL,MONTH,HWTFINL,CPSID,PERNUM,WTFINL,CPSIDP,AGE,SEX,RACE
0,2022,1,5,1892.0459,20220500000100,1,1892.0459,20220500000101,49,2,100
1,2022,3,5,1962.718,20220500000300,1,1962.718,20220500000301,29,1,100
2,2022,3,5,1962.718,20220500000300,2,2450.4601,20220500000302,25,2,100
3,2022,3,5,1962.718,20220500000300,3,1885.9137,20220500000303,30,1,100
4,2022,6,5,1785.298,20220300001000,1,1630.787,20220300001001,80,1,200


Ta Da! Now we have an IPUMS CPS extract all ready for analysis in Python! Forgot a variable? Just add it to the cell where the extract is defined and re-run the notebook! Want to do the same analysis next month when the most recent data is available? Just add the newest sample id to that same cell and re-run! As you can imagine, getting IPUMS extracts via API opens up lots of interesting possibilities for efficient and reproduceable workflows! 