Interpolating Population Distributions using Public-use Data: An Application to Income Segregation using American Community Survey Data
This Github repository is a companion to the paper given by the title above. It contains code and instructions for reproducing all of the key results in the paper, including instructions for using the code to fit the models discussed in the paper to other data.
We include instructions to reproduce all analyses performed in the paper. This can be broken down into the following major steps:
- Download and clean the data
- Create the synthetic population
- Create the synthetic datasets
- Run the simulation study
- Compile the results of the simulation study
- Run the simulation study
- Create the synthetic datasets
- Fit models to the ACS PUMA-level data
- Exploratory analysis
- Compile the results of fitting models to ACS PUMA-level data
- Fit models to the ACS Metro-level data and construct the income
segregation indicies
- Fit EIV regressions using income segregation indicies
- Create the synthetic population
Steps that are indented below a previous step depend on that step. So every step below Download and clean the data depends on the that step being performed first, but once the data has been cleaned it does not matter whether Exploratory Analysis is performed first.
The following subsections will explain in detail how to perform each of these steps. Each subsection also lists all dependencies needed to complete it, including software dependencies and dependencies on previous steps, and contains an estimate for how long it will take to run that step. Most of these steps can be run on a moderately powerful laptop or desktop, though one step requires a powerful compute node.
Most of the data used in the paper is downloaded from the Census Bureau using the tidycensus R package, and cleaned using the tidyverse R package. This is done in three files:
/code/acsclean.R
: Download and clean the data for the PUMA-level example in Section 3.2/code/metroclean.R
: Download and clean the data for the metro-level example in Section 4./code/metroraceclean.R
: Download and clean the data for the metro-level by race example in Section 4. Each script will save several files in the/code/data
director. The metro scripts take some time to run, be patient. To use these script, you need three things:- A Census API key, which may be obtained here
- A file that maps Census tracts to PUMAs, available here from the Census Bureau, but also included in this repository:
/code/data/tract2puma2010.txt
. - A file that maps Census tracts to metro areas, available here from the Census Bureah, but also included in this repository:
/code/data/metro_deliniation_4-2018.csv
Dependencies:
- The file
/code/clean_functions.R
. R
version 4.1.0 and the following packages (and all dependencies of these packages):tidyverse
version 1.3.1tidycensus
version 1.1 Newer versions of these packages should work.
Time to complete: 5 minutes for acsclean.R
. Several hours for each of the other scripts.
To complete this step, the R
script /code/simulation/sim_setup.R
needs to be run. This script creates the synthetic population used in the simulation study. The synthetic population is saved as /code/simulation/population.RData
. Our R script will not reproduce our population in R
version 4.1, porbably due to the change to the random number generator in version 3.6. Normally, the line RNGkind(sample.kind = "Rounding")
would solve the problem, but it does not for our script. This is probably due to how packages we rely on use the random number generator. For this reason as well as convenience we include the population.RData
file in this Github repository. Additionally, this script creates the plots in Figure A.2.
Dependencies:
- previous steps
- Download and clean the data
- Download the 2014 Missouri household PUMS from here, and place in
/code/data/MO/
. This file is not included in the repository because it is too large. Additionally the tidycensus R functionget_pums
can download PUMS files, though we had trouble using it to download this specific file. - Download the 2014 Missouri shapefiles from here, and place in
/code/data/MO/shapefile/
. We have included these files in the repository for convenience, and because the names of these files may be different than what is expected by our scripts. R
version 4.1.0 and the following packages (and all dependencies of these packages):tidyverse
version 1.3.1tidycensus
version 1.1sf
version 0.9.8sp
version 1.4-5maptools
version 1.1-1reldist
version 1.6-6gridExtra
version 2.3ggthemes
version 4.2.4broom
version 0.7.7
The above versions are sufficient to run the script, but to replicate population.RData
, R version < 3.6 and corresponding packages are needed. See the apragraph at the beginning of this section. Unfortunately, we do not know the precise versions of the other packages necessary to completely reproduce the population, but we include the population in the repository for convenience.
Time to complete: 5 minutes.
To complete this step, the R
script /code/simulation/sim_create_sample.R
needs to be run. This script repeatedly does the following 1000 times:
- Sample from the synthetic population using a stratified random sample.
- Create synthetic tract-level estimates and their associated standard errors.
- Subsample the initial sample to create a synthetic PUMS.
The results of each iteration of this process are compiled and stored in the list
samples.RData
which is saved in this directory. This file is too large to store on Github, so it is not included in this repository. Additionally, the datasets are processed into the correct form for the Stan model files, and stored in the liststandats.RData
. This file is also too large to store on Github, so it is not included in this repository.
Dependencies:
- previous steps
- Download the data
- Clean the data
- Create the synthetic population
R
version 4.1.0 and the following packages packages (and all dependencies of these packages):rstan
version 2.21.2sp
version 1.4.5maptools
version 1.1-1reldist
version 1.6-6survey
version 4.0Hmisc
version 4.5-0dplyr
version 1.0.6
Time to complete: about 30 minutes.
Warning: This step is very computationally intensive and is intended to use multiple cores.
The code used to run this section is split into three files that are intended to be run simultaneously on a 12 core machine, each using 4 cores: fit1.R
, fit2.R
, and fit3.R
, all in /code/simulation
. The outputted files are stat.out1.RData
, stat.out2.RData
, and stat.out3.RData
, all of which are too large to be included in the Github repository.
Dependencies:
- previous steps
- Download and clean the data
- Create the synthetic population
- Create the synthetic datasets
R
version 4.10 and the following packages packages (and all dependencies of these packages):rstan
version 2.21.2reldist
version 1.6-6parallel
base packageMASS
version 7.3-54data.table
version 1.14.0abind
version 1.4-5tidyverse
version 1.3.1
Time to complete: about 2 weeks on a moderately powerful desktop computer with 6 physical cores (12 logical cores) - running fit1.R
, fit2.R
, and fit3.R
simultaneously.
This step compiles consists of several parts: 1) fit the orignal PRLN to the simulation study, then 2) compile the results of the sumulation study. The R
script /code/simulation/fitprln.R
fits the original PRLN to each sample from the population and saves results into prln.out.RData
. This file is not included in this repository. The R
script /code/simulation/postprocess.R
then compiles the results and creates tables 1, 2, and 3. Tables are created by xtable
and are manually copy/pasted into the appropriate .tex
file with minor editing.
Dependencies for fitprln.R
:
- previous steps
- Download and clean the data
- Create the synthetic population
- Create the synthetic datasets
- Run the simulation study
R
version 4.1 (no packages needed)
Time to complete: about 5 minutes.
Dependencies for postprocess.R
:
- previous steps
- Download and clean the data
- Create the synthetic population
- Create the synthetic datasets
- Run the simulation study
- Run
fitprln.R
, earlier this step.
R
version 4.1 and the following packages packages (and all dependencies of these packages):sp
version 1.4-5maptools
version 1.1-1Hmisc
version 4.5-0survey
version 4.0reldist
version 1.6-6dplyr
version 1.0.6MASS
version 7.3-54data.table
version 1.14.0rstan
version 2.21.2abind
version 1.4-5tidyverse
version 1.3.1parallel
base packagextable
version 1.8-4reshape2
verseion 1.4.4
Time to complete: about 5 minutes.
To complete this step, only one R
script needs to be run: /code/acspuma/explore.R
. This creates several of the tables and figures in Appendix A. Tables are created by xtable
and are manually copy/pasted into the appropriate .tex
file with minor editing. The plots are saved to /doc
. These plots are included in the Github repository for convenience.
Dependencies:
- previous steps
- Download and clean the data
R
version 4.1.0 and the following packages packages (and all dependencies of these packages):tidyverse
version 1.3.1sf
version 0.9.8ggthemes
version 4.2.4xtable
version 1.8-4
Time to complete: 1 minute.
The following scripts fit the tract-level models to each PUMA:
/code/acspuma/fitco.R
/code/acspuma/fitil.R
/code/acspuma/fitmo.R
/code/acspuma/fitmt.R
/code/acspuma/fitny.R
Each model is fit using 4 cores, one per Markov chain. The MCMC draws are saved separately for each each PUMA in the folder/code/acspuma
. These files are too large to include in the Github repository. Each model takes a about 5-10 minutes to fit.
Dependencies:
- previous steps
- Download and clean the data
R
version 4.1 and the following packages packages (and all dependencies of these packages):rstan
version 2.21.2tidyverse
version 1.3.1
Time to complete: < 1 hour. On a machine with many cores, the scripts can be run simultaneously to save time.
This section compiles the results of fitting all of the models to the ACS data, fits the original Pareto Linear procedure to the ACS data, then puts it all into the appropriate tables. Tables D.1-D.5 are created by xtable
and are manually copy/pasted into the appropriate .tex
file with minor editing. The script /code/acs_puma/postprocess.R
accomplishes all of this.
Dependencies:
- previous steps
- Download and clean the data
- Fit models to the ACS PUMA-level data
R
version 4.1 and the following packages packages (and all dependencies of these packages):rstan
version 2.21.2tidyverse
version 1.3.1sf
version 0.9.8xtable
version 1.8-4sp
version 1.4-5maptools
version 1.1-1reldist
version 1.6-6survey
version 4.0Hmisc
version 4.5-0parallel
base packageMASS
version 7.3-54data.table
version 1.14.0abind
version 1.4-5reshape2
verseion 1.4.4
Time to complete: 2-3 hours. On a non-windows machine much can be parallelized using the ncore
parameter to save time.
This section fits all of the tract-level models to each tract for each metro area, then constructs the information theory and divergence indicies for the analysis in Section 4. This is performace for 1) all households, 2) black households only, and 3) white households only, for each tract in each metro area. This process is very memory and computation intensive. We divided this into 12 distinct jobs that could be run in parallel, each job with its own compute node with 28 cores and 8GB of memory per core. The shell scripts to schedule the jobs are included in the repository, though the details of how to set these up will depend on the particulars of the server you have access to:
code/metrorace/metro_fit_2018_black_1.sh
code/metrorace/metro_fit_2018_black_2.sh
code/metrorace/metro_fit_2018_black_3.sh
code/metrorace/metro_fit_2018_black_4.sh
code/metrorace/metro_fit_2018_combined_1.sh
code/metrorace/metro_fit_2018_combined_2.sh
code/metrorace/metro_fit_2018_combined_3.sh
code/metrorace/metro_fit_2018_combined_4.sh
code/metrorace/metro_fit_2018_white_1.sh
code/metrorace/metro_fit_2018_white_2.sh
code/metrorace/metro_fit_2018_white_3.sh
code/metrorace/metro_fit_2018_white_4.sh
Each job corresponds to a particularR
script:code/metrorace/metro_fit_2018_black_1.R
code/metrorace/metro_fit_2018_black_2.R
code/metrorace/metro_fit_2018_black_3.R
code/metrorace/metro_fit_2018_black_4.R
code/metrorace/metro_fit_2018_combined_1.R
code/metrorace/metro_fit_2018_combined_2.R
code/metrorace/metro_fit_2018_combined_3.R
code/metrorace/metro_fit_2018_combined_4.R
code/metrorace/metro_fit_2018_white_1.R
code/metrorace/metro_fit_2018_white_2.R
code/metrorace/metro_fit_2018_white_3.R
code/metrorace/metro_fit_2018_white_4.R
Each script runs a single chain for each tract in each metro area for each group of households. Then all the chains for the tracts in a particular metro area for a particular group of households are jointly used to contruct the information theory and divergence indicies along with their standard errors for each iteration of the chains. These are saved in the following filescode/metrorace/metro_kls_2018_black_1.RData
code/metrorace/metro_kls_2018_black_2.RData
code/metrorace/metro_kls_2018_black_3.RData
code/metrorace/metro_kls_2018_black_4.RData
code/metrorace/metro_kls_2018_combined_1.RData
code/metrorace/metro_kls_2018_combined_2.RData
code/metrorace/metro_kls_2018_combined_3.RData
code/metrorace/metro_kls_2018_combined_4.RData
code/metrorace/metro_kls_2018_white_1.RData
code/metrorace/metro_kls_2018_white_2.RData
code/metrorace/metro_kls_2018_white_3.RData
code/metrorace/metro_kls_2018_white_4.RData
These files are too large to include in the repository.
Dependencies:
- Note the computational requirements in the description above
- previous steps
- Download and clean the data
R
version 4.1 and the following packages packages (and all dependencies of these packages):rstan
version 2.21.2tidyverse
version 1.3.1foreach
version 1.5.1doParallel
version 1.016
Time to complete: Each job takes up to 7 days to complete on a powerful compute node with 28 cores and >= 8GB memory per core, though the jobs can be run in parallel.
This section, 1) creates figures H.4, H.5, and H.6, then 2) estimates gini coefficients using latent PRLN for black households only and white households only for each tract in each metro area, then 3) fits the EIV regressions from section 4 for all households, black households only, and white households only, and finally 4) uses the results to construct tables H.1, H.2, H.3, H.4, H.5, and H.6, and tables 4 and 5. Fitting the EIV regressions is memory intensive, requiring >=32 GB of memory.
Dependencies:
- 32 GB of memory
- previous steps
- Download and clean the data
- Fit models to the ACS Metro-level data and construct the income segregation indicies
R
version 4.1 and the following packages packages (and all dependencies of these packages):rstan
version 2.21.2tidyverse
version 1.3.1sf
version 0.9.8xtable
version 1.8-4reldist
version 1.6-6
Time to complete: About 24 hours and a moderately powerful desktop computer (with 32 GB of memory).