# I. Introduction
# II. Preparing Data
# III. Estimates
# IV. Standard Error of Estimates
# V. Applications

# I. Introduction

## a. ACS and PUMS

The American Community Survey (ACS) produced by the U.S. Census Bureau is an incredbly popular data product with a wide range of applications. ACS surveys U.S. residents on a rolling basis, aiming to sample at least 1% of the U.S. population each year. 

Individual survey responses to the ACS are aggregated into particular geographies. The smallest geographies are known as Census Blocks and Census Tracts. Census Blocks are the smallest geographic unit, containing between 600 and 3,000 people; Census Tracts are made up of Census Blocks, and contain between 2,500 and 8,000 people. 

Importantly, the ACS does not primarily provide characteristics of people; it provides characteristics of geographies. 

The PUMS dataset, on the other hand, is a disaggregated form of the ACS. While the ACS provides characteristics of geographies (such as median income, racial composistion, etc., of the geography), PUMS provides the actual, individual responses to the ACS (with some limitations, described below). 

The responses represented in PUMS are aggregated at a geography known as a Public Use Microsample Area (PUMA). PUMAs contain around 100,000 people.

## b. Advantages PUMS

Because PUMS represents actual survey responses, it is hypothetically possible to build cross-tabulations that are not possible with normal ACS data. For instance, with normal ACS data one is able to identify the median income and racial composition of a geography, but it is not necessarily possible to derive the median income by race for that geography.

Because PUMS are actual survey responses, one can build any number of cross-tabulations of individual characteristics by PUMA. For example, PUMS may allow an estimate of the median income by race and age for a PUMA.

## c. Limitations of PUMS

There are three primary limtations of PUMS: (1) privacy protection measures limit the utility of the data; (2) the PUMS data file is cumbersome and not easy to interpret; and (3) deriving useful information from the PUMS requires complex calculations due to sampling methodologies. Each of these measures is discussed below.

### (1) Privacy limitations

The Census takes two primary measures to protect respondent identies: top-coding or bottom-coding survey responses; and aggregating responses to the large PUMA geographies.



## d. Other notes on PUMS data

PUMS data is divided into two categories: household and person data. Household data contains characteristics of the survey respondants living quarters (such as rent/mortage costs, utility costs, number of bathrooms, etc.). Person data contains characteristics of each person in each household.


# II. Preparing Data

## a. Downloading Data

The following is a methodology for downloading PUMS data. This downloads the 2015, 5-year person-level data for Texas. The URL in URL.PUMS.TX can be changed for different years or surveys (such as 1-year or 3-year surveys).

In [None]:
#Set URL for PUMS data
URL.PUMS.PTX <- "https://www2.census.gov/programs-surveys/acs/data/pums/2015/5-Year/csv_ptx.zip" 

#Set download destination 
destfile.PUMS.PTX <- "csv_ptx.zip"

#Download PUMS to destination (in working directory)
download.file(URL.PUMS.PTX, destfile.PUMS.PTX)
print("PUMS data downloaded")

#Unzip file
unzip(destfile.PUMS.PTX)

#List to identify .csv name
unzip(destfile.PUMS.PTX, list = TRUE)

## Read csv into dataframe, load libraries

PUMS.TX15 <- read.csv(file = "ss15ptx.csv", header = TRUE)

[TO BE COMPLETED LATER] The following methodology is for downloading PUMA shapefiles. 

In [None]:
#Set URL for PUMA shapefiles
URL.PUMA.TX <- "ftp://ftp2.census.gov/geo/tiger/TIGER2015/PUMA/tl_2015_48_puma10.zip"

#Create directory
dir.create(tl_2015_48_puma10)

#Set download destination 
destfile.PUMA.TX <- "tl_2015_48_puma10.zip"

#Download PUMA to destination (in working directory)
download.file(URL.PUMA.TX, destfile.PUMA.TX)
print("PUMS data downloaded")

#Unzip file
unzip(destfile.PUMA.TX)

## b. Identifying Relevant Geographies

[NOTE: this section assumes you have already downloaded the PUMA shapefiles into a folder titled "raw_data" in your working directory. Also, credit John Gates for introducing and writing much of this code]

Install necessary packages, load libraries:

In [None]:
install.packages('rgdal')
install.packages('sp')
install.packages('leaflet')
library(rgdal)
library(sp)
library(leaflet)

Read data:
(For ease of use, this example uses the State of Texas as an example geography, though the entire U.S. could hypothetically be used)

In [None]:
shp <- readOGR(dsn="tl_2015_48_puma10") #note: folder not .shp file
tx_shp <- subset(shp, shp$STATEFP10 %in% c("48"))

Transform shapefile into appropriate projection:

In [None]:
my.projection <- sp::CRS('+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0')
tx_shp_t <- sp::spTransform(tx_shp, my.projection)

Save and load shapefile into your local machine:

In [None]:
save(tx_shp_t, file = "tx_shp_t.Rdata")
load("tx_shp_t")

Plot map:

In [None]:
leaflet(tx_shp_t) %>% 
    addTiles() %>%
      addPolygons(
        stroke = TRUE,
        fillColor = "transparent",
        label = tx_shp_t$PUMACE10, # SHOWS LABEL WITH COURSER HOVER
        labelOptions = labelOptions(noHide = TRUE)
  )

If the plotting is working correctly, you should be able to hover your courser over geographies of interest to identify their PUMA number. We will use this in the next section to aggregate geographies of interest.

## c. Preparing Geographies for Analysis

### Option 1: Geographies of Interest

I am interested in the PUMAs that are contained within the jurisdiction of the Capital Area Metropolitan Planning Organization (CAMPO). CAMPO oversees a six-county region including Williamson, Travis, Hays, Bastrop, Burnet and Caldwell counties. Using the map plotted in the previous section (and knowledge of CAMPO boundaries), I identified the relevant PUMA codes.

Install, load neccesary library:

In [None]:
install.packages('dplyr')
library(dplyr)

Set geography, filter PUMS data based on that geography:

In [None]:
campo_pumas <- c("5201", "5202", "5203","5204", 
                 "5305", "5302", "5301", "5306",
                 "5303", "5308", "5307", "5309",
                 "5304", "5400")

campo_pums <- PUMS.TX15 %>%
  filter(PUMS.TX15$PUMA10 %in% campo_pumas | PUMS.TX15$PUMA00 %in% campo_pumas)

Note: PUMA geographies are subject to revision after each U.S. decennial census. PUMS data takes this into account, noting the year in which the respondant's survey was completed.

The PUMS survey we are using in this exercise covers a 5-year timespan in which surveys were completed in either the pre- or post-2010 PUMA geography. The PUMS data reflects this, containing variables identifying whether the survey was submitted within the 2000 or 2010 PUMA. The PUMA identifiers stay the same but move between the two tables depending upon when the survey was completed. Hence, we filter based on PUMA10 or PUMA00.

### Option 2: Test Geography

There is still a lot of data in the campo_pums region derived above. If you would prefer to use a test geography, simply follow the steps in section 1 above, but only include one PUMA, such as follows:

In [None]:
one_puma <- c("5201")

one_pums <- PUMS.TX15 %>%
  filter(PUMS.TX15$PUMA10 %in% one_puma | PUMS.TX15$PUMA00 %in% one_puma)

The PUMS data within this geography can be further reduced to just one survey response, again for ease of analysis:

In [None]:
one_row <- one_pums[1,]