## PART 1: Get the 2 RCT data sets...
website with data is here: http://users.nber.org/~rdehejia/data/nswdata2.html

### For the original Lalonde data
- STEP 1: Download the two text files, nsw_control.txt and nsw_treated.txt
- STEP 2: Ensure that they are in R's working directory
- STEP 3: Read them into your R workspace
- STEP 4: Bind the treated and controls together, into one data frame

In [3]:
nsw_treated = read.table("https://users.nber.org/~rdehejia/data/nsw_treated.txt")
nsw_controls = read.table("https://users.nber.org/~rdehejia/data/nsw_control.txt")

nsw_data <- rbind(nsw_treated, nsw_controls)
additional_column_to_label_data_set <- rep(c("Original Lalonde Sample"), length(nsw_data[,1]))
nsw_data <- cbind(additional_column_to_label_data_set, nsw_data)
names(nsw_data) <-  c("data_id", "treat", "age", "educ", "black", "hisp",
"married", "nodegr", "re75", "re78")
head(nsw_data)

data_id,treat,age,educ,black,hisp,married,nodegr,re75,re78
Original Lalonde Sample,1,37,11,1,0,1,1,0,9930.046
Original Lalonde Sample,1,22,9,0,1,0,1,0,3595.894
Original Lalonde Sample,1,30,12,1,0,0,0,0,24909.45
Original Lalonde Sample,1,27,11,1,0,0,1,0,7506.146
Original Lalonde Sample,1,33,8,1,0,0,1,0,289.7899
Original Lalonde Sample,1,22,9,1,0,0,1,0,4056.494


In [12]:
library(foreign)
DW_data <- read.dta("nsw_dw.dta")
head(DW_data)

data_id,treat,age,education,black,hispanic,married,nodegree,re74,re75,re78
Dehejia-Wahba Sample,1,37,11,1,0,1,1,0,0,9930.0459
Dehejia-Wahba Sample,1,22,9,0,1,0,1,0,0,3595.894
Dehejia-Wahba Sample,1,30,12,1,0,0,0,0,0,24909.4492
Dehejia-Wahba Sample,1,27,11,1,0,0,1,0,0,7506.146
Dehejia-Wahba Sample,1,33,8,1,0,0,1,0,0,289.7899
Dehejia-Wahba Sample,1,22,9,1,0,0,1,0,0,4056.4939


# PART II: CREATE THE FAKE OBSERVATIONAL DATA SETS

Now to create the 2 simulated observational data sets that each combine the
treatment group from the data sets above, with CPS-1 survey data

A. ### First with the original Lalonde RCT sample

In [13]:
cps_controls <- read.dta("cps_controls.dta")
head(cps_controls)

cps_controls_without_re74 <- cps_controls[,-9]
names(cps_controls_without_re74) <- names(nsw_data)

data_id,treat,age,education,black,hispanic,married,nodegree,re74,re75,re78
CPS1,0,45,11,0,0,1,1,21516.67,25243.551,25564.67
CPS1,0,21,14,0,0,0,0,3175.971,5852.565,13496.08
CPS1,0,38,12,0,0,1,0,23039.02,25130.76,25564.67
CPS1,0,48,6,0,0,1,1,24994.369,25243.551,25564.67
CPS1,0,18,8,0,0,1,1,1669.295,10727.61,9860.869
CPS1,0,22,11,0,0,1,1,16365.76,18449.27,25564.67


In [15]:
nsw_data_nocontrols <- nsw_data[-which(nsw_data$treat == 0),]
nsw_treated_data_with_CPS <- rbind(nsw_data_nocontrols, cps_controls_without_re74)

B. Second with Dehejia's experimental sample, which includes re74... 
in other words, 2 years of pre-treatment earnings -- Dehejia thought it was necessary
 to control for more than 1 year of pre-treatment earnings...

In [16]:
cps_controls <- read.dta("cps_controls.dta")
head(cps_controls)


data_id,treat,age,education,black,hispanic,married,nodegree,re74,re75,re78
CPS1,0,45,11,0,0,1,1,21516.67,25243.551,25564.67
CPS1,0,21,14,0,0,0,0,3175.971,5852.565,13496.08
CPS1,0,38,12,0,0,1,0,23039.02,25130.76,25564.67
CPS1,0,48,6,0,0,1,1,24994.369,25243.551,25564.67
CPS1,0,18,8,0,0,1,1,1669.295,10727.61,9860.869
CPS1,0,22,11,0,0,1,1,16365.76,18449.27,25564.67


In [17]:
cps_controls_new_names <- cps_controls
names(cps_controls_new_names) <- names(DW_data)

In [18]:
DW_data_nocontrols <- DW_data[-which(DW_data$treat == 0),]

In [19]:
DW_treated_data_with_CPS <- rbind(DW_data_nocontrols, cps_controls_new_names)