**Applied Empirical Analysis (HS 2020)**

**Conny Wunsch, Ulrike Unterhofer and Véra Zabrodina** -- University of Basel

***

# Lab Session 1 - Selection on Observables

***


## Application: What Did All the Money Do? On the General Ineffectiveness of Recent West German Labour Market Programmes
**Conny Wunsch and Michael Lechner, Kyklos 2008**


***
## 1. Introduction


* What is the research question?

* Why is this question of interest?

* What are the treatments and the outcomes? 

* Why can you not just regress the outcome on the treatment, i.e. what is the endogeneity problem? 


We use a simulated data set and similar setting to evaluate a training program for unemployed workers.

***

## 2. Setup and Data

* Evaluate the effect of an **online application coaching program for unemployed workers** on their employment outcomes
* This full-time course can only be started within the first month of unemployment, and lasts for **1 month**.
* Any worker can voluntarily participate, in agreement with their randomly-assigned caseworker. 
* Participation is however mandatory for all workers older than 60.
* Caseworkers are instructed to encourage older workers, as well as women with lower education to participate. 
* The program's capacity is limited each month.

* We have representative data on unemployment spells that started in 20XX, with a rich set of observable individual characteristics.

***

## 3. Identification strategy and assumptions

## Today we stay in the parametric world

### Notation


* $Y^*_{0i}$ ... potential outcome under no participation
* $Y^*_{1i}$ ... potential outcome under participation
* $Y_{0i}$, $Y_{1i}$  ... observed outcomes

Effects of interest:

* ATE = ${\rm E} [Y^*_{1} - Y^*_{0}]$

* ATET = ${\rm E} [Y^*_{1} - Y^*_{0}|D=1]$

* ATENT = ${\rm E} [Y^*_{1} - Y^*_{0}|D=0]$
 



We estimate the effect of participating in the training program $(D=1)$ by:

$Y_i=\alpha+\delta D_i+\sum_{k=1}^K\beta_kX_{i,k}+U_i$

What does $\delta$ claim to measure?



### Discussion of assumptions

What do these assumptions mean in words?

*	What could invalidate them? Think of concrete examples or mechanisms.

*	Which arguments or evidence can you provide to support that they hold?


**A1 Stable unit treatment value assumption (SUTVA):**

$Y_i=D_iY^*_{1,i}+(1-D_i)Y^*_{0,i}$


**A2 Zero conditional mean error:**

$E[U_i|D=D_i,X=X_i]=0$ 

implies that there are no unobserved variables $U_i$ that are correlated with the treatment conditional on $X$.


**A3 Linearity:**

$E[Y_i|D=D_i,X=X_i]=\alpha+\delta D_i+\sum_{k=1}^K\beta_kX_{i,k}$


**A4 Effect homogeneity:**

$E[Y^*_{1,i}-Y^*_{0,i}]=\delta\quad\forall\ i$

This implies $\delta = ATE = ATET = ATENT$.

***
## 4. Empirical Analysis

### Load Packages and set directory

In [1]:
# remove old objects from R working space
rm(list=ls())

#Load Packages
packages_vector <- c("tidyverse", "Hmisc", "dplyr", "fastDummies",
                     "tidyr", "knitr", "xtable", "lubridate" ,
                     "data.table", "stargazer", "mfx", "jtools")

lapply(packages_vector, require, character.only = TRUE) 


# Function for table display
repr_html.xtable <- function(obj, ...){
    paste(capture.output(print(obj, type = 'html')), collapse="", sep="")
}


Loading required package: tidyverse

"package 'tidyverse' was built under R version 3.6.3"
Error: package or namespace load failed for 'tidyverse' in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
 namespace 'tibble' 2.1.3 is already loaded, but >= 3.0.0 is required

Loading required package: Hmisc

"package 'Hmisc' was built under R version 3.6.3"
Loading required package: lattice

Loading required package: survival

"package 'survival' was built under R version 3.6.3"
Loading required package: Formula

Loading required package: ggplot2

"package 'ggplot2' was built under R version 3.6.3"

Attaching package: 'Hmisc'


The following objects are masked from 'package:base':

    format.pval, units


Loading required package: dplyr

"package 'dplyr' was built under R version 3.6.3"

Attaching package: 'dplyr'


The following objects are masked from 'package:Hmisc':

    src, summarize


The following objects are masked from 'package:stats':

    filter, lag


### Load the data

In [2]:
# Data for descriptives
load("data_desc.RData")
attach(data_desc)

# Data for regression
load("data_reg.RData")
attach(data_reg)

The following objects are masked from data_desc:

    agegr_2, agegr_3, agegr_4, contr_5y, duration, educ_0, educ_2,
    educ_3, educ_99, employed1y, full_time_0, insured_earn,
    lastj_fct_0, lastj_fct_1, lastj_fct_3, lastj_fct_99, lastj_occpt_1,
    lastj_occpt_10, lastj_occpt_11, lastj_occpt_12, lastj_occpt_14,
    lastj_occpt_15, lastj_occpt_16, lastj_occpt_17, lastj_occpt_18,
    lastj_occpt_19, lastj_occpt_2, lastj_occpt_20, lastj_occpt_21,
    lastj_occpt_22, lastj_occpt_23, lastj_occpt_3, lastj_occpt_4,
    lastj_occpt_5, lastj_occpt_6, lastj_occpt_7, lastj_occpt_8,
    lastj_occpt_9, marits_1, region_1, region_2, region_3, region_4,
    region_5, region_7, sex_1, swiss_0, treat, unempl_r




### Application specific cleaning

In [3]:
# Create application-specific variables that might be needed

# Quarter of entry into unemployment 
data_reg$quarter <- as.numeric(quarter(data_reg$date_start))

data_reg<-dummy_cols(data_reg, select_columns = c("quarter"), 
                     remove_most_frequent_dummy = TRUE)

# save final data set
save(data_reg, xcat_names_reg, xcont_names, file="data_reg_final.RData")

# Group variables, adding newly created ones - need matrix!

y1 <- as.matrix(data_reg$duration)

y2 <- as.matrix(data_reg$employed1y)

treat <- as.matrix(data_reg$treat)

x1 <- as.matrix(dplyr::select(data_reg, 
                              xcat_names_reg, 
                              xcont_names, 
                              starts_with("quarter_")))

print("Dimensions of outcome and input vectors")
dim(y1)
dim(y2)
dim(x1)

Note: Using an external vector in selections is ambiguous.
[34mi[39m Use `all_of(xcat_names_reg)` instead of `xcat_names_reg` to silence this message.
[34mi[39m See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
[90mThis message is displayed once per session.[39m

Note: Using an external vector in selections is ambiguous.
[34mi[39m Use `all_of(xcont_names)` instead of `xcont_names` to silence this message.
[34mi[39m See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
[90mThis message is displayed once per session.[39m



[1] "Dimensions of outcome and input vectors"


### Descriptive Statistics

#### Differences in characteristics and outcomes by treatment

In [4]:
# all
x_desc<- dplyr::select(data_desc, -treat, -employed1y, -duration) 

# a selection
x_desc<- cbind(insured_earn, contr_5y, unempl_r, sex_1, 
               marits_1, swiss_1, agegr_1, 
               agegr_2, agegr_3, agegr_4, 
               educ_0, educ_1, educ_2, 
               educ_3, full_time_1)
x_desc_names<- colnames(x_desc)

# Define a function estimating the differences in variables across D
balance_check.model <- function(x){
    
    # Conditional means
    mean_d0 <- mean(x[data_desc$treat==0])
    mean_d1 <- mean(x[data_desc$treat==1])
    
    # Difference in means
    diff_d <- lm(x ~ data_desc$treat)
    cov <- vcovHC(diff_d, type = "HC")
    robust.se <- sqrt(diag(cov))
    
    list(mean_d0 = mean_d0, mean_d1 = mean_d1,
        coeff = diff_d$coefficients[2], 
        robust.se = robust.se[2], 
        pval = 2*pnorm(-abs(diff_d$coefficients[2]/robust.se[2])) )             
}

diff_output <- apply(x_desc, 2, balance_check.model)

# convert list to table
diff_output<-rbindlist(diff_output)
rownames(diff_output)<- x_desc_names
colnames(diff_output)<- c("E(X|D=0)", "E(X|D=1)", "Difference", "s.e.", "p-value")

# plot table
print("Difference in means by treatment status")
xtable(diff_output, digits=3)

ERROR: Error: Can't combine `..1` <haven_labelled> and `..2` <double>.


### Regression Analysis

We estimate the effect of program participation on the *unemployment duration* and *individual employment one year after program start* using OLS and probit.

**Effect of program participation on the unemployment duration**

In [None]:
# Linear model (OLS)
lm.model.y1 <- lm(y1 ~ treat + x1)

# summ command in the jtools package directly calculates robust SEs
lm.model.out.y1 <-summ(lm.model.y1, robust = "HC1")
lm.model.out.y1

# store standard error
robust.se.y1 <- lm.model.out.y1$coeftable[,2]


**Effect of program participation on individual employment one year after program start**

In [None]:
# Linear model - effect on employment in year 1
lm.model.y2 <- lm(y2 ~ treat + x1)

lm.model.out.y2 <-summ(lm.model.y2, , robust = "HC1")
robust.se.y2 <- lm.model.out.y2$coeftable[,2]

# Probit model - effect on employment in year 1
probit.model <- glm(y2 ~ treat + x1, 
                    family = binomial(link = "probit") )

probit.model.out <-summ(probit.model, , robust = "HC1")
robust.se.probit <- probit.model.out$coeftable[,2]

# Marginal effect
probit.model.2 <- probitmfx(y2 ~ treat + x1, data=data_reg, 
                            atmean = TRUE, robust = TRUE)


#### Estimated Coefficients

In [None]:
stargazer(lm.model.y1,lm.model.y2,probit.model, 
            se=list(robust.se.y1, robust.se.y2, robust.se.probit), 
            keep=c("treat"), 
            keep.stat = c("n", "rsq"), 
            type ="text", 
            column.labels=c("Duration","Emply1", "Emply1"), 
            align=TRUE, 
            dep.var.labels.include = FALSE)


#### Marginal Effects

In [None]:
stargazer(lm.model.y1, lm.model.y2, probit.model.2$fit, 
          coef = list(NULL, NULL, probit.model.2$mfxest[,1]), 
          se=list(robust.se.y1, robust.se.y2 , probit.model.2$mfxest[,2]), 
          keep=c("treat"), 
          keep.stat = c("n", "rsq"), 
          type ="text", 
          
          column.labels=c("Duration","Emply1", "Emply1"), 
          align=TRUE, 
          dep.var.labels.include = FALSE)