Skip to content

R package to bound omitted variable bias using auxiliary data

Notifications You must be signed in to change notification settings

yujunghwang/bndovb

Repository files navigation

bndovb

Citation

Please cite the following paper if you use this package.

Introduction

The R package bndovb implements a Hwang(2021) estimator to bound omitted variable bias using auxiliary data. The basic assumption is that the main data includes a dependent variable and every regressor but one omitted variable. So there is an omitted variable bias in the OLS result from the main data. However, if there is auxiliary data that includes every right-hand side regressor (or its noisy proxies), it can bound the omitted variable bias. Hwang(2021) provides a more general estimator for when there is more than one omitted variable and when the auxiliary data does not contain every right-hand side regressor (or its noisy proxies).

This package implements a simple estimator when the number of omitted variables is just one, and the auxiliary data contains every right-hand side regressor but the only omitted variable. This package provides two different functions.

The function ‘bndovb’ can be used when the auxiliary data contain every right-hand side variable without measurement errors. The function ‘bndovbme’ can be used when noisy proxies for the omitted variable exist in the auxiliary data. Other regressors in the auxiliary data are assumed measurement error-free. The function requires another R package, ‘factormodel,’ written by the same author.

When using ‘bndovb,’ a user should specify a method for density estimation as part of an estimation procedure, either 1 (parametric normal density assumption) or 2 (nonparametric kernel density estimation). In general, it is strongly recommended to use method 1 (parametric normal density assumption), particularly when data is large or the regression model is large. Method 2 calls an R package “np” (Li and Racine, 2008; Li, Lin and Racine, 2013) but this method is very slow, as emphasized in their vignette file. I recommend using their method only when (i) data is small and (ii) there is only one common variable, which makes the conditional CDF and quantile function univariate. The default method is 1, using parametric normal density assumption.

When using ‘bndovbme’, the auxiliary data must contain noisy proxy variables for the omitted variable. A user should set the type of the proxy variables, ptype, to either 1 (continuous) or 2 (discrete). When proxy variables are continuous, the auxiliary data must contain at least 2 proxy variables. When proxy variables are discrete, the auxiliary data must contain at least 3 proxy variables.

To compute the equal-tailed confidence interval, set the option ‘ci=TRUE’ when using the ‘bndovb’ or ‘bndovbme’ function. The functions compute the confidence interval using the Hong and Li (2018) numerical delta method. The default nominal coverage rate is (1-tau) = 0.95 and the default number of bootstrap draws is 100. The default tuning parameter for the numerical delta method is set to ‘scale=-1/2’, which makes the bootstrap standard one. You can change any of these by choosing a different option. See the documentation for ‘bndovb’ and ‘bndovbme’ function.

To choose the optimal tuning parameter ‘scale’, use ‘bndovb_tuning’ and ‘bndovbme_tuning’ function. These functions implement the double bootstrap described in Hong and Li (2020). The default scale grid to search for an optimal scale is c(-1/2,-1/3,-1/4,-1/5,-1/6), which you can change by setting a different option. The default nominal coverage rate is (1-tau)=0.95 and the default number of bootstrap draws is 100. In double bootstrap, this means the number of the first draws is 100 and the number of second draws conditional on each first draw is 100. By default, the double bootstrap will be performed in parallel. If you want to compute serially, choose ‘parallel=FALSE’. See the documentation for ‘bndovb_tuning’ and ‘bndovbme_tuning’ function.

Installation

You can install a package bndovb using either CRAN or github.

install.packages("bndovb")

or

# install.packages("devtools")
devtools::install_github("yujunghwang/bndovb")

Example 1 : bndovb

The below example shows how to use a function ‘bndovb’ when the auxiliary data contain the omitted variable from main data without any measurement error. The code first simulates fake data using the same DGP in both main data and auxiliary data. Next, the main data omits one variable. The function ‘bndovb’ provides a bound on regression coefficients by using both main data and auxiliary data.

library(MASS)
library(bndovb)

# sample size
Nm <- 5000 # main data
Na <- 5000 # auxiliary data

# use same DGP in maindat and auxdat
maindat <- mvrnorm(Nm,mu=c(2,3,1),Sigma=rbind(c(2,1,1),c(1,2,1),c(1,1,2)))
auxdat  <- mvrnorm(Na,mu=c(2,3,1),Sigma=rbind(c(2,1,1),c(1,2,1),c(1,1,2)))

maindat <- as.data.frame(maindat)
auxdat <- as.data.frame(auxdat)

colnames(maindat) <- c("x1","x2","x3")
colnames(auxdat) <- c("x1","x2","x3")

# this is a true parameter which we try to get bounds on
truebeta <- matrix(c(2,1,3,2),ncol=1)

# generate a dependent variable
maindat$y <- as.matrix(cbind(rep(1,Nm),maindat[,c("x1","x2","x3")]))%*%truebeta

# main data misses one omitted variable "x1"
maindat <- maindat[,c("x2","x3","y")]

# use "bndovb" function assuming parametric "normal" distribution for the CDF and quantile function (set method=1)
# see Hwang(2021) for further details
oout <- bndovb(maindat=maindat,auxdat=auxdat,depvar="y",ovar="x1",comvar=c("x2","x3"),method=1)
#> [1] "If you want to compute an equal-tailed confidence interval using a numerical delta method, set ci=TRUE instead. Default is 95% CI. If you want a different coverage, set a different tau."
print(oout)
#> $hat_beta_l
#>       con        x1        x2        x3 
#>  2.159940 -1.097541  3.051552  1.818566 
#> 
#> $hat_beta_u
#>       con        x1        x2        x3 
#> 3.4754586 0.9482145 3.7202134 2.5578865 
#> 
#> $mu_l
#> [1] 34.17696
#> 
#> $mu_u
#> [1] 36.82752
#> 
#> $hat_beta_l_cil
#> NULL
#> 
#> $hat_beta_l_ciu
#> NULL
#> 
#> $hat_beta_u_cil
#> NULL
#> 
#> $hat_beta_u_ciu
#> NULL
#> 
#> $mu_l_cil
#> NULL
#> 
#> $mu_l_ciu
#> NULL
#> 
#> $mu_u_cil
#> NULL
#> 
#> $mu_u_ciu
#> NULL

# use "bndovb" function using nonparametric estimation of the CDF and quantile function (set method=2)
# for nonparametric density estimator, the R package "np" was used. See Hayfield and Racine (2008), Li and Racine (2008), Li, Lin and Racine (2013)
#### The next line takes very long because of large sample size. You can try using a smaller sample and run the next line.
#oout <- bndovb(maindat=maindat,auxdat=auxdat,depvar="y",ovar="x1",comvar=c("x2","x3"),method=2)
#print(oout)

Example 2 : bndovbme (continuous proxy variables)

The code below shows how to use a function ‘bndovbme’ when the auxiliary data does not contain the omitted variable but contain continuous proxy variables for the omitted variable. The code may take some time to run.

library(MASS)
library(pracma)
library(bndovb)

set.seed(210413)

### continuous proxy variables

# set DGP
nu      <- 0.5   # sd of measurement errors in proxy variables
beta    <- c(0,1,1,1) # true parameters in a regression model
gamma   <- c(0,1,1) # parameters to generate correlation between covariates
samsize <- c(6000)  # sample size
mu      <- c(0,0,0,0) # average of covariates
sigma   <- eye(4)

#### simulate data
A <- rbind( c(1,0,0,0), c(0,1,0,0), c(gamma[2],gamma[3],1,0), c(beta[3]+beta[2]*gamma[2],beta[4]+beta[2]*gamma[3], beta[2],1))
B <- c(0,0,gamma[1],beta[1])
mu2    <- A%*%mu + B
sigma2 <- A%*%sigma%*%t(A)
Sim     = 100         ;  # number of Monte Carlo simulations

n=6000;na=3000;nb=3000
simdata <- mvrnorm(n,mu=mu2,Sigma=sigma2)

w1<-simdata[,1]
w2<-simdata[,2]
x <-simdata[,3]
y <-simdata[,4]

# main data
w1_a <- w1[1:na]
w2_a <- w2[1:na]
x_a  <- x[ 1:na]
y_a  <- y[ 1:na]

# auxiliary data
w1_b <- w1[(na+1):n]
w2_b <- w2[(na+1):n]
x_b  <- x[ (na+1):n]
y_b  <- y[ (na+1):n]

# generate continuous proxies
z_b <- w2_b + cbind(rnorm(n-na,mean=0,sd=nu), rnorm(n-na,mean=0,sd=nu), rnorm(n-na,mean=0,sd=nu))

# main data does not include a variable w2
maindat <- data.frame(y=y_a,x=x_a,w1=w1_a)

# auxiliary data does not include a dependent variable y 
# auxiliary data contain three proxy variables for the omitted variable w2
auxdat <- data.frame(x=x_b,w1=w1_b,z1=z_b[,1],z2=z_b[,2],z3=z_b[,3])


# use 'bndovbme' function
oout <- bndovbme(maindat=maindat,auxdat=auxdat,depvar=c("y"),pvar=c("z1","z2","z3"),ptype=1,comvar=c("x","w1"),normalize=FALSE)
#> [1] "If you want to compute an equal-tailed confidence interval using a numerical delta method, set ci=TRUE instead. Default is 95% CI. If you want a different coverage, set a different tau."
print(oout)
#> $hat_beta_l
#>        con       ovar          x         w1 
#> -0.1088779 -1.8499939  0.5519592 -0.4152109 
#> 
#> $hat_beta_u
#>        con       ovar          x         w1 
#> 0.01428739 1.72932693 2.40020851 1.40518506 
#> 
#> $mu_l
#> [1] 0.5594916
#> 
#> $mu_u
#> [1] 2.309199
#> 
#> $hat_beta_l_cil
#> NULL
#> 
#> $hat_beta_l_ciu
#> NULL
#> 
#> $hat_beta_u_cil
#> NULL
#> 
#> $hat_beta_u_ciu
#> NULL
#> 
#> $mu_l_cil
#> NULL
#> 
#> $mu_l_ciu
#> NULL
#> 
#> $mu_u_cil
#> NULL
#> 
#> $mu_u_ciu
#> NULL

Example 3 : bndovbme (discrete proxy variables)

The code below is similar to the Example 2 but assumes the proxy variables in auxiliary data are discrete. The code may take some time to run.

library(MASS)
library(pracma)
library(bndovb)

set.seed(210413)

### discrete proxy variables
# set DGP

n=6000;na=3000;nb=3000
mu2 <- c(0,0,0)
Sigma2 <- rbind(c(1,0.5,0.5),c(0.5,1,0.5),c(0.5,0.5,1))

simdata <- mvrnorm(n,mu=mu2,Sigma=Sigma2)
# discretize
simdata[,2] <- (simdata[,2]>0)+1

beta    <- c(0,1,1,1) # true parameters to get bounds on

# simulate a dependent variable
y <- cbind(rep(1,n),simdata)%*%as.matrix(beta) + rnorm(n)

w1<-simdata[,1]
w2<-simdata[,2]
x <-simdata[,3]

# main data
w1_a <- w1[1:na]
w2_a <- w2[1:na]
x_a  <- x[ 1:na]
y_a  <- y[ 1:na]

# auxiliary data
w1_b <- w1[(na+1):n]
w2_b <- w2[(na+1):n]
x_b  <- x[ (na+1):n]
y_b  <- y[ (na+1):n]

# set measurement matrices for discrete proxy variables
M_param <- list()
M_param[[1]] <- rbind(c(0.9,0.1),c(0.1,0.9))
M_param[[2]] <- rbind(c(0.9,0.1),c(0.1,0.9))
M_param[[3]] <- rbind(c(0.9,0.1),c(0.1,0.9))

CM_param <- list()
CM_param[[1]] <- t(apply(M_param[[1]],1,cumsum))
CM_param[[2]] <- t(apply(M_param[[2]],1,cumsum))
CM_param[[3]] <- t(apply(M_param[[3]],1,cumsum))

# simulate proxy variables
z_b <- matrix(NA,nrow=nb,ncol=3)
for (k in 1:nb){
  for (l in 1:3){
    z_b[k,l] <- which(runif(1)<CM_param[[l]][w2_b[k],])[1]
  }
}


# main data
maindat <- data.frame(y=y_a,x=x_a,w1=w1_a)
# auxiliary data
auxdat <- data.frame(x=x_b,w1=w1_b,z1=z_b[,1],z2=z_b[,2],z3=z_b[,3])

# use 'bndovbme' function
oout <- bndovbme(maindat=maindat,auxdat=auxdat,depvar=c("y"),pvar=c("z1","z2","z3"),ptype=2,comvar=c("x","w1"),sbar=2,normalize=FALSE)
#> [1] "If you want to compute an equal-tailed confidence interval using a numerical delta method, set ci=TRUE instead. Default is 95% CI. If you want a different coverage, set a different tau."
print(oout)
#> $hat_beta_l
#>        con       ovar          x         w1 
#> -1.1624776 -1.9791250  0.9362842  0.7931693 
#> 
#> $hat_beta_u
#>      con     ovar        x       w1 
#> 4.453464 1.766742 1.426545 1.269087 
#> 
#> $mu_l
#> [1] 2.240954
#> 
#> $mu_u
#> [1] 2.99594
#> 
#> $hat_beta_l_cil
#> NULL
#> 
#> $hat_beta_l_ciu
#> NULL
#> 
#> $hat_beta_u_cil
#> NULL
#> 
#> $hat_beta_u_ciu
#> NULL
#> 
#> $mu_l_cil
#> NULL
#> 
#> $mu_l_ciu
#> NULL
#> 
#> $mu_u_cil
#> NULL
#> 
#> $mu_u_ciu
#> NULL

Conclusion

This vignette showed how to use functions in `bndovb’ R package.

References

About

R package to bound omitted variable bias using auxiliary data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages