# Group Assignment 2

## Main Assignment
For Friday I would like you to prepare a visualization comparing the accuracy of confidence intervals generated by:
* A bootstrap over the data
*  Standard inference (using the Delta-method for transformations)
## Specifics
For this exercise I would like you to look at GLM estimates for a Poisson outcome with a mean given by:
$$\lambda_i=\text{exp}\left\{\beta_0+\beta_1\delta_{i,1}+\beta_2x_{i,2}\right\}$$

where:
* $\delta_{i,1}$ is a dummy variable taking value 1 with 60 percent chance, 0 with a 40 percent chance
    * Think of this as a factor characteristic
* $x_{i,2}$ is a draw from a chi-squared distribution, however:
    * If $\delta_{i,1}$  is one, it is  drawn from a distribution with 4 degrees of freedom
    * If $\delta_{i,1}$  is zero, it is  drawn from a distribution with 2 degrees of freedom
    * Think of this as an economic measure that is correlated with the factor.
    
We will set the true values of the parameters as:
* $\beta _0=-1$
* $\beta_1=-1$
* $\beta_2=\tfrac{1}{2}$ 

Load in libraries:

In [None]:
library(boot)
library(MASS)

Function to create a new draw of the Data:

In [None]:
draw.data <- function(N, beta=c(-1,-1,1/2)) {
    x1 <- ifelse(runif(N)<0.4,0,1) # Draw 0 with 40% chance, 1 otherwise
    x2.0 <- rchisq(N,df=2) # Draw a Chi-squared 2
    x2.1 <- rchisq(N,df=4) # Draw a Chi-squared 4
    x2 <- ifelse(x1==1,x2.1,x2.0) 
    const <- rep(1,N)
    Xmat <- cbind(const,delta1=x1,x2=x2) # Make the data, [column of ones, delta1, x2]
    eta <- Xmat%*% beta
    lambda <- exp(eta)
    # Draw a Poisson from the releveant distribution
    generateY <- function(lambda.i) rpois(1,lambda=lambda.i)
    y <- sapply(lambda,generateY)
    data.frame(cbind(y=y,Xmat[,2:3]))
}

For the transformation we're trying to estimate the probability that a poisson is 0 or 1. For a mean of $\lambda$ this probability is given by:
$$h(\lambda)=\Pr\left\{y<2\right\}=\Pr\left\{y=0\right\}+\Pr\left\{y=1\right\}= e^{-\lambda}+e^{-\lambda}\lambda=e^{-\lambda}(1+\lambda)$$
The derivative with respect to $\lambda_i$ is therefore:
$$\frac{\partial h(\lambda)}{\partial \lambda}= -e^{-\lambda}(1+\lambda) +e^{-\lambda}=e^{-\lambda}\left( 1-1-\lambda \right) =-\lambda  e^{-\lambda}$$

But we know that $\lambda(\beta_0,\beta_1,\beta_2)=\exp\left(\beta_0+\beta_1+4\cdot\beta_2\right)$ and we really want the partial derivatives with respect to each of the parameters. But this means we're just using the chain rule where:
$$\frac{\partial h( \lambda(\beta_0,\beta_1,\beta_2) )}{\partial \beta_0}=\frac{\partial h(\lambda)}{\partial \lambda} \frac{\partial \lambda(\beta_0,\beta_1,\beta_2)}{\partial \beta_0}$$

But we know that 
$$\frac{\partial \lambda(\beta_0,\beta_1,\beta_2)}{\partial \beta_0}=\lambda(\beta_0,\beta_1,\beta_2)$$
$$\frac{\partial \lambda(\beta_0,\beta_1,\beta_2)}{\partial \beta_1}=\delta_1\lambda(\beta_0,\beta_1,\beta_2)$$
$$\frac{\partial \lambda(\beta_0,\beta_1,\beta_2)}{\partial \beta_2}=x_2\lambda(\beta_0,\beta_1,\beta_2)$$

Make the function for the probability of being 0 or 1 with the argument being the coefficient vector: $(\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2)$

In [None]:
prob.0.1 <- function(beta.hat) {
    lambda <- exp( beta.hat[1]+beta.hat[2]+4*beta.hat[3])
    g <- unname( (1+lambda)*exp(-lambda) )
    return( c("Prob.0.1"=g))
}

Generate the Delta-Method confidence interval with the arguments being the estimates  $\hat{\beta}=(\hat{\beta}_0,\hat{\beta}_1,\hat{\beta}_2)$, the variance-covariance matrix estimate $\hat{\Sigma}$, and the point we want to assess the model at $x=(x_1,x_2)$.

In [None]:
Delta.Method.pois <- function(model,xAt=c(1,4),confAt=c(0.8,0.9,0.95)){
    Sigma <- vcov(model)
    beta.hat <- coef(model)
    # Here I use the above to just write the formula directly.
    g.est <- prob.0.1(beta.hat)
    lambda <- exp(  t(c(1,xAt) )%*%beta.hat)[1,1]
    dH <-  -lambda* exp(-lambda)
    dG <- dH*lambda*c( 1 , xAt )
    se <- sqrt(t(dG) %*% Sigma %*% dG)[1,1] 
    critVals <- qnorm(1-(1-confAt)/2)
    # Return a matrix across the confidence intervals for upper and lower
    return(cbind(g.lower=g.est-critVals*se,g.upper=g.est+critVals*se))
}

Generate the standard errors from the expected score

In [None]:
standard.analysis <- function(data){
    est.model.glm <- glm(y~delta1+x2,data=data,family="poisson")
    beta.hat <- unname(coef(est.model.glm))
    c80 <- confint(profile(est.model.glm),level=0.80)
    c90 <- confint(profile(est.model.glm),level=0.90)
    c95 <- confint(profile(est.model.glm),level=0.95)
    g <- Delta.Method.pois( est.model.glm , xAt=c(1,4) )
    g.hat <- prob.0.1(beta.hat)
    outMat <- rbind(   # join these rows together
        c(id=1,estimate=beta.hat[1],conf.80.lower=c80[1,1],conf.80.upper=c80[1,2],conf.90.lower=c90[1,1],conf.90.upper=c90[1,2],conf.95.lower=c95[1,1],conf.95.upper=c95[1,2]),
        c(id=2,estimate=beta.hat[2],conf.80.lower=c80[2,1],conf.80.upper=c80[2,2],conf.90.lower=c90[2,1],conf.90.upper=c90[2,2],conf.95.lower=c95[2,1],conf.95.upper=c95[2,2]),
        c(id=3,estimate=beta.hat[3],conf.80.lower=c80[3,1],conf.80.upper=c80[3,2],conf.90.lower=c90[3,1],conf.90.upper=c90[3,2],conf.95.lower=c95[3,1],conf.95.upper=c95[3,2]),
        c(id=4,estimate=g.hat      ,conf.80.lower=  g[1,1],conf.80.upper=  g[1,2],conf.90.lower=  g[2,1],conf.90.upper=  g[2,2],conf.95.lower=  g[3,1],conf.95.upper=  g[3,2])
        )
    return(outMat)
}

# Bootstraps

Function to return bootstraps over the three coefficients, and the transformed probability:

In [None]:
pois.bs <- function(formula, data, indices) {
  d <- data[indices,]
  pois.fit <- glm(formula, data=d,family="poisson")
  # Return the coefficients of the glm model and the Prob of a 0 or 1
  return( c( coef(pois.fit), prob.0.1(coef(pois.fit))[1] )  ) 
}

In [None]:
bootstrap.analysis <- function(data){
    # Estimate the model
    est.model.glm <- glm(y~delta1+x2,data=data,family="poisson")
    beta.hat <- unname(coef(est.model.glm))
    # figure out the transformation
    g.hat <- prob.0.1(beta.hat)
    # Bootstrap the drawn data 250 times
    bs.result <- boot(data=data, statistic=pois.bs,R=250, formula=y~ delta1+x2)
    # assemble confidence intervals from the data using a normal fit
    b0 <- boot.ci(bs.result,conf=c(0.8,0.9,0.95),type="norm",index=1)$normal
    b1 <- boot.ci(bs.result,conf=c(0.8,0.9,0.95),type="norm",index=2)$normal
    b2 <- boot.ci(bs.result,conf=c(0.8,0.9,0.95),type="norm",index=3)$normal
    g <- boot.ci(bs.result,conf=c(0.8,0.9,0.95),type="norm",index=4)$normal
    # Output the data in an organized way!
    outMat <- rbind( # join these rows together
            c(id=1,estimate=beta.hat[1],conf.80.lower=b0[1,2],conf.80.upper=b0[1,3],conf.90.lower=b0[2,2],conf.90.upper=b0[2,3],conf.95.lower=b0[3,2],conf.95.upper=b0[3,3]),
            c(id=2,estimate=beta.hat[2],conf.80.lower=b1[1,2],conf.80.upper=b1[1,3],conf.90.lower=b1[2,2],conf.90.upper=b1[2,3],conf.95.lower=b1[3,2],conf.95.upper=b1[3,3]),
            c(id=3,estimate=beta.hat[3],conf.80.lower=b2[1,2],conf.80.upper=b2[1,3],conf.90.lower=b2[2,2],conf.90.upper=b2[2,3],conf.95.lower=b2[3,2],conf.95.upper=b2[3,3]),
            c(id=4,estimate=g.hat      ,conf.80.lower= g[1,2],conf.80.upper= g[1,3],conf.90.lower= g[2,2],conf.90.upper= g[2,3],conf.95.lower= g[3,2],conf.95.upper= g[3,3])
            )
    return(outMat)
}

Overall simulation function:

In [None]:
conductSims <- function(N,Nsim=10000) {
    # initialize the matrices
    std.matrix <-matrix(0,nrow=Nsim*4,ncol=8)
    bs.matrix <- matrix(0,nrow=Nsim*4,ncol=8)
    # set the column names
    colnames(std.matrix) <- c('id','estimate','conf.80.lower','conf.80.upper','conf.90.lower','conf.90.upper','conf.95.lower','conf.95.upper')
    colnames(bs.matrix) <-  c('id','estimate','conf.80.lower','conf.80.upper','conf.90.lower','conf.90.upper','conf.95.lower','conf.95.upper')
    for (i in 1:Nsim){
        i.lower <- 4*(i-1)+1 # Get indices in bigger matrix for this simulation as multiple of 4 (1 for i=1, 5 for i=2, etc)
        i.upper <- i.lower+3 # Upper index is i.lower+3 (4 for i=1, 8 for i=2, etc.)
        data <- draw.data(N) # draw a new set of data
        std.matrix[i.lower:i.upper, ] <- standard.analysis(data)  # run the standard analysis on the data
        bs.matrix[i.lower:i.upper, ]  <- bootstrap.analysis(data) # run the bootstrap analysis on the data
        ## If running in R studio you can uncomment this to get a post every 100 repetitions.
        #if (i%%100 ==0){
        #    print(paste0("Completed ",toString(i)," repetitions."))
        #}
    }
    # Now convert the two matrices into data frames
    df.std <- data.frame(std.matrix)
    df.std["type"] <- "standard"
    df.bs <- data.frame(bs.matrix)
    df.bs["type"] <- "bs"
    # and append them together (where we can subset on type, etc later)
    df <- rbind(df.bs,df.std)
    # Now its a dataframe can include strings! Here is the ID to metric link
    key <- data.frame(id=c(1:4), metric=c("beta0","beta1","beta2","Prob.0.1"))
    # create a metric column
    df['metric'] <- key[ match(df[['id']], key[['id']] ) , 'metric']
    # Output the dataframe in the order (metric, type, data columns 2:8)
    return(df[ ,c(10,9,2:8)])
}

Load in from saved versions:

In [None]:
load(file='comp50.rdata')
load(file='comp100.rdata')
load(file='comp200.rdata')
load(file='comp400.rdata')
load(file='comp800.rdata')
load(file='comp1600.rdata')
load(file='comp3200.rdata')

In [None]:
compare.ci.50["N"]<- 50
compare.ci.100["N"]<- 100
compare.ci.200["N"]<- 200
compare.ci.400["N"]<- 400
compare.ci.800["N"]<- 800
compare.ci.1600["N"]<- 1600
compare.ci.3200["N"]<- 3200

In [None]:
sim.data<- rbind(compare.ci.50,compare.ci.100,compare.ci.200,compare.ci.400,compare.ci.800,compare.ci.1600,compare.ci.3200)

In [None]:
# Define colors
Pitt.Blue<- "#003594"
Pitt.Gold<-"#FFB81C"
Pitt.DGray <- "#75787B"
Pitt.Gray <- "#97999B"
Pitt.LGray <- "#C8C9C7"
# ggplot preferences
library("ggplot2")
library("repr")
options(repr.plot.width=10, repr.plot.height=10/1.68)
Pitt.Theme<-theme( panel.background = element_rect(fill = "white", size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour =Pitt.Gray), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid', colour = "white")
  )
base<- ggplot() +aes()+ Pitt.Theme

In [None]:
ggplot(data=subset(sim.data,metric=="Prob.0.1" & type=="bs" & N==3200), aes(x=estimate))+Pitt.Theme+
geom_histogram(binwidth=0.001,color=Pitt.Blue,fill=Pitt.Gold, size=2,aes(y=..density..))+xlab("Outcome Prob")+ylab("Density")+
stat_function(fun = dnorm, args = list(mean = 1, sd = 0.2425328978796))

In [None]:
Nlist=c(50,100,200,400,800,1600)

In [None]:
metricList <- c('beta0','beta1','beta2','Prob.0.1')
actual.sample <- data.frame()
for (m in metricList){ 
    SampleDist <- matrix(0, ncol=7,nrow=length(Nlist))
    for (ii in 1:length(Nlist)) {
        Nn <- Nlist[ii]
       SampleDist[ii , ] <- c(N=Nn,quantile(subset(sim.data,metric==m & type=="bs" & N==Nn)$estimate,c(0.025,0.05,0.1,0.9,0.95,0.975)))
    }
    colnames(SampleDist) <- c("N",'q.025','q.05','q.1','q.9','q.95','q.975')
    df0 <- data.frame(SampleDist)
    df0['metric'] <- m
    actual.sample <- rbind(actual.sample,df0)
}

In [None]:
ActualInt <- function(stat,N,lower,upper){
    quantile(subset(sim.data,metric==stat & type=="bs" & N==N)$estimate,c(lower,upper) )
}
library(stats)
CoverageInt <- function(stat,N,lower,upper){
    Fn <- ecdf(subset(sim.data,metric==stat & type=="bs" & N==N)$estimate)
    Fn(upper)-Fn(lower)
}

# Parameter: $\beta_0$

In [None]:
Type1.beta0 <- subset( sim.data,metric=="beta0" )
for (conf in c(80,90,95)) {
    contStr <- toString(conf)
    contStr.up <- paste0("conf.",toString(conf),".upper")
    contStr.low <-  paste0("conf.",toString(conf),".lower")
    namecol <- paste0("tI.",toString(conf))
    Type1.beta0[namecol] <- ifelse( -1>=Type1.beta0[,contStr.low] & -1<=Type1.beta0[,contStr.up] ,TRUE,FALSE)
}

In [None]:
Error.beta0 <- data.frame(matrix( 0, ncol = 4, nrow = 12))
names(Error.beta0) <- c("N",'tI.80','tI.90','tI.95')
Error.beta0["type"] <- "empty"
ii <- 0
for (typei in c("bs","standard")) {
    for (Nj in Nlist){
        ii <- ii+1
        Error.beta0[ii, c("N",'tI.80','tI.90','tI.95','type')] <- list(Nj, 
                 round(100*(mean( subset(Type1.beta0, type==typei & N==Nj)$tI.80 )),4) , 
                 round(100*(mean( subset(Type1.beta0, type==typei & N==Nj)$tI.90 )),4) ,
                 round(100*(mean( subset(Type1.beta0, type==typei & N==Nj)$tI.95 )),4) ,typei)             
    }
}
Error.beta0['metric'] <- 'beta0'
Error.beta0

# Parameter: $\beta_1$

In [None]:
Type1.beta1 <- subset( sim.data,metric=="beta1" )
for (conf in c(80,90,95)) {
    contStr <- toString(conf)
    contStr.up <- paste0("conf.",toString(conf),".upper")
    contStr.low <-  paste0("conf.",toString(conf),".lower")
    namecol <- paste0("tI.",toString(conf))
    Type1.beta1[namecol] <- ifelse( -1>=Type1.beta1[,contStr.low] & -1<=Type1.beta1[,contStr.up] ,TRUE,FALSE)
}

In [None]:
Error.beta1 <- data.frame(matrix( 0, ncol = 4, nrow = 12))
names(Error.beta1) <- c("N",'tI.80','tI.90','tI.95')
Error.beta1["type"] <- "empty"
ii <- 0
for (typei in c("bs","standard")) {
    for (Nj in Nlist){
        ii <- ii+1
        Error.beta1[ii, c("N",'tI.80','tI.90','tI.95','type')] <- list(Nj, 
                 round(100*(mean( subset(Type1.beta1, type==typei & N==Nj)$tI.80 )),4) , 
                 round(100*(mean( subset(Type1.beta1, type==typei & N==Nj)$tI.90 )),4) ,
                 round(100*(mean( subset(Type1.beta1, type==typei & N==Nj)$tI.95 )),4) ,typei)             
    }
}
Error.beta1['metric'] <- 'beta1'
Error.beta1

# Parameter: $\beta_2$

In [None]:
Type1.beta2 <- subset( sim.data,metric=="beta2" )
for (conf in c(80,90,95)) {
    contStr <- toString(conf)
    contStr.up <- paste0("conf.",toString(conf),".upper")
    contStr.low <-  paste0("conf.",toString(conf),".lower")
    namecol <- paste0("tI.",toString(conf))
    Type1.beta2[namecol] <- ifelse( 1/2 >=Type1.beta2[ ,contStr.low] & 1/2 <=Type1.beta2[ ,contStr.up] ,TRUE,FALSE)
}

In [None]:
Error.beta2 <- data.frame(matrix( 0, ncol = 4, nrow = 12))
names(Error.beta2) <- c("N",'tI.80','tI.90','tI.95')
Error.beta2["type"] <- "empty"
ii <- 0
for (typei in c("bs","standard") )  {
    for (Nj in Nlist){
        ii <- ii+1
        Error.beta2[ii, c("N",'tI.80','tI.90','tI.95','type')] <- list(Nj, 
                 round(100*(mean( subset(Type1.beta2, type==typei & N==Nj)$tI.80 )),4) , 
                 round(100*(mean( subset(Type1.beta2, type==typei & N==Nj)$tI.90 )),4) ,
                 round(100*(mean( subset(Type1.beta2, type==typei & N==Nj)$tI.95 )),4) ,typei)             
    }
}
Error.beta2['metric'] <- 'beta2'
Error.beta2

# Parameter: $\Pr\left\{X\leq1\right\}$

In [None]:
Type1.prob.0.1 <- subset( sim.data,metric=="Prob.0.1" )
true.p <- prob.0.1(c(-1,-1,1/2))
for (conf in c(80,90,95)) {
    contStr <- toString(conf)
    contStr.up <- paste0("conf.",toString(conf),".upper")
    contStr.low <-  paste0("conf.",toString(conf),".lower")
    namecol <- paste0("tI.",toString(conf))
    Type1.prob.0.1[namecol] <- ifelse( true.p>=Type1.prob.0.1[,contStr.low] & true.p<=Type1.prob.0.1[,contStr.up] ,TRUE,FALSE)
}

In [None]:
Error.prob.0.1 <- data.frame(matrix( 0, ncol = 4, nrow = 12))
names(Error.prob.0.1) <- c("N",'tI.80','tI.90','tI.95')
Error.prob.0.1["type"] <- "empty"
ii <- 0
for (typei in c("bs","standard")) {
    for (Nj in Nlist){
        ii <- ii+1
        Error.prob.0.1[ii, c("N",'tI.80','tI.90','tI.95','type')] <- list(Nj, 
                 round(100*(mean( subset(Type1.prob.0.1, type==typei & N==Nj)$tI.80 )),4) , 
                 round(100*(mean( subset(Type1.prob.0.1, type==typei & N==Nj)$tI.90 )),4) ,
                 round(100*(mean( subset(Type1.prob.0.1, type==typei & N==Nj)$tI.95 )),4) ,typei)             
    }
}
Error.prob.0.1['metric'] <- 'prob.0.1'
Error.prob.0.1

## Bring the results together|
Put results into a table

In [None]:
allResults <- rbind(Error.beta0,Error.beta1,Error.beta2,Error.prob.0.1)
allResults$tI.80ratio <-100/(100-allResults$tI.80)
allResults$tI.90ratio <-(100)/(100-allResults$tI.90)
allResults$tI.95ratio <-(100)/(100-allResults$tI.95)
subset(allResults,metric=="beta0")

$\beta_0$ 90 percent, where I present this as an inverse odds ratio (out of how many draws are you expected to falsely reject the null!)

In [None]:
ggplot(data=subset(allResults,metric=="beta0"), aes(x=N, y=tI.90ratio, color=type)  )+geom_point(size=5) +
theme(legend.position='top')+ Pitt.Theme + 
geom_hline(aes(yintercept=10,color="Nominal"),size=1) +
xlab("N")+ylab("Type I - inverse odds ratio ") +scale_x_continuous(trans='log10') +
scale_color_manual(name='',values=c("bs"= Pitt.Blue,"standard"= Pitt.Gold,"Nominal"="#DC582A") )

$\beta_0$ 95 percent:

In [None]:
ggplot(data=subset(allResults,metric=="beta0"), aes(x=N, y=tI.95ratio, color=type)  )+geom_point(size=5) +
theme(legend.position='top')+ Pitt.Theme + 
geom_hline(aes(yintercept=20,color="Nominal"),size=1) +
xlab("N")+ylab("Type I - inverse odds ratio") +scale_x_continuous(trans='log10') +
scale_color_manual(name='',values=c("bs"= Pitt.Blue,"standard"= Pitt.Gold,"Nominal"="#DC582A") )

$\beta_1$:

In [None]:
ggplot(data=subset(allResults,metric=="beta1"), aes(x=N, y=tI.95ratio, color=type)  )+geom_point(size=5) +
theme(legend.position='top')+ Pitt.Theme + 
geom_hline(aes(yintercept=20,color="Nominal"),size=1) +
xlab("N")+ylab("Type I - inverse odds ratio") +scale_x_continuous(trans='log10') +
scale_color_manual(name='',values=c("bs"= Pitt.Blue,"standard"= Pitt.Gold,"Nominal"="#DC582A") )

$\beta_2$:

In [None]:
ggplot(data=subset(allResults,metric=="beta2"), aes(x=N, y=tI.95ratio, color=type)  )+geom_point(size=5) +
theme(legend.position='top')+ Pitt.Theme + 
geom_hline(aes(yintercept=20,color="Nominal"),size=1) +
xlab("N")+ylab("Type I - inverse odds ratio") +scale_x_continuous(trans='log10') +
scale_color_manual(name='',values=c("bs"= Pitt.Blue,"standard"= Pitt.Gold,"Nominal"="#DC582A") )

Draw the graph for the probability comparison:

In [None]:
ggplot(data=subset(allResults,metric=="prob.0.1"), aes(x=N, y=tI.95ratio, color=type)  )+geom_point(size=5) +
theme(legend.position='top')+ Pitt.Theme + 
geom_hline(aes(yintercept=20,color="Nominal"),size=1) +
xlab("N")+ylab("Type I - inverse odds ratio") +scale_x_continuous(trans='log10') +
scale_color_manual(name='',values=c("bs"= Pitt.Blue,"standard"= Pitt.Gold,"Nominal"="#DC582A") )