Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: "MRP with rstanarm" | |
| author: "Lauren Kennedy and Jonah Gabry" | |
| date: "`r Sys.Date()`" | |
| output: | |
| html_vignette: | |
| toc: yes | |
| bibliography: mrp-files/mrp.bib | |
| --- | |
| <!-- | |
| %\VignetteEngine{knitr::rmarkdown} | |
| %\VignetteIndexEntry{MRP with rstanarm} | |
| --> | |
| ```{r, child="children/SETTINGS-knitr.txt"} | |
| ``` | |
| ```{r packages-1, message=FALSE} | |
| library(rstanarm) | |
| library(ggplot2) | |
| library(bayesplot) | |
| theme_set(bayesplot::theme_default()) | |
| # options(mc.cores = 4) | |
| ``` | |
| ```{r packages-2, eval=FALSE, message=FALSE} | |
| library(dplyr) | |
| library(tidyr) | |
| ``` | |
| Inference about the population is one the main aims of statistical methodology. | |
| Multilevel regression and post-stratification (MRP) [@little1993post; | |
| @lax2009should; @park2004bayesian] has been shown to be an effective method of | |
| adjusting the sample to be more representative of the population for a set of | |
| key variables. Recent work has demonstrated the effectiveness of MRP when there | |
| are a number of suspected interactions between these variables | |
| [@ghitza2013deep], replicated by @lei20172008. While @ghitza2013deep use | |
| approximate marginal maximum likelihood estimates; @lei20172008 implement a | |
| fully Bayesian approach through Stan. | |
| The **rstanarm** package allows the user to conduct complicated regression | |
| analyses in Stan with the simplicity of standard formula notation in R. The | |
| purpose of this vignette is to demonstrate the utility of **rstanarm** when | |
| conducting MRP analyses. We will not delve into the details of conducting | |
| logistic regression with rstanarm as this is already covered in [other | |
| vignettes](https://mc-stan.org/rstanarm/articles/). | |
| Most of the code for data manipulation and plotting is not shown in the text | |
| but is available in the R markdown | |
| [source code on GitHub](https://github.com/stan-dev/rstanarm/blob/master/vignettes/mrp.Rmd). | |
| ```{r, include=FALSE, collapse=TRUE} | |
| simulate_mrp_data <- function(n) { | |
| J <- c(2, 3, 7, 3, 50) # male or not, eth, age, income level, state | |
| poststrat <- as.data.frame(array(NA, c(prod(J), length(J)+1))) # Columns of post-strat matrix, plus one for size | |
| colnames(poststrat) <- c("male", "eth", "age","income", "state",'N') | |
| count <- 0 | |
| for (i1 in 1:J[1]){ | |
| for (i2 in 1:J[2]){ | |
| for (i3 in 1:J[3]){ | |
| for (i4 in 1:J[4]){ | |
| for (i5 in 1:J[5]){ | |
| count <- count + 1 | |
| # Fill them in so we know what category we are referring to | |
| poststrat[count, 1:5] <- c(i1-1, i2, i3,i4,i5) | |
| } | |
| } | |
| } | |
| } | |
| } | |
| # Proportion in each sample in the population | |
| p_male <- c(0.52, 0.48) | |
| p_eth <- c(0.5, 0.2, 0.3) | |
| p_age <- c(0.2,.1,0.2,0.2, 0.10, 0.1, 0.1) | |
| p_income<-c(.50,.35,.15) | |
| p_state_tmp<-runif(50,10,20) | |
| p_state<-p_state_tmp/sum(p_state_tmp) | |
| poststrat$N<-0 | |
| for (j in 1:prod(J)){ | |
| poststrat$N[j] <- round(250e6 * p_male[poststrat[j,1]+1] * p_eth[poststrat[j,2]] * | |
| p_age[poststrat[j,3]]*p_income[poststrat[j,4]]*p_state[poststrat[j,5]]) #Adjust the N to be the number observed in each category in each group | |
| } | |
| # Now let's adjust for the probability of response | |
| p_response_baseline <- 0.01 | |
| p_response_male <- c(2, 0.8) / 2.8 | |
| p_response_eth <- c(1, 1.2, 2.5) / 4.7 | |
| p_response_age <- c(1, 0.4, 1, 1.5, 3, 5, 7) / 18.9 | |
| p_response_inc <- c(1, 0.9, 0.8) / 2.7 | |
| p_response_state <- rbeta(50, 1, 1) | |
| p_response_state <- p_response_state / sum(p_response_state) | |
| p_response <- rep(NA, prod(J)) | |
| for (j in 1:prod(J)) { | |
| p_response[j] <- | |
| p_response_baseline * p_response_male[poststrat[j, 1] + 1] * | |
| p_response_eth[poststrat[j, 2]] * p_response_age[poststrat[j, 3]] * | |
| p_response_inc[poststrat[j, 4]] * p_response_state[poststrat[j, 5]] | |
| } | |
| people <- sample(prod(J), n, replace = TRUE, prob = poststrat$N * p_response) | |
| ## For respondent i, people[i] is that person's poststrat cell, | |
| ## some number between 1 and 32 | |
| n_cell <- rep(NA, prod(J)) | |
| for (j in 1:prod(J)) { | |
| n_cell[j] <- sum(people == j) | |
| } | |
| coef_male <- c(0,-0.3) | |
| coef_eth <- c(0, 0.6, 0.9) | |
| coef_age <- c(0,-0.2,-0.3, 0.4, 0.5, 0.7, 0.8, 0.9) | |
| coef_income <- c(0,-0.2, 0.6) | |
| coef_state <- c(0, round(rnorm(49, 0, 1), 1)) | |
| coef_age_male <- t(cbind(c(0, .1, .23, .3, .43, .5, .6), | |
| c(0, -.1, -.23, -.5, -.43, -.5, -.6))) | |
| true_popn <- data.frame(poststrat[, 1:5], cat_pref = rep(NA, prod(J))) | |
| for (j in 1:prod(J)) { | |
| true_popn$cat_pref[j] <- plogis( | |
| coef_male[poststrat[j, 1] + 1] + | |
| coef_eth[poststrat[j, 2]] + coef_age[poststrat[j, 3]] + | |
| coef_income[poststrat[j, 4]] + coef_state[poststrat[j, 5]] + | |
| coef_age_male[poststrat[j, 1] + 1, poststrat[j, 3]] | |
| ) | |
| } | |
| #male or not, eth, age, income level, state, city | |
| y <- rbinom(n, 1, true_popn$cat_pref[people]) | |
| male <- poststrat[people, 1] | |
| eth <- poststrat[people, 2] | |
| age <- poststrat[people, 3] | |
| income <- poststrat[people, 4] | |
| state <- poststrat[people, 5] | |
| sample <- data.frame(cat_pref = y, | |
| male, age, eth, income, state, | |
| id = 1:length(people)) | |
| #Make all numeric: | |
| for (i in 1:ncol(poststrat)) { | |
| poststrat[, i] <- as.numeric(poststrat[, i]) | |
| } | |
| for (i in 1:ncol(true_popn)) { | |
| true_popn[, i] <- as.numeric(true_popn[, i]) | |
| } | |
| for (i in 1:ncol(sample)) { | |
| sample[, i] <- as.numeric(sample[, i]) | |
| } | |
| list( | |
| sample = sample, | |
| poststrat = poststrat, | |
| true_popn = true_popn | |
| ) | |
| } | |
| ``` | |
| # The Data | |
| Three data sets are simulated by the function `simulate_mrp_data()`, which is | |
| defined in the | |
| [source code](https://github.com/stan-dev/rstanarm/blob/master/vignettes/mrp.Rmd) | |
| for this R markdown document (and printed in the appendix). The first, `sample`, | |
| contains $n$ observations from the individuals that form our sample (i.e., $n$ | |
| rows). For each individual we have their age (recorded as membership within a | |
| specific age bracket), ethnicity, income level (recorded as membership within a | |
| specific bracket), and gender. Participants were randomly sampled from a | |
| state. | |
| MRP is often used for dichotomous fixed choice questions (e.g., McCain's | |
| share of two party vote [@ghitza2013deep]; support for George W Bush, | |
| [@park2004bayesian]; or support for the death penalty | |
| [@shirley2015hierarchical]), so we will use a binary variable as the outcome in | |
| this vignette. However, MRP can also be used if there are more than two | |
| categories or if the outcome is continuous. | |
| As this is a simple toy example, we will describe the proportion of the | |
| population who would choose to adopt a cat over a dog, given the opportunity. We | |
| will simulate data using a function that is included in the appendix of this | |
| document. The `simulate_mrp_data()` function simulates a sample from a much | |
| larger population. It returns a list including the sample, population | |
| poststratification matrix and the true population preference for cats. | |
| ```{r include=FALSE, eval=FALSE} | |
| mrp_sim <- simulate_mrp_data(n=1200) | |
| save(mrp_sim, file = "mrp-files/mrp_sim.rda", version = 2) | |
| ``` | |
| ```{r eval=FALSE} | |
| mrp_sim <- simulate_mrp_data(n=1200) | |
| str(mrp_sim) | |
| ``` | |
| ```{r, echo=FALSE} | |
| load("mrp-files/mrp_sim.rda") | |
| str(mrp_sim) | |
| ``` | |
| ```{r, message=FALSE} | |
| sample <- mrp_sim[["sample"]] | |
| rbind(head(sample), tail(sample)) | |
| ``` | |
| The variables describing the individual (age, ethnicity, income level and | |
| gender) will be used to match the sample to the population of interest. To do | |
| this we will need to form a post-stratification table, which contains the number | |
| of people in each possible combination of the post-stratification variables. We | |
| have 4 variables with 2 (male), 7 (age), 3 (ethnicity) and 3 (income) levels, so | |
| there are 2x7x3x3 different levels. Participants are also selected from a state | |
| (50), increasing the number of possible levels to $6300$. | |
| To make inference about the population, we will also need the proportion of | |
| individuals in each post stratification cell at the *population* level. We will | |
| use this information to update the estimate of our outcome variable from the | |
| sample so that is more representative of the population. This is particularly | |
| helpful if there is a belief that the sample has some bias (e.g., a greater | |
| proportion of females responded than males), and that the bias impacts the outcome | |
| variable (e.g., maybe women are more likely to pick a cat than men). For each | |
| possible combination of factors, the post-stratification table shows the | |
| proportion/number of the population in that cell (rather than the | |
| proportion/number in the sample in the cell). | |
| Below we read in the poststrat data our simulated data list. | |
| ```{r message=FALSE} | |
| poststrat <- mrp_sim[["poststrat"]] | |
| rbind(head(poststrat), tail(poststrat)) | |
| ``` | |
| One of the benefits of using a simulated data set for this example is that the | |
| actual population level probability of cat preference is known for each | |
| post-stratification cell. In real world data analysis, we don't have this | |
| luxury, but we will use it later in this case study to check the predictions of | |
| the model. Details regarding the simulation of this data are available in the | |
| appendix. | |
| ```{r message=FALSE} | |
| true_popn <- mrp_sim[["true_popn"]] | |
| rbind(head(true_popn), tail(true_popn)) | |
| ``` | |
| # Exploring Graphically | |
| Before we begin with the MRP analysis, we first explore the data set with some | |
| basic visualizations. | |
| ## Comparing sample to population | |
| The aim of this analysis is to obtain a *population* estimation of cat | |
| preference given our sample of $4626$. We can see in the following plot the | |
| difference in proportions between the sample and the population. Horizontal | |
| panels represent each variable. Bars represent the proportion of the sample | |
| (solid) and population (dashed) in each category (represented by colour and the | |
| x-axis). For ease of viewing, we ordered the states in terms of the proportion | |
| of the sample in that state that was observed. We will continue this formatting | |
| choice thoughout this vignette. | |
| ```{r order-states} | |
| sample$state <- factor(sample$state, levels=1:50) | |
| sample$state <- with(sample, factor(state, levels=order(table(state)))) | |
| true_popn$state <- factor(true_popn$state,levels = levels(sample$state)) | |
| poststrat$state <- factor(poststrat$state,levels = levels(sample$state)) | |
| ``` | |
| ```{r state-and-pop-data-for-plots, eval=FALSE, include=FALSE} | |
| # not evaluated to avoid tidyverse dependency | |
| income_popn <- poststrat %>% | |
| group_by(income) %>% | |
| summarize(Num=sum(N)) %>% | |
| mutate(PROP=Num/sum(Num),TYPE='Popn',VAR='Income',CAT=income) %>% | |
| ungroup() | |
| income_data <- sample %>% | |
| group_by(income) %>% | |
| summarise(Num=n()) %>% | |
| mutate(PROP=Num/sum(Num),TYPE='Sample',VAR='Income',CAT=income) %>% | |
| ungroup() | |
| income<-rbind(income_data[,2:6],income_popn[,2:6]) | |
| age_popn <- poststrat%>% | |
| group_by(age)%>% | |
| summarize(Num=sum(N))%>% | |
| mutate(PROP=Num/sum(Num),TYPE='Popn',VAR='Age',CAT=age)%>% | |
| ungroup() | |
| age_data <- sample%>% | |
| group_by(age)%>% | |
| summarise(Num=n())%>% | |
| mutate(PROP=Num/sum(Num),TYPE='Sample',VAR='Age',CAT=age)%>% | |
| ungroup() | |
| age <- rbind(age_data[,2:6],age_popn[,2:6] ) | |
| eth_popn <- poststrat%>% | |
| group_by(eth)%>% | |
| summarize(Num=sum(N))%>% | |
| mutate(PROP=Num/sum(Num),TYPE='Popn',VAR='Ethnicity',CAT=eth)%>% | |
| ungroup() | |
| eth_data <- sample%>% | |
| group_by(eth)%>% | |
| summarise(Num=n())%>% | |
| mutate(PROP=Num/sum(Num),TYPE='Sample',VAR='Ethnicity',CAT=eth)%>% | |
| ungroup() | |
| eth<-rbind(eth_data[,2:6],eth_popn[,2:6]) | |
| male_popn <- poststrat%>% | |
| group_by(male)%>% | |
| summarize(Num=sum(N))%>% | |
| mutate(PROP=Num/sum(Num),TYPE='Popn',VAR='Male',CAT=male)%>% | |
| ungroup() | |
| male_data <- sample%>% | |
| group_by(male)%>% | |
| summarise(Num=n())%>% | |
| mutate(PROP=Num/sum(Num),TYPE='Sample',VAR='Male',CAT=male)%>% | |
| ungroup() | |
| male <- rbind(male_data[,2:6],male_popn[,2:6]) | |
| state_popn <- poststrat%>% | |
| group_by(state)%>% | |
| summarize(Num=sum(N))%>% | |
| mutate(PROP=Num/sum(poststrat$N),TYPE='Popn',VAR='State',CAT=state)%>% | |
| ungroup() | |
| state_plot_data <- sample%>% | |
| group_by(state)%>% | |
| summarise(Num=n())%>% | |
| mutate(PROP=Num/nrow(sample),TYPE='Sample',VAR='State',CAT=state)%>% | |
| ungroup() | |
| state_plot_data <- rbind(state_plot_data[,2:6],state_popn[,2:6]) | |
| state_plot_data$TYPE <- factor(state_plot_data$TYPE, levels = c("Sample","Popn")) | |
| plot_data <- rbind(male,eth,age,income) | |
| plot_data$TYPE <- factor(plot_data$TYPE, levels = c("Sample","Popn")) | |
| save(state_plot_data, file = "mrp-files/state_plot_data.rda", version = 2) | |
| save(plot_data, file = "mrp-files/plot_data.rda", version = 2) | |
| ``` | |
| ```{r plot-data, echo=FALSE, fig.height = 4, fig.width = 7, fig.align = "center"} | |
| load("mrp-files/plot_data.rda") # created in previous chunk | |
| ggplot(data=plot_data, aes(x=as.factor(CAT), y=PROP, group=as.factor(TYPE), linetype=as.factor(TYPE))) + | |
| geom_point(stat="identity",colour='black')+ | |
| geom_line()+ | |
| facet_wrap( ~ VAR, scales = "free",nrow=1,ncol=5)+ | |
| theme_bw()+ | |
| scale_fill_manual(values=c('#1f78b4','#33a02c', | |
| '#e31a1c','#ff7f00','#8856a7'),guide=FALSE)+ | |
| scale_y_continuous(breaks=c(0,.25,.5,.75,1), labels=c('0%','25%',"50%","75%","100%"))+ | |
| scale_alpha_manual(values=c(1, .3))+ | |
| ylab('Proportion')+ | |
| labs(alpha='')+ | |
| theme(legend.position="bottom", | |
| axis.title.y=element_blank(), | |
| axis.title.x=element_blank(), | |
| legend.title=element_blank(), | |
| legend.text=element_text(size=10), | |
| axis.text=element_text(size=10), | |
| strip.text=element_text(size=10), | |
| strip.background = element_rect(fill='grey92')) | |
| load("mrp-files/state_plot_data.rda") # created in previous chunk | |
| ggplot(data=state_plot_data, aes(x=as.factor(CAT), y=PROP, group=as.factor(TYPE), linetype=as.factor(TYPE))) + | |
| geom_point(stat="identity",colour='black')+ | |
| geom_line()+ | |
| facet_wrap( ~ VAR)+ | |
| theme_bw()+ | |
| scale_fill_manual(values=c('#1f78b4','#33a02c', | |
| '#e31a1c','#ff7f00','#8856a7'),guide=FALSE)+ | |
| scale_y_continuous(breaks=c(0,.025,.05,1), labels=c('0%','2.5%',"5%","100%"),expand=c(0,0),limits=c(0,.06))+ | |
| scale_alpha_manual(values=c(1, .3))+ | |
| ylab('Proportion')+ | |
| labs(alpha='')+ | |
| theme(legend.position="bottom", | |
| axis.title.y=element_blank(), | |
| axis.title.x=element_blank(), | |
| legend.title=element_blank(), | |
| legend.text=element_text(size=10), | |
| axis.text.y=element_text(size=10), | |
| axis.text.x=element_text(size=8,angle=90), | |
| strip.text=element_text(size=10), | |
| strip.background = element_rect(fill='grey92')) | |
| ``` | |
| # Effect of the post-stratification variable on preference for cats | |
| Secondly; we consider the evidence of different proportions across different | |
| levels of a post-stratification variable; which we should consider for each of | |
| the post-stratification variables. Here we break down the proportion of | |
| individuals who would prefer a cat (*y-axis*) by different levels (*x-axis*) of | |
| the post-stratification variable (*horizontal panels*). We can see from this | |
| figure that there appears to be differences in cat preference for the different | |
| levels of post-stratification variables. Given the previous figure, which | |
| suggested that the sample was different to the population in the share of | |
| different levels of theses variables, this should suggest that using the sample | |
| to estimate cat preference may not give accurate estimates of cat preference in | |
| the population. | |
| ```{r, eval=FALSE, echo=FALSE} | |
| # not evaluated to avoid dependency on tidyverse | |
| #Summarise | |
| summary_by_poststrat_var <- sample %>% | |
| gather(variable,category,c("income","eth","age","male")) %>% | |
| group_by(variable,category) %>% | |
| #Wald confidence interval | |
| summarise(y_mean=mean(cat_pref),y_sd=sqrt(mean(cat_pref)*(1-mean(cat_pref))/n())) %>% | |
| ungroup() | |
| summary_by_poststrat_var$variable <- as.factor(summary_by_poststrat_var$variable) | |
| levels(summary_by_poststrat_var$variable) <- list('Age'='age','Ethnicity'='eth','Income'='income','Male'='male') | |
| save(summary_by_poststrat_var, file = "mrp-files/summary_by_poststrat_var.rda", | |
| version = 2) | |
| ``` | |
| ```{r plot-summary-by-poststrat-var, echo=FALSE, fig.height = 4, fig.width = 7, fig.align = "center"} | |
| load("mrp-files/summary_by_poststrat_var.rda") # created in previous chunk | |
| ggplot(data=summary_by_poststrat_var, aes(x=as.factor(category), y=y_mean,group=1)) + | |
| geom_errorbar(aes(ymin=y_mean-y_sd, ymax=y_mean+y_sd), width=0)+ | |
| geom_line()+ | |
| geom_point()+ | |
| scale_colour_manual(values=c('#1f78b4','#33a02c','#e31a1c','#ff7f00', | |
| '#8856a7'))+theme_bw()+ | |
| facet_wrap(~variable,scales = "free_x",nrow=1,ncol=5)+ | |
| scale_y_continuous(breaks=c(.5,.75,1), labels=c("50%","75%", | |
| "100%"), limits=c(0.4-.4*.05,.9),expand = c(0,0))+ | |
| labs(x="",y="Cat preference")+ | |
| theme(legend.position="none", | |
| axis.title.y=element_text(size=10), | |
| axis.title.x=element_blank(), | |
| axis.text=element_text(size=10), | |
| strip.text=element_text(size=10), | |
| strip.background = element_rect(fill='grey92')) | |
| ``` | |
| ## Interaction effect | |
| Thirdly, we demonstrate visually that there is an interaction between age and | |
| gender and compare to a case where there is no interaction. | |
| Here a simulated interaction effect between age (*x-axis*) and gender (*color*), | |
| right panel, is contrasted with no interaction effect (*left panel*). While both | |
| panels demonstrate a difference between the genders on the outcome variable | |
| (*y-axis*), only the second panel shows this difference changing with the | |
| variable on the x-axis. | |
| ```{r interaction-summary, eval=FALSE, echo=FALSE} | |
| # not evaluated to avoid dependency on tidyverse | |
| #Summarise | |
| interaction <- sample %>% | |
| gather(variable, category, c("age", "eth")) %>% | |
| group_by(variable, category, male) %>% | |
| summarise(y_mean = mean(cat_pref), | |
| y_sd = sqrt(mean(cat_pref) * (1 - mean(cat_pref)) / n())) %>% | |
| ungroup() | |
| #Tidy for nice facet labels | |
| interaction$variable <- as.factor(interaction$variable) | |
| levels(interaction$variable) <- list('Ethnicity' = 'eth', 'Age' = 'age') | |
| save(interaction, file = "mrp-files/interaction.rda", version = 2) | |
| ``` | |
| ```{r plot-interaction, echo=FALSE, fig.height = 4, fig.width = 7, fig.align = "center"} | |
| load("mrp-files/interaction.rda") # created in previous chunk | |
| ggplot(data=interaction, aes(x=as.factor(category), y=y_mean, colour=as.factor(male),group=as.factor(male))) + | |
| geom_errorbar(aes(ymin=y_mean-y_sd, ymax=y_mean+y_sd),width=0 )+ | |
| geom_line(aes(x=as.factor(category), y=y_mean,colour=as.factor(male)))+ | |
| geom_point()+ | |
| facet_wrap(~variable,scales = "free_x",nrow=1,ncol=2)+ | |
| labs(x="",y="Cat preference",colour='Gender')+ | |
| scale_y_continuous(breaks=c(0,.25,.5,.75,1), labels=c("0%",'25%',"50%","75%", | |
| "100%"), limits=c(0,1),expand=c(0,0))+ | |
| scale_colour_manual(values=c('#4575b4','#d73027'))+theme_bw()+ | |
| theme(axis.title=element_text(size=10), | |
| axis.text=element_text(size=10), | |
| legend.position='none', | |
| strip.text=element_text(size=10), | |
| strip.background = element_rect(fill='grey92')) | |
| ``` | |
| ## Design effect | |
| Lastly we look at the difference in cat preference between states, which will | |
| form the basis for the multi-level component of our analysis. Participants were | |
| randomly selected from particular states. Plotting the state (*x-axis*) against | |
| the overall proportion of participants who prefer cats (*y-axis*) demonstrates | |
| state differences. The downward slope is because we ordered the x-axis by the | |
| proportion of cat preference for ease of viewing. We also include second plot | |
| with a horizontal line to represent the overall preference for cats in the total | |
| population, according to the sample. | |
| ```{r, eval=FALSE, echo=FALSE} | |
| # not evaluated to avoid dependency on tidyverse | |
| #Summarise by state | |
| preference_by_state <- sample %>% | |
| group_by(state) %>% | |
| summarise(y_mean = mean(cat_pref), | |
| y_sd = sqrt(mean(cat_pref) * (1 - mean(cat_pref)) / n())) %>% | |
| ungroup() | |
| save(preference_by_state, file = "mrp-files/preference_by_state.rda", version = 2) | |
| ``` | |
| ```{r, echo=FALSE, fig.height = 4, fig.width = 8, fig.align = "center"} | |
| load("mrp-files/preference_by_state.rda") | |
| compare <- ggplot(data=preference_by_state, aes(x=state, y=y_mean,group=1)) + | |
| geom_ribbon(aes(ymin=y_mean-y_sd,ymax=y_mean+y_sd,x=state),fill='lightgrey',alpha=.7)+ | |
| geom_line(aes(x=state, y=y_mean))+ | |
| geom_point()+ | |
| scale_y_continuous(breaks=c(0,.25,.5,.75,1), | |
| labels=c("0%","25%","50%","75%","100%"), | |
| limits=c(0,1), expand=c(0,0))+ | |
| scale_x_discrete(drop=FALSE)+ | |
| scale_colour_manual(values=c('#1f78b4','#33a02c','#e31a1c','#ff7f00', | |
| '#8856a7'))+ | |
| theme_bw()+ | |
| labs(x="States",y="Cat preference")+ | |
| theme(legend.position="none", | |
| axis.title=element_text(size=10), | |
| axis.text.y=element_text(size=10), | |
| axis.text.x=element_text(angle=90,size=8), | |
| legend.title=element_text(size=10), | |
| legend.text=element_text(size=10)) | |
| compare2 <- ggplot()+ | |
| geom_hline(yintercept = mean(sample$cat_pref),size=.8)+ | |
| geom_text(aes(x = 5.2, y = mean(sample$cat_pref)+.025, label = "Sample"))+ | |
| scale_y_continuous(breaks=c(0,.25,.5,.75,1), | |
| labels=c("0%","25%","50%","75%","100%"), | |
| limits=c(-0.25,1.25),expand=c(0,0))+ | |
| theme_bw()+ | |
| labs(x="Popn",y="")+ | |
| theme(legend.position="none", | |
| axis.title.y=element_blank(), | |
| axis.title.x=element_text(size=10), | |
| axis.text=element_blank(), | |
| axis.ticks=element_blank(), | |
| legend.title=element_text(size=10), | |
| legend.text=element_text(size=10)) | |
| bayesplot_grid(compare,compare2, | |
| grid_args = list(nrow=1, widths = c(8,1))) | |
| ``` | |
| # MRP with rstanarm | |
| From visual inspection, it appears that different levels of post-stratification | |
| variable have different preferences for cats. Our survey also appears to have | |
| sampling bias; indicating that some groups were over/under sampled relative to | |
| the population. The net effect of this is that we could not make good population | |
| level estimates of cat preference straight from our sample. Our aim is to infer | |
| the preference for cats in the *population* using the post-stratification | |
| variables to account for systematic differences between the sample and | |
| population. Using rstanarm, this becomes a simple procedure. | |
| The first step is to use a multi-level logistic regression model to | |
| predict preference for cats in the sample given the variables that we will use | |
| to post-stratify. Note that we actually have more rows in the | |
| post-stratification matrix than the we have observed units, so there are some | |
| cells in the poststrat matrix that we don't observe. We can use a multi-level | |
| model to partially pool information across the different levels within each | |
| variable to assist with this. In the model described below, we use a fixed | |
| intercept for gender, and hierarchically modeled varying intercepts | |
| for each of the other factors. | |
| Let $\theta_{j}$ denote the preference for cats in the $j$th poststratification cell. | |
| The non-hierarchical part of the model can be written as | |
| $$\theta_j= logit^{-1}(X_{j}\beta),$$ | |
| where here $X$ only contains an indicator for male or female and an interaction | |
| term with age. | |
| Adding the varying intercepts for the other variables the model becomes | |
| $$ | |
| \theta_j = logit^{-1}( | |
| X_{j}\beta | |
| + \alpha_{\rm state[j]}^{\rm state} | |
| + \alpha_{\rm age[j]}^{\rm age} | |
| + \alpha_{\rm eth[j]}^{\rm eth} | |
| + \alpha_{\rm inc[j]}^{\rm inc} | |
| ) | |
| $$ | |
| with | |
| $$ | |
| \begin{align*} | |
| \alpha_{\rm state[j]}^{\rm state} & \sim N(0,\sigma^{\rm state}) \\ | |
| \alpha_{\rm age[j]}^{\rm age} & \sim N(0,\sigma^{\rm age})\\ | |
| \alpha_{\rm eth[j]}^{\rm eth} & \sim N(0,\sigma^{\rm eth})\\ | |
| \alpha_{\rm inc[j]}^{\rm inc} &\sim N(0,\sigma^{\rm inc}) \\ | |
| \end{align*} | |
| $$ | |
| Each of $\sigma^{\rm state}$, $\sigma^{\rm age}$, $\sigma^{\rm eth}$, | |
| and $\sigma^{\rm inc}$ are estimated from the data (in this case using | |
| rstanarm's default priors), which is beneficial as it means we share | |
| information between the levels of each variable and we can prevent levels with | |
| with less data from being too sensitive to the few observed values. This also | |
| helps with the levels we don't observe at all it will use information from the | |
| levels that we do observe. For more on the benefits of this type of model, see | |
| @gelman2005analysis, and see @ghitza2013deep and @si2017bayesian for more | |
| complicated extensions that involve deep interactions and structured prior | |
| distributions. | |
| Here is the model specified using the `stan_glmer()` function in rstanarm, | |
| which uses the same formula syntax as the `glmer()` function from the lme4 | |
| package: | |
| ```{r, message=FALSE, warning=FALSE, results='hide'} | |
| fit <- stan_glmer( | |
| cat_pref ~ factor(male) + factor(male) * factor(age) + | |
| (1 | state) + (1 | age) + (1 | eth) + (1 | income), | |
| family = binomial(link = "logit"), | |
| data = sample | |
| ) | |
| ``` | |
| ```{r} | |
| print(fit) | |
| ``` | |
| As a first pass to check whether the model is performing well, note that there | |
| are no warnings about divergences, failure to converge or tree depth. If | |
| these errors do occur, more information on how to alleviate them is provided | |
| [here](https://mc-stan.org/rstanarm/articles/rstanarm.html#step-3-criticize-the-model). | |
| ## Population Estimate | |
| From this we get a summary of the baseline log odds of cat preference at the | |
| first element of each factor (i.e., male = 0, age = 1) for each state, plus | |
| estimates on variability of the intercept for state, ethnicity, age and income. | |
| While this is interesting, currently all we have achieved is a model that | |
| predicts cat preference given a number of factor-type predictors in a sample. | |
| What we would like to do is estimate cat preference in the population by | |
| accounting for differences between our sample and the population. We use the | |
| `posterior_linpred()` function to obtain posterior estimates for cat preference | |
| given the proportion of people in the *population* in each level of the factors | |
| included in the model. | |
| ```{r, message=FALSE} | |
| posterior_prob <- posterior_linpred(fit, transform = TRUE, newdata = poststrat) | |
| poststrat_prob <- posterior_prob %*% poststrat$N / sum(poststrat$N) | |
| model_popn_pref <- c(mean = mean(poststrat_prob), sd = sd(poststrat_prob)) | |
| round(model_popn_pref, 3) | |
| ``` | |
| We can compare this to the estimate we would have made if we had just used the sample: | |
| ```{r, message=FALSE} | |
| sample_popn_pref <- mean(sample$cat_pref) | |
| round(sample_popn_pref, 3) | |
| ``` | |
| We can also add it to the last figure to graphically represent the difference | |
| between the sample and population estimate. | |
| ```{r, message=FALSE,fig.height = 4, fig.width = 8, fig.align = "center"} | |
| compare2 <- compare2 + | |
| geom_hline(yintercept = model_popn_pref[1], colour = '#2ca25f', size = 1) + | |
| geom_text(aes(x = 5.2, y = model_popn_pref[1] + .025), label = "MRP", colour = '#2ca25f') | |
| bayesplot_grid(compare, compare2, | |
| grid_args = list(nrow = 1, widths = c(8, 1))) | |
| ``` | |
| As this is simulated data, we can look directly at the preference for cats that | |
| we simulated from to consider how good our estimate is. | |
| ```{r, message=FALSE} | |
| true_popn_pref <- sum(true_popn$cat_pref * poststrat$N) / sum(poststrat$N) | |
| round(true_popn_pref, 3) | |
| ``` | |
| Which we will also add to the figure. | |
| ```{r, echo=FALSE, message=FALSE,fig.height = 4, fig.width = 8, fig.align = "center"} | |
| compare2 <- compare2 + | |
| geom_hline(yintercept = mean(true_popn_pref), linetype = 'dashed', size = .8) + | |
| geom_text(aes(x = 5.2, y = mean(true_popn_pref) - .025), label = "True") | |
| bayesplot_grid(compare, compare2, | |
| grid_args = list(nrow = 1, widths = c(8, 1))) | |
| ``` | |
| Our MRP estimate is barely off, while our sample estimate is off by more than 10 | |
| percentage points. This indicates that using MRP helps to make estimates for the | |
| population from our sample that are more accurate. | |
| ## Estimates for states | |
| One of the nice benefits of using MRP to make inference about the population is | |
| that we can change the population of interest. In the previous paragraph we | |
| inferred the preference for cats in the whole population. We can also infer the | |
| preference for cats in a single state. In the following code we post-stratify | |
| for each state in turn. Note that we can reuse the predictive model from the | |
| previous step and update for different population demographics. This is | |
| particularly useful for complicated cases or large data sets where the model | |
| takes some time to fit. | |
| As before, first we use the proportion of the population in each combination of | |
| the post-stratification groups to estimate the proportion of people who | |
| preferred cats in the population, only in this case the population of interest | |
| is the state. | |
| ```{r, message=FALSE} | |
| state_df <- data.frame( | |
| State = 1:50, | |
| model_state_sd = rep(-1, 50), | |
| model_state_pref = rep(-1, 50), | |
| sample_state_pref = rep(-1, 50), | |
| true_state_pref = rep(-1, 50), | |
| N = rep(-1, 50) | |
| ) | |
| for(i in 1:length(levels(as.factor(poststrat$state)))) { | |
| poststrat_state <- poststrat[poststrat$state == i, ] | |
| posterior_prob_state <- posterior_linpred( | |
| fit, | |
| transform = TRUE, | |
| draws = 1000, | |
| newdata = as.data.frame(poststrat_state) | |
| ) | |
| poststrat_prob_state <- (posterior_prob_state %*% poststrat_state$N) / sum(poststrat_state$N) | |
| #This is the estimate for popn in state: | |
| state_df$model_state_pref[i] <- round(mean(poststrat_prob_state), 4) | |
| state_df$model_state_sd[i] <- round(sd(poststrat_prob_state), 4) | |
| #This is the estimate for sample | |
| state_df$sample_state_pref[i] <- round(mean(sample$cat_pref[sample$state == i]), 4) | |
| #And what is the actual popn? | |
| state_df$true_state_pref[i] <- | |
| round(sum(true_popn$cat_pref[true_popn$state == i] * poststrat_state$N) / | |
| sum(poststrat_state$N), digits = 4) | |
| state_df$N[i] <- length(sample$cat_pref[sample$state == i]) | |
| } | |
| state_df[c(1,3:6)] | |
| state_df$State <- factor(state_df$State, levels = levels(sample$state)) | |
| ``` | |
| Here we similar findings to when we considered the population as whole. While | |
| estimates for cat preference (in percent) using the sample are off by | |
| ```{r} | |
| round(100 * c( | |
| mean = mean(abs(state_df$sample_state_pref-state_df$true_state_pref), na.rm = TRUE), | |
| max = max(abs(state_df$sample_state_pref-state_df$true_state_pref), na.rm = TRUE) | |
| )) | |
| ``` | |
| the MRP based estimates are much closer to the actual percentage, | |
| ```{r} | |
| round(100 * c( | |
| mean = mean(abs(state_df$model_state_pref-state_df$true_state_pref)), | |
| max = max(abs(state_df$model_state_pref-state_df$true_state_pref)) | |
| )) | |
| ``` | |
| and especially when the sample size for that population is relatively small. | |
| This is easier to see graphically, so we will continue to add additional layers | |
| to the previous figure. Here we add model estimates,represented by triangles, | |
| and the true population cat preference, represented as transparent circles. | |
| ```{r, message=FALSE, echo=FALSE, fig.height = 4, fig.width = 8, fig.align = "center",warning=FALSE, fig.align = "center"} | |
| #Summarise by state | |
| compare <- compare + | |
| geom_point(data=state_df, mapping=aes(x=State, y=model_state_pref), | |
| inherit.aes=TRUE,colour='#238b45')+ | |
| geom_line(data=state_df, mapping=aes(x=State, y=model_state_pref,group=1), | |
| inherit.aes=TRUE,colour='#238b45')+ | |
| geom_ribbon(data=state_df,mapping=aes(x=State,ymin=model_state_pref-model_state_sd, | |
| ymax=model_state_pref+model_state_sd,group=1), | |
| inherit.aes=FALSE,fill='#2ca25f',alpha=.3)+ | |
| geom_point(data=state_df, mapping=aes(x=State, y=true_state_pref), | |
| alpha=.5,inherit.aes=TRUE)+ | |
| geom_line(data=state_df, mapping=aes(x=State, y=true_state_pref), | |
| inherit.aes = TRUE,linetype='dashed') | |
| bayesplot_grid(compare, compare2, | |
| grid_args = list(nrow = 1, widths = c(8, 1))) | |
| ``` | |
| # Other formats | |
| ## Alternate methods of modelling | |
| Previously we used a binary outcome variable. An alternative form of this model | |
| is to aggregate the data to the poststrat cell level and model the number of | |
| successes (or endorsement of cat preference in this case) out of the total | |
| number of people in that cell. To do this we need to create two n x 1 outcome | |
| variables, `N_cat_pref` (number in cell who prefer cats) and `N` (number in the | |
| poststrat cell). | |
| ```{r, eval=FALSE} | |
| # not evaluated to avoid dependency on tidyverse | |
| sample_alt <- sample %>% | |
| group_by(male, age, income, state, eth) %>% | |
| summarise(N_cat_pref = sum(cat_pref), N = n()) %>% | |
| ungroup() | |
| ``` | |
| <!-- ```{r, include=FALSE, eval=FALSE} --> | |
| <!-- # not evaluated to avoid dependency on tidyverse --> | |
| <!-- sample_alt <- sample %>% --> | |
| <!-- group_by(male, age, income, state, eth) %>% --> | |
| <!-- summarise(N_cat_pref = sum(cat_pref), N = n()) %>% --> | |
| <!-- ungroup() --> | |
| <!-- save(sample_alt, file = "mrp-files/sample_alt.rda", version = 2) --> | |
| <!-- ``` --> | |
| ```{r, include=FALSE} | |
| load("mrp-files/sample_alt.rda") | |
| ``` | |
| We then can use these two outcome variables to model the data using the | |
| binomial distribution. | |
| ```{r, message=FALSE, warning=FALSE, results='hide'} | |
| fit2 <- stan_glmer( | |
| cbind(N_cat_pref, N - N_cat_pref) ~ factor(male) + factor(male) * factor(age) + | |
| (1 | state) + (1 | age) + (1 | eth) + (1 | income), | |
| family = binomial("logit"), | |
| data = sample_alt, | |
| refresh = 0 | |
| ) | |
| ``` | |
| ```{r} | |
| print(fit2) | |
| ``` | |
| Like before, we can use the `posterior_linpred()` function to obtain an estimate of | |
| the preference for cats in the population. | |
| ```{r, message=FALSE} | |
| posterior_prob_alt <- posterior_linpred(fit2, transform = TRUE, newdata = poststrat) | |
| poststrat_prob_alt <- posterior_prob_alt %*% poststrat$N / sum(poststrat$N) | |
| model_popn_pref_alt <- c(mean = mean(poststrat_prob_alt), sd = sd(poststrat_prob_alt)) | |
| round(model_popn_pref_alt, 3) | |
| ``` | |
| As we should, we get the same answer as when we fit the model using the binary | |
| outcome. The two ways are equivalent, so we can use whichever form is most | |
| convenient for the data at hand. More details on these two forms of binomial models | |
| are available [here](https://mc-stan.org/rstanarm/articles/binomial.html). | |
| # Appendix | |
| ### Examples of other formulas | |
| The formulas for fitting so-called "mixed-effects" models in **rstanarm** are | |
| the same as those in the **lme4** package. A table of examples can be found in | |
| Table 2 of the vignette for the **lme4** package, available | |
| [here](https://CRAN.R-project.org/package=lme4/vignettes/lmer.pdf). | |
| ### Code to simulate the data | |
| Here is the source code for the `simulate_mrp_function()`, which is based off | |
| of some code provided by Aki Vehtari. | |
| ```{r} | |
| print(simulate_mrp_data) | |
| ``` | |
| # References | |
| <!-- mrp.bib specified at the top --> | |