vignettes/session_lecture.Rmd

---
title: 'Session 5: loglinear regression part 2'
author: "Levi Waldron"
clean: false
output:
  beamer_presentation:
    colortheme: dove
    df_print: paged
    fonttheme: structurebold
    slide_level: 2
    theme: Hannover
  slidy_presentation: default
  md_document:
    preserve_yaml: false
always_allow_html: true
institute: CUNY SPH Biostatistics 2
---

# Learning objectives and outline

## Learning objectives

1. Define and identify over-dispersion in count data
2. Define the negative binomial (NB) distribution and identify applications for it
3. Define zero-inflated count models
4. Fit and interpret Poisson and NB, with and without zero inflation

## Outline

1. Review of log-linear Poisson glm
2. Review of diagnostics and interpretation of coefficients
3. Over-dispersion
    + Negative Binomial distribution
4. Zero-inflated models

Resources:

* Vittinghoff section 8.1-8.3
* Short tutorials on regression in R (and Stata, SAS, SPSS, Mplus)
    + https://stats.idre.ucla.edu/other/dae/


# Review

## Components of GLM

* **Random component** specifies the conditional distribution for the response variable - it doesn’t have to be normal but can be any distribution that belongs to the “exponential” family of distributions
* **Systematic component** specifies linear function of predictors (linear predictor)
* **Link** [denoted by g(.)] specifies the relationship between the expected value of the random component and the systematic component, can be linear or nonlinear  


## Motivating example: Choice of Distribution

* Count data are often modeled as Poisson distributed:
    + mean $\lambda$ is greater than 0
    + variance is also $\lambda$
    + Probability density $P(k, \lambda) = \frac{\lambda^k}{k!} e^{-\lambda}$
       
```{r, echo=FALSE, fig.height=6}
##par(cex=2)  #increase size of type and axes
plot(x=0:10, y=dpois(0:10, lambda=1), 
     type="b", lwd=2,
     xlab="Counts (k)", ylab="Probability density")
lines(x=0:10, y=dpois(0:10, lambda=2), 
      type="b", lwd=2, lty=2, pch=2)
lines(x=0:10, dpois(0:10, lambda=4), 
      type="b", lwd=2, lty=3, pch=3)
legend("topright", lwd=2, lty=1:3, pch=1:3,
       legend=c(expression(paste(lambda, "=1")),
                expression(paste(lambda, "=2")),
                expression(paste(lambda, "=4"))))
```

## Poisson model: the GLM

The **systematic part** of the GLM is:
$$
log(\lambda_i) = \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i
$$
Or alternatively:
$$
\lambda_i = exp \left( \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i \right)
$$

The **random part** is (Recall the $\lambda_i$ is both the mean and variance of a Poisson distribution):
$$
y_i \sim Poisson(\lambda_i)
$$


## Example: Risky Drug Use Behavior

* Outcome is # times the drug user shared a syringe in the past month (`shared_syr`)
* Predictors: `sex`, `ethn`, `homeless`
    + filtered to `sex` "M" or "F", `ethn` "White", "AA", "Hispanic"

\small
```{r loadandfilter, echo=FALSE, message=FALSE}
suppressPackageStartupMessages({
  library(readr)
  library(dplyr)
})
needledat <- readr::read_csv("needle_sharing.csv")
needledat2 <- needledat %>%
  dplyr::filter(sex %in% c("M", "F") &
                  ethn %in% c("White", "AA", "Hispanic") &
                  !is.na(homeless)) %>%
  mutate(
    homeless = recode(homeless, "0" = "no", "1" = "yes"),
    hiv = recode(
      hivstat,
      "0" = "negative",
      "1" = "positive",
      "2" = "yes"
    )
  )
```

```{r}
summary(needledat2$shared_syr)
var(needledat2$shared_syr, na.rm = TRUE)
```

## Example: Risky Drug Use Behavior

Exploratory plots

```{r, echo=FALSE}
par(mfrow = c(1, 2))
##par(cex=2)
hist(needledat2$shared_syr, main = "")
plot(
  sort(needledat2$shared_syr),
  pch = ".",
  xlab = "count",
  ylab = "# participants"
)
```

* There are a _lot_ of zeros and variance is much greater than mean
    + Poisson model is probably not a good fit

## Fitting a Poisson model


```{r, results='hide'}
fit.pois <- glm(shared_syr ~ sex + ethn + homeless,
                data = needledat2,
                family = poisson(link = "log"))
```

## Residuals plots

```{r, echo=FALSE, warning=FALSE}
par(mfrow = c(2, 2))
plot(fit.pois)
```
* Poisson model is definitely not a good fit.

# Over-dispersion

## When the Poisson model doesn't fit

1. Variance > mean (over-dispersion)
    + Negative binomial distribution
2. Excess zeros (zero inflation)
    + Can introduce zero-inflation

## Negative binomial distribution

* The binomial distribution is the number of successes in n trials:
    + Roll a die ten times, how many times do you see a 6?
* The negative binomial distribution is the number of successes it takes to observe r failures:
    + How many times do you have to roll the die to see a 6 ten times?
    + Note that the number of rolls is no longer fixed.
    + In this example, p=5/6 and a 6 is a "failure"

## Negative binomial GLM

*One way* to parametrize a NB model is with a **systematic part** equivalent to the Poisson model:
$$
log(\lambda_i) = \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i
$$
Or:
$$
\lambda_i = exp \left( \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i \right)
$$

And a **random part**:
$$
y_i \sim NB(\lambda_i, \theta)
$$

* $\theta$ is a **dispersion parameter** that is estimated
* When $\theta = 0$ it is equivalent to Poisson model
* `MASS::glm.nb()` uses this parametrization, `dnbinom()` does not
* The Poisson model can be considered **nested** within the Negative Binomial model

## Negative Binomial Random Distribution

```{r, echo=FALSE}
plot(
  x = 0:40,
  y = dnbinom(0:40, size = 10, prob = 0.5),
  type = "b",
  lwd = 2,
  ylim = c(0, 0.2),
  xlab = "Counts (k)",
  ylab = "Probability density"
)
lines(
  x = 0:40,
  y = dnbinom(0:40, size = 20, prob = 0.5),
  type = "b",
  lwd = 2,
  lty = 2,
  pch = 2
)
lines(
  x = 0:40,
  y = dnbinom(0:40, size = 10, prob = 0.3),
  type = "b",
  lwd = 2,
  lty = 3,
  pch = 3
)
legend(
  "topright",
  lwd = 2,
  lty = 1:3,
  pch = 1:3,
  legend = c("n=10, p=0.5", "n=20, p=0.5", "n=10, p=0.3")
)
```

## Compare Poisson vs. Negative Binomial

Negative Binomial Distribution has two parameters: # of trials n, and probability of success p

```{r, echo=FALSE}
plot(
  x = 0:40,
  y = dnbinom(0:40, size = 10, prob = 0.5),
  type = "b",
  lwd = 2,
  ylim = c(0, 0.15),
  xlab = "Counts (k)",
  ylab = "Probability density"
)
lines(
  x = 0:40,
  y = dnbinom(0:40, size = 20, prob = 0.5),
  type = "b",
  lwd = 2,
  lty = 2,
  pch = 2
)
lines(
  x = 0:40,
  y = dnbinom(0:40, size = 10, prob = 0.3),
  type = "b",
  lwd = 2,
  lty = 3,
  pch = 3
)
lines(x = 0:40,
      y = dpois(0:40, lambda = 9),
      col = "red")
lines(x = 0:40,
      y = dpois(0:40, lambda = 20),
      col = "red")
legend(
  "topright",
  lwd = c(2, 2, 2, 1),
  lty = c(1:3, 1),
  pch = c(1:3, -1),
  col = c(rep("black", 3), "red"),
  legend = c("n=10, p=0.5", "n=20, p=0.5", "n=10, p=0.3", "Poisson")
)
```

## Negative Binomial Regression

```{r, echo=TRUE, message=FALSE, warning=FALSE}
library(MASS)
fit.negbin <- MASS::glm.nb(shared_syr ~ sex +
                             ethn + homeless,
                           data = needledat2)
```

## 

\tiny
```{r}
summary(fit.negbin)
```

## Likelihood ratio test 

Basis: Under $H_0$: no improvement in fit by more complex model, difference in model residual deviances is $\chi^2$-distributed.

Deviance: $\Delta (\textrm{D}) = -2 * \Delta (\textrm{log likelihood})$

```{r}
(ll.negbin <- logLik(fit.negbin))
(ll.pois <- logLik(fit.pois))
pchisq(2 * (ll.negbin - ll.pois), df=1, 
       lower.tail=FALSE)
```


## NB regression residuals plots

```{r, echo=FALSE, warning=FALSE}
par(mfrow = c(2, 2))
plot(fit.negbin)
```

# Zero Inflation

## Zero inflated "two-step" models

**Step 1**: logistic model to determine whether count is zero or Poisson/NB

**Step 2**: Poisson or NB regression distribution for $y_i$ not set to zero by *1.*

## Poisson Distribution with Zero Inflation

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(gamlss)
##par(cex=2)  #increase size of type and axes
plot(
  x = 0:10,
  y = dpois(0:10, lambda = 2),
  type = "b",
  lwd = 2,
  ylim = c(0, 0.5),
  xlab = "Counts (k)",
  ylab = "Probability density"
)
lines(
  x = 0:10,
  y = dZIP(0:10, mu = 2, sigma = 0.2),
  type = "b",
  lwd = 2,
  lty = 2,
  pch = 2
)
lines(
  x = 0:10,
  y = dZIP(0:10, mu = 2, sigma = 0.4),
  type = "b",
  lwd = 2,
  lty = 3,
  pch = 3
)
legend(
  "topright",
  lwd = 2,
  lty = 1:3,
  pch = 1:3,
  legend = c(expression(paste(lambda, "=2")),
             expression(
               paste("ZIP: ", lambda, "=2, ", Sigma, "=0.2")
             ),
             expression(
               paste("ZIP: ", lambda, "=2, ", Sigma, "=0.4")
             ))
)
```

## Zero-inflated Poisson regression

```{r, echo=TRUE, results='hide', message=FALSE, warning=FALSE}
library(pscl)
fit.ZIpois <-
  pscl::zeroinfl(shared_syr~sex+ethn+homeless,
                 dist = "poisson",
                 data = needledat2)
```

## 

\tiny
```{r}
summary(fit.ZIpois)
```


## Zero-inflated Negative Binomial regression

```{r, echo=TRUE, results='hide', message=FALSE}
fit.ZInegbin <- 
  pscl::zeroinfl(shared_syr~sex+ethn+homeless,
                         dist = "negbin",
                         data = needledat2)
```

* *NOTE*: zero-inflation model can include any of your variables as predictors
* *WARNING* Default in `zerinfl()` function is to use _all_ variables as predictors in logistic model 

## 

\tiny
```{r}
summary(fit.ZInegbin)
```

## Zero-inflated NB - simplified

* Model is much more interpretable if the exposure of interest is _not_ included in the zero-inflation model.
* E.g. with HIV status as the only predictor in zero-inflation model:

```{r, echo=TRUE, results='hide', message=FALSE}
fit.ZInb2 <- pscl::zeroinfl(shared_syr ~ sex + ethn + 
                        homeless + hiv | hiv,
                      dist = "negbin",
                      data = needledat2)
```

## 

\tiny
```{r}
summary(fit.ZInb2)
```


## Intercept-only ZI model

```{r}
fit.ZInb3 <-
  pscl::zeroinfl(shared_syr~sex+ethn+homeless|1,
           dist = "negbin",
           data = needledat2)
```

## 

\tiny
```{r}
summary(fit.ZInb3)
```


## Confidence intervals

Use the `confint()` function for all these models (don't try to specify which package confint comes from). E.g.:

```{r}
confint(fit.ZInb3)
```


## Residuals vs. fitted values

I invisibly define functions `plotpanel1` and `plotpanel2` that will work for all types of models (see lab).  These use Pearson residuals.
```{r, echo=FALSE}
plotpanel1 <- function(fit, ...) {
  plot(
    x = predict(fit),
    y = residuals(fit, type = "pearson"),
    xlab = "Predicted Values",
    ylab = "Pearson Residuals",
    ...
  )
  abline(h = 0, lty = 3)
  lines(lowess(x = predict(fit), y = resid(fit, type = "pearson")),
        col = "red")
}
plotpanel2 <- function(fit, ...) {
  resids <- scale(residuals(fit, type = "pearson"))
  qqnorm(resids, ylab = "Std Pearson resid.", ...)
  qqline(resids)
}
```

```{r, echo=FALSE, warning=FALSE}
par(mfrow = c(2, 2))
plotpanel1(fit.pois, main = "Residuals vs. Fitted\n Poisson")
plotpanel1(fit.negbin, main = "Residuals vs. Fitted\n Negative Binomial")
plotpanel1(fit.ZIpois, main = "Residuals vs. Fitted\n Zero-inflated Poisson")
plotpanel1(fit.ZInegbin, main = "Residuals vs. Fitted\n Zero-inflated Negative Binomial")
```


## Quantile-quantile plots for residuals

```{r, echo=FALSE, warning=FALSE}
par(mfrow = c(2, 2))
plotpanel2(fit.pois, main = "Normal Q-Q Plot\n Poisson")
plotpanel2(fit.negbin, main = "Normal Q-Q Plot\n Negative Binomial")
plotpanel2(fit.ZIpois, main = "Normal Q-Q Plot\n Zero-inflated Poisson")
plotpanel2(fit.ZInegbin, main = "Normal Q-Q Plot\n Zero-inflated Negative Binomial")
```
_still_ over-dispersed - ideas?

## Summary / Conclusions

* These are multiplicative models
* Fitting zero-inflated models can be problematic (convergence, over-complicated default models), especially for small samples
* Use QQ and residuals plots to assess model fit
* Can use LRT to compare nested models