## 16. Inference about Independence

This chapter addresses two questions:

1. How do we test if two random variables are independent?
2. How do we estimate the strength of dependence between two random variables?

Recall we write $Y \text{ ⫫ } Z$ to mean that $Y$ and $Z$ are independent.

When $Y$ and $Z$ are not independent, we say they are **dependent** or **associated** or **related**.

Note that dependence does not mean causation:
- Smoking is related to heart disease, and quitting smoking will reduce the chance of heart disease.
- Owning a TV is related to lower starvation, but giving a starving person a TV does not make them not hungry.

### 16.1 Two Binary Variables

Suppose that both $Y$ and $Z$ are binary.  Consider a data set $(Y_1, Z_1), \dots, (Y_n, Z_n)$.  Represent the data as a two-by-two table:

$$
\begin{array}{c|cc|c} 
      & Y = 0  & Y = 1 & \\
\hline
Z = 0 & X_{00} & X_{01} & X_{0\text{·}}\\
Z = 1 & X_{10} & X_{11} & X_{1\text{·}}\\
 \hline
      & X_{\text{·}0} & X_{\text{·}1} & n = X_{\text{··}}
\end{array}
$$

where $X_{ij}$ represents the number of observations where $(Z_k, Y_k) = (i, j)$.  The dotted subscripts denote sums, e.g. $X_{i\text{·}} = \sum_j X_{ij}$.  Denote the corresponding probabilities by:

$$
\begin{array}{c|cc|c} 
      & Y = 0  & Y = 1 & \\
\hline
Z = 0 & p_{00} & p_{01} & p_{0\text{·}}\\
Z = 1 & p_{10} & p_{11} & p_{1\text{·}}\\
 \hline
      & p_{\text{·}0} & p_{\text{·}1} & 1
\end{array}
$$

where $p_{ij} = \mathbb{P}(Z = i, Y = j)$.  Let $X = (X_{00}, X_{01}, X_{10}, X_{11})$ denote the vector of counts.  Then $X \sim \text{Multinomial}(n, p)$ where $p = (p_{00}, p_{01}, p_{10}, p_{11})$.

The **odds ratio** is defined to be

$$ \psi = \frac{p_{00} p_{11}}{p_{01} p_{10}}$$

The **log odds ratio** is defined to be

$$ \gamma = \log \psi$$

**Theorem 16.2**.  The following statements are equivalent:

1. $Y \text{ ⫫ } Z$
2. $\psi = 1$
3. $\gamma = 0$
4. For $i, j \in \{ 0, 1 \}$, $p_{ij} = p_{i\text{·}} p_{\text{·}j}$

Now consider testing

$$
H_0: Y \text{ ⫫ } Z
\quad \text{versus} \quad
H_1: \text{not} (Y \text{ ⫫ } Z)
$$

First consider the likelihood ratio test.  Under $H_1$, $X \sim \text{Multinomial}(n, p)$ and the MLE is $\hat{p} = X / n$.  Under $H_0$, again $X \sim \text{Multinomial}(n, p)$ but $p$ is subjected to the constraint $p_{ij} = p_{i.} p_{.j}$.  This leads to the following test.

**Theorem 16.3 (Likelihood Ratio Test for Independence in a 2-by-2 table)**. 
Let

$$ T = 2 \sum_{i=0}^1 \sum_{j=0}^1 X_{ij} \log \left( \frac{X_{ij} X_{\text{··}}}{X_{i\text{·}} X_{\text{·}j}} \right)$$

Under $H_0$, $T \leadsto \chi_1^2$.  Thus, an approximate level $\alpha$ test is obtained by rejecting $H_0$ when $T > \chi_{1, \alpha}^2$.

**Theorem 16.4 (Pearson's $\chi^2$ test for Independence in a 2-by-2 table)**. Let

$$ U = \sum_{i=0}^1 \sum_{j=0}^1 \frac{(X_{ij} - E_{ij})^2}{E_{ij}} $$

where

$$ E_{ij} = \frac{X_{i\text{·}} X_{\text{·}j}}{n}$$

Under $H_0$, $U \leadsto \chi_1^2$.  Thus, an approximate level $\alpha$ test is obtained by rejecting $H_0$ when $U > \chi_{1, \alpha}^2$.

Here's the intuition for the Pearson test: Under $H_0$, $p_{ij} = p_{i\text{·}} p_{\text{·}j}$, so the MLE of $p_{ij}$ is $\hat{p}_{ij} = \hat{p}_{i\text{·}} \hat{p}_{\text{·}j} = \frac{X_{i\text{·}}}{n} \frac{X_{\text{·}j}}{n}$.  Thus, the expected number of observations in the $(i, j)$ cell is $E_{ij} = n \hat{p}_{ij} = \frac{X_{i\text{·}} X_{\text{·}j}}{n}$.  The statistic $U$ compares the observed and expected counts.

**Theorem 16.6**. The MLE's of $\psi$ and $\gamma$ are

$$
\hat{\psi} = \frac{X_{00} X_{11}}{X_{01} X_{10}}
, \quad
\hat{\gamma} = \log \hat{\psi}
$$

The asymptotic standard errors (computed from the delta method) are

$$
\begin{align}
\hat{\text{se}}(\hat{\psi}) &= \sqrt{\frac{1}{X_{00}} + \frac{1}{X_{01}} + \frac{1}{X_{10}} + \frac{1}{X_{11}}}\\
\hat{\text{se}}(\hat{\gamma}) &= \hat{\psi} \hat{\text{se}}(\hat{\gamma})
\end{align}
$$

Yet another test of independence is the Wald test for $\gamma = 0$ given by $W = (\hat{\gamma} - 0) / \hat{\text{se}}(\hat{\gamma})$. 

A $1 - \alpha$ confidence interval for $\gamma$ is $\hat{\gamma} \pm z_{\alpha/2} \hat{\text{se}}(\hat{\gamma})$.

A $1 - \alpha$ confidence interval for $\psi$ can be obtained in two ways.  First, we could use $\hat{\psi} \pm z_{\alpha/2} \hat{\text{se}}(\hat{\psi})$.  Second, since $\psi = e^{\gamma}$ we could use  $\exp \{\hat{\gamma} \pm z_{\alpha/2} \hat{\text{se}}(\hat{\gamma})\}$.  This second method is usually more accurate.

### 16.2 Interpreting the Odds Ratios

Suppose event $A$ has probability $\mathbb{P}(A)$.  The odds of $A$ are defined as

$$\text{odds}(A) = \frac{\mathbb{P}(A)}{1 - \mathbb{P}(A)}$$

It follows that

$$\mathbb{P}(A) = \frac{\text{odds}(A)}{1 + \text{odds}(A)}$$

Let $E$ be the event that someone is exposed to something (smoking, radiation, etc) and let $D$ be the event that they get a disease.  The odds of getting the disease given exposure are:

$$\text{odds}(D | E) = \frac{\mathbb{P}(D | E)}{1 - \mathbb{P}(D | E)}$$

and the odds of getting the disease given non-exposure are:

$$\text{odds}(D | E^c) = \frac{\mathbb{P}(D | E^c)}{1 - \mathbb{P}(D | E^c)}$$

The **odds ratio** is defined to be

$$\psi = \frac{\text{odds}(D | E)}{\text{odds}(D | E^c)}$$

If $\psi = 1$ then the disease probability is the same for exposed and unexposed; this implies these events are independent.  Recall that the log-odds ratio is defined as $\gamma = \log \psi$.  Independence corresponds to $\gamma = 0$.

Consider this table of probabilities:

$$
\begin{array}{c|cc|c} 
      & D^c    & D      & \\
\hline
E^c   & p_{00} & p_{01} & p_{0\text{·}}\\
E     & p_{10} & p_{11} & p_{1\text{·}}\\
 \hline
      & p_{\text{·}0} & p_{\text{·}1} & 1
\end{array}
$$

Denote the data by

$$
\begin{array}{c|cc|c} 
      & D^c    & D      & \\
\hline
E^c   & X_{00} & X_{01} & X_{0\text{·}}\\
E     & X_{10} & X_{11} & X_{1\text{·}}\\
 \hline
      & X_{\text{·}0} & X_{\text{·}1} & X_{\text{··}}
\end{array}
$$

Now

$$
\mathbb{P}(D | E) = \frac{p_{11}}{p_{10} + p_{11}}
\quad \text{and} \quad
\mathbb{P}(D | E^c) = \frac{p_{01}}{p_{00} + p_{01}}
$$

and so

$$
\text{odds}(D | E) = \frac{p_{11}}{p_{10}}
\quad \text{and} \quad
\text{odds}(D | E^c) = \frac{p_{01}}{p_{00}}
$$

and therefore

$$ \psi = \frac{p_{11}p_{00}}{p_{01}p_{10}}$$

To estimate the parameters, we have to consider how the data were collected.  There are three methods.

**Multinomial Sampling**.  We draw a sample from the population and, for each sample, record their exposure and disease status.  In this case, $X = (X_{00}, X_{01}, X_{10}, X_{11}) \sim \text{Multinomial}(n, p)$.  We then estimates the probabilities in the table by $\hat{p}_{ij$ = X_{ij} / n$ and

$$ \hat{\psi} = \frac{\hat{p}_{11} \hat{p}_{00}}{\hat{p}_{01} \hat{p}_{10}} = \frac{X_{11} X_{00}}{X_{01} X_{10}}$$

**Prospective Sampling (Cohort Sampling)**.  We get some exposed and unexposed people and count the number with disease within each group.  Thus,

$$
X_{01} \sim \text{Binomial}(X_{0\text{·}}, \mathbb{P}(D | E^c))
\quad \text{and} \quad
X_{11} \sim \text{Binomial}(X_{1\text{·}}, \mathbb{P}(D | E))
$$

In this case we should write small letters $x_{0\text{·}},  x_{1\text{·}}$ instead of capital letters $ X_{0\text{·}},  X_{1\text{·}}$ since they are fixed and not random, but we'll keep using capital letters for notational simplicity.

We can estimate $\mathbb{P}(D | E))$ and $\mathbb{P}(D | E^c)$ but we cannot estimate all probabilities in the table.  Still, we can estimate $\psi$.  Now:

$$\hat{\mathbb{P}}(D | E) = \frac{X_{11}}{X_{1\text{·}}}
\quad \text{and} \quad
\hat{\mathbb{P}}(D | E^c) = \frac{X_{01}}{X_{0\text{·}}}
$$

Thus,

$$ \hat{\psi} = \frac{X_{11} X_{00}}{X_{01} X_{10}}$$

as before.

**Case-Control (Retrospective Sampling)**.  Here we get some diseased and non-diseased people and we observe how many are exposed.  This is much more efficient if the disease is rare.  Hence,

$$
X_{10} \sim \text{Binomial}(X_{\text{·}0}, \mathbb{P}(E | D^c))
\quad \text{and} \quad
X_{11} \sim \text{Binomial}(X_{\text{·}1}, \mathbb{P}(E | D))
$$

From this data we can estimate $\mathbb{P}(E | D)$ and $\mathbb{P}(E | D^c)$.  Surprisingly, we can still estimate $\psi$.  To understand why, note that

$$
\mathbb{P}(E | D) = \frac{p_{11}}{p_{01} + p_{11}},
\quad 1 - \mathbb{P}(E | D) = \frac{p_{01}}{p_{01} + p_{11}},
\quad \text{odds}(E | D) = \frac{p_{11}}{p_{01}}
$$

By a similar argument,

$$\text{odds}(E | D^c) = \frac{p_{10}}{p_{00}}$$

Hence,

$$\frac{\text{odds}(E | D)}{\text{odds}(E | D^c)} = \frac{p_{11} p_{00}}{p_{01} p_{10}} = \psi$$

Therefore,

$$\hat{\psi} = \frac{X_{11} X_{00}}{X_{01} X_{10}}$$

In all three methods, the estimate of $\psi$ turns out to be the same.

It is tempting to try to estimate $\mathbb{P}(D | E) - \mathbb{P}(D | E^c)$.  In a case-control design, this quantity is not estimable.  To see this, we apply Bayes' theorem to get

$$\mathbb{P}(D | E) - \mathbb{P}(D | E^c) = \frac{\mathbb{P}(E | D) \mathbb{P}(D))}{\mathbb{P}(E)} - \frac{\mathbb{P}(E^c | D) \mathbb{P}(D)}{\mathbb{P}(E^c)}$$

Because of the way we obtained the data, $\mathbb{P}(D)$ is not estimable from the data.

However, we can estimate $\xi = \mathbb{P}(D | E) / \mathbb{P}(D | E^c)$, which is called the **relative risk**, under the **rare disease assumption**.

**Theorem 16.9**.  Let $\xi = \mathbb{P}(D | E) / \mathbb{P}(D | E^c)$.  Then

$$ \frac{\psi}{\xi} \rightarrow 1$$

as $\mathbb{P}(D) \rightarrow 0$.

Thus, under the rare disease assumption, the relative risk is approximately the same as the odds ratio, which we can estimate.

In a randomized experiment, we can interpret a strong association, that is $\psi \neq 1$, as a causal relationship.  In an observational (non-randomized) study, the association can be due to other unobserved **confounding** variables.  We'll discuss causation in more detail later.