# Simpson's paradox

## Example from Causality Primer

From [Primer](http://bayes.cs.ucla.edu/PRIMER/primer-ch1.pdf), table 1.1. is

Subpopulations | No Treatment | Treatment 
------------ | ------------- | ------
Male |  234 of 270 recover (87%) | 81 of 87 recover (93%) 
Female | 55 of 80 recover (69%) | 192 of 263 recover (73%)
Total | 289 of 350 (83%)) | 273 of 350 (78%)


We exchange the rows due to keys having to be in alphabetical order for the fake-data-for-learning package (see [this issue](https://github.com/munichpavel/fake-data-for-learning/issues/8)):

Subpopulations | No Treatment | Treatment 
------------ | ------------- | ------
Female | 55 of 80 recover (69%) | 192 of 263 recover (73%)
Male |  234 of 270 recover (87%) | 81 of 87 recover (93%) 
Total | 289 of 350 (83%)) | 273 of 350 (78%)



## Equations for Simpson paradox

Keeping with the above example, we consider the counts above as being derived from a space of counts along dimensions $(\mathrm{RECOVERED}, \mathrm{GENDER}, \mathrm{TREATED})$ of $\mathbb{N}^2 \times \mathbb{N}^2  \times \mathbb{N}^2$:

$$\begin{align*}
u_{000} &= \textrm{Count of non-recovered females who received no treatment} \\
u_{100} &= \textrm{Count of recovered females who received no treatment} \\
u_{010} &= \textrm{Count of non-recovered males who received no treatment} \\
u_{110} &= \textrm{Count of recovered males who received no treatment} \\
u_{001} &= \textrm{Count of non-recovered females who received treatment} \\
u_{101} &= \textrm{Count of recovered females who received treatment} \\
u_{011} &= \textrm{Count of non-recovered males who received treatment} \\
u_{111} &= \textrm{Count of recovered males who received treatment}
\end{align*}$$

To denote sums along one or more of the count dimensions of RECOVERED, GENDER, TREATED, we substitute the dimension along which we sum by a $+$. So $u_{00+}=147$ is the count of recovered females, regardless of treatment. Hence we can re-write the above table as 

Subpopulations | No Treatment | Treatment 
:----------: | :-----------: | :----: 
Female |  $u_{100}$ of $u_{+00}$ recover (ratio $\frac{u_{100}}{u_{+00}}$) | $u_{101}$ of $u_{+01}$ recover (ratio $\frac{n_{101}}{n_{+01}})$ 
Male | $u_{110}$ of $u_{+10}$  recover (ratio $\frac{u_{110}}{u_{+10}})$ | $u_{111}$ of $u_{+11}$ recover (ratio $\frac{u_{111}}{u_{+11}})$
Total |  $u_{1+0}$ of $u_{++0}$  recover $\left(\frac{u_{1+1}}{u_{++0}}\right)$ |  $n_{1+1}$ of $u_{++1}$  recover (ratio $\frac{u_{1+1}}{u_{++1}})$ 



## Example: Contingency tables

For computation, the $u_{ijk}$ lend themselves to representing experiment outcomes grouped by sub-populations as contigency tables; see [Lectures on Algebraic Statistics](http://math.berkeley.edu/~bernd/owl.pdf), Section 1.1]. 

In this example, the contingency table is a 2x2x2 matrices of event counts. Using [xarray](http://xarray.pydata.org/en/stable/), we can define this matrix in a convenient form for defining, accessing and performing operations (e.g. the marginal sums of the $+$ notation).

In [2]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import xarray as xr

from fake_data_for_learning import utils as ut
from risk_learning.simpson import compute_margin, transform_data_array_component

ModuleNotFoundError: No module named 'risk_learning'

In [2]:
U = [
    [
        [25, 71],
        [36, 6]
        
    ], 
    [
         [55, 192],
        [234, 81]
    ]
       
]
gender_vals = ['female', 'male']
treated_vals = [0, 1]
recovered_vals = [0, 1]
contingency = xr.DataArray(
    U, 
    dims=('recovered', 'gender', 'treated'),
    coords=[recovered_vals, gender_vals, treated_vals]
)

print("Check against counts from contingency table above:\n")
events = [
    dict(recovered=1, gender='female', treated=0),
    dict(recovered=1, gender='female', treated=1),
    dict(recovered=1, gender='male'),
    dict(recovered=1, gender='male'),

]

for event in events:
    marginal_counts = contingency.sel(**event)
    print(f'Counts for event(s) {event}: \n{marginal_counts.values}')

Check against counts from contingency table above:

Counts for event(s) {'recovered': 1, 'gender': 'female', 'treated': 0}: 
55
Counts for event(s) {'recovered': 1, 'gender': 'female', 'treated': 1}: 
192
Counts for event(s) {'recovered': 1, 'gender': 'male'}: 
[234  81]
Counts for event(s) {'recovered': 1, 'gender': 'male'}: 
[234  81]


## Exercise: Recovery rates with `xarray`

Difficulty: (*)

Determine if the recovery rate $\left(=\frac{\textrm{subpopulation-size-if-recovered}}{\textrm{subpopulation-size}}\right)$ is higher or lower for the following subpopulations

1. females who do not recover vs males who do not recover
1. females who did not receive treatment and recover vs males who did not receive treatment and recover

## Exercise: No-simpson

Difficulty: (**)

Show that if all sub-population counts $u_{ijk}$ in the above $(\mathrm{RECOVERED}, \mathrm{GENDER}, \mathrm{TREATED})$ of $\mathbb{N}^2 \times \mathbb{N}^2  \times \mathbb{N}^2$ are all equal to a constant $u_{i} > 0$ for all $j,k$, then Simpson's paradox cannot occur.


## Exercise: Simple linear coordinate change about a Simpson point

Difficulty: (**)

Assume $p_{ijk}$ as before ($i,j,k \in \{0,1\}$) such that the sub-population and total population (in)equalities are 

$$\begin{align}
p_{101}p_{+00} -p_{100}p_{+01} > 0 \\
p_{111}p_{+10} -p_{110}p_{+11} = 0 
\end{align}$$

$$\begin{equation}
p_{1+1}p_{++0} - p_{1+0}p_{++1}  = 0
\end{equation}$$

For $\epsilon > 0$, consider the coordinate change

$$\begin{align}
p_{101} \mapsto p_{101} + \epsilon \\
p_{100} \mapsto p_{100} - \epsilon
\end{align}$$

where $\epsilon$ is chosen small enough that $p_{101} + \epsilon < 1$ and $p_{100} - \epsilon > 0$.

This change makes the first subpopulation inequality further from zero, potentially increasing the ''paradox,'' as the treated female subpopulation fares even better.

What happens to the other two (in-)equalities? Give your answer in terms of the original $p_{ijk}$ and $\epsilon$.

## Exercise: Simpson relative-risk in terms of counts

Difficulty: (*)

Recall that an the relative risk for two sub-population groups is the probability of an outcome in an exposed group (=treated) to the probability of the outcome in an unexposed group (=untreated).

In terms of relative risk, Simpson's paradox roughly occurs when the relative risk of one or more subpopulations is less than (resp. greater than) 1, while the trend is reversed in the total population, or no trend exists, i.e. relative risk values are 1.

Assuming that all terms we need to divide by are non-zero, show that, for the female, male and total subgroups, respectively, the relative risks with event $recover=1$ and exposure event $treated=1$,

$$\begin{align}
rr_{gender=0} &= \frac{ u_{101} u_{+00} }{ u_{100} u_{+01} } \\
rr_{gender=1} &= \frac{ u_{111} u_{+10} }{ u_{110} u_{+11} } \\
rr &= \frac{ u_{1+1} u_{++0} }{ u_{1+0} u_{++1} }
\end{align}$$

## Exercise: more extreme Simpson case of female subpopulation examples above

Difficulty: (***)

Consider the count transformation

$$\begin{align}
\phi_{101}: u_{101} &\mapsto u_{101} + \epsilon \\
\phi_{111}: u_{111} &\mapsto u_{111} + \nu
\end{align}$$

And assume that the gender x treatment sub-population counts remain fixed, i.e. the $u_{+jk}$ do not change (by changing the non-recovery subpopulation counts, e.g. $u_{100} \mapsto u_{100} - \epsilon$.)

Find a $\phi_{110}: u_{110} \mapsto \phi_{110}(u_{110})$ in terms of $\nu$ that preserves the relative-risk $rr_{gender=1}$. Hint: finding a solution with the concrete numbers of the example "Better recovery with treatment for female subpopulation, all else the same recovery" may help.

What contraints are needed on $\epsilon$ and $\nu$ for the transformation to be valid? Recall our assumption that sub-population counts $u_{+jk}$ remain fixed.

## Exercise: more extreme Simpson case of female subpopulation examples above, continued

Difficulty: (**)

Consider the count transformation

$$\begin{align}
\phi_{101}: u_{101} &\mapsto u_{101} + \epsilon \\
\phi_{111}: u_{111} &\mapsto u_{111} + \nu \\
\phi_{110}: u_{110} &\mapsto \phi_{110}(u_{110}) \\
\phi_{100}: u_{100} &\mapsto \phi_{100} - \delta
\end{align}$$

where we assume $\epsilon, \delta > 0$, and $\phi_{110}$ is from your solution to the previous exercise. We also keep the assumption that the $u_{+jk}$ are fixed under all transformations.

Determine conditions on $\delta$ in terms of $\epsilon$ and $\nu$ to ensure that the total population relative-risk $rr$ is preserved.
