## Berkson's Paradox

Berkson's paradox deals with the observation that there can be a negative correlation between two positive traits in a population. That is, members of a population that have the one trait seem to lack the other trait. This apparent negative correlation can arise when we ignore members of the population which have neither of these traits. When we ignore such members, it might happen that, of the remaining population, the majority have only one or the other of the two traits. This leads to the apparent negative correlation. 

The simulation example given in the book about newsworthiness and trustworthiness of scientific articles can exemplify this paradox should we consider only those articles whose _combined_ newsworthiness and trustworthiness is high. This can happen as in this sub-population of scientific articles, the majority are likely to have one or the other of these two characteristics but not both.

In [1]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns

### Code 6.1

In [2]:
from scipy.stats import norm


N = 200
p = 0.1

nw = norm.rvs(loc=0, scale=1, size=N)
tw = norm.rvs(loc=0, scale=1, size=N)

score = nw + tw
# select the top 90% of the combined score
q = np.quantile(score, 1 - p)

selected = np.where(score >= q, True, False)

np.corrcoef(nw[selected], tw[selected])

array([[ 1.        , -0.75685876],
       [-0.75685876,  1.        ]])

## Multicollinearity

### Code 6.2

In [3]:
from scipy.stats import uniform


N = 100

height = norm.rvs(loc=165, scale=10, size=N)
leg_prop = uniform.rvs(loc=0.4, scale=0.1, size=N)

# assuming that the variation in leg length is in mm 
leg_left = height * leg_prop + norm.rvs(loc=0.0, scale=0.1, size=N)
leg_right = height * leg_prop + norm.rvs(loc=0.0, scale=0.1, size=N)

df = pd.DataFrame({'height': height, 'leg_left': leg_left, 'leg_right': leg_right})
df.head()

Unnamed: 0,height,leg_left,leg_right
0,169.733439,81.345882,81.428369
1,153.792855,63.273094,63.489442
2,181.47458,75.498856,75.457561
3,161.649857,74.336097,74.451645
4,161.566315,67.176114,67.337097


### Code 6.3

In [4]:
with pm.Model() as m_6_1:
    a = pm.Normal('a', mu=80, sigma=100)
    b_l = pm.Normal('b_l', mu=2.2, sigma=10)
    b_r = pm.Normal('b_r', mu=2.2, sigma=10)
    
    sigma = pm.Exponential('sigma', lam=1)
    
    mu = pm.Deterministic('mu', a + b_l * df['leg_left'] + b_r * df['leg_right'])
    height = pm.Normal('height', mu=mu, sigma=sigma, observed=df['height'])
    
    trace_6_1 = pm.sample(2000, tune=2000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, b_r, b_l, a]


Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 508 seconds.
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.8824096967009342, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 35 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 70 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accep

In [5]:
az.summary(trace_6_1, hdi_prob=0.89, var_names=['a', 'b_l', 'b_r'])



Unnamed: 0,mean,sd,hdi_5.5%,hdi_94.5%,mcse_mean,mcse_sd,ess_mean,ess_sd,ess_bulk,ess_tail,r_hat
a,93.29,8.601,79.521,106.708,0.149,0.106,3334.0,3323.0,3329.0,3880.0,1.0
b_l,3.434,4.181,-3.498,9.783,0.1,0.07,1762.0,1762.0,1755.0,2327.0,1.0
b_r,-2.456,4.16,-8.893,4.347,0.099,0.07,1782.0,1782.0,1776.0,2278.0,1.0
