# Case study 1: flu shot encouragement (kernel)
This notebook includes experiments from Case Study 1 from the paper Multi-Source Causal Inference Using Control Variates. Specifically, this notebook contains experiments using kernel smoothing to estimate the odds ratios.

We use flu shot data from Section 8.1 of [Ding and Lu 2016](https://www.dropbox.com/s/jxk76wk8ckxx4m3/Ding_et_al-2017%20JRSSB%20Principal%20stratification%20analysis%20using%20principal%20scores.pdf?dl=0). The original dataset fludata.txt can be downloaded at https://rss.onlinelibrary.wiley.com/hub/journal/14679868/series-b-datasets/79_3a

The variables are:

- Z: the binary randomized encouragement to get the flu shot
- Y: the binary outcome of flu-related hospitalization. 
- X: all covariates. Most of them are binary. 

In [None]:
import numpy as np
import pandas as pd
from importlib import reload

import data_sampler
import bootstrap

In [2]:
df = pd.read_csv('fludata.txt', sep=" ")

In [6]:
Y_COLUMN = 'outcome'
Z_COLUMN = 'assign'
X_COLUMNS = ['age', 'copd', 'dm', 'heartd', 'race', 'renal', 'sex', 'liverd']

# Data generation using logistic regression model with interaction terms

In this section, we assume that the data generating outcome model is

$$P(Y=1 | Z = z, X = x) = \frac{e^{\beta_0 + \beta_1 z + \beta_2 ^T x + \beta_3 ^T xz}}{1 + e^{\beta_0 + \beta_1 z + \beta_2^T x + \beta_3 ^T xz}}$$

This allows for linear heterogenous effects in $x$.

## Fit model to get P(Y = 1 | Z = z, X = x)

In [8]:
data_sampler_interaction_logistic = data_sampler.DataSamplerInteractionLogistic(Z_COLUMN, X_COLUMNS, Y_COLUMN)
data_sampler_interaction_logistic.fit_outcome(df, print_results=True)

Accuracy for outcome model: 0.915065
AUC for outcome model: 0.665919
Coefficients for outcome model: [[ 1.05670447 -0.00555082  0.44484032  0.32466448  0.83032499  0.01765058
   1.39359229  0.09913    -2.47969443 -0.0050041  -0.2109543   0.28471807
  -0.35065923 -0.55507202  0.33194963 -0.65973364  3.59850653]]


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Fit model to get $P(Z = 1 | X = x)$ (propensity score)

We assume that the propensity score comes from a simple logistic regression model: 

$$P(Z = 1 | X = x) = \frac{e^{a_0 + a_1^Tx}}{ 1 + e^{a_0 + a_1^Tx}}$$

We fit $a_0, a_1$ from the data.

In [9]:
data_sampler_interaction_logistic.fit_propensity(df, print_results=True)

Training accuracy for propensity model: 0.526389


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Generate case control data

In [10]:
selection_biased_samples = data_sampler_interaction_logistic.generate_selection_biased_data(df, num_samples=10000)
selection_biased_samples.describe()

Generated 100000 samples before selection bias
Filtered to 17110 samples after selection bias; only returning the requested 10000


Unnamed: 0,age,copd,dm,heartd,race,renal,sex,liverd,assign,outcome
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,64.8548,0.3066,0.3168,0.6398,0.6321,0.0312,0.6478,0.0037,0.4915,0.4622
std,12.711722,0.461105,0.465252,0.480082,0.482258,0.173867,0.47768,0.060718,0.499953,0.498594
min,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,59.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,67.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
75%,73.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0
max,100.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Compute ATE estimates with and without control variate with kernel estimator

In [None]:
def ATE_estimator_fn_interaction(df_input):
    data_sampler_interaction_logistic = data_sampler.DataSamplerInteractionLogistic(Z_COLUMN, X_COLUMNS, Y_COLUMN)
    data_sampler_interaction_logistic.fit_outcome(df_input)
    return data_sampler_interaction_logistic.get_ATE_estimate(df_input)

OR_xs=df.sample(20, replace=False)[X_COLUMNS] #sample a few statas
def CV_estimator_kernel(df_input_obs, df_input_bias, bandwidth=10, n_OR_samples=20):
    data_sampler_interaction_logistic = data_sampler.DataSamplerInteractionLogistic(Z_COLUMN, X_COLUMNS, Y_COLUMN)
    # Estimate OR from observational dataset
    OR_obs = np.mean(data_sampler_interaction_logistic.get_conditional_OR_estimates_kernel(input_df=df_input_obs, x_inputs=OR_xs, bandwidth=bandwidth))
    # Estimate OR from selection bias dataset
    OR_bias = np.mean(data_sampler_interaction_logistic.get_conditional_OR_estimates_kernel(input_df=df_input_bias, x_inputs=OR_xs, bandwidth=bandwidth))
    return OR_obs - OR_bias

CV_samples, ATE_hat_samples, _ = bootstrap.run_bootstrap_df(df_obs=df, 
              df_bias=selection_biased_samples, 
              n_replicates=300, 
              ATE_estimator_fn=ATE_estimator_fn_interaction,
              CV_estimator_fn=CV_estimator_kernel,
             )

In [64]:
sample_cov = np.cov(np.array([ATE_hat_samples, CV_samples]), ddof=1)

# Get optimal control variates coefficient
cov_ATE_CV = sample_cov[0][1]
var_CV = sample_cov[1][1]
optimal_CV_coeff = cov_ATE_CV / var_CV
print("optimal CV coefficient:", optimal_CV_coeff)

optimal CV coefficient: 0.07205267541055566


In [65]:
# Get variance/bias of ATE estimators with and without CV.
CV_samples, ATE_hat_samples, ATE_hat_CV_samples = bootstrap.run_bootstrap_df(
    df_obs=df, 
    df_bias=selection_biased_samples, 
    n_replicates=300, # Try increasing this
    ATE_estimator_fn=ATE_estimator_fn_interaction,
    CV_estimator_fn=CV_estimator_kernel,
    optimal_CV_coeff=optimal_CV_coeff)

ATE_var = np.var(np.array(ATE_hat_samples), ddof=1)
print(">>> Variance of ATE estimator:", ATE_var)

ATE_bias = np.mean(np.array(ATE_hat_samples)) - ATE_estimate
print(">>> Bias of ATE estimator:", ATE_bias)

ATE_CV_var = np.var(np.array(ATE_hat_CV_samples), ddof=1)
print(">>> Variance of ATE estimator with CV:", ATE_CV_var)

ATE_CV_bias = np.mean(np.array(ATE_hat_CV_samples)) - ATE_estimate
print(">>> Bias of ATE estimator with CV:", ATE_CV_bias)

>>> Variance of ATE estimator: 0.00010715788098051833
>>> Bias of ATE estimator: 0.000538436939324833
>>> Variance of ATE estimator with CV: 2.4004733006866436e-05
>>> Bias of ATE estimator with CV: -0.004464371090199419
