## Example: (nonlinear) IV causal inference (with invalid IVs)
> Below is an example that demonstrates the usage of `ts_twas` in `nl_causal`.

## Simulate Data

- **library:** `nl_causal.base.sim`
- **Two Stage Datasets:** two independent datasets, **2SLS** and **2SIR** require different types of datasets:
  * For 2SLS:
    + Stage 1. LD matrix (`np.dot(Z1.T, Z1)`) + XZ_sum (`np.dot(Z1.T, X1)`)
    + Stage 2. ZY_sum (GWAS summary) (`np.dot(Z2.T, y2)`)
  * For 2SIR:
    + Stage 1. invidual-level data `Z1` and `X1`
    + Stage 2. ZY_sum (GWAS summary) (`np.dot(Z2.T, y2)`)
- **Remarks:** In terms of data, the advantage of 2SLS is merely requiring summary statistics of XZ and YZ in both Stages 1 and 2.

In [2]:
## import libraries
import numpy as np
from nl_causal.base import sim
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## simulate a dataset
np.random.seed(1)
n, p = 2000, 50
beta0 = 0.10
theta0 = np.ones(p) / np.sqrt(p)
## simulate invalid IVs
alpha0 = np.zeros(p)
alpha0[:5] = 1.

Z, X, y, phi = sim(n, p, theta0, beta0, alpha0=alpha0, case='inverse', feat='AP-normal')

## normalize the dataset
center = StandardScaler(with_std=False)
mean_X, mean_y = X.mean(), y.mean()
Z, X, y = center.fit_transform(Z), X - mean_X, y - mean_y
y_scale = y.std()
y = y / y_scale

## generate two-stage dataset
Z1, Z2, X1, X2, y1, y2 = train_test_split(Z, X, y, test_size=0.5, random_state=42)
n1, n2 = len(Z1), len(Z2)
LD_Z1, cov_ZX1 = np.dot(Z1.T, Z1), np.dot(Z1.T, X1)
LD_Z2, cov_ZY2 = np.dot(Z2.T, Z2), np.dot(Z2.T, y2)

## Models
- **library:** `nl_causal.ts_models._2SLS` and `nl_causal.ts_models._2SIR`
- **Methods:** [2SLS](https://doi.org/10.1080/01621459.2014.994705) and [2SIR](https://openreview.net/pdf?id=cylRvJYxYI)
- **sparse regression:**
    + `sparse_reg=None`: assume all IVs are valid.
    + specify a sparse regression method from `sparse_reg` to detect invalid IVs, such as `SCAD`.
- **Remarks.** 2SIR circumvents the linearity assumption in the standard 2SLS, and includes 2SLS as a special case.

In [3]:
from nl_causal.ts_models import _2SLS, _2SIR

In [7]:
## 2SLS

# specify a sparse regression model to detect invalid IVs
from nl_causal.sparse_reg import L0_IC

Ks = range(int(p/2))
reg_model = L0_IC(fit_intercept=False, alphas=10**np.arange(-1,3,.3),
				  Ks=Ks, max_iter=10000, refit=False, find_best=False)
LS = _2SLS(sparse_reg=reg_model)
## Stage-1 fit theta
LS.fit_theta(LD_Z1, cov_ZX1)
## Stage-2 fit beta
LS.fit_beta(LD_Z2, cov_ZY2, n2)
LS.selection_summary()

Unnamed: 0,candidate_model,criteria,mse
0,"[0, 1, 2, 3, 4, 5, 6, 7, 10, 15, 17, 18, 19, 2...",-1.44748,0.199236
1,"[0, 1, 2, 3, 4, 5, 36, 33, 43, 45, 18, 21, 23,...",-1.509465,0.200654
2,"[0, 1, 2, 3, 4, 5, 6, 7, 10, 15, 17, 18, 20, 2...",-1.457821,0.198553
3,"[1, 2, 3, 50]",-1.146681,0.309031
4,"[0, 1, 2, 3, 4, 5, 6, 7, 40, 10, 17, 18, 23, 50]",-1.505995,0.201351
5,"[0, 1, 2, 3, 4, 5, 6, 7, 40, 10, 43, 45, 17, 1...",-1.492272,0.201333
6,"[0, 1, 2, 3, 4, 18, 23, 50]",-1.54317,0.202213
7,"[0, 1, 2, 3, 4, 6, 40, 10, 18, 23, 50]",-1.526101,0.201476
8,"[0, 1, 2, 3, 4, 6, 18, 23, 50]",-1.536274,0.202211
9,"[0, 1, 2, 3, 4, 5, 6, 7, 40, 41, 10, 43, 45, 1...",-1.485722,0.201261


In [6]:
## produce p_value and CI for beta
LS.test_effect(n2, LD_Z2, cov_ZY2)
LS.CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2)
LS.summary()

╔═══════════════════════════════════════════════╗
║ 2SLS                                          ║
║ ----                                          ║
║ x = z^T theta + omega;                        ║
║ y = beta x + z^T alpha + epsilon.             ║
║ ---                                           ║
║ beta: causal effect from x to y.              ║
║ ---                                           ║
║ Est beta (CI): 0.023 (CI: [-0.0414  0.0877])  ║
║ p-value: 0.2461, -log10(p): 0.6088            ║
╚═══════════════════════════════════════════════╝


In [4]:
## 2SIR
SIR = _2SIR(sparse_reg=None)
## Stage-1 fit theta
SIR.fit_theta(Z1, X1)
## Stage-2 fit beta
SIR.fit_beta(LD_Z2, cov_ZY2, n2)
## generate CI for beta
SIR.test_effect(n2, LD_Z2, cov_ZY2)
SIR.CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2)
SIR.summary()

╔═════════════════════════════════════════════╗
║ 2SIR                                        ║
║ ----                                        ║
║ ψ(x) = z^T theta + omega;                   ║
║ y = beta ψ(x) + z^T alpha + epsilon.        ║
║ ---                                         ║
║ beta: causal effect from x to y.            ║
║ ---                                         ║
║ Est beta (CI): 0.084 (CI: [0.0219 0.1462])  ║
║ p-value: 0.0074, -log10(p): 2.1314          ║
╚═════════════════════════════════════════════╝


## Results

In the simulated data, the true causal effect is `beta0 = 0.10`. 

- 2SLS provides wrong p-values and CIs, and fails to reject the null hypothesis that `H0: beta = 0`. 
- 2SIR provides a valid CI and reject the null hypothesis.