Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generalize branch #80

Closed
pzivich opened this issue Mar 3, 2019 · 7 comments
Closed

generalize branch #80

pzivich opened this issue Mar 3, 2019 · 7 comments
Assignees
Labels
Causal inference Updates for the causal inference branch enhancement Short-term Issues/additions that will be completed in the near future

Comments

@pzivich
Copy link
Owner

pzivich commented Mar 3, 2019

In the future, I think a zepid.causal.generalize branch would be a useful addition. This branch would contain some generalizability and transportability tools. Specifically, the g-transport formula, inverse probability of sampling weights, inverse odds of sampling weights, and doubly robust generalizability.

Generally, I think I can repurpose a fair bit of the existing code. I need to consider how best to handle the distinction between generalizability (sample from target population) and transportability (sample not from target population). I am imagining that the user would pass in two data sets, the data to estimate the model on, and the other data set to generalize to.

As far as I know, these resources are largely lacking in all other commonly used epi softwares. Plus this is becoming an increasingly hot topic in epi (and I think it will catch on more widely once people recognize you can go from your biased sample to a full population under an additional set of assumptions)

Resources:

Notes:
Some record of Issa's presentation at CIRG. This is the more difficult doubly robust estimator. It is when only a subset of the cohort has some measure needed for transportability. Rather than throwing out the individual who don't have X2 measures, you can use the process in the arXiv paper. For the nested-nested population, the robust estimator has three parts. b_a(X_1, S) is harder to estimate but you can use the following algorithm

  1. model Y as a function of X={X1, X2} among S=1 and A=a

  2. Predict \hat{Y} among those with D=1

  3. Model \hat{Y} as X1, S in D=1

  4. Predict \hat{Y*} in D={0, 1}

Also hard to estimate Pr(S=1|X) becuase X2 only observed for subset. Can use m-estimator to obtain. Can do this by a weighted regression with 1 for D=1 & S=1 and 1/C(X1, S) for D=1 & S=0. This is a little less obvious to me but seems doable

@pzivich pzivich added enhancement Long-term Issues/additions that are longer term goals labels Mar 3, 2019
@pzivich pzivich self-assigned this Mar 3, 2019
@pzivich pzivich mentioned this issue Mar 7, 2019
9 tasks
@pzivich
Copy link
Owner Author

pzivich commented Mar 11, 2019

Marking to look at later: https://tech.wayfair.com/2018/10/pylift-a-fast-python-package-for-uplift-modeling/

Does some similar stuff, but finds ideal customer to target. Might be a useful approach to look into consider. Slightly different from the generalizability/transportability problem, but it is closely related

@pzivich
Copy link
Owner Author

pzivich commented Mar 12, 2019

How g-transport should work:

  1. Fit g-formula to the sample with observed exposure/outcome
  2. Use fitted model to predict to full sample
  3. Take mean to generate marginal outcome
  • If generalize, take marginal of all
  • If transport, take marginal of outside sample only

@pzivich
Copy link
Owner Author

pzivich commented Mar 13, 2019

As interesting aside, I think IPSW is going to be broadly more useful. Reason being is that you don't need the confounders measured in the target sample (you only need the modifiers). For the g-transport-formula or the doubly robust estimator you need both.

This only applies to observational data where you would need to account for confounding. However, that's still a major point in favor of IPSW

@pzivich pzivich added Short-term Issues/additions that will be completed in the near future and removed Long-term Issues/additions that are longer term goals labels Mar 13, 2019
@pzivich
Copy link
Owner Author

pzivich commented Mar 13, 2019

As a reminder for myself, the g-transport formula and IPSW have some differences in the sample data. These differences are just a result of the sample taken. I performed some simulations to check that both are approximately unbiased. Below is a quick simulation to demonstrate that

import numpy as np
import pandas as pd
from scipy.stats import logistic
from zepid.causal.generalize import IPSW, GTransportFormula

g_gen = []
g_tra = []
i_gen = []
i_tra = []

for i in range(200):
    n = 1000
    df = pd.DataFrame()
    df['L'] = np.random.binomial(1, p=0.35, size=n)
    df['A'] = np.random.binomial(1, p=0.5, size=n)
    df['Y1'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 1 + 0.3 * df['L'] - 1.1 * 1 * df['L']), size=n)
    df['Y0'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 0 + 0.3 * df['L'] - 1.1 * 0 * df['L']), size=n)
    df['Y'] = df['Y1'] * df['A'] + df['Y0'] * (1 - df['A'])
    df['S'] = np.random.binomial(1, p=0.67 - 0.4 * df['L'], size=n)

    ipsw = IPSW(df, exposure='A', outcome='Y', selection='S')
    ipsw.regression_models('L', print_results=False)
    ipsw.fit()
    i_gen.append(ipsw.risk_difference - 0.09)

    gtransport = GTransportFormula(df, exposure='A', outcome='Y', selection='S', generalize=True)
    gtransport.outcome_model('A + L + L:A', print_results=False)
    gtransport.fit()
    g_gen.append(gtransport.risk_difference - 0.09)


print('Bias - IPSW:', np.mean(i_gen))
print('Bias - G:', np.mean(g_gen))

As I would expect (knowing more) IPSW does slightly better (bias approaches zero at a faster rate). It makes sense since IPSW has less to model than the g-transport formula

Results (2019/3/13)

Bias - IPSW: -0.002591761051227205
Bias - G: -0.0026239656430772397

@pzivich
Copy link
Owner Author

pzivich commented Mar 14, 2019

I am having trouble getting AIPSW to be a consistent estimator in the above example when g-transport is misspecified (i.e. Y ~ A + L). It does better than the g-transport but AIPSW doesn't get closer to zero when I bump the sample size from 10,000 to 1,000,000

Maybe an edge case since L is a strong modifier? Either way, I need to dig deeper into this

UPDATE: based on some simulations g-transport seems to dominate between the two measures. I suppose that makes sense based on the formula? I think the above is an edge case. At least it seems like that when I adapt the above code to:

for i in range(200):
    n = 1000
    df = pd.DataFrame()
    df['C'] = np.random.binomial(1, p=0.4, size=n)
    df['L'] = np.random.binomial(1, p=0.8*df['C'] + 0.2*(1-df['C']), size=n)
    df['A'] = np.random.binomial(1, p=0.5, size=n)
    df['Y1'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 1 + 0.3 * df['L'] - 1.1 * 1 * df['L']), size=n)
    df['Y0'] = np.random.binomial(1, p=logistic.cdf(-0.5 + 0.75 * 0 + 0.3 * df['L'] - 1.1 * 0 * df['L']), size=n)
    df['Y'] = df['Y1'] * df['A'] + df['Y0'] * (1 - df['A'])
    df['S'] = np.random.binomial(1, p=0.67 - 0.4 * df['L'] + df['C']*0.07, size=n)

    true = 0.03817

    ipsw = IPSW(df, exposure='A', outcome='Y', selection='S', generalize=False)
    ipsw.regression_models('L', print_results=False)
    ipsw.fit()
    i_gen.append(ipsw.risk_difference - true)

    gtransport = GTransportFormula(df, exposure='A', outcome='Y', selection='S', generalize=False)
    gtransport.outcome_model('A + C + C:A', print_results=False)
    gtransport.fit()
    g_gen.append(gtransport.risk_difference - true)

    aipw = AIPSW(df, exposure='A', outcome='Y', selection='S', generalize=False)
    aipw.weight_model('L', print_results=False)
    aipw.outcome_model('A + C + C:A', print_results=False)
    aipw.fit()
    a_gen.append(aipw.risk_difference - true)

Gives me

Bias - IPSW: -0.017543562941460363
Bias - G: 0.0503045192696528
Bias - AIPSW: 0.017478819696215078

which is more of what I would have expected. Additionally bumping up the sample to 1mil for a single run has AIPSW looking like a consistent estimator. I think I just may have found an example that "breaks" AIPSW, merely because there is so little variation in estimation. Unlikely to ever happen in practice since there is often a multitude of modifiers in all studies. Furthermore, it only "breaks" in that it didn't approach 0 bias that quickly (still did so faster than the mispecified g-transport, so still the better estimator)

@pzivich
Copy link
Owner Author

pzivich commented Mar 31, 2019

The release of the general generalizability formulas for v0.6.0 are unbiased in my simulations I have run

Issue will remain open. Still need to implement the 3-part double robust estimator. Will be on the back-burner for me, since the utility of that estimator is not as straight-forward as the others

@pzivich pzivich added the Causal inference Updates for the causal inference branch label Apr 18, 2019
@pzivich
Copy link
Owner Author

pzivich commented Jul 11, 2019

3-part estimator has a more narrow use-case and is more complicated to implement. As a result, will delay implementation. May re-open at some point (or if users want this implemented)

@pzivich pzivich closed this as completed Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Causal inference Updates for the causal inference branch enhancement Short-term Issues/additions that will be completed in the near future
Projects
None yet
Development

No branches or pull requests

1 participant