Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: added intervening_variable.py and tests. Updated docs accordingly #8733

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

codydance
Copy link

added intervening variable analysis- classical sobel method and bootstrapping

Notes:

  • It is essential that you add a test when making code changes. Tests are not
    needed for doc changes.
  • When adding a new function, test values should usually be verified in another package (e.g., R/SAS/Stata).
  • When fixing a bug, you must add a test that would produce the bug in main and
    then show that it is fixed with the new code.
  • New code additions must be well formatted. Changes should pass flake8. If on Linux or OSX, you can
    verify you changes are well formatted by running
    git diff upstream/main -u -- "*.py" | flake8 --diff --isolated
    
    assuming flake8 is installed. This command is also available on Windows
    using the Windows System for Linux once flake8 is installed in the
    local Linux environment. While passing this test is not required, it is good practice and it help
    improve code quality in statsmodels.
  • Docstring additions must render correctly, including escapes and LaTeX.

:class:`~statsmodels.stats.intervening_variable.InterveningVariables` creates
confidence intervals for the indirect effect using Sobel's classical method
as well as bootstrapping techniques.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is for 0.13 which was already released

This PR is a candidate for 0.15

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any release notes after 0.13. Should I add a document?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release notes are mostly autogenerated. So we don't add them until a release.

You could park this section at the bottom of the top comment of the PR. Then we can copy it over when we have the release note file.

(We should find some better way to add to release notes, but don't have one yet)

@josef-pkt
Copy link
Member

Hi, thanks for the pull request

I'm not really familiar with details in this area.

How much does this differ from the current stats.mediation analysis?
Is there a possibility of common interface, code sharing, ... with it?

The mediation module has around 400 lines, so it would be possible to add it to that module, or at least make that the public access to this.

@kshedden Do you have time to look at this or comment on the background for this?

@codydance
Copy link
Author

codydance commented Mar 15, 2023

Hi @josef-pkt , thanks for checking-in.

The intent of these techniques are the same as those in stats.Mediation, though the implementation and background assumptions are very different. Confusingly, both are often referred to as 'Mediation Analysis'; however, I followed the convention of Mackinnon, et al. (2002) and reserved the name 'Mediation Analysis' to refer to the causal inference methods implemented in stats.Mediation and the name 'Intervening Variable Analysis' to refer to the product of coefficients methods I implemented. I prefer this differentiation because they really are different techniques with different applications and there is minimal code-sharing between the two implementations. In addition, stats.Mediation can be significantly expanded to include non-binary treatment and mediation variables, as well as sensitivity parameter analysis. My proposed module could also be significantly expanded to include ~14 other intervening variable techniques. In my view, it makes sense to keep them as separate modules.

Also, do I need to do anything about the statsmodels.statsmodels failure below? I'm not sure what that is indicating.

Cody

@josef-pkt
Copy link
Member

you have 3 failures like

        cur_dir = os.path.dirname(os.path.abspath(__file__))
>       data = pd.read_csv(os.path.join(cur_dir, 'results', "mackinnon2008.csv",
                                        index_col='id'))
E       TypeError: join() got an unexpected keyword argument 'index_col'

code should be robust to several versions of pandas. I'm not familiar with the error here.

Also you have style pep-8 violations, mainly trailing whitespaces and lines too long
https://dev.azure.com/statsmodels/statsmodels-testing/_build/results?buildId=5061&view=logs&j=cee8c96f-7e65-5602-f593-266823630fd5&t=0bf77771-d04a-5c5f-6285-ba31ad6bef7d

To the content
I will have to read up to understand more of the background, and the distinction between the two approaches.
I recently downloaded some mediation articles to get an overview, but did not get around to reading them yet.

@josef-pkt
Copy link
Member

ok, based on a very quick look

mediation is related to average treatment effect literature (statsmodels.treatment)
while the intervening variable analysis comes more from the system of equation, linear structural equations, multivariate modelling, path analysis literature.

Rubin versus Pearl?

It's better to keep them separate, but I'm not sure yet how we want to frame this.

@codydance
Copy link
Author

Yes, that's right.

Pandas error and pep-8 are easy fixes. I'll update the code in the next several days.

Thanks...

…way indirect proportion value is calculated to more align with R packages
@josef-pkt
Copy link
Member

more general thought, independently of merging this PR

I don't know yet where in which statsmodels structure this should be going. I always have problems finding names and categories for new folders.

Mediation analysis, traditional or treatment effect, seems to become much more popular in recent years.
eventually we need to get this out of stats and into a more dedicated folder.

I just did a quick google search for "mediation with instrumental variables" and there are a pretty large number of recent articles.
There is also a literature for mediation analysis for Poisson, Logit or similar.

roughly related areas that we have missing

  • SEM, system of equation (not a high priority)
  • IV parametric, currently minimal IV in sandbox for continuous treatment, nothing yet for binary treatment
  • IV parametric, equivalent for Poisson and other nonlinear (control function) (nothing yet)
  • treatment effect, non-parametric with ignorability basic linear outcome, binary treatment in treatment
  • IV-treatment effect non-parametric (nothing, Stata has it)
  • mediation what's here and in stats.mediation
  • IV-mediation ???

I don't know how to group and where to put those. #7691 for cases with IV or both treatment and outcome model.

(Aside: I'm against using the word "causal" only in the Rubin tradition. OLS and 2SLS are also "causal" with appropriate assumptions. "causal" != "non-parametrically identified average effect under ignorable unobserved heterogeneity" :)

@codydance
Copy link
Author

codydance commented Mar 16, 2023

I agree about 'causal', Rubin followers owning the word feels a bit elitist. There does seem to be a lot of literature coming out on the subject and it is quite technical.

I'll leave naming the folders to you :), though 'causal inference' seems plausible.

Anything else I need to do to merge this PR? statsmodels.statsmodels error doesn't seem to be from me.

@josef-pkt
Copy link
Member

I have not looked much at your code yet. Overall looks good, but I have not checked the details yet.
(I have a big problem reading "black"ened parentheses and indentation. Skimming code takes much longer.)

@codydance
Copy link
Author

Yeah, I don't like the black parentheses either, but the other code linter I have doesn't remove trailing whitespace and I was feeling lazy.

Please reach out if you have any questions!

@josef-pkt
Copy link
Member

I always remove trailing whitespace with my code editor, when I see those in my git gui. (I usually just set a key shortcut to F12 )

@josef-pkt
Copy link
Member

your non-black first commit looks much easier to read.

@codydance
Copy link
Author

Sorry. I can go back and redo it without the black, if you like.

@josef-pkt
Copy link
Member

josef-pkt commented Mar 17, 2023

I'm trying to think whether we want statsmodels.causal as an umbrella folder.
I wanted to avoid the word "causal" but it would be a useful umbrella for anything with ignorable or not ignorable endogeity, both traditional (parametric) and ATE/Rubin style, from heckman to treatment, mostly 2 equation or 2 stage models.
including IVPoisson or PoissonGMM.
(not sure whether this should also include endogenous missing, i.e. missing not completely at random. I guess yes.)
Throwing everything into causal would also work against having the name completely taken over by the Rubin tradition.
(I would have liked statsmodels.endogenity but we already got too many complaints about endog and exog)

(pure SEM, system of equations would still be outside of this, for now system of equation is only in linearmodels package)

I still don't like the word "causal" because it depends on identifying assumptions that we don't (or the model doesn't) know whether they hold, i.e. causality is an interpretation of the model or estimated effects. It does not really describe an algorithm or estimator. (The proper name for methods will be more explicit about the underlying assumptions on the data generating process and how it is handled, e.g. IVPoisson (parametric with endogenous regressors) or "non-parametrically identified ate under endogeneity with conditional independence or ignorability")
The only reason to use the name causal is name recognition.

@josef-pkt
Copy link
Member

I'm still browsing literature (on various "causal" issues)

I think fit here should get a cov_type option, so we don't need to compute bootstrap standard errors unless requested.
Most likely there will be other options for how to compute standard errors, e.g. is it possible to compute heteroskedasticity or correlation robust standard errors?

@codydance
Copy link
Author

codydance commented Mar 17, 2023

Regarding names- I would also be hesitant to have this module and other pathway analysis stuff in a folder labelled 'causal' for the same reasons you mentioned. They are not causal analyses without further assumptions.

Regarding standard error- yes, there are many, many ways to compute the standard error in this context. See the Mackinnon paper referenced above for a good overview. However, in my view, the main point of this module is the confidence intervals, not the standard error. The bootstrap is central in this regard because it makes no assumptions about the distribution of the product of coefficients, while many of the other methods (including Sobel) assume normality (which isn't true).

I think that if this model were totally built out to include 14+ ways for computing the confidence interval for the indirect effect, then I would agree that fit() should take an argument specifying the method. For the time being, however, I think it's useful to be able to see all the information together. For what it's worth, the 3 R packages I have seen display all the information in a single table.

@josef-pkt
Copy link
Member

I'm getting more in favor of statsmodels.causal as umbrella
I was just reading parts of Guido Imbens' nobel prize lecture and Jim Heckman on Haavelmo

We need "causal analysis", however that's defined. :)

@codydance
Copy link
Author

codydance commented Mar 17, 2023 via email

@codydance
Copy link
Author

codydance commented Apr 5, 2023 via email

@codydance
Copy link
Author

codydance commented May 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants