Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: GLM, discrete with endogenous regressors, control functions #7689

Open
josef-pkt opened this issue Sep 4, 2021 · 13 comments
Open

ENH: GLM, discrete with endogenous regressors, control functions #7689

josef-pkt opened this issue Sep 4, 2021 · 13 comments

Comments

@josef-pkt
Copy link
Member

josef-pkt commented Sep 4, 2021

(back to econometrics)

#1742 ivpoisson, GMMPoisson

control function approach with linear function for endogenous regressor doesn't look so scary.
https://stats.stackexchange.com/questions/281848/control-function-for-iv-poisson-regression
https://www.stata.com/features/overview/poisson-regression-with-endogenous-variables/

Cameron Trivedi count book 2nd edition, p. 402ff has docvis example that compares Poisson-CF, NB2-CF and additive and multiplicative version of GMMPoisson.

main caveat: They recommend bootstrap standard errors, because it is a two step estimation with generated regressor in second stage.
(maybe there is a Murphy/Topel or GMM asymptotic distribution for the two step procedure, see next comment)

related also fully parametric bivariate/multivariate models.
Cameron Trivedi also include a case with an endogenous categorical regressor modelled through simulated MLE for mixed multinomial.

IIRC, I have seen some time ago somewhere the use of splines for the control functions, to have less restrictive parametric assumptions.

partially related
#7636 heckman procedure with endogenous binary treatment but linear model at final stage.

@josef-pkt
Copy link
Member Author

josef-pkt commented Sep 4, 2021

AFAICS, without reading much

y1 ~ Poisson(f(x b + y2 a)) where f is inverse link, exp
y2 = z b2 + e (linear model)

second stage: use e_hat = y - z b2_hat
y1 ~ Poisson(f(x b + y2 a + e_hat))

joined moment conditions as function of (b, a, b2):
[ z (y2 - z b2), poisson_score(f(x b + y2 a + (y2 - z b2)))

I guess we can get the asy cov_params from GMM on the joined moment conditions. (exactly identified ?)
Derivatives of moment conditions will also mostly be by reusing existing code.

using generic two-stage:
We need derivative poisson_score(x b + y2 a + (y2 - z b2) w.r.t. b2, i.e. parameter of first stage.

@josef-pkt
Copy link
Member Author

josef-pkt commented Sep 4, 2021

It looks like my GMMPoisson has good test coverage against Stata
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/sandbox/regression/tests/test_gmm_poisson.py
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/sandbox/regression/tests/results_gmm_poisson.py

The stata results where created using the gmm command, and not ivpoisson

@josef-pkt
Copy link
Member Author

josef-pkt commented Mar 27, 2023

We might want to add some helper functions to compute cov_params for exactly identified recursive, triangular system like 2-stage estimators.

AFAIR, we don't have computational shortcut for GMM when it is exactly identified

#7436 (comment)

disadvantage compared to using GMM directly is that a helper function does not inherit robust cov_types (besides generic sandwich, HC )

slides showing how to use GMM for standard errors for control function 2-stage
https://www.stata.com/meeting/uk20/slides/UK20_Pinzon.pdf

@josef-pkt josef-pkt modified the milestones: 0.14, 0.15 Mar 27, 2023
@josef-pkt
Copy link
Member Author

eteffects in Stata uses simple control function, include Probit residuals in second stage.
standard errors by stacking moment conditions similar to teffects.
implements only regression adjustment
seems to report only POM and ATE/ATET/... . I don't see whether it's possible to get the full estimation results from eteffects
#8121

It should be possible to base unit tests for control function approach on this.

@josef-pkt
Copy link
Member Author

What do we use for predict?
#8762

The estimating linear predictor is x b + cf where cf are control functions.
We have and in-sample estimate of cf_i, but none for out of sample prediction.
What does Stata use for POM in etreatment?

(open question, I have not looked at this.)

@josef-pkt
Copy link
Member Author

josef-pkt commented Mar 31, 2023

implementation detail for control functions, API design

which resid, which functions to use as control functions?

maybe "control_kwds"
control_kwds = {"resid": "inv_mills_ratio", "polynomial": degree, "interaction": covars}

resid might depend on context, e.g inverse mills ratio for probit 1st stage or generalized residual or resid_outcome or resid_pearson.

everything gets more complicated if we have more than one endogenous variable with different 1st stage models, e.g. binary (Logit, Probit) or OLS/multivariateLS (as in IV2SLS), or multinomial, or ...
automatic choice could be based on type information about endogenous explanatory variable in second stage, if we could get that information from the design info.

partitioned formulas
"y1 ~ y2 + y3 + x | y2 ~ Probit(x3, x4) | y3 ~ OLS(x3, x4, x5)"
"y1 ~ y2 + y3 + x | y2 + y3 ~ OLS(x3, x4, x5)"

in general using a formula would be good, e.g. for interactions terms in control functions "~ r + I(r**2) + r:x" (without main effect of x, I guess "nested" in patsy)

another option
make user build 1st stage model(s) (if not linear, continuous) and use it as __init__ keyword, similar to current treatment effect model where both outcome and participation models are user provided or None.

like extended regression in Stata
etreatment is simpler because only Probit for 1st stage is supported and only one endogenous variable in 1st stage.

update
let the user specify a one-sided formula for functions of resid to include as control functions.
This would be useful if the endogenous regressor is interacted with other exog or transformed, e.g. polynomials.
I need to check Wooldridge again for the details of these extra control functions.

@josef-pkt
Copy link
Member Author

I'm trying out an example with Poisson outcome model and a linear endogenous variable as regressors
The MonteCarlo is pretty slow for 1000 replications, over 7 minutes with true residuals as cf. (One speedup in MC is to use start_params.)
Bootstrap standard errors will be slow, and we need asymptotic cov_params.

However it works, both with "true" first stage error, and with residuals from linear projection on instruments.
from statsmodels.stats.multivariate_tools import partial_project

Poisson with cf estimates the slope parameters without bias (my nobs=5000 for the MC).
However, the constant is larger than the DGP constant. This possibly compensates for the random noise term in the linear predictor (Jensen's inequality).
I need poisson-normal or poisson-lognormal mixture distribution to verify.

@josef-pkt
Copy link
Member Author

another implementation issue

which results class?
If we return the results class of the outcome model (GLM, Poisson, ...), then we might have results methods and post-estimation results that are not appropriate even if cov_params and similar are correct.

Wald inference will be supported, but other postestimation score_test, influence (e.g. in Poisson) might not be correct.
Some might be correct or can be partially overridden, but we would have to verify all of those (or disable them).

predict and get_prediction methods in results should be generic, and we only need to change the model predict.
Maybe not, we have the extra exog from the control functions, which would need to be handled, e.g. in the formula transform.

If we return a plain LikelihoodModelResults, then we loose out on methods that could work, or need to add/copy them individually.

@josef-pkt
Copy link
Member Author

missing first stage model MultivariateLS

for the 2-stage parameter estimates with linear 1st stage I can use from statsmodels.stats.multivariate_tools import partial_project
However, to compute cov_params of the second stage, I also need cov_params (or score_obs and Hessian for sandwich) of first stage flattened parameters vect(params_first).

Aside: I guess I don't need scale in least squares score_obs and hessian, if I use a standard sandwich, but will need to add scale for nonrobust cov_params.

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 19, 2023

success for residual inclusion against R OneSampleMR, in their docstring baby example
using GLM-gaussian, GLM-binomial and GMM for cov_params
using default GMM, without explicit hessian/momcond_deriv GMM.gradient_momcond uses approx_fprime

> s$smry$coefficients[1:5, "Std. Error"]
Z(Intercept)           ZZ  (Intercept)            X          res 
  0.04123383   0.06300569   0.41121194   0.82008325   0.87404599 

res_gmm.bse
0.04123383   0.06300569   0.41121194   0.82008325   0.87404599

agreement looks very high, precision in optimization and numdiff is not necessarily this high.

@josef-pkt
Copy link
Member Author

flexibility versus efficient or more precise computation for derivatives

For response residual inclusion we could get the analytical cross-derivative instead of using numdiff.
e.g. model _deriv_mean_dparams ( and variable addition derivative (score_factor, score_test) ???)
(or explicit computation for linear first stage, OLS, MultivariateLS, which should not be to difficult to add)

This will be more difficult if we use other residuals (e.g. for Probit first stage) or use functions (formula) of the residuals.

We would have to switch methods for cases where analytical derivative is not available.

@josef-pkt
Copy link
Member Author

api: which __init__ arguments for explanatory variables, endogenous explanatory variables, exogenous variables and instruments?

With control functions we don't need to replace endogenous explanatory variables by their 1st stage estimate, we only need to append the control functions.

We can allow for terms of the endogenous variables other than simple main effect, e.g. polynomials and interaction.
For this it would be simpler and more flexible, if user can specify a formula or provide design matrix with arbitrary functions of the endogenous variable.

If explanvar includes endogenous explanatory variables, then we need to know which are the exogenous variables or require that those are included in the instruments (as I have currently in GMM, IV2SLS)

If explanvar includes functions of endogenous variables, then control functions should include something similar.
This would work if we have a partial formula for just the terms that include the endogenous explanatory variables (substituting resid for endogenous explanatory variable, eev).
Aside: we would have to check that control functions created by formulas do not have a constant or other variables that make the control functions collinear with the explanatory variables.

(formula handling in the difficult cases will be a pain. We usually don't get enough information in statsmodels.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant