ENH: GLM, discrete with endogenous regressors, control functions #7689

josef-pkt · 2021-09-04T18:45:36Z

(back to econometrics)

#1742 ivpoisson, GMMPoisson

control function approach with linear function for endogenous regressor doesn't look so scary.
https://stats.stackexchange.com/questions/281848/control-function-for-iv-poisson-regression
https://www.stata.com/features/overview/poisson-regression-with-endogenous-variables/

Cameron Trivedi count book 2nd edition, p. 402ff has docvis example that compares Poisson-CF, NB2-CF and additive and multiplicative version of GMMPoisson.

main caveat: They recommend bootstrap standard errors, because it is a two step estimation with generated regressor in second stage.
(maybe there is a Murphy/Topel or GMM asymptotic distribution for the two step procedure, see next comment)

related also fully parametric bivariate/multivariate models.
Cameron Trivedi also include a case with an endogenous categorical regressor modelled through simulated MLE for mixed multinomial.

IIRC, I have seen some time ago somewhere the use of splines for the control functions, to have less restrictive parametric assumptions.

partially related
#7636 heckman procedure with endogenous binary treatment but linear model at final stage.

josef-pkt · 2021-09-04T18:55:11Z

AFAICS, without reading much

y1 ~ Poisson(f(x b + y2 a)) where f is inverse link, exp
y2 = z b2 + e (linear model)

second stage: use e_hat = y - z b2_hat
y1 ~ Poisson(f(x b + y2 a + e_hat))

joined moment conditions as function of (b, a, b2):
[ z (y2 - z b2), poisson_score(f(x b + y2 a + (y2 - z b2)))

I guess we can get the asy cov_params from GMM on the joined moment conditions. (exactly identified ?)
Derivatives of moment conditions will also mostly be by reusing existing code.

using generic two-stage:
We need derivative poisson_score(x b + y2 a + (y2 - z b2) w.r.t. b2, i.e. parameter of first stage.

josef-pkt · 2021-09-04T19:16:21Z

It looks like my GMMPoisson has good test coverage against Stata
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/sandbox/regression/tests/test_gmm_poisson.py
https://github.com/statsmodels/statsmodels/blob/main/statsmodels/sandbox/regression/tests/results_gmm_poisson.py

The stata results where created using the gmm command, and not ivpoisson

josef-pkt · 2023-03-27T19:28:53Z

We might want to add some helper functions to compute cov_params for exactly identified recursive, triangular system like 2-stage estimators.

AFAIR, we don't have computational shortcut for GMM when it is exactly identified

#7436 (comment)

disadvantage compared to using GMM directly is that a helper function does not inherit robust cov_types (besides generic sandwich, HC )

slides showing how to use GMM for standard errors for control function 2-stage
https://www.stata.com/meeting/uk20/slides/UK20_Pinzon.pdf

josef-pkt · 2023-03-28T16:17:27Z

eteffects in Stata uses simple control function, include Probit residuals in second stage.
standard errors by stacking moment conditions similar to teffects.
implements only regression adjustment
seems to report only POM and ATE/ATET/... . I don't see whether it's possible to get the full estimation results from eteffects
#8121

It should be possible to base unit tests for control function approach on this.

josef-pkt · 2023-03-28T18:55:01Z

https://stats.stackexchange.com/questions/521808/generalised-residuals-estimated-value-of-the-density-at-that-point-for-observa

josef-pkt · 2023-03-29T15:19:18Z

What do we use for predict?
#8762

The estimating linear predictor is x b + cf where cf are control functions.
We have and in-sample estimate of cf_i, but none for out of sample prediction.
What does Stata use for POM in etreatment?

(open question, I have not looked at this.)

josef-pkt · 2023-03-31T18:24:37Z

implementation detail for control functions, API design

which resid, which functions to use as control functions?

maybe "control_kwds"
control_kwds = {"resid": "inv_mills_ratio", "polynomial": degree, "interaction": covars}

resid might depend on context, e.g inverse mills ratio for probit 1st stage or generalized residual or resid_outcome or resid_pearson.

everything gets more complicated if we have more than one endogenous variable with different 1st stage models, e.g. binary (Logit, Probit) or OLS/multivariateLS (as in IV2SLS), or multinomial, or ...
automatic choice could be based on type information about endogenous explanatory variable in second stage, if we could get that information from the design info.

partitioned formulas
"y1 ~ y2 + y3 + x | y2 ~ Probit(x3, x4) | y3 ~ OLS(x3, x4, x5)"
"y1 ~ y2 + y3 + x | y2 + y3 ~ OLS(x3, x4, x5)"

in general using a formula would be good, e.g. for interactions terms in control functions "~ r + I(r**2) + r:x" (without main effect of x, I guess "nested" in patsy)

another option
make user build 1st stage model(s) (if not linear, continuous) and use it as __init__ keyword, similar to current treatment effect model where both outcome and participation models are user provided or None.

like extended regression in Stata
etreatment is simpler because only Probit for 1st stage is supported and only one endogenous variable in 1st stage.

update
let the user specify a one-sided formula for functions of resid to include as control functions.
This would be useful if the endogenous regressor is interacted with other exog or transformed, e.g. polynomials.
I need to check Wooldridge again for the details of these extra control functions.

josef-pkt · 2023-04-17T17:17:07Z

I'm trying out an example with Poisson outcome model and a linear endogenous variable as regressors
The MonteCarlo is pretty slow for 1000 replications, over 7 minutes with true residuals as cf. (One speedup in MC is to use start_params.)
Bootstrap standard errors will be slow, and we need asymptotic cov_params.

However it works, both with "true" first stage error, and with residuals from linear projection on instruments.
from statsmodels.stats.multivariate_tools import partial_project

Poisson with cf estimates the slope parameters without bias (my nobs=5000 for the MC).
However, the constant is larger than the DGP constant. This possibly compensates for the random noise term in the linear predictor (Jensen's inequality).
I need poisson-normal or poisson-lognormal mixture distribution to verify.

josef-pkt · 2023-04-17T18:43:41Z

another implementation issue

which results class?
If we return the results class of the outcome model (GLM, Poisson, ...), then we might have results methods and post-estimation results that are not appropriate even if cov_params and similar are correct.

Wald inference will be supported, but other postestimation score_test, influence (e.g. in Poisson) might not be correct.
Some might be correct or can be partially overridden, but we would have to verify all of those (or disable them).

predict and get_prediction methods in results should be generic, and we only need to change the model predict.
Maybe not, we have the extra exog from the control functions, which would need to be handled, e.g. in the formula transform.

If we return a plain LikelihoodModelResults, then we loose out on methods that could work, or need to add/copy them individually.

josef-pkt · 2023-04-19T18:32:42Z

missing first stage model MultivariateLS

for the 2-stage parameter estimates with linear 1st stage I can use from statsmodels.stats.multivariate_tools import partial_project
However, to compute cov_params of the second stage, I also need cov_params (or score_obs and Hessian for sandwich) of first stage flattened parameters vect(params_first).

Aside: I guess I don't need scale in least squares score_obs and hessian, if I use a standard sandwich, but will need to add scale for nonrobust cov_params.

josef-pkt · 2023-04-19T20:19:02Z

success for residual inclusion against R OneSampleMR, in their docstring baby example
using GLM-gaussian, GLM-binomial and GMM for cov_params
using default GMM, without explicit hessian/momcond_deriv GMM.gradient_momcond uses approx_fprime

> s$smry$coefficients[1:5, "Std. Error"]
Z(Intercept)           ZZ  (Intercept)            X          res 
  0.04123383   0.06300569   0.41121194   0.82008325   0.87404599 

res_gmm.bse
0.04123383   0.06300569   0.41121194   0.82008325   0.87404599

agreement looks very high, precision in optimization and numdiff is not necessarily this high.

josef-pkt · 2023-04-20T13:38:17Z

flexibility versus efficient or more precise computation for derivatives

For response residual inclusion we could get the analytical cross-derivative instead of using numdiff.
e.g. model _deriv_mean_dparams ( and variable addition derivative (score_factor, score_test) ???)
(or explicit computation for linear first stage, OLS, MultivariateLS, which should not be to difficult to add)

This will be more difficult if we use other residuals (e.g. for Probit first stage) or use functions (formula) of the residuals.

We would have to switch methods for cases where analytical derivative is not available.

josef-pkt · 2023-04-20T13:59:31Z

api: which __init__ arguments for explanatory variables, endogenous explanatory variables, exogenous variables and instruments?

With control functions we don't need to replace endogenous explanatory variables by their 1st stage estimate, we only need to append the control functions.

We can allow for terms of the endogenous variables other than simple main effect, e.g. polynomials and interaction.
For this it would be simpler and more flexible, if user can specify a formula or provide design matrix with arbitrary functions of the endogenous variable.

If explanvar includes endogenous explanatory variables, then we need to know which are the exogenous variables or require that those are included in the instruments (as I have currently in GMM, IV2SLS)

If explanvar includes functions of endogenous variables, then control functions should include something similar.
This would work if we have a partial formula for just the terms that include the endogenous explanatory variables (substituting resid for endogenous explanatory variable, eev).
Aside: we would have to check that control functions created by formulas do not have a constant or other variables that make the control functions collinear with the explanatory variables.

(formula handling in the difficult cases will be a pain. We usually don't get enough information in statsmodels.)

josef-pkt added type-enh comp-genmod comp-discrete labels Sep 4, 2021

josef-pkt added this to the 0.14 milestone Sep 4, 2021

josef-pkt mentioned this issue Sep 5, 2021

ENH/Design module structure for IV, GMM, endogeneity #7691

Open

This was referenced Nov 24, 2021

ENH/Design: post-estimation for instrumental variables IV and two-stage estimation #7903

Open

ENH: endogeneity test with control function, instruments #7905

Open

josef-pkt modified the milestones: 0.14, 0.15 Mar 27, 2023

josef-pkt mentioned this issue Mar 28, 2023

SUMM: roadmap for 0.15 josef #8217

Open

6 tasks

This was referenced Apr 20, 2023

ENH: add GMMIVCF control function for endogenous explanatory variable #8812

Open

ENH: tools: computing in blocks for matrix algebra, cov for multivariateLS and multi-equation or multi-stage models #8827

Open

ENH: summary for multiple models #8832

Open

josef-pkt mentioned this issue Oct 27, 2023

ENH: control function IV with multinomial treatment #9043

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: GLM, discrete with endogenous regressors, control functions #7689

ENH: GLM, discrete with endogenous regressors, control functions #7689

josef-pkt commented Sep 4, 2021 •

edited

josef-pkt commented Sep 4, 2021 •

edited

josef-pkt commented Sep 4, 2021 •

edited

josef-pkt commented Mar 27, 2023 •

edited

josef-pkt commented Mar 28, 2023

josef-pkt commented Mar 28, 2023

josef-pkt commented Mar 29, 2023

josef-pkt commented Mar 31, 2023 •

edited

josef-pkt commented Apr 17, 2023

josef-pkt commented Apr 17, 2023

josef-pkt commented Apr 19, 2023

josef-pkt commented Apr 19, 2023 •

edited

josef-pkt commented Apr 20, 2023

josef-pkt commented Apr 20, 2023

ENH: GLM, discrete with endogenous regressors, control functions #7689

ENH: GLM, discrete with endogenous regressors, control functions #7689

Comments

josef-pkt commented Sep 4, 2021 • edited

josef-pkt commented Sep 4, 2021 • edited

josef-pkt commented Sep 4, 2021 • edited

josef-pkt commented Mar 27, 2023 • edited

josef-pkt commented Mar 28, 2023

josef-pkt commented Mar 28, 2023

josef-pkt commented Mar 29, 2023

josef-pkt commented Mar 31, 2023 • edited

josef-pkt commented Apr 17, 2023

josef-pkt commented Apr 17, 2023

josef-pkt commented Apr 19, 2023

josef-pkt commented Apr 19, 2023 • edited

josef-pkt commented Apr 20, 2023

josef-pkt commented Apr 20, 2023

josef-pkt commented Sep 4, 2021 •

edited

josef-pkt commented Sep 4, 2021 •

edited

josef-pkt commented Sep 4, 2021 •

edited

josef-pkt commented Mar 27, 2023 •

edited

josef-pkt commented Mar 31, 2023 •

edited

josef-pkt commented Apr 19, 2023 •

edited