Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add multilink models and distribution.dfamilies #7793

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

josef-pkt
Copy link
Member

@josef-pkt josef-pkt commented Oct 12, 2021

closes #7778

This will be a continuation of #7778

now starting with distribution dfamilies
similar to genmod families but focused on MLE with scipy optimizers

I'm planning to add cleaner parts of #7778 based on MultiLinkModel to this PR

@pep8speaks
Copy link

pep8speaks commented Oct 12, 2021

Hello @josef-pkt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-14 17:42:06 UTC

@lgtm-com
Copy link

lgtm-com bot commented Oct 12, 2021

This pull request introduces 1 alert when merging 20f8bc0 into a5ec3cb - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2021

This pull request introduces 1 alert when merging 438f79c into a5ec3cb - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@josef-pkt
Copy link
Member Author

josef-pkt commented Oct 13, 2021

older scipy don't have stats.betabinom, e.g. scipy==1.3.3
based on github PR in scipy, betabinom was added in scipy 1.4.0
Fails on import of statsmodels/distributions/dfamilies/_discrete.py with AttributeError, distribution is a class attribute.

Not worth fixing with compat, we can increase min scipy version soon.

solution for now: add attribute to scipy stats.betabinom = None

otherwise on pep-8 style check failure

@josef-pkt
Copy link
Member Author

all green here
pre fails in statespace test_dynamic_factor_mq

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2021

This pull request introduces 1 alert when merging a523aad into a5ec3cb - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@josef-pkt
Copy link
Member Author

funny idea: constant like, ilink/transformation function that returns a constant.

That would be one way of encoding args with fixed value. But there are no params in that case, i.e. linpred would be empty.
I'm not sure that will get less messy than transform_params handling as in current TModel.

@josef-pkt
Copy link
Member Author

We need a PredictMixin with fittedvalues, resid and related, that we can use for models where the first parameter is the mean.
Symmetric distributions on R, GLM type distribution with mean-dispersion parameterization, and other models that have a mean parameterization like Beta and BetaBinomial.

For other models we have to decide whether to provide a predicted mean. Not too difficult for scipy distributions.
Very difficult for flexible distributions on R+ that don't have simple expressions for mean and are focused on quantiles.
Asymmetric distributions like johnsonsu might also be difficult (I don't remember details for that.). That will often apply to flexible distributions generated by transformations.

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2021

This pull request introduces 1 alert when merging 2c30dea into 123cca1 - view on LGTM.com

new alerts:

  • 1 for Unreachable code

@josef-pkt
Copy link
Member Author

josef-pkt commented Oct 13, 2021

I copied most parts over from #7778, not included is the Het base model. This is mostly obsolete with the more flexible MultiLinkModel. We might create a subclass for MultiLinkModel for loc/mean - scale HET models, mainly for name recognition and start_params.
(depends on whether MultiLinkModel will be refactored to have only one explicit link/exog. That's still open)

This still includes the full THet model because it's the only one with a fixed arg

adjustments are for changed paths and subclassing MultiLinkModel instead of Het Model

@josef-pkt
Copy link
Member Author

a unit test on one machine fails statsmodels\regression\tests\test_quantile_regression.py:319
No seed for random numbers in test_nontrivial_singular_matrix

@josef-pkt
Copy link
Member Author

josef-pkt commented Oct 13, 2021

Stata has hetregress, MLE and 2 step FGLS, only since version 15. I only have version 14
My results agree with their first doc example at print precision.
But I cannot check the other things that are not in the summary.
hetregress has vce(robust) option

@josef-pkt
Copy link
Member Author

In R: it looks like https://search.r-project.org/CRAN/refmans/crch/html/crch.html looks useful
censored regression (Tobit) with heteroscedasticity
uses two part formula y ~ x1 | x2

estimate by MLE or another method that I never heard about.

@josef-pkt
Copy link
Member Author

large sample results compared to GLSHet are in #647 (comment)
agreement is good.

@josef-pkt
Copy link
Member Author

parking example dataset for two-way anova (while shutting down windows)

10 39 0-26 5 6 0-83 8 16 0-50 3 12 0-25 23 62 0-37 53 74 0-72 10 30 0-33 22 41 0-54 23 81 0-28 55 72 0-76 8 28 0-29 15 30 0-50 26 51 0-51 32 51 0-63 23 45 0.51 32 51 0-63 17 39 0-44 46 79 0*58 0 4 0-00 3 7 0-43 10 13 0-77

from Martin J. Crowder, 1978, Beta-Binomial Anova for Proportions

# TODO: here or in __init__
self.k_vars = self.exog.shape[1]
self.k_params = (self.exog.shape[1] + self.exog_scale.shape[1] +
self.k_extra)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes we don't have exog for the extra

@lgtm-com
Copy link

lgtm-com bot commented Oct 21, 2021

This pull request introduces 2 alerts when merging 76e9b75 into 63a32e9 - view on LGTM.com

new alerts:

  • 2 for Unreachable code

@josef-pkt
Copy link
Member Author

josef-pkt commented Oct 25, 2021

If we want to generically support a Binomial count MLE model, then we would need that the MultiLinkModel supports also the case when we have only a single darg, link and exog.

Aside Zipf/Zipfian distribution is also a one parameter count distribution, with possibly finite support 0, ... n

@josef-pkt
Copy link
Member Author

josef-pkt commented Dec 2, 2021

decision: (i)links go into the dfamilies
two reasons:

  • reduce signature of model.__init__, no *args or list of links needed as separate argument (exogs and offsets make the signature already messy enough), _init_keys only needs dfamily
  • easier to define default links that are specific to family or groups of families. (e.g. NBP needs exp (log-link) for dispersion parameter, GPP can use identity link)

I might still add a model.links attribute as shortcut and consistency with models that don't have (d)families.

I'm still not sure how to handle a variable number of exogs and offsets. If I put them in a list or tuple, then super in base datahandling will not handle missing values across data arrays.
If I put them all in **kwargs, then the subclass needs to define a list or tuple of exog.
I used this already for score_test in multi parameter (dargs) models.

As reference implementation:
Stata's parametric survival models (like weibull) allow exog_xxx for the additional parameters, i.e. they are multi-exog models.
Has predefined links, no link option.

@josef-pkt
Copy link
Member Author

I think the way forward here is to keep this PR, and make new PRs with refactored versions.
And then decide which version to merge.

@josef-pkt
Copy link
Member Author

postponed for now, likely for 0.15

(I had applied for a small grant, but my proposal wasn't good enough to get it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants