-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats: fit: function for fitting discrete and continuous distributions #15436
Conversation
… plot horizontal axis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some self-review.
scipy/stats/_continuous_distns.py
Outdated
@@ -5658,6 +5661,9 @@ class nakagami_gen(rv_continuous): | |||
%(example)s | |||
|
|||
""" | |||
def _shape_info(self): | |||
return [ShapeInfo("nu", False, (0, np.inf), (False, True))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the sort of thing we need to add for all remaining continuous distributions.
- name of parameter
- whether domain is discretized
- lower and upper bounds of the domain
- whether the bounds are inclusive or not
We could also consider adding examples of valid parameters here instead of in _distr_params.py
. This information could be used for several other purposes:
- automatic documentation of distribution parameters
- default _argcheck
- automatic generation of invalid (out of bounds) parameters (so we don't need that list in
_distr_params.py
, either)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite interesting and fixes a long-standing issue (or feature request) for discrete distributions. Thanks @mdhaber!
Can you amend the PR description with stats.test()
and stats.test('full')
timings before/after? These tests tend to be quite expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wooo thanks @mdhaber! I did a first quick look for now.
Mainly, I would move this new functionality in a separate file and as Ralf suggested maybe do another PR for the nnlf
part.
There is another important point, ShapeInfo
. I am just unsure about the design, if we are doing this like that because of the current architecture or because it's really what we want. I think before doing anything more involved we should really discuss how we want the new distribution interface to look like. If we don't think about this now, we might block or add churn for future work.
@@ -468,6 +476,7 @@ | |||
from ._rvs_sampling import rvs_ratio_uniforms, NumericalInverseHermite # noqa | |||
from ._page_trend_test import page_trend_test | |||
from ._mannwhitneyu import mannwhitneyu | |||
from ._distn_infrastructure import fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start with something fresh and the end goal is to put have this part of the larger refactoring on distributions, I would have a separate file to make things clear/clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that was something I meant to ask about. I put it there too write the code and figured we'd discuss. Also, I didn't think it would be so many lines. I'm happy with a new file, but it's been discouraged before so I don't jump into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to a new file will erase a lot of comment history. Let me know if your other comments have been resolved and I'll move this to _fit
. Er, maybe we should do that at the end right before we're ready to merge? I don't want my self-review to be erased if others are going to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me to move at the end if you prefer.
@rgommers Sure, but before I do that, maybe this will help us determine which tests to include and which to skip.
I don't think it's important to have all distributions pass Can you suggest a time limit for regular tests and a time limit for |
I wonder from time to time how to figure out if there's an improved way of detecting convergence with Can the objective functions be vectorised (see #14528)? If so, then that might be something to experiment with, it may provide a speed-up. |
Yes, easy. I'll do that. Although in tests it typically doesn't help much because the underlying calculation is evaluating the PDF at all the data points (usually many), and that is already vectorized. Looking towards the future, though, good to write in a vectorized way to make use of e.g CuPy arrays. |
I'd use timeit first, not sure how much it'll change the situation. |
Oops by "tests" I didn't mean unit tests; I meant "tests" in the generic sense using Specifically, before I submitted the Vectorizing DE (and Here's a good example of the point: from scipy import stats
import numpy as np
p0 = 0.5
dist = stats.geom
data = dist.rvs(p0, size=10000)
def objective_not_vectorized(p):
return -np.log(dist._pmf(data, p)).sum()
def objective_vectorized_with_looping(p):
res = np.empty_like(p)
for k, pk in enumerate(p):
res[k] = objective_not_vectorized(pk)
return res
def objective_natively_vectorized(p):
return -np.log(dist._pmf(data, p)).sum(axis=1)
p = np.linspace(0, 0.999, 75)
res1 = objective_natively_vectorized(p[:, None])
res2 = objective_vectorized_with_looping(p)
np.testing.assert_allclose(res1, res2) Timing:
So in this case, native vectorization is slower. |
scipy/stats/_distn_infrastructure.py
Outdated
distribution. | ||
objective_val : float | ||
The value of the negative log likelihood function evaluated with | ||
the fitted shapes, i.e. ``result.nllf(result.params)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a followup, we can add confidence intervals.
How about >1 sec for |
I went ahead and made the |
Doc looks good (https://43309-1460385-gh.circle-artifacts.com/0/html-scipyorg/reference/generated/scipy.stats.fit.html) and CI failures are not related. Thanks again everyone! 😃 |
Thanks @tupui, @chrisb83. Added this to the release notes. Let me know if that looks appropriate. I thought I'd use this as the tracker for additional features:
|
I can work on it this weekend: as a start, I would try to work up to line 3349 in continuous_distns.py, i.e., up to class gumbel_r_gen(rv_continuous) |
Perfect! I can help in real-time, if you ever need it. Should be quick to review, too. |
@mdhaber was the |
@rgommers a message was send to the ML #15436 (comment) https://mail.python.org/archives/list/scipy-dev@python.org/thread/4AMSQGLUAUYNM3VDRT52WRPMK53PK3E4/ |
Hmm, I missed that and it didn't have any replies. Ah well, a bit late now perhaps to do anything about it. |
We still have a few days. If you have a better name I can refactor this quickly. |
An obvious one that comes to mind is Extending the |
Ok thanks, I can prepare this. @mdhaber can you confirm the name? |
I can do it, @tupui. I think |
Also something I wanted to confirm with others before release: the default |
I too prefer |
Some thoughts on my side: I was fine with It's true that it's not very descriptive though. But I don't think someone could have mixed it up with something from optim for instance. The only issue I see is that yes some people import everything from a namespace and |
On the optimization side. I prefer to also force users to be explicit about bounds on shape/loc since these have strong effects on the output. It's too hard otherwise for the optimisers to play along with second order effect params. IIRC I suggested a 2 step process to mitigate this (which you can do with our API, just call twice the function). |
After looking at #11806 and #10908, fixing the location parameter sounds good (context for others: #11806 (comment)). Are there other issues reporting this odd behavior? |
Those are the issues I was thinking of. I can't think of others, but I don't imagine the problem is limited to those distributions. For instance, I think |
Methods are quite different, because they're already attached to the thing you provide.
You can argue that for Anyway, it was proposed so you can keep it if y'all agree. I should have flagged this before merging. |
I have the same feelings as @rgommers on this. I'd far prefer a |
Reference issue
Closes gh-11948
What does this implement/fix?
As discussed in gh-11948, this adds a function
scipy.stats.fit(dist, data, ...)
to fit discrete and continuous distributions to data. Thanks to the addition ofintegrality
constraints to differential evolution (thanks @andyfaff!), the algorithmic part is extremely simple. Almost all of this is input validation, distribution domain info, and tests.Example:
Suggested code for test driving:
Additional information
Currently, all discrete distributions and two continuous distributions (
norm
andnakagami
) are supported. The rest are easy to add, but it's important to take some time for each one to make sure there are no surprises. I think that adding the remaining distributions can and should be crowdsourced to get more people testing out the code. While we're at it, we can make this function dispatch to analytical fits (e.g. fromnorm.fit
and new_fit
methods for discrete distributions) where available.In the future, this function could power a
fit
method of discrete distributions, but for now, a new function with a cleaner interface and codebase thanrv_continuous.fit
can be useful for both continuous and discrete distributions.Of course, users can immediately fit custom distributions by subclassing
rv_continuous
orrv_discrete
. We could add support for simpler custom distributions, but I'd prefer to start with the basic functionality here. There are many other features (e.g. for confidence intervals - see gh-15491 and mdhaber#70, censored data, etc.) to be added in follow-ups!