ENH: stats: fit: function for fitting discrete and continuous distributions #15436

mdhaber · 2022-01-19T22:38:39Z

Reference issue

Closes gh-11948

What does this implement/fix?

As discussed in gh-11948, this adds a function scipy.stats.fit(dist, data, ...) to fit discrete and continuous distributions to data. Thanks to the addition of integrality constraints to differential evolution (thanks @andyfaff!), the algorithmic part is extremely simple. Almost all of this is input validation, distribution domain info, and tests.

Example:

import numpy as np
from scipy import stats
rng = np.random.default_rng()
dist = stats.nbinom
shapes = (5, 0.5)
data = dist.rvs(*shapes, size=1000, random_state=rng)
shape_bounds = {'n': (0, 30)}
res = stats.fit(dist, data, shape_bounds)
res.params  # FitShapes(n=5.0, p=0.4969793457498593, loc=0.0)
res.plot()

Suggested code for test driving:

import numpy as np
from scipy import stats
from scipy.optimize import differential_evolution
from scipy.stats._distr_params import distdiscrete, distcont

# for now, works for all discrete distributions plus norm and nakagami
# ['bernoulli', 'betabinom', 'binom', 'boltzmann', 'dlaplace', 'geom',
#  'hypergeom', 'hypergeom', 'nhypergeom', 'nchypergeom_fisher',
#  'nchypergeom_wallenius', 'logser', 'nbinom', 'planck',
#  'poisson', 'randint', 'skellam', 'zipf', 'zipfian', 'yulesimon',
#  'nakagami', 'norm']
dist_name = 'nakagami'
N = 5000
rng = np.random.default_rng(65146251968)

# get information about distribution
dist_data = dict(distdiscrete + distcont)  # example parameters
dist = getattr(stats, dist_name)
shapes = dist_data[dist_name]
shape_names = dist.shapes.split(', ')

# automatically prepare some bounds; feel free to adjust
bounds = np.empty((len(shapes), 2), dtype=np.float64)
bounds[:, 0] = np.array(shapes)/10  # essentially all shapes are > 0
bounds[:, 1] = np.array(shapes)*10

# bounds can be specified as either an array_like or a dictionary of tuples
bounds_dict = {param: bound.tolist()
               for param, bound in zip(shape_names, bounds)}

# generate some data
data = dist.rvs(*shapes, size=N, random_state=rng)

# not required; just demonstrating how a custom optimizer / optimizer options
# can be configured
def opt(fun, bounds, integrality=None, x0=None):
    return differential_evolution(fun, bounds, integrality=integrality, x0=x0,
                                  seed=rng)

res = stats.fit(dist, data, bounds, optimizer=opt)  # or bounds_dict
res.plot()
print(res)

Additional information

Currently, all discrete distributions and two continuous distributions (norm and nakagami) are supported. The rest are easy to add, but it's important to take some time for each one to make sure there are no surprises. I think that adding the remaining distributions can and should be crowdsourced to get more people testing out the code. While we're at it, we can make this function dispatch to analytical fits (e.g. from norm.fit and new _fit methods for discrete distributions) where available.

In the future, this function could power a fit method of discrete distributions, but for now, a new function with a cleaner interface and codebase than rv_continuous.fit can be useful for both continuous and discrete distributions.

Of course, users can immediately fit custom distributions by subclassing rv_continuous or rv_discrete. We could add support for simpler custom distributions, but I'd prefer to start with the basic functionality here. There are many other features (e.g. for confidence intervals - see gh-15491 and mdhaber#70, censored data, etc.) to be added in follow-ups!

…tributions

… plot horizontal axis

mdhaber

Some self-review.

scipy/stats/_continuous_distns.py

mdhaber · 2022-01-19T22:52:38Z

scipy/stats/_continuous_distns.py

@@ -5658,6 +5661,9 @@ class nakagami_gen(rv_continuous):
    %(example)s

    """
+    def _shape_info(self):
+        return [ShapeInfo("nu", False, (0, np.inf), (False, True))]


This is the sort of thing we need to add for all remaining continuous distributions.

name of parameter

whether domain is discretized

lower and upper bounds of the domain

whether the bounds are inclusive or not

We could also consider adding examples of valid parameters here instead of in _distr_params.py. This information could be used for several other purposes:

automatic documentation of distribution parameters

default _argcheck

automatic generation of invalid (out of bounds) parameters (so we don't need that list in _distr_params.py, either)

scipy/stats/_distn_infrastructure.py

scipy/stats/tests/test_fit.py

scipy/stats/_distn_infrastructure.py

rgommers

This looks quite interesting and fixes a long-standing issue (or feature request) for discrete distributions. Thanks @mdhaber!

Can you amend the PR description with stats.test() and stats.test('full') timings before/after? These tests tend to be quite expensive.

tupui

Wooo thanks @mdhaber! I did a first quick look for now.

Mainly, I would move this new functionality in a separate file and as Ralf suggested maybe do another PR for the nnlf part.

There is another important point, ShapeInfo. I am just unsure about the design, if we are doing this like that because of the current architecture or because it's really what we want. I think before doing anything more involved we should really discuss how we want the new distribution interface to look like. If we don't think about this now, we might block or add churn for future work.

tupui · 2022-01-20T09:45:48Z

scipy/stats/__init__.py

@@ -468,6 +476,7 @@
 from ._rvs_sampling import rvs_ratio_uniforms, NumericalInverseHermite  # noqa
 from ._page_trend_test import page_trend_test
 from ._mannwhitneyu import mannwhitneyu
+from ._distn_infrastructure import fit


If we start with something fresh and the end goal is to put have this part of the larger refactoring on distributions, I would have a separate file to make things clear/clean.

Yeah that was something I meant to ask about. I put it there too write the code and figured we'd discuss. Also, I didn't think it would be so many lines. I'm happy with a new file, but it's been discouraged before so I don't jump into it.

Moving to a new file will erase a lot of comment history. Let me know if your other comments have been resolved and I'll move this to _fit. Er, maybe we should do that at the end right before we're ready to merge? I don't want my self-review to be erased if others are going to review.

Fine with me to move at the end if you prefer.

scipy/stats/_distn_infrastructure.py

mdhaber · 2022-01-20T20:06:12Z

Can you amend the PR description with stats.test() and stats.test('full') timings before/after? These tests tend to be quite expensive.

@rgommers Sure, but before I do that, maybe this will help us determine which tests to include and which to skip.

7.03s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[hypergeom]
4.95s call     scipy/stats/tests/test_fit.py::TestFit::test_nchypergeom_wallenius
4.56s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[skellam]
4.03s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[zipfian]
2.60s call     scipy/stats/tests/test_fit.py::TestFit::test_nchypergeom_fisher
1.58s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[betabinom]
0.58s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[binom]
0.56s call     scipy/stats/tests/test_fit.py::TestFit::test_nhypergeom
0.41s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[zipf]
0.37s call     scipy/stats/tests/test_fit.py::TestFit::test_nbinom
0.22s call     scipy/stats/tests/test_fit.py::TestFit::test_randint
0.19s call     scipy/stats/tests/test_fit.py::TestFit::test_missing_shape_bounds
0.17s call     scipy/stats/tests/test_fit.py::TestFit::test_boltzmann
0.14s call     scipy/stats/tests/test_fit.py::TestFit::test_yulesimon
0.11s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[planck]
0.10s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[bernoulli]
0.10s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[poisson]
0.10s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[geom]
0.09s call     scipy/stats/tests/test_fit.py::TestFit::test_scale_bounds_iv
0.08s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[logser]
0.08s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[dlaplace]
0.07s call     scipy/stats/tests/test_fit.py::TestFit::test_shape_bounds_iv

I don't think it's important to have all distributions pass test_basic. The overwhelming majority of tests failures will just mean that optimization didn't converge to the global optimum. That's not a bug in fit or differential_evolution. We just can't guarantee that we will find the global MLE with arbitrary bounds and random state. So I'm happy with skipping lots of them to keep the time down.

Can you suggest a time limit for regular tests and a time limit for slow tests, and I'll mark the rest as xslow?
Or, how much time would be acceptable to add to the regular and slow test suites? I can figure out the details to get us there.

andyfaff · 2022-01-20T20:29:48Z

I wonder from time to time how to figure out if there's an improved way of detecting convergence with differential_evolution. At the moment it figures out if something is converged by dividing the stdev of the population by the mean, and stop if it's less than the tol. I suspect for some situations there's could be wasted iterations because outliers prevent halting. (for me wasted iterations don't matter so much, I just need to know that the optimum is a good one). I reckon for these tests you could experiment with relaxing tol, and seeing if they converge more quickly.

Can the objective functions be vectorised (see #14528)? If so, then that might be something to experiment with, it may provide a speed-up.

mdhaber · 2022-01-20T21:09:57Z

Can the objective functions be vectorised (see #14528)?

Yes, easy. I'll do that. Although in tests it typically doesn't help much because the underlying calculation is evaluating the PDF at all the data points (usually many), and that is already vectorized.

Looking towards the future, though, good to write in a vectorized way to make use of e.g CuPy arrays.

andyfaff · 2022-01-20T21:12:39Z

I'd use timeit first, not sure how much it'll change the situation.

mdhaber · 2022-01-20T23:28:24Z

Oops by "tests" I didn't mean unit tests; I meant "tests" in the generic sense using timeit.

Specifically, before I submitted the integrality PR for brute, I wrote my own version of vectorized brute and used it in the zooming/iterative brute strategy for solving these problems. It was not really any faster, and I think the reason is that each objective function evaluation involves calculating the PDF of a distribution at N (1000 ~ 10000) data points, and that calculation is already vectorized. At that point, vectorizing w.r.t. the decision variables mostly starts to eat up memory without making the calculation too much faster. I imagine the same thing would be true for D.E.

Vectorizing DE (and brute) is a great idea in general, I just don't expect it to do much for large/slow MLE problems (where it matters) on a CPU.

Here's a good example of the point:

from scipy import stats

import numpy as np
p0 = 0.5
dist = stats.geom
data = dist.rvs(p0, size=10000)

def objective_not_vectorized(p):
    return -np.log(dist._pmf(data, p)).sum()

def objective_vectorized_with_looping(p):
    res = np.empty_like(p)
    for k, pk in enumerate(p):
        res[k] = objective_not_vectorized(pk)
    return res

def objective_natively_vectorized(p):
    return -np.log(dist._pmf(data, p)).sum(axis=1)

p = np.linspace(0, 0.999, 75)

res1 = objective_natively_vectorized(p[:, None])
res2 = objective_vectorized_with_looping(p)
np.testing.assert_allclose(res1, res2)

Timing:

>>> %timeit objective_natively_vectorized(p[:, None])
17.5 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit objective_vectorized_with_looping(p)
14.9 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So in this case, native vectorization is slower.
If data has fewer elements, then native vectorization can be a lot faster.
I suppose all I should say is that it's problem-dependent. The best thing to do is what we did in permutation_test and bootstrap: vectorize the calculation, but allow the user to control the "batch" size.

mdhaber · 2022-01-21T21:36:27Z

scipy/stats/_distn_infrastructure.py

+            distribution.
+        objective_val : float
+            The value of the negative log likelihood function evaluated with
+            the fitted shapes, i.e. ``result.nllf(result.params)``.


In a followup, we can add confidence intervals.

rgommers · 2022-01-21T22:13:56Z

Can you suggest a time limit for regular tests and a time limit for slow tests, and I'll mark the rest as xslow?
Or, how much time would be acceptable to add to the regular and slow test suites? I can figure out the details to get us there.

How about >1 sec for xslow and >0.25 sec forslow? There are a lot of distributions, so I think setting the limits a bit more strictly than for a single test makes sense.

…st/data private

mdhaber · 2022-02-17T19:16:59Z

I went ahead and made the FitResult _dist and _data private. We can always make them public later if desired, but it's tough to go the other way around. Also, think the FitResult class documentation will render now.

tupui · 2022-02-17T21:06:11Z

Doc looks good (https://43309-1460385-gh.circle-artifacts.com/0/html-scipyorg/reference/generated/scipy.stats.fit.html) and CI failures are not related. Thanks again everyone! 😃

mdhaber · 2022-02-17T21:50:25Z

mdhaber · 2022-02-18T04:58:50Z

@chrisb83 @tupui I think it would be good to add _shape_info to the remaining distributions first. Would you prefer to write some of this code and I'll review or vice-versa?

chrisb83 · 2022-03-04T07:05:18Z

I can work on it this weekend: as a start, I would try to work up to line 3349 in continuous_distns.py, i.e., up to class gumbel_r_gen(rv_continuous)

mdhaber · 2022-03-04T19:14:53Z

Perfect! I can help in real-time, if you ever need it. Should be quick to review, too.

rgommers · 2022-05-24T10:31:55Z

@mdhaber was the stats.fit name discussed on the mailing list? I just noticed it - it is too generic I'd say. There's a number of things that do fitting in stats, optimize, and other submodules. Nothing is called just fit.

tupui · 2022-05-24T10:39:05Z

@rgommers a message was send to the ML #15436 (comment)

https://mail.python.org/archives/list/scipy-dev@python.org/thread/4AMSQGLUAUYNM3VDRT52WRPMK53PK3E4/

rgommers · 2022-05-24T10:46:29Z

Hmm, I missed that and it didn't have any replies. Ah well, a bit late now perhaps to do anything about it.

tupui · 2022-05-24T10:48:05Z

We still have a few days. If you have a better name I can refactor this quickly.

rgommers · 2022-05-24T12:16:03Z

An obvious one that comes to mind is fit_distribution.

Extending the See Also would also be good - for most users I suspect "fit" means things like stats.linregress and optimize.curve_fit.

tupui · 2022-05-24T12:36:44Z

Ok thanks, I can prepare this. @mdhaber can you confirm the name?

mdhaber · 2022-05-24T12:46:44Z

I can do it, @tupui. I think fit_distribution was suggested, bit I preferred fit. Is the concern that people will read through the list of functions available and try to use it for the wrong purpose without reading the first line of the documentation? I suppose there is some other fitting. I thought distribution fitting was the most fundamental fitting in stats, and the name is familiar to users of distributions' fit method, but fit_distribution is ok if need be.

mdhaber · 2022-05-24T13:17:40Z

Also something I wanted to confirm with others before release: the default loc=0 and scale=1 (fixed. Only shape parameters are fit by default). This was a reaction to the problems that have been reported about fitting Nakagami and Weibull, where there is a fair amount of overlap between the roles of the location and shape parameter, so leaving location free appears to cause issues. I thought this would force the user to be more explicit to avoid these problems.
Do we want those to be free or fixed by default? (Also should we change the name as suggested above?) Thought I'd bring you in @tirthasheshpatel @chrisb83 @rlucas7.

tirthasheshpatel · 2022-05-24T13:23:03Z

Also should we change the name as suggested above?

I too prefer fit over fit_distribution. Optimize users might get confused but since we are already putting this in the stats module, a more verbose name like fit_distribution doesn't seem necessary. Anyways, users who want to import it under any other name can do so easily using the as keyword.

tupui · 2022-05-24T13:28:33Z

Some thoughts on my side:

I was fine with fit when I looked at this initially because in sklearn for instance, almost everything has a fit method and all of these are doing very different things for the given algo. So context matters.

It's true that it's not very descriptive though. But I don't think someone could have mixed it up with something from optim for instance.

The only issue I see is that yes some people import everything from a namespace and fit is common enough so there could be conflicts.

tupui · 2022-05-24T13:33:22Z

On the optimization side. I prefer to also force users to be explicit about bounds on shape/loc since these have strong effects on the output. It's too hard otherwise for the optimisers to play along with second order effect params. IIRC I suggested a 2 step process to mitigate this (which you can do with our API, just call twice the function).

tirthasheshpatel · 2022-05-24T13:39:56Z

Do we want those to be free or fixed by default?

After looking at #11806 and #10908, fixing the location parameter sounds good (context for others: #11806 (comment)). Are there other issues reporting this odd behavior?

mdhaber · 2022-05-24T15:34:28Z

Those are the issues I was thinking of. I can't think of others, but I don't imagine the problem is limited to those distributions. For instance, I think loc being free for discrete distributions could also be surprising.

rgommers · 2022-05-24T17:36:59Z

I was fine with fit when I looked at this initially because in sklearn for instance, almost everything has a fit method and all of these are doing very different things for the given algo. So context matters.

Methods are quite different, because they're already attached to the thing you provide. fit as a function just breaks descriptive API naming rules, it's always fit_<what> (or estimate_<what> or calc_<what>).

I suppose there is some other fitting. I thought distribution fitting was the most fundamental fitting in stats, and the name is familiar to users of distributions' fit method, but fit_distribution is ok if need be.

You can argue that for scipy.stats as it as today indeed, but I'd think not for statistics. Regressions like https://www.statsmodels.org/stable/user-guide.html#regression-and-linear-models are a lot more common. So if you keep fit, I guess you're looking at scipy.stats as "lots of statistical distributions, plus some other assorted things" as the scope (and it won't change over time)?

Anyway, it was proposed so you can keep it if y'all agree. I should have flagged this before merging.

rkern · 2022-05-24T20:24:32Z

I have the same feelings as @rgommers on this. I'd far prefer a fit_distribution name for the reasons given, but with the timing, I won't insist on it.

mdhaber added 10 commits January 17, 2022 00:13

ENH: stats: fit: add function for fitting discrete and continuous dis…

6620dd1

…tributions

TST: stats: fit: unit testing

5ab0da3

Merge remote-tracking branch 'upstream/main' into fit_distributions

539463e

MAINT: stats: fit: fix dist not instance of rv_*, empty shape_bounds

f255c86

TST: stats: fit: add basic fit test

9dc0eb3

ENH: stats: fit: add ShapeParams for dists, add tests

0535799

ENH: stats: fit: added shape info for remaining discrete distributions

4c3bf39

ENH: stats: fit: fix plotting for continuous distributions

acfdcb9

ENH: stats: fit: use distribution '_penalized_nnlf', complete docs

22bb048

ENH: stats: fit: for discrete distributions, include only integers on…

b29e86e

… plot horizontal axis

mdhaber added scipy.stats enhancement A new feature or improvement labels Jan 19, 2022

mdhaber commented Jan 19, 2022

View reviewed changes

mdhaber and others added 5 commits January 19, 2022 17:07

MAINT: stats: fit: PEP8 fixes, add fit to docs

f3768a9

TST: stats: fix: dist_name19 is not a distribution name

ef8d92b

STY: stats: fit: additional PEP8 issues

05f725d

MAINT: stats: fit silence mypy

62f01ea

Apply suggestions from code review

995d321

mdhaber commented Jan 20, 2022

View reviewed changes

scipy/stats/tests/test_fit.py Outdated Show resolved Hide resolved

rgommers reviewed Jan 20, 2022

View reviewed changes

scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved

rgommers reviewed Jan 20, 2022

View reviewed changes

tupui reviewed Jan 20, 2022

View reviewed changes

mdhaber mentioned this pull request Jan 21, 2022

ENH: stats: add nnlf method for discrete distributions #15445

Merged

mdhaber commented Jan 21, 2022

View reviewed changes

DOC: stats.fit: improve rendering of FitResult documentation; make di…

19a2744

…st/data private

tupui merged commit fca3775 into scipy:main Feb 17, 2022

mdhaber mentioned this pull request Feb 17, 2022

MAINT: stats.fit: move to new file #15615

Merged

mdhaber mentioned this pull request Mar 3, 2022

BUG: default value of shape parameter in fit method of rv_continuous might not be admissible #15689

Closed

mdhaber mentioned this pull request Mar 9, 2022

stats distributions fit problems (Trac #1359) #1884

Closed

mdhaber mentioned this pull request Apr 21, 2022

ENH: stats.fit: use overridden distribution fit method when available #16022

Closed

rgommers mentioned this pull request May 24, 2022

DOC: prep 1.9.0 release notes #16248

Merged

mdhaber mentioned this pull request Aug 16, 2022

ENH: stats.anderson: add fit parameters to result #16848

Merged

mdhaber mentioned this pull request Sep 3, 2022

ENH: stats.fit: add plot_types to FitResult.plot #16951

Merged

ENH: stats: fit: function for fitting discrete and continuous distributions #15436

ENH: stats: fit: function for fitting discrete and continuous distributions #15436

Conversation

mdhaber commented Jan 19, 2022 • edited Loading

Reference issue

What does this implement/fix?

Additional information

mdhaber left a comment

Choose a reason for hiding this comment

mdhaber Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

rgommers left a comment

Choose a reason for hiding this comment

tupui left a comment

Choose a reason for hiding this comment

tupui Jan 20, 2022

Choose a reason for hiding this comment

mdhaber Jan 20, 2022

Choose a reason for hiding this comment

mdhaber Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

tupui Jan 23, 2022

Choose a reason for hiding this comment

mdhaber commented Jan 20, 2022

andyfaff commented Jan 20, 2022

mdhaber commented Jan 20, 2022 • edited Loading

andyfaff commented Jan 20, 2022

mdhaber commented Jan 20, 2022 • edited Loading

mdhaber Jan 21, 2022

Choose a reason for hiding this comment

rgommers commented Jan 21, 2022

mdhaber commented Feb 17, 2022 • edited Loading

tupui commented Feb 17, 2022

mdhaber commented Feb 17, 2022 • edited Loading

mdhaber commented Feb 18, 2022

chrisb83 commented Mar 4, 2022

mdhaber commented Mar 4, 2022

rgommers commented May 24, 2022

tupui commented May 24, 2022 • edited Loading

rgommers commented May 24, 2022

tupui commented May 24, 2022

rgommers commented May 24, 2022 • edited Loading

tupui commented May 24, 2022

mdhaber commented May 24, 2022 • edited Loading

mdhaber commented May 24, 2022 • edited Loading

tirthasheshpatel commented May 24, 2022

tupui commented May 24, 2022

tupui commented May 24, 2022 • edited Loading

tirthasheshpatel commented May 24, 2022

mdhaber commented May 24, 2022 • edited Loading

rgommers commented May 24, 2022

rkern commented May 24, 2022

mdhaber commented Jan 19, 2022 •

edited

Loading

mdhaber Jan 19, 2022 •

edited

Loading

mdhaber Jan 21, 2022 •

edited

Loading

mdhaber commented Jan 20, 2022 •

edited

Loading

mdhaber commented Jan 20, 2022 •

edited

Loading

mdhaber commented Feb 17, 2022 •

edited

Loading

mdhaber commented Feb 17, 2022 •

edited

Loading

tupui commented May 24, 2022 •

edited

Loading

rgommers commented May 24, 2022 •

edited

Loading

mdhaber commented May 24, 2022 •

edited

Loading

mdhaber commented May 24, 2022 •

edited

Loading

tupui commented May 24, 2022 •

edited

Loading

mdhaber commented May 24, 2022 •

edited

Loading