Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats: fit: function for fitting discrete and continuous distributions #15436

Merged
merged 35 commits into from
Feb 17, 2022

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Jan 19, 2022

Reference issue

Closes gh-11948

What does this implement/fix?

As discussed in gh-11948, this adds a function scipy.stats.fit(dist, data, ...) to fit discrete and continuous distributions to data. Thanks to the addition of integrality constraints to differential evolution (thanks @andyfaff!), the algorithmic part is extremely simple. Almost all of this is input validation, distribution domain info, and tests.

Example:

import numpy as np
from scipy import stats
rng = np.random.default_rng()
dist = stats.nbinom
shapes = (5, 0.5)
data = dist.rvs(*shapes, size=1000, random_state=rng)
shape_bounds = {'n': (0, 30)}
res = stats.fit(dist, data, shape_bounds)
res.params  # FitShapes(n=5.0, p=0.4969793457498593, loc=0.0)
res.plot()

image

Suggested code for test driving:

import numpy as np
from scipy import stats
from scipy.optimize import differential_evolution
from scipy.stats._distr_params import distdiscrete, distcont

# for now, works for all discrete distributions plus norm and nakagami
# ['bernoulli', 'betabinom', 'binom', 'boltzmann', 'dlaplace', 'geom',
#  'hypergeom', 'hypergeom', 'nhypergeom', 'nchypergeom_fisher',
#  'nchypergeom_wallenius', 'logser', 'nbinom', 'planck',
#  'poisson', 'randint', 'skellam', 'zipf', 'zipfian', 'yulesimon',
#  'nakagami', 'norm']
dist_name = 'nakagami'
N = 5000
rng = np.random.default_rng(65146251968)

# get information about distribution
dist_data = dict(distdiscrete + distcont)  # example parameters
dist = getattr(stats, dist_name)
shapes = dist_data[dist_name]
shape_names = dist.shapes.split(', ')

# automatically prepare some bounds; feel free to adjust
bounds = np.empty((len(shapes), 2), dtype=np.float64)
bounds[:, 0] = np.array(shapes)/10  # essentially all shapes are > 0
bounds[:, 1] = np.array(shapes)*10

# bounds can be specified as either an array_like or a dictionary of tuples
bounds_dict = {param: bound.tolist()
               for param, bound in zip(shape_names, bounds)}

# generate some data
data = dist.rvs(*shapes, size=N, random_state=rng)

# not required; just demonstrating how a custom optimizer / optimizer options
# can be configured
def opt(fun, bounds, integrality=None, x0=None):
    return differential_evolution(fun, bounds, integrality=integrality, x0=x0,
                                  seed=rng)

res = stats.fit(dist, data, bounds, optimizer=opt)  # or bounds_dict
res.plot()
print(res)

Additional information

Currently, all discrete distributions and two continuous distributions (norm and nakagami) are supported. The rest are easy to add, but it's important to take some time for each one to make sure there are no surprises. I think that adding the remaining distributions can and should be crowdsourced to get more people testing out the code. While we're at it, we can make this function dispatch to analytical fits (e.g. from norm.fit and new _fit methods for discrete distributions) where available.

In the future, this function could power a fit method of discrete distributions, but for now, a new function with a cleaner interface and codebase than rv_continuous.fit can be useful for both continuous and discrete distributions.

Of course, users can immediately fit custom distributions by subclassing rv_continuous or rv_discrete. We could add support for simpler custom distributions, but I'd prefer to start with the basic functionality here. There are many other features (e.g. for confidence intervals - see gh-15491 and mdhaber#70, censored data, etc.) to be added in follow-ups!

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Jan 19, 2022
Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some self-review.

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
@@ -5658,6 +5661,9 @@ class nakagami_gen(rv_continuous):
%(example)s

"""
def _shape_info(self):
return [ShapeInfo("nu", False, (0, np.inf), (False, True))]
Copy link
Contributor Author

@mdhaber mdhaber Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the sort of thing we need to add for all remaining continuous distributions.

  • name of parameter
  • whether domain is discretized
  • lower and upper bounds of the domain
  • whether the bounds are inclusive or not

We could also consider adding examples of valid parameters here instead of in _distr_params.py. This information could be used for several other purposes:

  • automatic documentation of distribution parameters
  • default _argcheck
  • automatic generation of invalid (out of bounds) parameters (so we don't need that list in _distr_params.py, either)

scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_fit.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_fit.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_fit.py Outdated Show resolved Hide resolved
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite interesting and fixes a long-standing issue (or feature request) for discrete distributions. Thanks @mdhaber!

Can you amend the PR description with stats.test() and stats.test('full') timings before/after? These tests tend to be quite expensive.

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wooo thanks @mdhaber! I did a first quick look for now.

Mainly, I would move this new functionality in a separate file and as Ralf suggested maybe do another PR for the nnlf part.

There is another important point, ShapeInfo. I am just unsure about the design, if we are doing this like that because of the current architecture or because it's really what we want. I think before doing anything more involved we should really discuss how we want the new distribution interface to look like. If we don't think about this now, we might block or add churn for future work.

@@ -468,6 +476,7 @@
from ._rvs_sampling import rvs_ratio_uniforms, NumericalInverseHermite # noqa
from ._page_trend_test import page_trend_test
from ._mannwhitneyu import mannwhitneyu
from ._distn_infrastructure import fit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start with something fresh and the end goal is to put have this part of the larger refactoring on distributions, I would have a separate file to make things clear/clean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was something I meant to ask about. I put it there too write the code and figured we'd discuss. Also, I didn't think it would be so many lines. I'm happy with a new file, but it's been discouraged before so I don't jump into it.

Copy link
Contributor Author

@mdhaber mdhaber Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to a new file will erase a lot of comment history. Let me know if your other comments have been resolved and I'll move this to _fit. Er, maybe we should do that at the end right before we're ready to merge? I don't want my self-review to be erased if others are going to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me to move at the end if you prefer.

scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
scipy/stats/_distn_infrastructure.py Outdated Show resolved Hide resolved
@mdhaber
Copy link
Contributor Author

mdhaber commented Jan 20, 2022

Can you amend the PR description with stats.test() and stats.test('full') timings before/after? These tests tend to be quite expensive.

@rgommers Sure, but before I do that, maybe this will help us determine which tests to include and which to skip.

7.03s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[hypergeom]
4.95s call     scipy/stats/tests/test_fit.py::TestFit::test_nchypergeom_wallenius
4.56s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[skellam]
4.03s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[zipfian]
2.60s call     scipy/stats/tests/test_fit.py::TestFit::test_nchypergeom_fisher
1.58s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[betabinom]
0.58s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[binom]
0.56s call     scipy/stats/tests/test_fit.py::TestFit::test_nhypergeom
0.41s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[zipf]
0.37s call     scipy/stats/tests/test_fit.py::TestFit::test_nbinom
0.22s call     scipy/stats/tests/test_fit.py::TestFit::test_randint
0.19s call     scipy/stats/tests/test_fit.py::TestFit::test_missing_shape_bounds
0.17s call     scipy/stats/tests/test_fit.py::TestFit::test_boltzmann
0.14s call     scipy/stats/tests/test_fit.py::TestFit::test_yulesimon
0.11s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[planck]
0.10s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[bernoulli]
0.10s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[poisson]
0.10s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[geom]
0.09s call     scipy/stats/tests/test_fit.py::TestFit::test_scale_bounds_iv
0.08s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[logser]
0.08s call     scipy/stats/tests/test_fit.py::TestFit::test_basic_fit[dlaplace]
0.07s call     scipy/stats/tests/test_fit.py::TestFit::test_shape_bounds_iv

I don't think it's important to have all distributions pass test_basic. The overwhelming majority of tests failures will just mean that optimization didn't converge to the global optimum. That's not a bug in fit or differential_evolution. We just can't guarantee that we will find the global MLE with arbitrary bounds and random state. So I'm happy with skipping lots of them to keep the time down.

Can you suggest a time limit for regular tests and a time limit for slow tests, and I'll mark the rest as xslow?
Or, how much time would be acceptable to add to the regular and slow test suites? I can figure out the details to get us there.

@andyfaff
Copy link
Contributor

I wonder from time to time how to figure out if there's an improved way of detecting convergence with differential_evolution. At the moment it figures out if something is converged by dividing the stdev of the population by the mean, and stop if it's less than the tol. I suspect for some situations there's could be wasted iterations because outliers prevent halting. (for me wasted iterations don't matter so much, I just need to know that the optimum is a good one). I reckon for these tests you could experiment with relaxing tol, and seeing if they converge more quickly.

Can the objective functions be vectorised (see #14528)? If so, then that might be something to experiment with, it may provide a speed-up.

@mdhaber
Copy link
Contributor Author

mdhaber commented Jan 20, 2022

Can the objective functions be vectorised (see #14528)?

Yes, easy. I'll do that. Although in tests it typically doesn't help much because the underlying calculation is evaluating the PDF at all the data points (usually many), and that is already vectorized.

Looking towards the future, though, good to write in a vectorized way to make use of e.g CuPy arrays.

@andyfaff
Copy link
Contributor

I'd use timeit first, not sure how much it'll change the situation.

@mdhaber
Copy link
Contributor Author

mdhaber commented Jan 20, 2022

Oops by "tests" I didn't mean unit tests; I meant "tests" in the generic sense using timeit.

Specifically, before I submitted the integrality PR for brute, I wrote my own version of vectorized brute and used it in the zooming/iterative brute strategy for solving these problems. It was not really any faster, and I think the reason is that each objective function evaluation involves calculating the PDF of a distribution at N (1000 ~ 10000) data points, and that calculation is already vectorized. At that point, vectorizing w.r.t. the decision variables mostly starts to eat up memory without making the calculation too much faster. I imagine the same thing would be true for D.E.

Vectorizing DE (and brute) is a great idea in general, I just don't expect it to do much for large/slow MLE problems (where it matters) on a CPU.

Here's a good example of the point:

from scipy import stats

import numpy as np
p0 = 0.5
dist = stats.geom
data = dist.rvs(p0, size=10000)

def objective_not_vectorized(p):
    return -np.log(dist._pmf(data, p)).sum()

def objective_vectorized_with_looping(p):
    res = np.empty_like(p)
    for k, pk in enumerate(p):
        res[k] = objective_not_vectorized(pk)
    return res

def objective_natively_vectorized(p):
    return -np.log(dist._pmf(data, p)).sum(axis=1)

p = np.linspace(0, 0.999, 75)

res1 = objective_natively_vectorized(p[:, None])
res2 = objective_vectorized_with_looping(p)
np.testing.assert_allclose(res1, res2)

Timing:

>>> %timeit objective_natively_vectorized(p[:, None])
17.5 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit objective_vectorized_with_looping(p)
14.9 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So in this case, native vectorization is slower.
If data has fewer elements, then native vectorization can be a lot faster.
I suppose all I should say is that it's problem-dependent. The best thing to do is what we did in permutation_test and bootstrap: vectorize the calculation, but allow the user to control the "batch" size.

distribution.
objective_val : float
The value of the negative log likelihood function evaluated with
the fitted shapes, i.e. ``result.nllf(result.params)``.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a followup, we can add confidence intervals.

@rgommers
Copy link
Member

Can you suggest a time limit for regular tests and a time limit for slow tests, and I'll mark the rest as xslow?
Or, how much time would be acceptable to add to the regular and slow test suites? I can figure out the details to get us there.

How about >1 sec for xslow and >0.25 sec forslow? There are a lot of distributions, so I think setting the limits a bit more strictly than for a single test makes sense.

@mdhaber
Copy link
Contributor Author

mdhaber commented Feb 17, 2022

I went ahead and made the FitResult _dist and _data private. We can always make them public later if desired, but it's tough to go the other way around. Also, think the FitResult class documentation will render now.

@tupui
Copy link
Member

tupui commented Feb 17, 2022

Doc looks good (https://43309-1460385-gh.circle-artifacts.com/0/html-scipyorg/reference/generated/scipy.stats.fit.html) and CI failures are not related. Thanks again everyone! 😃

@tupui tupui merged commit fca3775 into scipy:main Feb 17, 2022
@mdhaber
Copy link
Contributor Author

mdhaber commented Feb 17, 2022

Thanks @tupui, @chrisb83. Added this to the release notes. Let me know if that looks appropriate.

I thought I'd use this as the tracker for additional features:

@mdhaber
Copy link
Contributor Author

mdhaber commented Feb 18, 2022

@chrisb83 @tupui I think it would be good to add _shape_info to the remaining distributions first. Would you prefer to write some of this code and I'll review or vice-versa?

@chrisb83
Copy link
Member

chrisb83 commented Mar 4, 2022

I can work on it this weekend: as a start, I would try to work up to line 3349 in continuous_distns.py, i.e., up to class gumbel_r_gen(rv_continuous)

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 4, 2022

Perfect! I can help in real-time, if you ever need it. Should be quick to review, too.

@rgommers
Copy link
Member

@mdhaber was the stats.fit name discussed on the mailing list? I just noticed it - it is too generic I'd say. There's a number of things that do fitting in stats, optimize, and other submodules. Nothing is called just fit.

@tupui
Copy link
Member

tupui commented May 24, 2022

@rgommers
Copy link
Member

Hmm, I missed that and it didn't have any replies. Ah well, a bit late now perhaps to do anything about it.

@tupui
Copy link
Member

tupui commented May 24, 2022

We still have a few days. If you have a better name I can refactor this quickly.

@rgommers
Copy link
Member

rgommers commented May 24, 2022

An obvious one that comes to mind is fit_distribution.

Extending the See Also would also be good - for most users I suspect "fit" means things like stats.linregress and optimize.curve_fit.

@tupui
Copy link
Member

tupui commented May 24, 2022

Ok thanks, I can prepare this. @mdhaber can you confirm the name?

@mdhaber
Copy link
Contributor Author

mdhaber commented May 24, 2022

I can do it, @tupui. I think fit_distribution was suggested, bit I preferred fit. Is the concern that people will read through the list of functions available and try to use it for the wrong purpose without reading the first line of the documentation? I suppose there is some other fitting. I thought distribution fitting was the most fundamental fitting in stats, and the name is familiar to users of distributions' fit method, but fit_distribution is ok if need be.

@mdhaber
Copy link
Contributor Author

mdhaber commented May 24, 2022

Also something I wanted to confirm with others before release: the default loc=0 and scale=1 (fixed. Only shape parameters are fit by default). This was a reaction to the problems that have been reported about fitting Nakagami and Weibull, where there is a fair amount of overlap between the roles of the location and shape parameter, so leaving location free appears to cause issues. I thought this would force the user to be more explicit to avoid these problems.
Do we want those to be free or fixed by default? (Also should we change the name as suggested above?) Thought I'd bring you in @tirthasheshpatel @chrisb83 @rlucas7.

@tirthasheshpatel
Copy link
Member

Also should we change the name as suggested above?

I too prefer fit over fit_distribution. Optimize users might get confused but since we are already putting this in the stats module, a more verbose name like fit_distribution doesn't seem necessary. Anyways, users who want to import it under any other name can do so easily using the as keyword.

@tupui
Copy link
Member

tupui commented May 24, 2022

Some thoughts on my side:

I was fine with fit when I looked at this initially because in sklearn for instance, almost everything has a fit method and all of these are doing very different things for the given algo. So context matters.

It's true that it's not very descriptive though. But I don't think someone could have mixed it up with something from optim for instance.

The only issue I see is that yes some people import everything from a namespace and fit is common enough so there could be conflicts.

@tupui
Copy link
Member

tupui commented May 24, 2022

On the optimization side. I prefer to also force users to be explicit about bounds on shape/loc since these have strong effects on the output. It's too hard otherwise for the optimisers to play along with second order effect params. IIRC I suggested a 2 step process to mitigate this (which you can do with our API, just call twice the function).

@tirthasheshpatel
Copy link
Member

Do we want those to be free or fixed by default?

After looking at #11806 and #10908, fixing the location parameter sounds good (context for others: #11806 (comment)). Are there other issues reporting this odd behavior?

@mdhaber
Copy link
Contributor Author

mdhaber commented May 24, 2022

Those are the issues I was thinking of. I can't think of others, but I don't imagine the problem is limited to those distributions. For instance, I think loc being free for discrete distributions could also be surprising.

@rgommers
Copy link
Member

I was fine with fit when I looked at this initially because in sklearn for instance, almost everything has a fit method and all of these are doing very different things for the given algo. So context matters.

Methods are quite different, because they're already attached to the thing you provide. fit as a function just breaks descriptive API naming rules, it's always fit_<what> (or estimate_<what> or calc_<what>).

I suppose there is some other fitting. I thought distribution fitting was the most fundamental fitting in stats, and the name is familiar to users of distributions' fit method, but fit_distribution is ok if need be.

You can argue that for scipy.stats as it as today indeed, but I'd think not for statistics. Regressions like https://www.statsmodels.org/stable/user-guide.html#regression-and-linear-models are a lot more common. So if you keep fit, I guess you're looking at scipy.stats as "lots of statistical distributions, plus some other assorted things" as the scope (and it won't change over time)?

Anyway, it was proposed so you can keep it if y'all agree. I should have flagged this before merging.

@rkern
Copy link
Member

rkern commented May 24, 2022

I have the same feelings as @rgommers on this. I'd far prefer a fit_distribution name for the reasons given, but with the timing, I won't insist on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fitting discrete distributions
7 participants