Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: stats: univariate distribution infrastructure #15928

Open
mdhaber opened this issue Apr 3, 2022 · 65 comments
Open

RFC: stats: univariate distribution infrastructure #15928

mdhaber opened this issue Apr 3, 2022 · 65 comments
Labels
enhancement A new feature or improvement RFC Request for Comments; typically used to gather feedback for a substantial change proposal scipy.stats

Comments

@mdhaber
Copy link
Contributor

mdhaber commented Apr 3, 2022

The existing univariate distribution infrastructure is an extremely valuable feature of SciPy, and it offers two main ways of interacting with distributions.

  1. stats.norm.pdf(2, loc=0, scale=1) calls the pdf method of the norm object that is instantiated at import time.
  2. stats.norm(loc=0, scale=1).pdf(2) uses the __call__ method of the norm object to create a "frozen" instance of the distribution, then calls that object's pdf method.

There are two major advantages of approach 1 (and, to be fair, I'll be happy to add others as they come up in the comments below).

  • It is a bit more natural than approach 2 for very simple invocations. For example, it may be more natural to think of the standard normal PDF as a function and call stats.norm.pdf(2) rather than thinking of (and instantiating) a standard normal object and invoking its pdf method (stats.norm().pdf(2)).
  • It is the more common approach in existing code.

However, there are several important disadvantages.

These could easily be avoided if norm were simply a class (rather than an instance of the norm_gen class) and the user were required to instantiate that class (e.g. norm()) before invoking methods. This has been noted before; for example, see gh-12133 for an earlier discussion of the history and pros/cons of the two approaches.

There are also several issues that have been reported over the years that may not indicate inherent disadvantages of the existing infrastructure, but do suggest that some upgrades are desired. There are several features and limitations of the existing infrasture (inflexibility w.r.t. adding/removing method arguments) that suggest that it may be easier to start fresh, reimagining the infrastructure with these enhancements in mind rather than adding patches to the existing framework.

Starting fresh, we would also ensure that everything is designed with vectorization in mind, as we still occasionally discover issues like gh-13504, gh-13581, gh-12192, gh-11746, #18093 (comment), gh-18919.

This enhancement proposal is for a new univariate distribution infrastructure. To be clear, I don't mean that we'd discard the old infrastructure. A lot of the existing algorithmic code would likely be retained (perhaps modernized), and the new infrastructure/distributions would generally need to pass the old tests (with updated syntax). Hopefully, the transition would be easy for users - code involving frozen distributions (approach 2 above) would need minimal changes, and the rest just needs to add some parentheses and pass arguments into the distribution initializer instead of the method itself. But we would probably rewrite parts that define methods by executing strings, for instance. (I understand it was needed at the time, but we could avoid with a new structure.)

Of course, we'd want to discuss and study other possibilities to make sure we get things right, but it might look something like this.

  • Distribution families are represented by classes. The names are typically the same as existing distribution objects except in CamelCase (norm -> Norm or Normal, loguniform -> LogUniform).
  • Distributions are instantiated before using methods. The distribution class documentation describes distribution shape parameters, which are passed as keyword-only arguments to the class __init__ method (only). Naturally, this makes alternative parameterizations easy to support. Default support for loc and scale is provided, but distributions can disable them as needed. (Contrast against MAINT: stats: fix overparameterized loguniform distribution #15889.)
  • For the most part, public methods include those that are currently available with minor changes (e.g. the name of nnlf is corrected to nllf).
  • The shape parameters used to initialize the distribution can be used as guesses for the fit method, which modifies the shape parameters of the distribution. Methods are available to extract the shape parameters (e.g. see Easy access to scipy.stat.<distributionName> frozen distribution arguments #9321).
  • There are new methods for calculating the mode, partial derivatives of the PDF (useful for random variate generation and fitting parameters to data), and generating the distributions of order statistics.
  • As in the existing infrastructure, public methods are rarely overridden by subclasses; instead, subclasses almost exclusively override private methods. However, the new infrastructure is more strict about this distinction (e.g. there is a new _fit method so that fit overrides do not have to repeat boilerplate code) and there are different versions of private methods that can be overridden depending on (for example) whether the developer is able to vectorize with respect to shape parameters or would prefer to rely on a default loop.
  • Methods have parameters that enable the user to control tradeoffs between precision/reliability and speed. For example, the user can opt out of expensive data validation checks. The user can opt out of special accuracy improvements (e.g. distributions cdf, sf evaluation in the wrong tail #2509) by specifying the desired absolute and/or relative tolerances, which can be passed into the quadrature, minimization, and root finding routines used by the generic implementations of distribution functions. The user can access accuracy estimates.
  • The quadrature, minimization, and root finding routines are natively vectorized rather than requiring looping in Python.
  • The infrastructure supports non-NumPy arrays (e.g. CuPy) and NumPy arrays that support nonstandard numerical types (e.g. mpmath).
  • The infrastructure provides a standard mechanism for switching to special case implementations (e.g. t distribution to normal distribution for df=inf) rather than requiring developers to use np.where/np.lazywhere.
  • Distributions "know" more about their parameters (e.g. like _shape_info, which was recently added to the existing infrastructure). Based on this information, parameter documentation and input validation are generated automatically.
  • Formulae used in overridden methods are documented.
  • The infrastructure is documented (e.g. how to create new distributions, How to define new distributions? #12133).
  • The infrastructure natively supports truncated support, wrapped domains, folded distributions, and complex inputs (e.g. ENH: stats.vonmises.rvs should support scale #17592).
  • The infrastructure favors Pythonic techniques for introspection and avoids self-modification (e.g. adding methods with exec).
  • Distribution-specific tests (TST: stats.rv_continuous: standards for distribution-specific tests #17807 ) assure us that the distribution object represents the distribution it is intended to represent, property-based tests (e.g. see BUG: fix moments method to support arrays and list #12197 (comment)) ensure that the methods are consistent with one another and the standards for SciPy distributions, and benchmarks track the speed and accuracy (especially for numerically challenging arguments).
  • The infrastructure includes a standard framework for investigating the accuracy of each method of the distribution by comparing itself with an mpmath backend. See notes below.
  • The documentation specifies the accuracy the user can expect when using methods within a specified range of inputs (e.g. accurate to X digits for 99.Y% of parameter combinations within hypercube Z).

The new infrastructure and distributions would be widely advertised, and documentation and existing code would transition to the new infrastructure. Existing infrastructure/distributions would be supported without deprecation warnings until two releases before SciPy 2.0, at which point they would be moved to a new package that is released once (without planned maintenance).

@tupui @ev-br @tirthasheshpatel @chrisb83

Notes on framework for investigating accuracy:
The infrastructure will accept as arguments arrays containing mpmath.mpf objects. Following the approach of gh-19023, special functions will dispatch to mparray versions where possible. This will allow for end-to-end calculation with arbitrary precision arithmetic and, therefore, automatic generation of reference values. By comparing numerical results computed using floating point arithmetic against these reference values, we can identify shortcomings in the implementation (e.g. ranges of parameter values where catastrophic cancellation causes loss of precision).

We cannot hope to test the accuracy for all possible floating point inputs, so we resort to random or quasi-random sampling from possible floating point inputs. From one perspective, it would be ideal if the distribution of floating point inputs tested would be representative of the distribution of floating point inputs used by users. This way, we focus limited resources on parameter ranges experienced in practice. However, this distribution is not known, so we should not make assumptions about it. Instead, I would suggest sampling parameter values uniformly from the set of all valid floating point inputs:

import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(2184358452652)

types = {np.float16: np.int16, np.float32: np.int32, np.float64: np.int64,
         np.int16: np.float16, np.int32: np.float32, np.int64: np.float64}

def float2int(xf):
    """Convert the ith floating point number from zero to integer i"""
    # Adapted from numpy.testing._private.utils.integer_repr
    xf = np.asarray(xf)
    itype = types[xf.dtype.type]
    xi = xf.view(itype).copy()
    comp = np.asarray(np.iinfo(itype).min, dtype=itype)[()]
    xi[xi < 0] = comp - xi[xi < 0]
    return xi

def int2float(xi):
    """Convert integer i to the ith floating point number from zero"""
    # Adapted from numpy.testing._private.utils.integer_repr
    xi = np.asarray(xi).copy()
    comp = np.asarray(np.iinfo(xi.dtype).min, dtype=xi.dtype)[()]
    xi[xi < 0] = comp - xi[xi < 0]
    ftype = types[xi.dtype.type]
    xf = xi.view(ftype).copy()
    return xf

def sample(a, b, size, rng):
    """Sample uniformly from the set of floating point numbers between a and b"""
    int_range = float2int(np.asarray([a, b]))
    ints = rng.integers(*int_range, size=size)
    return int2float(ints)

a, b = 1e-2, 1e2
floats = sample(a, b, size=10000, rng=rng)
plt.hist(np.log10(floats))
plt.show()

image

As we might expect, the floating point numbers between positive a and b are approximately log-uniformly spaced. However, the log-uniform distribution is only defined over positive support, whereas the domain of a and b above includes negative reals.

@mdhaber mdhaber closed this as completed Apr 3, 2022
@mdhaber mdhaber changed the title Distribution Wants ENH: stats: univariate distribution meta-Issue Apr 3, 2022
@mdhaber mdhaber changed the title ENH: stats: univariate distribution meta-Issue ENH: stats: univariate distribution meta-issue Apr 3, 2022
@mdhaber mdhaber added the enhancement A new feature or improvement label Apr 3, 2022
@mdhaber mdhaber reopened this May 20, 2022
@tupui
Copy link
Member

tupui commented May 20, 2022

I am +100 on this proposal. There are too many issues with the current infrastructure and it's very hard to maintain. Starting off with a clean slate will definitely be a good thing. (E.g. It would be a good opportunity to make use of UNU.RAN and QMC)

@mdhaber mdhaber mentioned this issue May 21, 2022
25 tasks
@chrisb83
Copy link
Member

Thanks for writing this clear proposal. I will take a closer look at the issue 12133 to learn more about the background of this topic. Overall it makes a lot of sense to me.

At this point, I have just one comment:

There are new methods for calculating the mode and partial derivatives of the PDF (useful for random variate generation and fitting parameters to data).

Maybe dlogpdf (derivative of logpdf) could be considered, too: potentially it is useful to express nllf and I think some of UNU.RAN methods use it (I need to check in more detail). So not very important at this stage, but let's keep it in mind.

@josef-pkt
Copy link
Member

I jsut saw this

This will be difficult.
Many of the problems are orthogonal to class versus instance, e.g. vectorizing when support depends on parameters is never trivial.

Main design issue: stateless classes and instances versus (some) fixed parameters (fixed attributes of a class defined in __init__.
Both are convenient and very useful for different cases and distributions.
Allowing both in a generic way looks to me very difficult.

@tupui
Copy link
Member

tupui commented May 23, 2022

@josef-pkt to

Allowing both in a generic way looks to me very difficult.

One of the main point of this proposal is to remove the concept of frozen distributions. There would be only one way to play with distributions: which is using a class instance. (Or a factory, something else. We need to take some time to play with different APIs' architecture and investigate that with various distributions.) The current duality is too problematic/buggy and complicate the user API (and maintenance). We want to have a single point of entry for shape parameters, the init of the class and then you just do what you want with the distribution.

The point is that we only provide a single Pythonic and clean way to do things user side. For users using the class-based approach, it should not deviate much more. But in the background things would be extremely different with a more manageable infrastructure which can be evolved (e.g. right now adding a QMC sampler would be a pain), documented properly, etc.

@tirthasheshpatel
Copy link
Member

I too am +1 for overhauling the frozen distribution infrastructure. To minimize breaking code, it would be good to keep all the distributions in the stats module as-is and declare them legacy.

Main design issue: stateless classes and instances versus (some) fixed parameters (fixed attributes of a class defined in __init__.

@josef-pkt I think simply initializing a new class for different parameters to achieve the stateless class behavior isn't too bad. We can also add static methods in the new distribution classes that offer the stateless classes syntax. But not having it at all also doesn't seem like an unreasonable trade-off.

@tupui
Copy link
Member

tupui commented May 24, 2022

@josef-pkt I think simply initializing a new class for different parameters to achieve the stateless class behavior isn't too bad. We can also add static methods in the new distribution classes that offer the stateless classes syntax. But not having it at all also doesn't seem like an unreasonable trade-off.

Just to be clear. The goal is to start with a clean slate, meaning the API can break things. We should only consider adding things if they make sense in terms of architecture. Not to comply with old behaviours. I would not add any mechanism to have a stateless distribution.

@josef-pkt
Copy link
Member

Instantiating a new class for different parameters sounds like a large overhead.
What's the overhead of creating a new instance each time when I simply want to call norm.pdf(loc, scale), or calling logpdf inside fit?

On the other hand, there are some distributions that would benefit from precomputing some constants for method calls with the same distribution parameters.

(In statsmodels we use the equivalent of stateless, semi-frozen (some parameters fixed in __init__) and frozen distributions.)

@mdhaber
Copy link
Contributor Author

mdhaber commented May 25, 2022

We would be able to change the parameters of an existing object with a method call, which is something we'd want for fit anyway. In typical cases, only the parameters that change would need to be re-processed (like your semi-frozen distributions). In some cases, this is about as much overhead as we have now. In all other cases, this is less overhead than what we have now. Distribution objects could wrapped to behave statelessly at the expense of - in the worst case - one method call (to change the parameters) relative to what we have now. I'd expect the efficiency gains from cleaning up the code would more than make up for this. Perhaps the old distribution objects could be replaced with these wrappers.

@josef-pkt
Copy link
Member

I will see it when you have a prototype

tupui didn't like it when I moved most parameters to the methods in copulas so we can use it as a semi-frozen or unfrozen distribution. But that looks the better design to me for building models on top of the distributions.

In statsmodels, all (almost all) parameters that are estimated in models are arguments in model methods. In some cases it gets messy when we need to change additional model attributes that are not method arguments.
(But our models are a lot more complicated than a scipy distribution.)

One pattern that I use is to allow both, set some parameters in __init__ as attributes, and allow method arguments to override them. However, this is fragile if there are extra model attributes that are computed from the changed attributes.

@josef-pkt
Copy link
Member

just one more usecase that I remembered

finite difference derivative approximations need to compute the logpdf at different distribution parameters in one function.
extreme case:
A model in statsmodels that uses directly the scipy distribution for logpdf (in most cases we have our own functions for that, but not for all)
Uing a newton optimizer, requires logpdf and first and second derivatives (score and hessian). If they are not available in explicit form, then score and hessian are computed using numerical derivatives (finite difference or complex step).
So, each iteration of the optimizer will call the logpdf/loglike many times with different parameter values.

On the other hand, when we have estimated the model, then we have a get_distribution method that transforms the estimated parameters into the parameterization of the scipy distribution and returns a frozen distribution to the user, which then can be used to compute pdf, cdf, rvs based on the estimated model and parameters.

@tupui
Copy link
Member

tupui commented May 27, 2022

Following your example. You could write a small wrapper that would re-use a given class instance and just vary attributes (shape parameters). It might even be faster than instantiating a frozen distribution every time you want to update some parameters.

I don't see any issues there.

tupui didn't like it when I moved most parameters to the methods in copulas

What I did not like at the time was the use of *args, **kwargs which IMO is not a good design as it prevents easy inspection of method options/params, documentation etc. Seems like the situation improved a bit-from my perspective-with more descriptive args but still could be more explicit in most cases.

@josef-pkt
Copy link
Member

You could write a small wrapper that would re-use a given class instance and just vary attributes (shape parameters). It might even be faster than instantiating a frozen distribution every time you want to update some parameters.

But that's not needed right now because the scipy.stats distributions have stateless methods, i.e. we don't need to create a new instance when changing parameters, we always use the same instance.

If we loose that, then we cannot use any scipy distribution without fully wrapping it for the estimation models, and carry around our own case specific instances.
And users that want to use our GenericLikelihoodModel will also need to have a wrapper for the scipy distributions.

@tupui
Copy link
Member

tupui commented May 27, 2022

I don't believe that having to modify the shape parameters of a distribution is the common use of distributions. I am not saying this is not useful, but that it does not represent the majority of usages. I do not think we should maintain such mechanism as it's quite straightforward to do on user/downstream side.

dist = toto(a, b)

for ai, bi in ... :
    dist.a, dist.b = ai, bi
    dist.pdf(x)

    # instead of
    toto.pdf(x, ai, bi)

@josef-pkt
Copy link
Member

josef-pkt commented May 27, 2022

dist.a, dist.b = ai, bi
dist.pdf(x)

If you allow that, then you don't gain anything at all in the current implementation.
If it is allowed, then the distributions still need to be stateless except for the distribution args, i.e. no precomputed attributes, or you have to wipe and update them each time.

For users, this is pain without gain.

I don't believe that having to modify the shape parameters of a distribution is the common use of distributions.

Shape parameters, but also loc and scale.
It's one of the main usecases in statsmodels.

other examples
confidence intervals for parameters that sometimes need rootfinding, similarly profile likelihood (confint)
power and sample size computation where scale or shape parameters change. (generic sample size computation again requires rootfinding for different parameters. In statsmodels those functions directly use scipy distributions, mainly norm, t, f, chi2 and their non-central versions, but also beta, binomial, poisson, gamma, ... for countmodels)

statsmodels uses frozen distributions only as user convenience. Users don't have to carry around the distribution args when providing or receiving a distribution instance.

Bayesian or Monte Carlo applications require mixture distributions where parameters change. Very often those cannot be fully vectorized and we cannot avoid many calls to distribution methods with different parameters (shape, loc, scale).

Aside:
statsmodels has 1664 matches for "stats.". I guess more than 3 quarter of those are for scipy stats, the large majority of those are to scipy distributions.
And most of the distribution calls just treat it as a function as convenience wrapper around scipy.special.
for example pval = stats.chi2.sf(chi2stat, k_constraints)

@tirthasheshpatel
Copy link
Member

@mdhaber Can you also add gh-11026 to the potential changes we want in the new API?

@rlucas7
Copy link
Member

rlucas7 commented May 9, 2024

Thanks everyone who worked on this behind the scenes and to those who've provided comments. My 2 cents

On moments:

  • Add cumulant support, since going between them and moments is sometimes useful generally and is essentially required to compute moments in closed form for sum distributions

Cumulants and moments (eg mapping between the two) is useful for both continuous and discrete distributions in applications. Cornish Fisher comes to mind, as does Saddlepoint expansions which are sometimes formulated this way.

For discrete distributions (and maybe sometimes continuous) it is often easiest to write a distributions moments by working in the factorial basis and converting to raw
moments.
Two methods which could do this would use the Stirling numbers of first and second kinds:

$$ x^n = \sum_{k=1}^{n}{n \brace k} (x)_k $$

and

$$(x)_{n}=\sum_1^n {n \brack k} x^{k}$$.

(for whatever reason the github latex isn't rendering the latter change of basis correctly and the way I was able to get it to render was by removing the k= in the \sum)

so converting back and forth is a change of basis (but can make computations easier).
If we can leverage these change of basis it might be a nice use case.

Edit: I read closer into the PR, it looks like the 2 change of basis above would be something like the "transform" method in @mdhaber 's PR but the existing nomenclature ("transform") conflates changing of basis and cumulants (a specific basis representation. It seems like the above change of basis might be another, alternate "transform" method-perhaps we want more transform methods or a modifier on the name? (whatever's simpler)

I have WIP CR for First kind Stirling and Second kind Stirling was merged a release or so ago. These codes are in scipy.special and support broadcasting.

Example common distributions where this is handy (Poisson, Binomial, Negative Binomial), all three have simpler arbitrary positive integer moment representations that are much cleaner than the raw moments.

  • Mixed continuous and discrete distributions are rare but do exist https://en.wikipedia.org/wiki/Tweedie_distribution . It'd make more sense to me for the distribution infrastructure to be as support-independent as possible, even down to being generic over it.

IIUC @mdhaber has noted this is planned, at least I'm assuming that's what is meant by mixture distributions (and not variance scale-mixture families which is a wholely different class of distributions).

  • Type annotations, it's confusing to tell what function params actually mean when implementing them

+1 these would be nice to have.

One thing I haven't noticed mentioned (possible I missed it though) is which methods are required and which aren't? (when implemented a new distribution)

There are some distributions which have inverse CDFs easily represented but not CDFs and some which have nice algorithms for Sampling but not other things like easy CDFs (example is the Dickman or Goncharov distribution-this one has some nice CFTP sampling approaches). Tweedie is maybe another one in this domain for certain values of p parameter and Polya-Gamma is another.

These are maybe less important to the generalist scipy user but for infrastructure we might consider them, if nothing else at least to determine and document whether they or distributions like them would or would not be implementable using planned infrastructure.

(apologies in advance if the 'public' interface aspect is noted somewhere in thread above and I missed)

Edit: it looks like some methods are raising notImplementedErrors which seems like a good design to me, if we can keep that and be ok in cases where certain methods aren't workable (eg. raise the exception/error) that would work for these distributions. Still, it would be good to document which are truly minimal necessary.

-thanks again for all the hard work here-

@tupui
Copy link
Member

tupui commented Jun 3, 2024

Thanks everyone for your comments and support, we really appreciate you taking the time.

I am trying in the following to address the comments. Feel free to add more comments if I forgot anything.

Based on the overall positive response, thank you again, @mdhaber is going to open a PR. I will post another message on the forum for it too.


Have you considered making the distribution naming a little bit more consistent

We did consider alternative naming schemes (we did a poll between us too) and the API we have now is the solution we came up with. Being slightly more descriptive with complementary_cdf would be nice, but then for consistency we would need inverse_log_complementary_cdf which seems too long for us. We also try to avoid adding small things like inv or comp as we had issues in the past with acronyms becoming slurs. Just a precaution here.

Which methods use the RNG? Right now, there's only sample? Would it make sense to just pass in the RNG into those methods?

We had quite some back and forth there too and are still discussing this point. Some of us thing we should not allow passing a RNG in methods and just allow having it at the init. Right now it does both. This is still a point open for discussions, we will have a second look.

Also, I suggest moving the fit function out of the distribution and making it a free function. After all, it doesn't seem to rely on private members, so by OO principles, it should be a free function.

Yes and no. Things like staticmethod for instance are here to group methods under a given "name space" to have a logical grouping. We will think this once more.

But yes, after discussing this. We are going to remove the fit method from the initial API. And we will think about how to follow up.

Have you considered adding a base Distribution class ... It would serve as a type annotation for generic functions that accept distributions

Typing is not going to be part of this first iteration. But it's definitely something that we are going to look into in follow ups. (I am myself very much interested into typing. There are just a lot of initiatives right now happening on that front and waiting a little bit more might help us get there.)

distribution object immutable

In the previous design, there was 2 ways to create a distribution: a classical class instance, and a frozen one. This design lead to a lot of confusion in the community and this is precisely one of the pain point this new design addresses.

We want to allow class mutability to be able to change parameters (in a controlled manner with proper checks). This will be important for some downstream libraries.

detach tol, cache, and rng from the object

See above for RNG, otherwise we think it's a better API that way.

support mixed continuous and discrete distribution

Mixed and Discrete distribution are definitely on the table and will be a follow up.

The notebook example suggests that a distribution object may be instantiated with vector parameter. But the documentation gives no clue of this

More documentation will come. Now we are focusing on the basics.

Normal._pdf_formula ... remove self and make the method a classmethod or staticmethod, or remove the distribution parameters and only access them via self

Thanks, we will have another look and consider it.

sf to ccdf

We had some discussions here when we added features around survival analysis and this naming is less common for non experts than the complementary of the CDF.

as compound Poisson Gamma (Tweedie).

A general note about distributions. We don't want to go back to the current status where we have more tham 100 distributions. Instead, we want to make an API which allows anyone to add their custom distributions.

We are actually thinking about making a dedicated library to host user defined distributions. We don't need to bloat the library itself and most importantly add to the maintenance pile more than we have to.

I'd also want a way to select the distribution's sampling strategy. It may be useful to force inverse-transform, or ziggurat, or Box-Muller for the Normal distribution in particular.

There is a method parameter to do a basic method selection. But we are going to work on this more. The end goal will be to use methods which are Array API compatible so that sampling can be fast and library independent.

Something else we will look into is to see how we can integrate the work done in stats.sampling.FastGeneratorInversion with UNURAN. We just need to see how, here as well, it will fit with Array API support.

cumulant support,

We can consider that in a follow up.

One thing I haven't noticed mentioned (possible I missed it though) is which methods are required and which aren't? (when implemented a new distribution)

We have a diagram for all the logic but we need to do some small polishing. But yes, it will be in the doc.

@josef-pkt
Copy link
Member

I think dropping the fit method is not a good idea:

It's the main usecase where we need to call methods with different parameters, and you need it for the design decisions.

Current distributions have many distribution specific fit methods, that are useful and took a long time to add and improve.
It's very difficult to fit some of those distributions that are not well behaved at boundaries or because of changing support.

@tupui
Copy link
Member

tupui commented Jun 3, 2024

Moving the fit method out for now does not mean that (i) it will not be added back, (ii) we cannot use any specialisation. We are considering all of that, don't worry.

@josef-pkt
Copy link
Member

I gave up worrying and just hope for a loooong deprecation period

@WarrenWeckesser
Copy link
Member

WarrenWeckesser commented Jun 3, 2024

+1 for immutable (fewer headaches in the long run).
+1 for implementing fit as a class method, where the method generates a new instance of the class.

@josef-pkt
Copy link
Member

josef-pkt commented Jun 3, 2024

immutable without args in the methods kills it for any fitting applications, or we have to create a new instance each time during fitting.

(Models in statsmodels are supposed to be immutable (by user), but not having some extra args in the methods makes that very annoying in some use cases. We have some backdoor mutating in internal applications, but it looks dirty.
I just created a set of classes that have set_params method, but I also don't feel comfortable with it. It's too easy to mutate the wrong instance.
)

Depends also on the cost of creating a new instance. We create all the time many instances that are cheap.

@tupui
Copy link
Member

tupui commented Jun 3, 2024

Yes immutability is off the table. As I said, we want to support changing parameters for things like fitting.

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Jun 3, 2024

we have to create a new instance each time during fitting.

Why would that be a problem? Do you have an example of a program that is limited by that?

This is how I do fitting in efax. (Admittedly, the situation is a bit different for exponential family distributions.)

Would it be possible to clarify why fitting needs mutable distributions?

It also might help to see exactly how you expect a fitting example to work. I know that a "joint distribution transformer" is on the list, and it would nice to see what you expect to happen in that case. I had to redesign my approach to make fitting work with joint distributions.

@tupui
Copy link
Member

tupui commented Jun 3, 2024

Before doing more meta-discussions around that. Please also have a look at the timings Matt put above as well. You really need to look at this with fresh eyes.

@josef-pkt
Copy link
Member

Would it be possible to clarify why fitting needs mutable distributions?

The main reason is overhead of instance creation, and convenience of calling a method if it has a parameter argument.

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Jun 3, 2024

The main reason is overhead of instance creation,

Would you happen to have a program that illustrates why this overhead might be problematic? The reason I ask is twofold. First, with the broadcasting interface, if you're doing a lot of fitting in parallel, then you can fit them all at once, and only one instance is created for all of the random variables. And second, nowadays, there are plenty of mathematical JITs that would erase all of the overhead. So, if you could show an example, we could explore if a JIT would solve your problem.

I think it would be a shame to complicate the interface for a problem that may no longer be a problem.

and convenience of calling a method if it has a parameter argument.

I'm not sure why calling a method is any nicer than instantiating and calling a method on the instance? I find the Python principle of there being "1 way to do things" to be more motivating, personally.

Please also have a look at the timings Matt put above as well. You really need to look at this with fresh eyes.

I read (and loved) Matt's comment when I hearted it. I see that you're referring to the timings. I just don't see how creating instances is problematic in practice, but maybe my applications are different than other people's.

@tupui
Copy link
Member

tupui commented Jun 3, 2024

I read (and loved) Matt's comment when I hearted it. I see that you're referring to the timings. I just don't see how creating instances is problematic in practice, but maybe my applications are different than other people's.

Oh I was saying that for everyone here. Plus we had a talk with you so I know you have more context from us.

What I am saying here is that if we really want to have a meaningful discussion then everyone needs to be on the same page by pulling Matt's branch locally and playing with it. Otherwise it's easy to think that it works in certain ways though it does something else entirely. Talking about overhead and such without trying it out does not make sense to me otherwise. The new API is very different from the previous one.

@fancidev
Copy link
Contributor

fancidev commented Jun 3, 2024

For brainstorming purpose, Java and C# address the mutable/immutable dilemma (for String) by keeping String immutable but having a separate, mutable StringBuilder class. Do whatever you like on StringBuilder, and once you’re happy with the result, create a String from it for later use.

I feel this model may be an overkill for Python, so I don’t intend to propose it as a solution. Just for your info.

@josef-pkt
Copy link
Member

Actually, it's difficult to find an example in statsmodels that uses scipy distribution in fit with variable parameters.

For most distributions we have explicit formulas for the loglike and its derivatives.
For loc-scale distributions without shape parameter we can just use the standard distribution on the transformed data.

The only example I currently find:
For the t distribution model with estimated df, the loglikelihood uses
llike = - stats.t._logpdf(errorsest/scale, df) + np_log(scale)
which is then used with scipy.optimize.

(Models with other distributions that have shape parameters, e.g. for failure time and extreme values, are still missing in statsmodels, Numfocus thought they are not important)

So, maybe I don't need to care.

The only other time consuming loops that I remember right now are rootfinding for power and sample size computation which in many cases have to solve a cdf constraint for a shape parameter.

Most other application of scipy distribution in statsmodels are one time calls where overhead would be unimportant most of the time.
(I have a recent case with distribution.expect numerical integration inside rootfinding. But that is so slow that it's not usable for users anyway. :)

You can also time the fit method in scipy to see whether creating instances has a non-negligible overhead.

@tupui
Copy link
Member

tupui commented Jun 3, 2024

For brainstorming purpose, Java and C# address the mutable/immutable dilemma (for String) by keeping String immutable but having a separate, mutable StringBuilder class. Do whatever you like on StringBuilder, and once you’re happy with the result, create a String from it for later use.

I feel this model may be an overkill for Python, so I don’t intend to propose it as a solution. Just for your info.

To be clear, this ship has sailed. We are not going back to a state where we can have frozen distributions.

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Jun 3, 2024

@josef-pkt

Thanks for explaining your cases: optimize and root finding. Have you considered using Jax? Here's an example illustrative program:

from collections.abc import Callable, Generator
from contextlib import contextmanager
from time import perf_counter

import jax
import jax.numpy as jnp
import numpy as np
from efax import NormalVP
from jax.experimental import enable_x64
from jax.scipy.optimize import minimize as jminimize
from scipy.optimize import minimize
from scipy.stats import norm

with enable_x64():
    data = norm(loc=2.3, scale=1.5).rvs(size=10000)
    jax_data = jnp.asarray(data)
    guess = np.asarray([1.0, 2.0])
    jax_guess = jnp.asarray(guess)

    @contextmanager
    def catchtime() -> Generator[Callable[[], float], None, None]:
        t1 = t2 = perf_counter()
        yield lambda: t2 - t1
        t2 = perf_counter()

    def f(mv: np.ndarray) -> float:
        v = np.exp(mv[1])
        return -np.sum(norm(loc=mv[0], scale=np.sqrt(v)).logpdf(data))  # pyright: ignore

    def g(mv: np.ndarray) -> float:
        v = np.exp(mv[1])
        return -np.sum(norm.logpdf(data, loc=mv[0], scale=np.sqrt(v)))

    def h(mv: jax.Array) -> jax.Array:
        v = jnp.exp(mv[1])
        return -jnp.sum(NormalVP(mean=mv[0], variance=v).log_pdf(jax_data))

    @jax.jit
    def mh(g: jax.Array) -> jax.Array:
        return jminimize(h, g, method='BFGS').x

    mh(jax_guess)  # Compile.

    with catchtime() as t:
        minimize(f, guess)

    with catchtime() as u:
        minimize(g, guess)

    with catchtime() as v:
        mh(jax_guess)

    print(f"construct: {t():.1e}, no construct: {u():.1e}, jax: {v():.1e}")

gives:

construct: 2.2e-02, no construct: 6.6e-03, jax: 7.1e-05

So, at least on this problem, Jax is about two orders of magnitude faster than the fastest SciPy solution.

@josef-pkt
Copy link
Member

@NeilGirdhar

Switching backend or supporting backends other than numpy/scipy/pandas is currently not planned for statsmodels.
The priority is still on adding statistics and econometrics that is currently not available in Python, with reasonable performance given current tools and not on finding the fastest implementation.
(There are several other Python packages that have parts of statsmodels' functionality but oriented on speed and performance.)

@rkern
Copy link
Member

rkern commented Jun 4, 2024

We don't want to go back to the current status where we have more tham 100 distributions.

Is there a record of these policy decisions somewhere? I don't recall this being a part of the roadmap when the work got started.

@NeilGirdhar
Copy link
Contributor

@josef-pkt I understand. The reason I wanted to illustrate the Jax solution is to support my argument that I don't think that speed should be a significant consideration when making design decisions. Whether you need to construct objects or not is totally irrelevant if you care about speed because if you care about speed, you should just use a JIT.

And just so that we're clear: I'm not sure if you meant "array library" when you said "backend"? Because the Jax solution is a different array library, but it's the same backend (CPU).

In short, my personal preference is for design decisions to be focused on ergonomics over efficiency since JITs provided by Array API libraries should solve the efficiency problem.

@tupui
Copy link
Member

tupui commented Jun 4, 2024

We don't want to go back to the current status where we have more tham 100 distributions.

Is there a record of these policy decisions somewhere? I don't recall this being a part of the roadmap when the work got started.

We did not make a public official announcement specific to that yet no.

While we iterated on this RFC with the other SciPy stats maintainers, it became clear for us that we did not want to leave the door open anymore to anything. The maintenance cost is just too high and none of us want to review nor maintain more than what we must.

We feel that a community effort around that would be a more sustainable solution. The new API will make it simpler to add any distributions and this does not need to live in SciPy. Hence we will push for another repository proposing a collection of distribution. I plan to quick start this repository later this year, but in the end, I and other SciPy stats maintainers will not provide any guarantee on that repo.

And just so that we're clear: I'm not sure if you meant "array library" when you said "backend"? Because the Jax solution is a different array library, but it's the same backend (CPU).

In short, my personal preference is for design decisions to be focused on ergonomics over efficiency since JITs provided by Array API libraries should solve the efficiency problem.

Support for Array API will be added right after. The team has been quite busy adding support to stats and preparing the ground to enable a full support with the new distribution infrastructure.

@rkern
Copy link
Member

rkern commented Jun 4, 2024

While we iterated on this RFC with the other SciPy stats maintainers, it became clear for us that we did not want to leave the door open anymore to anything. The maintenance cost is just too high and none of us want to review nor maintain more than what we must.

Where did this discussion take place? Not on the mailing list or the Discourse.

@rgommers
Copy link
Member

rgommers commented Jun 4, 2024

Is there a record of these policy decisions somewhere? I don't recall this being a part of the roadmap when the work got started.

We did not make a public official announcement specific to that yet no.

While we iterated on this RFC with the other SciPy stats maintainers, it became clear for us that we did not want to leave the door open anymore to anything. The maintenance cost is just too high and none of us want to review nor maintain more than what we must. [...]

That's kinda not how this works. Decisions get made in public, with clear rationales. I suggest to focus on getting the new infrastructure merged without making such definitive statements on what can and cannot be contributed in the future. Deprecations/removals of existing API surface is the same. It can be mixed in one huge discussion, but I'd advise keeping it separate. It's not at all clear to me what the best way forward is for the old (very widely used) infra.

The new framework seems to be really good. Let's get that completed, used, and then worry about the rest.

@tupui
Copy link
Member

tupui commented Jun 4, 2024

That's kinda not how this works. Decisions get made in public, with clear rationales. I suggest to focus on getting the new infrastructure merged without making such definitive statements on what can and cannot be contributed in the future.

To be clear, this is not a done deal in the sense that this RFC is not about that. This is just me saying that some of the current stats folks (@mdhaber, @steppi, @tirthasheshpatel, @ev-br, @chrisb83, @dschmitz89 and myself) had 6 meetings together and we felt that we did not want to go back to this state and spend our time on this in the future.

@fancidev
Copy link
Contributor

fancidev commented Jun 4, 2024

We feel that a community effort around that would be a more sustainable solution. The new API will make it simpler to add any distributions and this does not need to live in SciPy. Hence we will push for another repository proposing a collection of distribution. I plan to quick start this repository later this year, but in the end, I and other SciPy stats maintainers will not provide any guarantee on that repo.

I like this idea. I’m impressed by the large, comprehensive set of distributions in SciPy. There could be some room for improvement as I find some distributions less polished than others. A community effort could help in this regard I think.

@lucascolley
Copy link
Member

For those who are following this issue but may have missed the discourse post, see gh-21050 for a first PR for the new infra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement RFC Request for Comments; typically used to gather feedback for a substantial change proposal scipy.stats
Projects
None yet
Development

No branches or pull requests