-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: stats: univariate distribution infrastructure #15928
Comments
I am +100 on this proposal. There are too many issues with the current infrastructure and it's very hard to maintain. Starting off with a clean slate will definitely be a good thing. (E.g. It would be a good opportunity to make use of UNU.RAN and QMC) |
Thanks for writing this clear proposal. I will take a closer look at the issue 12133 to learn more about the background of this topic. Overall it makes a lot of sense to me. At this point, I have just one comment:
Maybe |
I jsut saw this This will be difficult. Main design issue: stateless classes and instances versus (some) fixed parameters (fixed attributes of a class defined in |
@josef-pkt to
One of the main point of this proposal is to remove the concept of frozen distributions. There would be only one way to play with distributions: which is using a class instance. (Or a factory, something else. We need to take some time to play with different APIs' architecture and investigate that with various distributions.) The current duality is too problematic/buggy and complicate the user API (and maintenance). We want to have a single point of entry for shape parameters, the init of the class and then you just do what you want with the distribution. The point is that we only provide a single Pythonic and clean way to do things user side. For users using the class-based approach, it should not deviate much more. But in the background things would be extremely different with a more manageable infrastructure which can be evolved (e.g. right now adding a QMC sampler would be a pain), documented properly, etc. |
I too am +1 for overhauling the frozen distribution infrastructure. To minimize breaking code, it would be good to keep all the distributions in the
@josef-pkt I think simply initializing a new class for different parameters to achieve the stateless class behavior isn't too bad. We can also add static methods in the new distribution classes that offer the stateless classes syntax. But not having it at all also doesn't seem like an unreasonable trade-off. |
Just to be clear. The goal is to start with a clean slate, meaning the API can break things. We should only consider adding things if they make sense in terms of architecture. Not to comply with old behaviours. I would not add any mechanism to have a stateless distribution. |
Instantiating a new class for different parameters sounds like a large overhead. On the other hand, there are some distributions that would benefit from precomputing some constants for method calls with the same distribution parameters. (In statsmodels we use the equivalent of stateless, semi-frozen (some parameters fixed in |
We would be able to change the parameters of an existing object with a method call, which is something we'd want for |
I will see it when you have a prototype tupui didn't like it when I moved most parameters to the methods in copulas so we can use it as a semi-frozen or unfrozen distribution. But that looks the better design to me for building models on top of the distributions. In statsmodels, all (almost all) parameters that are estimated in models are arguments in model methods. In some cases it gets messy when we need to change additional model attributes that are not method arguments. One pattern that I use is to allow both, set some parameters in |
just one more usecase that I remembered finite difference derivative approximations need to compute the logpdf at different distribution parameters in one function. On the other hand, when we have estimated the model, then we have a |
Following your example. You could write a small wrapper that would re-use a given class instance and just vary attributes (shape parameters). It might even be faster than instantiating a frozen distribution every time you want to update some parameters. I don't see any issues there.
What I did not like at the time was the use of |
But that's not needed right now because the scipy.stats distributions have stateless methods, i.e. we don't need to create a new instance when changing parameters, we always use the same instance. If we loose that, then we cannot use any scipy distribution without fully wrapping it for the estimation models, and carry around our own case specific instances. |
I don't believe that having to modify the shape parameters of a distribution is the common use of distributions. I am not saying this is not useful, but that it does not represent the majority of usages. I do not think we should maintain such mechanism as it's quite straightforward to do on user/downstream side. dist = toto(a, b)
for ai, bi in ... :
dist.a, dist.b = ai, bi
dist.pdf(x)
# instead of
toto.pdf(x, ai, bi) |
If you allow that, then you don't gain anything at all in the current implementation. For users, this is pain without gain.
Shape parameters, but also loc and scale. other examples statsmodels uses frozen distributions only as user convenience. Users don't have to carry around the distribution args when providing or receiving a distribution instance. Bayesian or Monte Carlo applications require mixture distributions where parameters change. Very often those cannot be fully vectorized and we cannot avoid many calls to distribution methods with different parameters (shape, loc, scale). Aside: |
Thanks everyone who worked on this behind the scenes and to those who've provided comments. My 2 cents On moments:
Cumulants and moments (eg mapping between the two) is useful for both continuous and discrete distributions in applications. Cornish Fisher comes to mind, as does Saddlepoint expansions which are sometimes formulated this way. For discrete distributions (and maybe sometimes continuous) it is often easiest to write a distributions moments by working in the factorial basis and converting to raw and (for whatever reason the github latex isn't rendering the latter change of basis correctly and the way I was able to get it to render was by removing the so converting back and forth is a change of basis (but can make computations easier). Edit: I read closer into the PR, it looks like the 2 change of basis above would be something like the I have WIP CR for First kind Stirling and Second kind Stirling was merged a release or so ago. These codes are in scipy.special and support broadcasting. Example common distributions where this is handy (Poisson, Binomial, Negative Binomial), all three have simpler arbitrary positive integer moment representations that are much cleaner than the raw moments.
IIUC @mdhaber has noted this is planned, at least I'm assuming that's what is meant by mixture distributions (and not variance scale-mixture families which is a wholely different class of distributions).
+1 these would be nice to have. One thing I haven't noticed mentioned (possible I missed it though) is which methods are required and which aren't? (when implemented a new distribution) There are some distributions which have inverse CDFs easily represented but not CDFs and some which have nice algorithms for Sampling but not other things like easy CDFs (example is the Dickman or Goncharov distribution-this one has some nice CFTP sampling approaches). Tweedie is maybe another one in this domain for certain values of These are maybe less important to the generalist scipy user but for infrastructure we might consider them, if nothing else at least to determine and document whether they or distributions like them would or would not be implementable using planned infrastructure. (apologies in advance if the 'public' interface aspect is noted somewhere in thread above and I missed) Edit: it looks like some methods are raising -thanks again for all the hard work here- |
Thanks everyone for your comments and support, we really appreciate you taking the time. I am trying in the following to address the comments. Feel free to add more comments if I forgot anything. Based on the overall positive response, thank you again, @mdhaber is going to open a PR. I will post another message on the forum for it too.
We did consider alternative naming schemes (we did a poll between us too) and the API we have now is the solution we came up with. Being slightly more descriptive with
We had quite some back and forth there too and are still discussing this point. Some of us thing we should not allow passing a RNG in methods and just allow having it at the init. Right now it does both. This is still a point open for discussions, we will have a second look.
Yes and no. Things like staticmethod for instance are here to group methods under a given "name space" to have a logical grouping. We will think this once more. But yes, after discussing this. We are going to remove the fit method from the initial API. And we will think about how to follow up.
Typing is not going to be part of this first iteration. But it's definitely something that we are going to look into in follow ups. (I am myself very much interested into typing. There are just a lot of initiatives right now happening on that front and waiting a little bit more might help us get there.)
In the previous design, there was 2 ways to create a distribution: a classical class instance, and a frozen one. This design lead to a lot of confusion in the community and this is precisely one of the pain point this new design addresses. We want to allow class mutability to be able to change parameters (in a controlled manner with proper checks). This will be important for some downstream libraries.
See above for RNG, otherwise we think it's a better API that way.
Mixed and Discrete distribution are definitely on the table and will be a follow up.
More documentation will come. Now we are focusing on the basics.
Thanks, we will have another look and consider it.
We had some discussions here when we added features around survival analysis and this naming is less common for non experts than the complementary of the CDF.
A general note about distributions. We don't want to go back to the current status where we have more tham 100 distributions. Instead, we want to make an API which allows anyone to add their custom distributions. We are actually thinking about making a dedicated library to host user defined distributions. We don't need to bloat the library itself and most importantly add to the maintenance pile more than we have to.
There is a method parameter to do a basic method selection. But we are going to work on this more. The end goal will be to use methods which are Array API compatible so that sampling can be fast and library independent. Something else we will look into is to see how we can integrate the work done in
We can consider that in a follow up.
We have a diagram for all the logic but we need to do some small polishing. But yes, it will be in the doc. |
I think dropping the It's the main usecase where we need to call methods with different parameters, and you need it for the design decisions. Current distributions have many distribution specific fit methods, that are useful and took a long time to add and improve. |
Moving the |
I gave up worrying and just hope for a loooong deprecation period |
+1 for immutable (fewer headaches in the long run). |
immutable without args in the methods kills it for any fitting applications, or we have to create a new instance each time during fitting. (Models in statsmodels are supposed to be immutable (by user), but not having some extra args in the methods makes that very annoying in some use cases. We have some backdoor mutating in internal applications, but it looks dirty. Depends also on the cost of creating a new instance. We create all the time many instances that are cheap. |
Yes immutability is off the table. As I said, we want to support changing parameters for things like fitting. |
Why would that be a problem? Do you have an example of a program that is limited by that? This is how I do fitting in efax. (Admittedly, the situation is a bit different for exponential family distributions.) Would it be possible to clarify why fitting needs mutable distributions? It also might help to see exactly how you expect a fitting example to work. I know that a "joint distribution transformer" is on the list, and it would nice to see what you expect to happen in that case. I had to redesign my approach to make fitting work with joint distributions. |
Before doing more meta-discussions around that. Please also have a look at the timings Matt put above as well. You really need to look at this with fresh eyes. |
The main reason is overhead of instance creation, and convenience of calling a method if it has a parameter argument. |
Would you happen to have a program that illustrates why this overhead might be problematic? The reason I ask is twofold. First, with the broadcasting interface, if you're doing a lot of fitting in parallel, then you can fit them all at once, and only one instance is created for all of the random variables. And second, nowadays, there are plenty of mathematical JITs that would erase all of the overhead. So, if you could show an example, we could explore if a JIT would solve your problem. I think it would be a shame to complicate the interface for a problem that may no longer be a problem.
I'm not sure why calling a method is any nicer than instantiating and calling a method on the instance? I find the Python principle of there being "1 way to do things" to be more motivating, personally.
I read (and loved) Matt's comment when I hearted it. I see that you're referring to the timings. I just don't see how creating instances is problematic in practice, but maybe my applications are different than other people's. |
Oh I was saying that for everyone here. Plus we had a talk with you so I know you have more context from us. What I am saying here is that if we really want to have a meaningful discussion then everyone needs to be on the same page by pulling Matt's branch locally and playing with it. Otherwise it's easy to think that it works in certain ways though it does something else entirely. Talking about overhead and such without trying it out does not make sense to me otherwise. The new API is very different from the previous one. |
For brainstorming purpose, Java and C# address the mutable/immutable dilemma (for String) by keeping String immutable but having a separate, mutable StringBuilder class. Do whatever you like on StringBuilder, and once you’re happy with the result, create a String from it for later use. I feel this model may be an overkill for Python, so I don’t intend to propose it as a solution. Just for your info. |
Actually, it's difficult to find an example in statsmodels that uses scipy distribution in fit with variable parameters. For most distributions we have explicit formulas for the loglike and its derivatives. The only example I currently find: (Models with other distributions that have shape parameters, e.g. for failure time and extreme values, are still missing in statsmodels, Numfocus thought they are not important) So, maybe I don't need to care. The only other time consuming loops that I remember right now are rootfinding for power and sample size computation which in many cases have to solve a cdf constraint for a shape parameter. Most other application of scipy distribution in statsmodels are one time calls where overhead would be unimportant most of the time. You can also time the |
To be clear, this ship has sailed. We are not going back to a state where we can have frozen distributions. |
Thanks for explaining your cases: optimize and root finding. Have you considered using Jax? Here's an example illustrative program: from collections.abc import Callable, Generator
from contextlib import contextmanager
from time import perf_counter
import jax
import jax.numpy as jnp
import numpy as np
from efax import NormalVP
from jax.experimental import enable_x64
from jax.scipy.optimize import minimize as jminimize
from scipy.optimize import minimize
from scipy.stats import norm
with enable_x64():
data = norm(loc=2.3, scale=1.5).rvs(size=10000)
jax_data = jnp.asarray(data)
guess = np.asarray([1.0, 2.0])
jax_guess = jnp.asarray(guess)
@contextmanager
def catchtime() -> Generator[Callable[[], float], None, None]:
t1 = t2 = perf_counter()
yield lambda: t2 - t1
t2 = perf_counter()
def f(mv: np.ndarray) -> float:
v = np.exp(mv[1])
return -np.sum(norm(loc=mv[0], scale=np.sqrt(v)).logpdf(data)) # pyright: ignore
def g(mv: np.ndarray) -> float:
v = np.exp(mv[1])
return -np.sum(norm.logpdf(data, loc=mv[0], scale=np.sqrt(v)))
def h(mv: jax.Array) -> jax.Array:
v = jnp.exp(mv[1])
return -jnp.sum(NormalVP(mean=mv[0], variance=v).log_pdf(jax_data))
@jax.jit
def mh(g: jax.Array) -> jax.Array:
return jminimize(h, g, method='BFGS').x
mh(jax_guess) # Compile.
with catchtime() as t:
minimize(f, guess)
with catchtime() as u:
minimize(g, guess)
with catchtime() as v:
mh(jax_guess)
print(f"construct: {t():.1e}, no construct: {u():.1e}, jax: {v():.1e}") gives:
So, at least on this problem, Jax is about two orders of magnitude faster than the fastest SciPy solution. |
Switching backend or supporting backends other than numpy/scipy/pandas is currently not planned for statsmodels. |
Is there a record of these policy decisions somewhere? I don't recall this being a part of the roadmap when the work got started. |
@josef-pkt I understand. The reason I wanted to illustrate the Jax solution is to support my argument that I don't think that speed should be a significant consideration when making design decisions. Whether you need to construct objects or not is totally irrelevant if you care about speed because if you care about speed, you should just use a JIT. And just so that we're clear: I'm not sure if you meant "array library" when you said "backend"? Because the Jax solution is a different array library, but it's the same backend (CPU). In short, my personal preference is for design decisions to be focused on ergonomics over efficiency since JITs provided by Array API libraries should solve the efficiency problem. |
We did not make a public official announcement specific to that yet no. While we iterated on this RFC with the other SciPy stats maintainers, it became clear for us that we did not want to leave the door open anymore to anything. The maintenance cost is just too high and none of us want to review nor maintain more than what we must. We feel that a community effort around that would be a more sustainable solution. The new API will make it simpler to add any distributions and this does not need to live in SciPy. Hence we will push for another repository proposing a collection of distribution. I plan to quick start this repository later this year, but in the end, I and other SciPy stats maintainers will not provide any guarantee on that repo.
Support for Array API will be added right after. The team has been quite busy adding support to stats and preparing the ground to enable a full support with the new distribution infrastructure. |
Where did this discussion take place? Not on the mailing list or the Discourse. |
That's kinda not how this works. Decisions get made in public, with clear rationales. I suggest to focus on getting the new infrastructure merged without making such definitive statements on what can and cannot be contributed in the future. Deprecations/removals of existing API surface is the same. It can be mixed in one huge discussion, but I'd advise keeping it separate. It's not at all clear to me what the best way forward is for the old (very widely used) infra. The new framework seems to be really good. Let's get that completed, used, and then worry about the rest. |
To be clear, this is not a done deal in the sense that this RFC is not about that. This is just me saying that some of the current stats folks (@mdhaber, @steppi, @tirthasheshpatel, @ev-br, @chrisb83, @dschmitz89 and myself) had 6 meetings together and we felt that we did not want to go back to this state and spend our time on this in the future. |
I like this idea. I’m impressed by the large, comprehensive set of distributions in SciPy. There could be some room for improvement as I find some distributions less polished than others. A community effort could help in this regard I think. |
For those who are following this issue but may have missed the discourse post, see gh-21050 for a first PR for the new infra. |
The existing univariate distribution infrastructure is an extremely valuable feature of SciPy, and it offers two main ways of interacting with distributions.
stats.norm.pdf(2, loc=0, scale=1)
calls thepdf
method of thenorm
object that is instantiated at import time.stats.norm(loc=0, scale=1).pdf(2)
uses the__call__
method of thenorm
object to create a "frozen" instance of the distribution, then calls that object'spdf
method.There are two major advantages of approach 1 (and, to be fair, I'll be happy to add others as they come up in the comments below).
stats.norm.pdf(2)
rather than thinking of (and instantiating) a standard normal object and invoking itspdf
method (stats.norm().pdf(2)
).However, there are several important disadvantages.
These could easily be avoided if
norm
were simply a class (rather than an instance of thenorm_gen
class) and the user were required to instantiate that class (e.g.norm()
) before invoking methods. This has been noted before; for example, see gh-12133 for an earlier discussion of the history and pros/cons of the two approaches.There are also several issues that have been reported over the years that may not indicate inherent disadvantages of the existing infrastructure, but do suggest that some upgrades are desired. There are several features and limitations of the existing infrasture (inflexibility w.r.t. adding/removing method arguments) that suggest that it may be easier to start fresh, reimagining the infrastructure with these enhancements in mind rather than adding patches to the existing framework.
values
with non-integer xk #3758 - Discrete distributions should not be limited to integral support.c
andscale
from mean and standard deviation #12134, DOC: scipy.stats.lognorm — SciPy v1.12.0.dev Manual #18984). I don't think it's possible to offer alternate parameterizations within the existing infrastructure without creating an entirely separate distribution class/object. We should embrace the need to offer multiple parameterizations as we design the new infrastructure.fit
method to override it. Thefit
method needs anan_policy
/inf_policy
.loc
/scale
parameterization, it is essentially impossible for users to fit truncated distributions with fixed truncation bounds. The new distribution interface should fix this.stats.sampling
to generate random variates rather than always using inverse transform sampling (even whenppf
is implemented numerically).ppf(0) = -1
and the definition ofinterval
for discrete distributions should be reconsidered when planning the new interface.logcdf
)._distn_infrastructure.py
- e.g. hazard function, probability sparsity function.relative_entropy
andcross_entropy
in the new interface.rvs
raises an error while others return NaNsValueError: Integers to negative integer powers are not allowed.
) when they are used.rv_discrete
fails when support is unbounded below #18564 - discrete distributions should be able to be unbounded below and aboveb
parameter #18782, BUG: stats.zipf: incorrect pmf values #20692 - the distribution infrastructure should take care of converting parameters to the appropriate data type to avoid these sorts of gotchasStarting fresh, we would also ensure that everything is designed with vectorization in mind, as we still occasionally discover issues like gh-13504, gh-13581, gh-12192, gh-11746, #18093 (comment), gh-18919.
This enhancement proposal is for a new univariate distribution infrastructure. To be clear, I don't mean that we'd discard the old infrastructure. A lot of the existing algorithmic code would likely be retained (perhaps modernized), and the new infrastructure/distributions would generally need to pass the old tests (with updated syntax). Hopefully, the transition would be easy for users - code involving frozen distributions (approach 2 above) would need minimal changes, and the rest just needs to add some parentheses and pass arguments into the distribution initializer instead of the method itself. But we would probably rewrite parts that define methods by
exec
uting strings, for instance. (I understand it was needed at the time, but we could avoid with a new structure.)Of course, we'd want to discuss and study other possibilities to make sure we get things right, but it might look something like this.
norm
->Norm
orNormal
,loguniform
->LogUniform
).__init__
method (only). Naturally, this makes alternative parameterizations easy to support. Default support forloc
andscale
is provided, but distributions can disable them as needed. (Contrast against MAINT: stats: fix overparameterizedloguniform
distribution #15889.)nnlf
is corrected tonllf
).fit
method, which modifies the shape parameters of the distribution. Methods are available to extract the shape parameters (e.g. see Easy access to scipy.stat.<distributionName> frozen distribution arguments #9321).mode
, partial derivatives of the PDF (useful for random variate generation and fitting parameters to data), and generating the distributions of order statistics._fit
method so that fit overrides do not have to repeat boilerplate code) and there are different versions of private methods that can be overridden depending on (for example) whether the developer is able to vectorize with respect to shape parameters or would prefer to rely on a default loop.mpmath
).np.where
/np.lazywhere
._shape_info
, which was recently added to the existing infrastructure). Based on this information, parameter documentation and input validation are generated automatically.scale
#17592).exec
).mpmath
backend. See notes below.The new infrastructure and distributions would be widely advertised, and documentation and existing code would transition to the new infrastructure. Existing infrastructure/distributions would be supported without deprecation warnings until two releases before SciPy 2.0, at which point they would be moved to a new package that is released once (without planned maintenance).
@tupui @ev-br @tirthasheshpatel @chrisb83
Notes on framework for investigating accuracy:
The infrastructure will accept as arguments arrays containing
mpmath.mpf
objects. Following the approach of gh-19023, special functions will dispatch tomparray
versions where possible. This will allow for end-to-end calculation with arbitrary precision arithmetic and, therefore, automatic generation of reference values. By comparing numerical results computed using floating point arithmetic against these reference values, we can identify shortcomings in the implementation (e.g. ranges of parameter values where catastrophic cancellation causes loss of precision).We cannot hope to test the accuracy for all possible floating point inputs, so we resort to random or quasi-random sampling from possible floating point inputs. From one perspective, it would be ideal if the distribution of floating point inputs tested would be representative of the distribution of floating point inputs used by users. This way, we focus limited resources on parameter ranges experienced in practice. However, this distribution is not known, so we should not make assumptions about it. Instead, I would suggest sampling parameter values uniformly from the set of all valid floating point inputs:
As we might expect, the floating point numbers between positive
a
andb
are approximately log-uniformly spaced. However, the log-uniform distribution is only defined over positive support, whereas the domain ofa
andb
above includes negative reals.The text was updated successfully, but these errors were encountered: