-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added rectangular integral to multivariate_normal #9072
Conversation
The multivariate_normal's CDF function originally only gives you the lower tail CDF. Now, it has been extended to calculate the CDF between lower and upper bounds, also known as the rectangular integral of the pdf between the lower and upper bounds.
Just to make my initial comment a bit clearer: The multivariate normal already has a method called "cdf", which calculates the n-dimensional integral of the PDF from (-infty, -infty, ..., -infty,) to (x1, x2, ..., xn)., sometimes called the "lower tail" CDF. The problem is that sometimes you want to calculate the n-dimensional rectangular CDF between a lower bound (z1, z2, ...,, zn) and an upper (x1, x2, ..., xn), and trying to do so with a function that only gives the "lower tail" CDF is a huge hassle. I just tweaked things around so that the rectangular CDF can be calculated using the already implemented mvn.mvnun function. I hope that cleared things up a bit. |
Hi @phildias, thanks for the contribution and idea. In principle this functionality makes sense. However, I'm not sure I'm in favor of adding new methods to only one distribution. This will make the distributions framework less consistent. I think we should consider this either for all or for none of the distributions. |
Hi @rgommers! Thanks for the comment =) Well, it seems like none of the other multivariate distributions in this file have a CDF associated to them. And this idea of "rectangular integral" doesn't really apply to univariate CDFs. In those cases, you can just take the CDF(upper) - CDF(lower). See what I mean? So given that the multivariate_normal already is the odd one out by having CDF functionality, would it really be too odd to add this rectangular CDF feature? |
In my opinion, the multivariate distributions are all somewhat unique creatures; many of them are defined over more interesting spaces that are imbued with special structure (multivariate normal lives up to its name and is the most boring). There isn't really an underlying system like the univariate distributions. Heck, some of them don't even define a The purpose of including the multivariate distributions in With respect to this PR, I'm not a fan of referring to this as evaluating a "CDF". I'm not sure anyone actually thinks of defining a CDF as a function that one then evaluates at the lower and upper bounding vectors. Rather, I just think of this method as an operation that integrates the PDF over a given domain, one that happens to be easy to specify, but isn't particularly special compared to other domains that are likely more interesting. Not entirely sure what a better, concise name would be, but avoiding "CDF" might help avoid the feeling that this is something that all multivariate distributions ought to have. |
no need, I think i remember that.
good point, that is what triggered my initial comment. It's more like integrate_box in |
@phildias, sounds like this would be useful but that there was some uncertainty about the method name. Are you still interested in working on this? |
@tupui @tirthasheshpatel @steppi @rkern this does look like it would be useful, and it seems to have been held up mostly because of the name. The quantity calculated here is the integral of the multivariate normal PDF over a rectangular region. The equivalent quantity is calculated by the method In Matlab, the multivariate CDF function Other thoughts? Would anyone like to take this over, or if I finish it, would you like to review? |
I agree that this sounds useful and should be part of CDF itself. It makes sense to add a lower bound parameter as the current parameter is specifying the upper bound. Noting here that this could be seen as somehow similar to I am happy to help you there. |
This could also be a useful feature for the univariate distributions, where the "hyperrectangle" is just an interval. It is common to want the probability of an interval |
If we're thinking of adding something equivalent on the univariate side, I'm even more in favor of a separate method rather than glomming more arguments into |
I would not add it (or anything else, for that matter) on the univariate side in the current framework. I have considered it in the context of gh-15928. Since methods would no longer need to accept shape parameters as arguments, I think keeping it in
This would sometimes fail because the former could be overridden to be more accurate. If someone comes up with a killer name for a separate method, I'd be ok with it, but I haven't seen it yet. |
The issue with MATLAB's cognate is that it has the I'm comfortable with |
I had in mind the (upper) / (lower, upper) variation in arguments that we all dislike : ) I think I dislike that only when the keywords are named like I suppose I feel a mild irritation at switching the meaning of |
Yeah, it's more natural in MATLAB than in Python because they have different conventions for dealing with different argument structures. I prefer keeping to Python's standard semantics when we can. It's much easier to document for folks. |
Ok so what do we do? 2 options from this discussion regarding multivariate distributions:
For |
def cdf(self, x, mean=None, cov=1, allow_singular=False, maxpts=None, abseps=1e-5, releps=1e-5): I'd suggest we tack |
I was not clear, yes only talking about multivariate here. And for |
Add |
And my vote is for not adding an extra method and use |
scipy/stats/_multivariate.py
Outdated
maxpts, abseps, releps)[0] | ||
out = np.apply_along_axis(func1d, -1, x) | ||
out = np.apply_along_axis(func1d, -1, limits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply_along
axis applies func1d
to every slice along the last axis. So we concatenate both lower and upper integration limits along the last axis and split them into separate lower and upper limits within func1d
.
@tupui is this what you had in mind? There is a weirdness, though. In 2d: import numpy as np
from scipy.stats import multivariate_normal
rng = np.random.default_rng(2408071309372769818)
ndim = 2
mean = np.zeros(ndim)
cov = np.eye(ndim)
multivariate_normal.cdf([0, 0], mean, cov, lower_limit=[1, 1]) # 0.11651623566859809
multivariate_normal.cdf([1, 1], mean, cov, lower_limit=[0, 0]) # 0.11651623566859809
multivariate_normal.cdf([0, 1], mean, cov, lower_limit=[1, 0]) # -0.11651623566859809
multivariate_normal.cdf([1, 0], mean, cov, lower_limit=[0, 1]) # -0.11651623566859809 whereas in 3D and higher dimensions (and 1D, too, actually): rng = np.random.default_rng(2408071309372769818)
ndim = 3
mean = np.zeros(ndim)
cov = np.eye(ndim)
multivariate_normal.cdf([1, 1, 1], mean, cov, lower_limit=[0, 0, 0]) # 0.03977220487716015
multivariate_normal.cdf([0, 0, 0], mean, cov, lower_limit=[1, 1, 1]) # 0.0
multivariate_normal.cdf([1, 0, 1], mean, cov, lower_limit=[0, 1, 0]) # 0.0 if any integration limits are reversed, the integral is zero. I guess three options are:
|
Thanks @mdhaber. Yes the API corresponds to what I had in mind. For the corner cases, I would not try to calculate anything of the lower value is above the requested value. Maybe even error out. I find it strange to define an integral with bounds swapped otherwise. (I am not sure how this case is defined in maths.) |
It is defined. |
Right indeed, I went back to my textbooks... Sorry about that 🤦 In that case yes I agree with your fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multivariate_normal's CDF function originally only gives you the lower tail CDF, i.e. the integral of the pdf from -infty to x. Now, I've created a new method that calculates the CDF between lower and upper bounds, also known as the rectangular integral of the pdf between the lower and upper bounds.