ENH: Added rectangular integral to multivariate_normal #9072

phildias · 2018-07-25T07:23:37Z

The multivariate_normal's CDF function originally only gives you the lower tail CDF, i.e. the integral of the pdf from -infty to x. Now, I've created a new method that calculates the CDF between lower and upper bounds, also known as the rectangular integral of the pdf between the lower and upper bounds.

The multivariate_normal's CDF function originally only gives you the lower tail CDF. Now, it has been extended to calculate the CDF between lower and upper bounds, also known as the rectangular integral of the pdf between the lower and upper bounds.

phildias · 2018-07-28T10:07:02Z

Just to make my initial comment a bit clearer:

The multivariate normal already has a method called "cdf", which calculates the n-dimensional integral of the PDF from (-infty, -infty, ..., -infty,) to (x1, x2, ..., xn)., sometimes called the "lower tail" CDF.

The problem is that sometimes you want to calculate the n-dimensional rectangular CDF between a lower bound (z1, z2, ...,, zn) and an upper (x1, x2, ..., xn), and trying to do so with a function that only gives the "lower tail" CDF is a huge hassle.

I just tweaked things around so that the rectangular CDF can be calculated using the already implemented mvn.mvnun function.

I hope that cleared things up a bit.

rgommers · 2018-07-31T05:38:26Z

Hi @phildias, thanks for the contribution and idea.

In principle this functionality makes sense. However, I'm not sure I'm in favor of adding new methods to only one distribution. This will make the distributions framework less consistent. I think we should consider this either for all or for none of the distributions.

phildias · 2018-08-01T00:33:19Z

Hi @rgommers! Thanks for the comment =)

Well, it seems like none of the other multivariate distributions in this file have a CDF associated to them. And this idea of "rectangular integral" doesn't really apply to univariate CDFs. In those cases, you can just take the CDF(upper) - CDF(lower). See what I mean?

So given that the multivariate_normal already is the odd one out by having CDF functionality, would it really be too odd to add this rectangular CDF feature?

rkern · 2018-08-01T02:35:30Z

In my opinion, the multivariate distributions are all somewhat unique creatures; many of them are defined over more interesting spaces that are imbued with special structure (multivariate normal lives up to its name and is the most boring). There isn't really an underlying system like the univariate distributions. Heck, some of them don't even define a pdf(). I'm not opposed to adding distribution-specific methods on principle. By their varied nature, there are a lot of interesting things one would like to do with any given distribution that just doesn't even make sense to do to another one.

The purpose of including the multivariate distributions in scipy.stats was never to provide a generic system where one can write generic code that accepts "a distribution object" and chugs away without needing to know what the distribution is, like was the purpose of the univariate distributions design. We explicitly disclaimed that when we added them in (well, I argued for that, at least; I'll have to search through the archives to dredge up the conversations). Rather, the purpose of including the multivariate distributions is because they are useful, and we needed to put them somewhere!

With respect to this PR, I'm not a fan of referring to this as evaluating a "CDF". I'm not sure anyone actually thinks of defining a CDF as a function that one then evaluates at the lower and upper bounding vectors. Rather, I just think of this method as an operation that integrates the PDF over a given domain, one that happens to be easy to specify, but isn't particularly special compared to other domains that are likely more interesting. Not entirely sure what a better, concise name would be, but avoiding "CDF" might help avoid the feeling that this is something that all multivariate distributions ought to have.

rgommers · 2018-08-01T04:29:24Z

well, I argued for that, at least; I'll have to search through the archives to dredge up the conversations

no need, I think i remember that.

With respect to this PR, I'm not a fan of referring to this as evaluating a "CDF". [...]

good point, that is what triggered my initial comment. It's more like integrate_box in gaussian_kde - would that be a useful name here?

mdhaber · 2022-02-27T08:15:11Z

@phildias, sounds like this would be useful but that there was some uncertainty about the method name. Are you still interested in working on this?

mdhaber · 2022-07-24T23:46:31Z

@tupui @tirthasheshpatel @steppi @rkern this does look like it would be useful, and it seems to have been held up mostly because of the name.

The quantity calculated here is the integral of the multivariate normal PDF over a rectangular region. The equivalent quantity is calculated by the method integrate_box in gaussian_kde, but I don't see other software that uses that name for this quantity, and I would have expected a method with that name to calculate a more general function over a box like the univariate method expect. I haven't found a concise, standard name for the idea in the math literature, either (and I looked in other contexts, too - e.g. difference between CDF at two points for univariate distributions).

In Matlab, the multivariate CDF function mvncdf accepts lower and upper limits, as does pmvnorm in R. That's probably because the underlying Fortran code mvndst, which we use, does this. I don't see a need to break this convention here: it seems very natural to me for a CDF function to accept lower and upper bounds because the CDF is often seen as an integral of the PDF, and the lower bound of integrals is not always minus infinity. I'd suggest we follow precedent and implement this feature by adding another argument to the cdf method of multivariate_normal that specifies the lower bound of the hyperrectangle. (The default is essentially a vector of negative infinities). It is not ideal that this argument has to go at the end of the signature, but there are other things I'd change about the multivariate distribution interfaces (e.g. see gh-16278), too, and I think this is still preferable to adding a new method.

Other thoughts? Would anyone like to take this over, or if I finish it, would you like to review?

tupui · 2022-07-25T19:29:16Z

I agree that this sounds useful and should be part of CDF itself. It makes sense to add a lower bound parameter as the current parameter is specifying the upper bound. Noting here that this could be seen as somehow similar to np.random.integers, but I personally don't like that the lower bound parameter corresponds to the upper bound if upper bound is not specify. i.e. here we should try not to do this and just add a param lower bound.

I am happy to help you there.

WarrenWeckesser · 2022-07-25T21:12:54Z

This could also be a useful feature for the univariate distributions, where the "hyperrectangle" is just an interval. It is common to want the probability of an interval a <= x <= b, and it is quite easy to lose a lot of precision unnecessarily by using the naive implementation cdf(b) - cdf(a). (At a minimum, an implementation should make the smart choice between cdf(b) - cdf(a) and sf(a) - sf(b).) I've always envisioned a separate method (admittedly with awkward ideas for names, such as delta_cdf or interval_prob), but adding a parameter to the CDF method also seems like a reasonable API.

rkern · 2022-07-25T21:28:06Z

If we're thinking of adding something equivalent on the univariate side, I'm even more in favor of a separate method rather than glomming more arguments into cdf(). I'd be okay with it having cdf in the name of it, though.

mdhaber · 2022-07-25T21:52:09Z

I would not add it (or anything else, for that matter) on the univariate side in the current framework. I have considered it in the context of gh-15928. Since methods would no longer need to accept shape parameters as arguments, I think keeping it in cdf would be reasonable, since these would be the only two arguments.

dist = stats.normal()
assert dist.cdf(x, y) == dist.cdf(y) - dist.cdf(x)

This would sometimes fail because the former could be overridden to be more accurate.

If someone comes up with a killer name for a separate method, I'd be ok with it, but I haven't seen it yet.

rkern · 2022-07-25T22:22:09Z

The issue with MATLAB's cognate is that it has the (upper) / (lower, upper) variation in its arguments that we all dislike. R's cognate always requires both lower and upper (reasonable for multivariate, but much less so for univariate CDFs, and even R doesn't try; CDFs are unary functions). Inserting that into our current implementation of cdf() forces us into one of those two choices. Or have a flipped argument structure with (upper, lower). Which are you going to choose?

I'm comfortable with cdf_rect(), or cdf_interval() if you want to be forward-looking towards using the same name over in univariate-land (under the existing or future framework).

mdhaber · 2022-07-25T22:48:10Z

Which are you going to choose?

I had in mind the (upper) / (lower, upper) variation in arguments that we all dislike : )

I think I dislike that only when the keywords are named like low/high or similar, because then low is actually the upper end of the interval when only one argument is passed. But with keyword names x and y, it would make sense to me from a user standpoint. That is, I'm comfortable taking an integral from -inf to x or over the interval x to y. For implementation, I think it would be good to dispatch to different private methods depending on whether one or both arguments are passed.

I suppose I feel a mild irritation at switching the meaning of x and y depending on whether one or both are passed, and usually I'd disagree with that sort of thing. But for some reason that actually resonates with me in this case, and I'd guess it's what users would typically expect (without thinking about it carefully).

rkern · 2022-07-25T22:52:44Z

Yeah, it's more natural in MATLAB than in Python because they have different conventions for dealing with different argument structures. I prefer keeping to Python's standard semantics when we can. It's much easier to document for folks.

tupui · 2022-07-29T19:30:58Z

Ok so what do we do? 2 options from this discussion regarding multivariate distributions:

dist.cdf(x, ..., *, y) (or something else like lower)
dist.cdf_box(*, lower, upper) or dist.cdf_interval(*, lower, upper)

For dist.cdf_box, I am not that concerned about flipping arguments. The method being new, we can force keywords arguments only for both upper/lower bounds.

mdhaber · 2022-07-29T19:37:12Z

cdf(x, y) is not an option because cdf already has this signature:

def cdf(self, x, mean=None, cov=1, allow_singular=False, maxpts=None, abseps=1e-5, releps=1e-5):

I'd suggest we tack lower on to the end of the list as a keyword-only argument of cdf and think about univariate distributions later (because the existing multivariate distribution interface should not be a model for a new univariate distribution interface).

tupui · 2022-07-29T19:38:50Z

cdf(x, y) is not an option because cdf already has this signature:
def cdf(self, x, mean=None, cov=1, allow_singular=False, maxpts=None, abseps=1e-5, releps=1e-5):
I'd suggest we tack lower on to the end of the list as a keyword-only argument of cdf and think about univariate distributions later (because the existing multivariate distribution interface should not be a model for a new univariate distribution interface).

I was not clear, yes only talking about multivariate here. And for y, yes can be anything as lower and it will be a keyword only. I will update above.

rkern · 2022-07-29T19:39:25Z

Add cdf_interval(lower, upper) and be done with it. Leave cdf() alone.

tupui · 2022-07-29T20:05:43Z

And my vote is for not adding an extra method and use cdf(x, ..., *, lower).

mdhaber · 2022-08-18T16:00:29Z

scipy/stats/_multivariate.py

                                           maxpts, abseps, releps)[0]
-        out = np.apply_along_axis(func1d, -1, x)
+        out = np.apply_along_axis(func1d, -1, limits)


apply_along axis applies func1d to every slice along the last axis. So we concatenate both lower and upper integration limits along the last axis and split them into separate lower and upper limits within func1d.

scipy/stats/tests/test_multivariate.py

mdhaber · 2022-08-18T16:38:06Z

@tupui is this what you had in mind?

There is a weirdness, though. In 2d:

import numpy as np
from scipy.stats import multivariate_normal
rng = np.random.default_rng(2408071309372769818)
ndim = 2
mean = np.zeros(ndim)
cov = np.eye(ndim)
multivariate_normal.cdf([0, 0], mean, cov, lower_limit=[1, 1])  #  0.11651623566859809
multivariate_normal.cdf([1, 1], mean, cov, lower_limit=[0, 0])  #  0.11651623566859809
multivariate_normal.cdf([0, 1], mean, cov, lower_limit=[1, 0])  # -0.11651623566859809
multivariate_normal.cdf([1, 0], mean, cov, lower_limit=[0, 1])  # -0.11651623566859809

whereas in 3D and higher dimensions (and 1D, too, actually):

rng = np.random.default_rng(2408071309372769818)
ndim = 3
mean = np.zeros(ndim)
cov = np.eye(ndim)
multivariate_normal.cdf([1, 1, 1], mean, cov, lower_limit=[0, 0, 0])  # 0.03977220487716015
multivariate_normal.cdf([0, 0, 0], mean, cov, lower_limit=[1, 1, 1])  # 0.0
multivariate_normal.cdf([1, 0, 1], mean, cov, lower_limit=[0, 1, 0])  # 0.0

if any integration limits are reversed, the integral is zero.

I guess three options are:

Leave it as-is and document
Check the order of limits, and return 0 (even in 2d) whenever np.any(lower > x)
Check the order of limits, make them all so that np.all(lower < x) when calling mvn, and determine the sign ourselves

tupui · 2022-08-18T18:28:25Z

Thanks @mdhaber. Yes the API corresponds to what I had in mind.

For the corner cases, I would not try to calculate anything of the lower value is above the requested value. Maybe even error out. I find it strange to define an integral with bounds swapped otherwise. (I am not sure how this case is defined in maths.)

mdhaber · 2022-08-18T19:43:58Z

It is defined.
$\int_b^a f(x) \equiv -\int_a^b f(x)$
I think the behavior of the 2d case is correct, so personally, I think the best option is the third, and it's not hard to implement the fix. If (b < a).sum(axis=-1) is odd, the result should be negative, else positive. I'll go ahead and do that when I'm back at a computer, if that sounds OK, because 0 is definitely not correct, and I don't think it's much harder to correct it than to error out.

tupui · 2022-08-18T22:07:48Z

It is defined.
I think the behavior of the 2d case is correct, so personally, I think the best option is the third, and it's not hard to implement the fix. If (b < a).sum(axis=-1) is odd, the result should be negative, else positive. I'll go ahead and do that when I'm back at a computer, if that sounds OK, because 0 is definitely not correct, and I don't think it's much harder to correct it than to error out.

Right indeed, I went back to my textbooks... Sorry about that 🤦 In that case yes I agree with your fix.

scipy/stats/_multivariate.py

tupui

LGTM now, let's get this in now. We always have time to iterate if there is something we missed.

Thanks @mdhaber for getting this to the finish line, @phildias for the initial work and everyone for the discussions and review.

phildias changed the title ~~Added rectangular integral to multivariate_normal~~ ENH: Added rectangular integral to multivariate_normal Jul 25, 2018

chrisb83 added scipy.stats enhancement A new feature or improvement labels Jul 26, 2018

mdhaber mentioned this pull request Mar 15, 2022

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber mentioned this pull request Jul 19, 2022

MAINT: stats: repair multivariate distribution shortcomings #16278

Open

13 tasks

mdhaber added 4 commits August 17, 2022 23:52

Merge remote-tracking branch 'upstream/main' into mvn_box_cdf

eb63d13

MAINT: stats.multivariate_normal.cdf: add lower_limit

ccfccd5

Merge remote-tracking branch 'upstream/main'

78bccb1

Merge branch 'mvn_box_cdf'

aaab54d

mdhaber reviewed Aug 18, 2022

View reviewed changes

ENH: mvn.logcdf: complex when cdf is negative

e94704b

mdhaber reviewed Aug 18, 2022

View reviewed changes

scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved

Update scipy/stats/tests/test_multivariate.py

b4107fc

mdhaber added 2 commits August 18, 2022 21:30

MAINT: stats.mvn.cdf: accept lower bounds greater than x

2d91b76

Merge branch 'master' of github.com:phildias/scipy

e06a776

mdhaber reviewed Aug 19, 2022

View reviewed changes

scipy/stats/_multivariate.py Outdated Show resolved Hide resolved

mdhaber added 2 commits August 18, 2022 21:33

Update scipy/stats/_multivariate.py

2f727e0

MAINT: stats.mvn: fix PEP8

6ef9be1

tupui approved these changes Aug 19, 2022

View reviewed changes

tupui merged commit 23b0814 into scipy:main Aug 19, 2022

tupui added this to the 1.10.0 milestone Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Added rectangular integral to multivariate_normal #9072

ENH: Added rectangular integral to multivariate_normal #9072

phildias commented Jul 25, 2018 •

edited

Loading

phildias commented Jul 28, 2018

rgommers commented Jul 31, 2018

phildias commented Aug 1, 2018

rkern commented Aug 1, 2018

rgommers commented Aug 1, 2018 •

edited

Loading

mdhaber commented Feb 27, 2022

mdhaber commented Jul 24, 2022 •

edited

Loading

tupui commented Jul 25, 2022 •

edited

Loading

WarrenWeckesser commented Jul 25, 2022

rkern commented Jul 25, 2022

mdhaber commented Jul 25, 2022 •

edited

Loading

rkern commented Jul 25, 2022

mdhaber commented Jul 25, 2022 •

edited

Loading

rkern commented Jul 25, 2022

tupui commented Jul 29, 2022 •

edited

Loading

mdhaber commented Jul 29, 2022 •

edited

Loading

tupui commented Jul 29, 2022

rkern commented Jul 29, 2022

tupui commented Jul 29, 2022

mdhaber Aug 18, 2022

mdhaber commented Aug 18, 2022 •

edited

Loading

tupui commented Aug 18, 2022

mdhaber commented Aug 18, 2022 •

edited

Loading

tupui commented Aug 18, 2022

tupui left a comment

ENH: Added rectangular integral to multivariate_normal #9072

ENH: Added rectangular integral to multivariate_normal #9072

Conversation

phildias commented Jul 25, 2018 • edited Loading

phildias commented Jul 28, 2018

rgommers commented Jul 31, 2018

phildias commented Aug 1, 2018

rkern commented Aug 1, 2018

rgommers commented Aug 1, 2018 • edited Loading

mdhaber commented Feb 27, 2022

mdhaber commented Jul 24, 2022 • edited Loading

tupui commented Jul 25, 2022 • edited Loading

WarrenWeckesser commented Jul 25, 2022

rkern commented Jul 25, 2022

mdhaber commented Jul 25, 2022 • edited Loading

rkern commented Jul 25, 2022

mdhaber commented Jul 25, 2022 • edited Loading

rkern commented Jul 25, 2022

tupui commented Jul 29, 2022 • edited Loading

mdhaber commented Jul 29, 2022 • edited Loading

tupui commented Jul 29, 2022

rkern commented Jul 29, 2022

tupui commented Jul 29, 2022

mdhaber Aug 18, 2022

Choose a reason for hiding this comment

mdhaber commented Aug 18, 2022 • edited Loading

tupui commented Aug 18, 2022

mdhaber commented Aug 18, 2022 • edited Loading

tupui commented Aug 18, 2022

tupui left a comment

Choose a reason for hiding this comment

phildias commented Jul 25, 2018 •

edited

Loading

rgommers commented Aug 1, 2018 •

edited

Loading

mdhaber commented Jul 24, 2022 •

edited

Loading

tupui commented Jul 25, 2022 •

edited

Loading

mdhaber commented Jul 25, 2022 •

edited

Loading

mdhaber commented Jul 25, 2022 •

edited

Loading

tupui commented Jul 29, 2022 •

edited

Loading

mdhaber commented Jul 29, 2022 •

edited

Loading

mdhaber commented Aug 18, 2022 •

edited

Loading

mdhaber commented Aug 18, 2022 •

edited

Loading