Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added rectangular integral to multivariate_normal #9072

Merged
merged 11 commits into from
Aug 19, 2022

Conversation

phildias
Copy link
Contributor

@phildias phildias commented Jul 25, 2018

The multivariate_normal's CDF function originally only gives you the lower tail CDF, i.e. the integral of the pdf from -infty to x. Now, I've created a new method that calculates the CDF between lower and upper bounds, also known as the rectangular integral of the pdf between the lower and upper bounds.

The multivariate_normal's CDF function originally only gives you the lower tail CDF. Now, it has been extended to calculate the CDF between lower and upper bounds, also known as the rectangular integral of the pdf between the lower and upper bounds.
@phildias phildias changed the title Added rectangular integral to multivariate_normal ENH: Added rectangular integral to multivariate_normal Jul 25, 2018
@chrisb83 chrisb83 added scipy.stats enhancement A new feature or improvement labels Jul 26, 2018
@phildias
Copy link
Contributor Author

Just to make my initial comment a bit clearer:

The multivariate normal already has a method called "cdf", which calculates the n-dimensional integral of the PDF from (-infty, -infty, ..., -infty,) to (x1, x2, ..., xn)., sometimes called the "lower tail" CDF.

The problem is that sometimes you want to calculate the n-dimensional rectangular CDF between a lower bound (z1, z2, ...,, zn) and an upper (x1, x2, ..., xn), and trying to do so with a function that only gives the "lower tail" CDF is a huge hassle.

I just tweaked things around so that the rectangular CDF can be calculated using the already implemented mvn.mvnun function.

I hope that cleared things up a bit.

@rgommers
Copy link
Member

Hi @phildias, thanks for the contribution and idea.

In principle this functionality makes sense. However, I'm not sure I'm in favor of adding new methods to only one distribution. This will make the distributions framework less consistent. I think we should consider this either for all or for none of the distributions.

@phildias
Copy link
Contributor Author

phildias commented Aug 1, 2018

Hi @rgommers! Thanks for the comment =)

Well, it seems like none of the other multivariate distributions in this file have a CDF associated to them. And this idea of "rectangular integral" doesn't really apply to univariate CDFs. In those cases, you can just take the CDF(upper) - CDF(lower). See what I mean?

So given that the multivariate_normal already is the odd one out by having CDF functionality, would it really be too odd to add this rectangular CDF feature?

@rkern
Copy link
Member

rkern commented Aug 1, 2018

In my opinion, the multivariate distributions are all somewhat unique creatures; many of them are defined over more interesting spaces that are imbued with special structure (multivariate normal lives up to its name and is the most boring). There isn't really an underlying system like the univariate distributions. Heck, some of them don't even define a pdf(). I'm not opposed to adding distribution-specific methods on principle. By their varied nature, there are a lot of interesting things one would like to do with any given distribution that just doesn't even make sense to do to another one.

The purpose of including the multivariate distributions in scipy.stats was never to provide a generic system where one can write generic code that accepts "a distribution object" and chugs away without needing to know what the distribution is, like was the purpose of the univariate distributions design. We explicitly disclaimed that when we added them in (well, I argued for that, at least; I'll have to search through the archives to dredge up the conversations). Rather, the purpose of including the multivariate distributions is because they are useful, and we needed to put them somewhere!

With respect to this PR, I'm not a fan of referring to this as evaluating a "CDF". I'm not sure anyone actually thinks of defining a CDF as a function that one then evaluates at the lower and upper bounding vectors. Rather, I just think of this method as an operation that integrates the PDF over a given domain, one that happens to be easy to specify, but isn't particularly special compared to other domains that are likely more interesting. Not entirely sure what a better, concise name would be, but avoiding "CDF" might help avoid the feeling that this is something that all multivariate distributions ought to have.

@rgommers
Copy link
Member

rgommers commented Aug 1, 2018

well, I argued for that, at least; I'll have to search through the archives to dredge up the conversations

no need, I think i remember that.

With respect to this PR, I'm not a fan of referring to this as evaluating a "CDF". [...]

good point, that is what triggered my initial comment. It's more like integrate_box in gaussian_kde - would that be a useful name here?

@mdhaber
Copy link
Contributor

mdhaber commented Feb 27, 2022

@phildias, sounds like this would be useful but that there was some uncertainty about the method name. Are you still interested in working on this?

@mdhaber
Copy link
Contributor

mdhaber commented Jul 24, 2022

@tupui @tirthasheshpatel @steppi @rkern this does look like it would be useful, and it seems to have been held up mostly because of the name.

The quantity calculated here is the integral of the multivariate normal PDF over a rectangular region. The equivalent quantity is calculated by the method integrate_box in gaussian_kde, but I don't see other software that uses that name for this quantity, and I would have expected a method with that name to calculate a more general function over a box like the univariate method expect. I haven't found a concise, standard name for the idea in the math literature, either (and I looked in other contexts, too - e.g. difference between CDF at two points for univariate distributions).

In Matlab, the multivariate CDF function mvncdf accepts lower and upper limits, as does pmvnorm in R. That's probably because the underlying Fortran code mvndst, which we use, does this. I don't see a need to break this convention here: it seems very natural to me for a CDF function to accept lower and upper bounds because the CDF is often seen as an integral of the PDF, and the lower bound of integrals is not always minus infinity. I'd suggest we follow precedent and implement this feature by adding another argument to the cdf method of multivariate_normal that specifies the lower bound of the hyperrectangle. (The default is essentially a vector of negative infinities). It is not ideal that this argument has to go at the end of the signature, but there are other things I'd change about the multivariate distribution interfaces (e.g. see gh-16278), too, and I think this is still preferable to adding a new method.

Other thoughts? Would anyone like to take this over, or if I finish it, would you like to review?

@tupui
Copy link
Member

tupui commented Jul 25, 2022

I agree that this sounds useful and should be part of CDF itself. It makes sense to add a lower bound parameter as the current parameter is specifying the upper bound. Noting here that this could be seen as somehow similar to np.random.integers, but I personally don't like that the lower bound parameter corresponds to the upper bound if upper bound is not specify. i.e. here we should try not to do this and just add a param lower bound.

I am happy to help you there.

@WarrenWeckesser
Copy link
Member

This could also be a useful feature for the univariate distributions, where the "hyperrectangle" is just an interval. It is common to want the probability of an interval a <= x <= b, and it is quite easy to lose a lot of precision unnecessarily by using the naive implementation cdf(b) - cdf(a). (At a minimum, an implementation should make the smart choice between cdf(b) - cdf(a) and sf(a) - sf(b).) I've always envisioned a separate method (admittedly with awkward ideas for names, such as delta_cdf or interval_prob), but adding a parameter to the CDF method also seems like a reasonable API.

@rkern
Copy link
Member

rkern commented Jul 25, 2022

If we're thinking of adding something equivalent on the univariate side, I'm even more in favor of a separate method rather than glomming more arguments into cdf(). I'd be okay with it having cdf in the name of it, though.

@mdhaber
Copy link
Contributor

mdhaber commented Jul 25, 2022

I would not add it (or anything else, for that matter) on the univariate side in the current framework. I have considered it in the context of gh-15928. Since methods would no longer need to accept shape parameters as arguments, I think keeping it in cdf would be reasonable, since these would be the only two arguments.

dist = stats.normal()
assert dist.cdf(x, y) == dist.cdf(y) - dist.cdf(x)

This would sometimes fail because the former could be overridden to be more accurate.

If someone comes up with a killer name for a separate method, I'd be ok with it, but I haven't seen it yet.

@rkern
Copy link
Member

rkern commented Jul 25, 2022

The issue with MATLAB's cognate is that it has the (upper) / (lower, upper) variation in its arguments that we all dislike. R's cognate always requires both lower and upper (reasonable for multivariate, but much less so for univariate CDFs, and even R doesn't try; CDFs are unary functions). Inserting that into our current implementation of cdf() forces us into one of those two choices. Or have a flipped argument structure with (upper, lower). Which are you going to choose?

I'm comfortable with cdf_rect(), or cdf_interval() if you want to be forward-looking towards using the same name over in univariate-land (under the existing or future framework).

@mdhaber
Copy link
Contributor

mdhaber commented Jul 25, 2022

Which are you going to choose?

I had in mind the (upper) / (lower, upper) variation in arguments that we all dislike : )

I think I dislike that only when the keywords are named like low/high or similar, because then low is actually the upper end of the interval when only one argument is passed. But with keyword names x and y, it would make sense to me from a user standpoint. That is, I'm comfortable taking an integral from -inf to x or over the interval x to y. For implementation, I think it would be good to dispatch to different private methods depending on whether one or both arguments are passed.

I suppose I feel a mild irritation at switching the meaning of x and y depending on whether one or both are passed, and usually I'd disagree with that sort of thing. But for some reason that actually resonates with me in this case, and I'd guess it's what users would typically expect (without thinking about it carefully).

@rkern
Copy link
Member

rkern commented Jul 25, 2022

Yeah, it's more natural in MATLAB than in Python because they have different conventions for dealing with different argument structures. I prefer keeping to Python's standard semantics when we can. It's much easier to document for folks.

@tupui
Copy link
Member

tupui commented Jul 29, 2022

Ok so what do we do? 2 options from this discussion regarding multivariate distributions:

  1. dist.cdf(x, ..., *, y) (or something else like lower)
  2. dist.cdf_box(*, lower, upper) or dist.cdf_interval(*, lower, upper)

For dist.cdf_box, I am not that concerned about flipping arguments. The method being new, we can force keywords arguments only for both upper/lower bounds.

@mdhaber
Copy link
Contributor

mdhaber commented Jul 29, 2022

cdf(x, y) is not an option because cdf already has this signature:

def cdf(self, x, mean=None, cov=1, allow_singular=False, maxpts=None, abseps=1e-5, releps=1e-5):

I'd suggest we tack lower on to the end of the list as a keyword-only argument of cdf and think about univariate distributions later (because the existing multivariate distribution interface should not be a model for a new univariate distribution interface).

@tupui
Copy link
Member

tupui commented Jul 29, 2022

cdf(x, y) is not an option because cdf already has this signature:

def cdf(self, x, mean=None, cov=1, allow_singular=False, maxpts=None, abseps=1e-5, releps=1e-5):

I'd suggest we tack lower on to the end of the list as a keyword-only argument of cdf and think about univariate distributions later (because the existing multivariate distribution interface should not be a model for a new univariate distribution interface).

I was not clear, yes only talking about multivariate here. And for y, yes can be anything as lower and it will be a keyword only. I will update above.

@rkern
Copy link
Member

rkern commented Jul 29, 2022

Add cdf_interval(lower, upper) and be done with it. Leave cdf() alone.

@tupui
Copy link
Member

tupui commented Jul 29, 2022

And my vote is for not adding an extra method and use cdf(x, ..., *, lower).

maxpts, abseps, releps)[0]
out = np.apply_along_axis(func1d, -1, x)
out = np.apply_along_axis(func1d, -1, limits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_along axis applies func1d to every slice along the last axis. So we concatenate both lower and upper integration limits along the last axis and split them into separate lower and upper limits within func1d.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 18, 2022

@tupui is this what you had in mind?

There is a weirdness, though. In 2d:

import numpy as np
from scipy.stats import multivariate_normal
rng = np.random.default_rng(2408071309372769818)
ndim = 2
mean = np.zeros(ndim)
cov = np.eye(ndim)
multivariate_normal.cdf([0, 0], mean, cov, lower_limit=[1, 1])  #  0.11651623566859809
multivariate_normal.cdf([1, 1], mean, cov, lower_limit=[0, 0])  #  0.11651623566859809
multivariate_normal.cdf([0, 1], mean, cov, lower_limit=[1, 0])  # -0.11651623566859809
multivariate_normal.cdf([1, 0], mean, cov, lower_limit=[0, 1])  # -0.11651623566859809

whereas in 3D and higher dimensions (and 1D, too, actually):

rng = np.random.default_rng(2408071309372769818)
ndim = 3
mean = np.zeros(ndim)
cov = np.eye(ndim)
multivariate_normal.cdf([1, 1, 1], mean, cov, lower_limit=[0, 0, 0])  # 0.03977220487716015
multivariate_normal.cdf([0, 0, 0], mean, cov, lower_limit=[1, 1, 1])  # 0.0
multivariate_normal.cdf([1, 0, 1], mean, cov, lower_limit=[0, 1, 0])  # 0.0

if any integration limits are reversed, the integral is zero.

I guess three options are:

  • Leave it as-is and document
  • Check the order of limits, and return 0 (even in 2d) whenever np.any(lower > x)
  • Check the order of limits, make them all so that np.all(lower < x) when calling mvn, and determine the sign ourselves

@tupui
Copy link
Member

tupui commented Aug 18, 2022

Thanks @mdhaber. Yes the API corresponds to what I had in mind.

For the corner cases, I would not try to calculate anything of the lower value is above the requested value. Maybe even error out. I find it strange to define an integral with bounds swapped otherwise. (I am not sure how this case is defined in maths.)

@mdhaber
Copy link
Contributor

mdhaber commented Aug 18, 2022

It is defined.
$\int_b^a f(x) \equiv -\int_a^b f(x)$
I think the behavior of the 2d case is correct, so personally, I think the best option is the third, and it's not hard to implement the fix. If (b < a).sum(axis=-1) is odd, the result should be negative, else positive. I'll go ahead and do that when I'm back at a computer, if that sounds OK, because 0 is definitely not correct, and I don't think it's much harder to correct it than to error out.

@tupui
Copy link
Member

tupui commented Aug 18, 2022

It is defined.
I think the behavior of the 2d case is correct, so personally, I think the best option is the third, and it's not hard to implement the fix. If (b < a).sum(axis=-1) is odd, the result should be negative, else positive. I'll go ahead and do that when I'm back at a computer, if that sounds OK, because 0 is definitely not correct, and I don't think it's much harder to correct it than to error out.

Right indeed, I went back to my textbooks... Sorry about that 🤦 In that case yes I agree with your fix.

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, let's get this in now. We always have time to iterate if there is something we missed.

Thanks @mdhaber for getting this to the finish line, @phildias for the initial work and everyone for the discussions and review.

@tupui tupui merged commit 23b0814 into scipy:main Aug 19, 2022
@tupui tupui added this to the 1.10.0 milestone Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants