Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added analytical formula for truncnorm entropy (#17748) #17874

Merged
merged 13 commits into from
Feb 14, 2023

Conversation

MatteoRaso
Copy link
Contributor

Reference issue

What does this implement/fix?

Additional information

@dschmitz89 dschmitz89 self-requested a review January 28, 2023 08:59
Copy link
Contributor

@dschmitz89 dschmitz89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again @MatteoRaso , thanks for all those entropy related PRs!

Please have a look at the formula again, I think it is not quite right at the moment.

Regarding tests, we should add some with extreme values for an and b:

  • close bounds (e. g. b = a + 1e-5)
  • very large range

Regarding the lint errors: I struggled with the linter for some time too for Scipy development as I was not familiar with these concepts. I tried to train muscle memory to always run python dev.py lint before pushing. That output will tell you what exactly needs to be fixed.

Comment on lines 8706 to 8711
def _entropy(self, a, b):
A = 0.5 * (1 + sc.erf(a / np.sqrt(2)))
B = 0.5 * (1 + sc.erf(b / np.sqrt(2)))
Z = B - A
C = (np.log(np.pi) + np.log(2) + 1) / 2
D = np.log(Z)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how this is the same as the formula found for example in wikipedia:

image
image

The second term using the PDF of the standard normal distribution is missing.

With $C=...$ I guess you are taking the first logarithmic term apart, this is a good idea

Please compare this with your closed PR where I had suggested a formulation as many existing functions can be reused here. The standard normal CDF throughout this file is used as _norm_cdf and the standard normal PDF as _norm_pdf .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how this is the same as the formula found for example in wikipedia

That's because it's not the same formula. The Wikipedia formula is wrong, so I derived the true formula by hand. I considered putting a comment explaining this, but I'm planning on changing the Wikipedia page once this PR is pulled, so it wouldn't make sense for me to do that.

Copy link
Contributor

@mdhaber mdhaber Jan 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double-check before changing the Wikipedia entry. It's possible that the first three implementations here are incorrect, but I would be very surprised if that were the case.

import numpy as np
from scipy import stats
from scipy import special as sc
from mpmath import mp
mp.dps = 50

def entropy_wikipedia(a, b):
    Z = stats.norm.cdf(b) - stats.norm.cdf(a)
    return (np.log(np.sqrt(2 * np.pi * np.e) * Z)
            + (a * stats.norm.pdf(a) - b * stats.norm.pdf(b))/(2*Z))

def entropy_mpmath(a, b):
    Z = mp.ncdf(b) - mp.ncdf(a)
    def pdf(x):
        return mp.npdf(x)/Z
    def dentropy(x):
        y = pdf(x)
        return -mp.log(y) * y
    return mp.quad(dentropy, (a, b))

def entropy_pr(a, b):
    A = 0.5 * (1 + sc.erf(a / np.sqrt(2)))
    B = 0.5 * (1 + sc.erf(b / np.sqrt(2)))
    Z = B - A
    C = (np.log(np.pi) + np.log(2) + 1) / 2
    D = np.log(Z)
    h = (C + D) / (2 * Z)
    return h

a, b = -1, 3
print(entropy_wikipedia(a, b))  # 1.0926338726407065
print(entropy_mpmath(a, b))  # 1.0926338726407065848966922217503260384942743385556
print(stats.truncnorm.entropy(a, b))  # 1.0926338726407065
print(entropy_pr(a, b))  # 0.7408253846201

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double-check before changing the Wikipedia entry. It's possible that the first three implementations here are incorrect, but I would be very surprised if that were the case.

You're right. I was using the formula for 0 to infinity, when I was actually supposed to be using the formula from a to b. Fixed it.

@dschmitz89 dschmitz89 added scipy.stats enhancement A new feature or improvement labels Jan 28, 2023
Copy link
Contributor

@dschmitz89 dschmitz89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also one styling issue: in Python, capitalized variables are usually used for constants. I am not sure how closely scipy follows these rules but it might make sense to use z instead of Z etc. Will let maintainers decide though.

I know this might seem nitpicky but all these conventions come from some bad experiences somewhere. For the fun of it, compare it to the famous "There's an interesting story behind every sign" scene from "How I met your mother".

A = sc.ndtr(a)
B = sc.ndtr(b)
Z = B - A
C = np.log(np.sqrt(2 * np.pi * np.e) * Z)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for readability? It does look slightly more readable with this change, but it doesn't seem right not to keep Z with the other term inside one instance of log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, it is possible that $\ln(\sqrt{2\pi e}Z)=\ln(\sqrt{2\pi e})+\ln Z$ overflows. Taking the logarithm apart solves that:

np.log(np.sqrt(2 * np.pi * np.e) * np.finfo(np.float64).max) # RuntimeWarning: overflow encountered in double_scalars
np.log(np.sqrt(2 * np.pi * np.e)) + np.log(np.finfo(np.float64).max) # 711.2016514265887

Of course this might seem like an artificial case but taking logarithms apart is good practice anyway and this is a low hanging fruit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flip side to that is that taking the logarithm apart puts us at risk of underflow. I'm also not too sure about taking logarithms apart being best practice. It's a very small difference, but calling np.log twice instead of once is slower. It's not a huge deal, but I'd rather not split the logarithm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flip side to that is that taking the logarithm apart puts us at risk of underflow

I don't see that.

It's a very small difference, but calling np.log twice instead of once is slower.

Note that even if it saves an evaluation of log, the current implementation adds a sqrt. On my machine, these tend to take the same amount of time. If we're going for speed, Z**.5 seems faster on my machine.

It's not a huge deal

Agreed if it's between these options, but if _log_gauss_mass were used, adding logZ would be the better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that.

I didn't mean as a realistic risk, just as a hypothetical risk, in the same way that we might hypothetically have an overflow if Z ends up being extremely large.

Copy link
Contributor

@mdhaber mdhaber Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a slight difference in that np.log(np.sqrt(2 * np.pi * np.e)) + np.log(Z) cannot underflow for real Z. AFAICT, the closest it can come is one of:

import numpy as np
np.log(np.nextafter(1, -np.inf))  # -1.1102230246251565e-16
np.log(np.nextafter(1, np.inf))  # 2.2204460492503128e-16

There could be catastrophic cancellation and the result of the addition could be zero even if the true value is not exactly zero, but that's a different thing, and doing the multiplication instead doesn't really avoid it. I'd be curious to know if there's something I'm not thinking of, though.

It is possible for the imaginary component for complex Z to underflow (e.g. np.log(1e100 + 1e-3001j)), but I think we're only interested in real Z.

scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
@dschmitz89 dschmitz89 added this to the 1.11.0 milestone Jan 29, 2023
MatteoRaso and others added 2 commits February 8, 2023 02:06
Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>
Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>
@MatteoRaso
Copy link
Contributor Author

capitalized variables are usually used for constants

True, but information for the distribution gives it as a capital Z, so I think it's best to leave it like that. Even though it's not exactly correct from a style point of view, I know that there's other parts of Scipy where capitalized variables are used, so it's acceptable.

A = sc.ndtr(a)
B = sc.ndtr(b)
Z = B - A
C = np.log(np.sqrt(2 * np.pi * np.e) * Z)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for readability? It does look slightly more readable with this change, but it doesn't seem right not to keep Z with the other term inside one instance of log.

Comment on lines 1074 to 1080
# def cdf(x):
# return 0.5 * (1 + mp.erf(x / mp.sqrt(2)))
#
# Z = cdf(b) - cdf(a)
#
# def pdf_standard_norm(x):
# return (mp.exp(-0.5 * x ** 2) / mp.sqrt(2 * mp.pi))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mpmath has functions available for the standard normal pdf and cdf: https://mpmath.org/doc/current/functions/expintegrals.html#npdf

They could save you some work here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was already run, so it doesn't matter. Thanks for the heads up, though.

Comment on lines 1068 to 1070
[(1, 1.1, -2.3030441048144877733524473299116314416365215401594087),
(0.6, 0.7, -2.3027610681852571124367653023002056177369149733744606),
(1, 3, 0.29622333134332169527554642294689301750721849032204733)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[(1, 1.1, -2.3030441048144877733524473299116314416365215401594087),
(0.6, 0.7, -2.3027610681852571124367653023002056177369149733744606),
(1, 3, 0.29622333134332169527554642294689301750721849032204733)])
[(0, 100, 0.7257913526447274),
(0.6, 0.7, -2.3027610681852573),
(1e-6, 2e-6, -13.815510557964274)])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the input values, but I didn't shorten the length of the output, the reason being that I want the outputs to be perfectly reproducible if somebody wants to run the code I used to get them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the outputs reproducible you have already provided the actual code that generated them. These type of unit tests compare two double precision values, that's why all throughout the test_distributions.py file multiprecision values are converted to doubles first (as far as I see at least). Let us please stay in line with standard scipy testing procedures here. For a big project like scipy, this type of standardization is important even if it might feel annoying sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I've made the changes you requested.

scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved
@dschmitz89 dschmitz89 self-requested a review February 12, 2023 08:21
Copy link
Contributor

@dschmitz89 dschmitz89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now.

@MatteoRaso
Copy link
Contributor Author

@mdhaber Does this look good enough now?

@mdhaber
Copy link
Contributor

mdhaber commented Feb 12, 2023

The rest of truncnorm was rewritten very carefully (twice) because we received bug reports about extreme values of the parameters. (The rewrite in gh-10104 helped, but gh-16586 rewrote it again because we received several new bug reports.) This implementation isn't really consistent with all of that; e.g.

        A = _norm_cdf(a)
        B = _norm_cdf(b)
        Z = B - A

could be:

        logZ = _log_gauss_mass(a, b)

which avoids underflows and follows @dschmitz89's suggestion to work in the log space.

That said, we haven't received any complaints about entropy in particular (for any distribution, really), so maybe a basic speed enhancement like this is OK. Is this at least as robust as the default implementation?

@MatteoRaso
Copy link
Contributor Author

Is this at least as robust as the default implementation?

Seems to be a fair bit more. I tested the function on values that the default implementation under/overflows on, and it worked fine.

Copy link
Contributor

@dschmitz89 dschmitz89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add @mdhaber 's suggestion. Should improve precision for small ranges of $[a,\ b]$. This suggestion is the last one from my side.

For large ranges, the standard normal entropy appears again asymptotically but finding a general formula would be difficult. Let's not overengineer this too much and get it in soon.

Comment on lines +8707 to +8710
A = _norm_cdf(a)
B = _norm_cdf(b)
Z = B - A
C = np.log(np.sqrt(2 * np.pi * np.e) * Z)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A = _norm_cdf(a)
B = _norm_cdf(b)
Z = B - A
C = np.log(np.sqrt(2 * np.pi * np.e) * Z)
logZ = _log_gauss_mass(a, b)
return (np.log(np.sqrt(2 * np.pi * np.e)) + logZ +
(a * _norm_pdf(a) - b * _norm_pdf(b)) / (2 * np.exp(logZ)))

Copy link
Contributor Author

@MatteoRaso MatteoRaso Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should improve precision for small ranges of [a, b]

Doesn't seem to be the case, actually.

scipy/stats/tests/test_distributions.py:1114: in test_extreme_entropy
    assert_allclose(stats.truncnorm.entropy(a, b), ref, rtol=1e-14)
        a          = 1e-11
        b          = 1.0001e-11
        ref        = -34.8181418495907
        self       = <scipy.stats.tests.test_distributions.TestTruncnorm object at 0x7f97edd412d0>
/usr/lib/python3.10/contextlib.py:79: in inner
    return func(*args, **kwds)
E   AssertionError: 
E   Not equal to tolerance rtol=1e-14, atol=0
E   
E   Mismatched elements: 1 / 1 (100%)
E   Max absolute difference: 0.99805745
E   Max relative difference: 0.02866487
E    x: array(-33.820084)
E    y: array(-34.818142)

This doesn't happen when using the current formula.

Copy link
Contributor

@dschmitz89 dschmitz89 Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get slightly different reference values with the following snippet (for the last test case -34.53877639491019):

def truncnorm_mpmath_entropy(a, b):
    a = mp.mpf(a)
    b = mp.mpf(b)
    mp.dps = 500
    A = mp.ncdf(a)
    B = mp.ncdf(b)
    Z = B - A
    C = mp.log(mp.sqrt(2 * mp.pi * mp.e)) + mp.log(Z)
    D = (a * mp.npdf(a) - b * mp.npdf(b))/(2 * Z)
    return float(C + D)

The crucial point is the conversion of the inputs to multiprecision floatsvia mp.mpf(). I think I had pointed to another post here regarding mpmath in the first entropy related PR @MatteoRaso (this one), sorry for missing it here before. The reference values for the tests need to be recomputed unfortunately.

Copy link
Contributor

@mdhaber mdhaber Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky thing is when a and b are both deep in the same tail. _log_gauss_mass is optimized for this case. It might be slightly better to use the naive implementation here when a and b are close to one another and right in the heart of the distribution, but either way you only get one digit of precision due to catastrophic cancellation:

import numpy as np
rng = np.random.default_rng(1638083107694713882823079058616272161)

from scipy.stats._continuous_distns import _log_gauss_mass, _norm_cdf

from mpmath import mp
mp.dps = 100
a, b = 1e-11, 1.0001e-11

def gauss_mass_mpmath(a, b):
    a, b = mp.mpf(a), mp.mpf(b)
    return np.float64(mp.ncdf(b) - mp.ncdf(a))

def gauss_mass_scipy(a, b):
    return _norm_cdf(b) - _norm_cdf(a)

ref = gauss_mass_mpmath(a, b)
res = gauss_mass_scipy(a, b)
res2 = np.exp(_log_gauss_mass(a, b))

(ref - res)/ref  # 0.16512507259888784
(ref - res2)/ref  # array([-0.67678942])  # worse, but both are terrible

The tests in test_extreme_entropy right now are essentially the same as the half-normal and regular normal distribution because one of the truncation points is either near 0 or deep in the opposite tail. I don't think they expose the shortcomings of the current approach (which is fine; we don't need to add tests that fail unless our goal is to improve the implementation to make them pass).

But again, I'm not saying that any adjustments are required; I just meant to make an observation. It sounds like this PR is better than the default implementation, and it might be too much work to go deeper. I don't know that simply using _log_gauss_mass for Z is enough; the rest of the formula would probably need to be logarithmized, too.

Copy link
Contributor

@mdhaber mdhaber Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't happen when using the current formula.

Tests tend to pass when the reference values are produced using the implementation that is being tested.

My IDE is doing funny things. I think the problem was that import mpmath as mp needs to be from mpmath import mp.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are enough things to adjust that I'll submit a PR to M.R.'s branch.

Update: see MatteoRaso#2. The 1e-11, 1.0001e-11 test is removed because it doesn't pass with the correct reference value. I didn't update to use _log_gauss_mass because that's only going to fix part of the problem; might as well make it obvious that this would need future work to be robust.

scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved
MAINT: stats.truncnorm.entropy: fix PEP8, tests
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging as an improvement. Perhaps someone can explore whether it makes sense to logarithmize the rest in a follow-up PR.

A = sc.ndtr(a)
B = sc.ndtr(b)
Z = B - A
C = np.log(np.sqrt(2 * np.pi * np.e) * Z)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flip side to that is that taking the logarithm apart puts us at risk of underflow

I don't see that.

It's a very small difference, but calling np.log twice instead of once is slower.

Note that even if it saves an evaluation of log, the current implementation adds a sqrt. On my machine, these tend to take the same amount of time. If we're going for speed, Z**.5 seems faster on my machine.

It's not a huge deal

Agreed if it's between these options, but if _log_gauss_mass were used, adding logZ would be the better choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants