-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added analytical formula for truncnorm entropy (#17748) #17874
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again @MatteoRaso , thanks for all those entropy related PRs!
Please have a look at the formula again, I think it is not quite right at the moment.
Regarding tests, we should add some with extreme values for an and b:
- close bounds (e. g.
b = a + 1e-5)
- very large range
Regarding the lint errors: I struggled with the linter for some time too for Scipy development as I was not familiar with these concepts. I tried to train muscle memory to always run python dev.py lint
before pushing. That output will tell you what exactly needs to be fixed.
scipy/stats/_continuous_distns.py
Outdated
def _entropy(self, a, b): | ||
A = 0.5 * (1 + sc.erf(a / np.sqrt(2))) | ||
B = 0.5 * (1 + sc.erf(b / np.sqrt(2))) | ||
Z = B - A | ||
C = (np.log(np.pi) + np.log(2) + 1) / 2 | ||
D = np.log(Z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how this is the same as the formula found for example in wikipedia:
The second term using the PDF of the standard normal distribution is missing.
With
Please compare this with your closed PR where I had suggested a formulation as many existing functions can be reused here. The standard normal CDF throughout this file is used as _norm_cdf
and the standard normal PDF as _norm_pdf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how this is the same as the formula found for example in wikipedia
That's because it's not the same formula. The Wikipedia formula is wrong, so I derived the true formula by hand. I considered putting a comment explaining this, but I'm planning on changing the Wikipedia page once this PR is pulled, so it wouldn't make sense for me to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double-check before changing the Wikipedia entry. It's possible that the first three implementations here are incorrect, but I would be very surprised if that were the case.
import numpy as np
from scipy import stats
from scipy import special as sc
from mpmath import mp
mp.dps = 50
def entropy_wikipedia(a, b):
Z = stats.norm.cdf(b) - stats.norm.cdf(a)
return (np.log(np.sqrt(2 * np.pi * np.e) * Z)
+ (a * stats.norm.pdf(a) - b * stats.norm.pdf(b))/(2*Z))
def entropy_mpmath(a, b):
Z = mp.ncdf(b) - mp.ncdf(a)
def pdf(x):
return mp.npdf(x)/Z
def dentropy(x):
y = pdf(x)
return -mp.log(y) * y
return mp.quad(dentropy, (a, b))
def entropy_pr(a, b):
A = 0.5 * (1 + sc.erf(a / np.sqrt(2)))
B = 0.5 * (1 + sc.erf(b / np.sqrt(2)))
Z = B - A
C = (np.log(np.pi) + np.log(2) + 1) / 2
D = np.log(Z)
h = (C + D) / (2 * Z)
return h
a, b = -1, 3
print(entropy_wikipedia(a, b)) # 1.0926338726407065
print(entropy_mpmath(a, b)) # 1.0926338726407065848966922217503260384942743385556
print(stats.truncnorm.entropy(a, b)) # 1.0926338726407065
print(entropy_pr(a, b)) # 0.7408253846201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please double-check before changing the Wikipedia entry. It's possible that the first three implementations here are incorrect, but I would be very surprised if that were the case.
You're right. I was using the formula for 0 to infinity, when I was actually supposed to be using the formula from a to b. Fixed it.
2124b1e
to
15c1d53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also one styling issue: in Python, capitalized variables are usually used for constants. I am not sure how closely scipy follows these rules but it might make sense to use z
instead of Z
etc. Will let maintainers decide though.
I know this might seem nitpicky but all these conventions come from some bad experiences somewhere. For the fun of it, compare it to the famous "There's an interesting story behind every sign" scene from "How I met your mother".
A = sc.ndtr(a) | ||
B = sc.ndtr(b) | ||
Z = B - A | ||
C = np.log(np.sqrt(2 * np.pi * np.e) * Z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for readability? It does look slightly more readable with this change, but it doesn't seem right not to keep Z with the other term inside one instance of log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, it is possible that
np.log(np.sqrt(2 * np.pi * np.e) * np.finfo(np.float64).max) # RuntimeWarning: overflow encountered in double_scalars
np.log(np.sqrt(2 * np.pi * np.e)) + np.log(np.finfo(np.float64).max) # 711.2016514265887
Of course this might seem like an artificial case but taking logarithms apart is good practice anyway and this is a low hanging fruit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flip side to that is that taking the logarithm apart puts us at risk of underflow. I'm also not too sure about taking logarithms apart being best practice. It's a very small difference, but calling np.log
twice instead of once is slower. It's not a huge deal, but I'd rather not split the logarithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flip side to that is that taking the logarithm apart puts us at risk of underflow
I don't see that.
It's a very small difference, but calling np.log twice instead of once is slower.
Note that even if it saves an evaluation of log
, the current implementation adds a sqrt
. On my machine, these tend to take the same amount of time. If we're going for speed, Z**.5
seems faster on my machine.
It's not a huge deal
Agreed if it's between these options, but if _log_gauss_mass
were used, adding logZ
would be the better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that.
I didn't mean as a realistic risk, just as a hypothetical risk, in the same way that we might hypothetically have an overflow if Z ends up being extremely large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a slight difference in that np.log(np.sqrt(2 * np.pi * np.e)) + np.log(Z)
cannot underflow for real Z
. AFAICT, the closest it can come is one of:
import numpy as np
np.log(np.nextafter(1, -np.inf)) # -1.1102230246251565e-16
np.log(np.nextafter(1, np.inf)) # 2.2204460492503128e-16
There could be catastrophic cancellation and the result of the addition could be zero even if the true value is not exactly zero, but that's a different thing, and doing the multiplication instead doesn't really avoid it. I'd be curious to know if there's something I'm not thinking of, though.
It is possible for the imaginary component for complex Z
to underflow (e.g. np.log(1e100 + 1e-3001j)
), but I think we're only interested in real Z
.
Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>
Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>
True, but information for the distribution gives it as a capital Z, so I think it's best to leave it like that. Even though it's not exactly correct from a style point of view, I know that there's other parts of Scipy where capitalized variables are used, so it's acceptable. |
A = sc.ndtr(a) | ||
B = sc.ndtr(b) | ||
Z = B - A | ||
C = np.log(np.sqrt(2 * np.pi * np.e) * Z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for readability? It does look slightly more readable with this change, but it doesn't seem right not to keep Z with the other term inside one instance of log.
# def cdf(x): | ||
# return 0.5 * (1 + mp.erf(x / mp.sqrt(2))) | ||
# | ||
# Z = cdf(b) - cdf(a) | ||
# | ||
# def pdf_standard_norm(x): | ||
# return (mp.exp(-0.5 * x ** 2) / mp.sqrt(2 * mp.pi)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mpmath has functions available for the standard normal pdf and cdf: https://mpmath.org/doc/current/functions/expintegrals.html#npdf
They could save you some work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code was already run, so it doesn't matter. Thanks for the heads up, though.
[(1, 1.1, -2.3030441048144877733524473299116314416365215401594087), | ||
(0.6, 0.7, -2.3027610681852571124367653023002056177369149733744606), | ||
(1, 3, 0.29622333134332169527554642294689301750721849032204733)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[(1, 1.1, -2.3030441048144877733524473299116314416365215401594087), | |
(0.6, 0.7, -2.3027610681852571124367653023002056177369149733744606), | |
(1, 3, 0.29622333134332169527554642294689301750721849032204733)]) | |
[(0, 100, 0.7257913526447274), | |
(0.6, 0.7, -2.3027610681852573), | |
(1e-6, 2e-6, -13.815510557964274)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the input values, but I didn't shorten the length of the output, the reason being that I want the outputs to be perfectly reproducible if somebody wants to run the code I used to get them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the outputs reproducible you have already provided the actual code that generated them. These type of unit tests compare two double precision values, that's why all throughout the test_distributions.py
file multiprecision values are converted to doubles first (as far as I see at least). Let us please stay in line with standard scipy testing procedures here. For a big project like scipy, this type of standardization is important even if it might feel annoying sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I've made the changes you requested.
Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now.
@mdhaber Does this look good enough now? |
The rest of A = _norm_cdf(a)
B = _norm_cdf(b)
Z = B - A could be: logZ = _log_gauss_mass(a, b) which avoids underflows and follows @dschmitz89's suggestion to work in the log space. That said, we haven't received any complaints about |
Seems to be a fair bit more. I tested the function on values that the default implementation under/overflows on, and it worked fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add @mdhaber 's suggestion. Should improve precision for small ranges of
For large ranges, the standard normal entropy appears again asymptotically but finding a general formula would be difficult. Let's not overengineer this too much and get it in soon.
A = _norm_cdf(a) | ||
B = _norm_cdf(b) | ||
Z = B - A | ||
C = np.log(np.sqrt(2 * np.pi * np.e) * Z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A = _norm_cdf(a) | |
B = _norm_cdf(b) | |
Z = B - A | |
C = np.log(np.sqrt(2 * np.pi * np.e) * Z) | |
logZ = _log_gauss_mass(a, b) | |
return (np.log(np.sqrt(2 * np.pi * np.e)) + logZ + | |
(a * _norm_pdf(a) - b * _norm_pdf(b)) / (2 * np.exp(logZ))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should improve precision for small ranges of [a, b]
Doesn't seem to be the case, actually.
scipy/stats/tests/test_distributions.py:1114: in test_extreme_entropy
assert_allclose(stats.truncnorm.entropy(a, b), ref, rtol=1e-14)
a = 1e-11
b = 1.0001e-11
ref = -34.8181418495907
self = <scipy.stats.tests.test_distributions.TestTruncnorm object at 0x7f97edd412d0>
/usr/lib/python3.10/contextlib.py:79: in inner
return func(*args, **kwds)
E AssertionError:
E Not equal to tolerance rtol=1e-14, atol=0
E
E Mismatched elements: 1 / 1 (100%)
E Max absolute difference: 0.99805745
E Max relative difference: 0.02866487
E x: array(-33.820084)
E y: array(-34.818142)
This doesn't happen when using the current formula.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get slightly different reference values with the following snippet (for the last test case -34.53877639491019
):
def truncnorm_mpmath_entropy(a, b):
a = mp.mpf(a)
b = mp.mpf(b)
mp.dps = 500
A = mp.ncdf(a)
B = mp.ncdf(b)
Z = B - A
C = mp.log(mp.sqrt(2 * mp.pi * mp.e)) + mp.log(Z)
D = (a * mp.npdf(a) - b * mp.npdf(b))/(2 * Z)
return float(C + D)
The crucial point is the conversion of the inputs to multiprecision floatsvia mp.mpf()
. I think I had pointed to another post here regarding mpmath in the first entropy related PR @MatteoRaso (this one), sorry for missing it here before. The reference values for the tests need to be recomputed unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky thing is when a
and b
are both deep in the same tail. _log_gauss_mass
is optimized for this case. It might be slightly better to use the naive implementation here when a
and b
are close to one another and right in the heart of the distribution, but either way you only get one digit of precision due to catastrophic cancellation:
import numpy as np
rng = np.random.default_rng(1638083107694713882823079058616272161)
from scipy.stats._continuous_distns import _log_gauss_mass, _norm_cdf
from mpmath import mp
mp.dps = 100
a, b = 1e-11, 1.0001e-11
def gauss_mass_mpmath(a, b):
a, b = mp.mpf(a), mp.mpf(b)
return np.float64(mp.ncdf(b) - mp.ncdf(a))
def gauss_mass_scipy(a, b):
return _norm_cdf(b) - _norm_cdf(a)
ref = gauss_mass_mpmath(a, b)
res = gauss_mass_scipy(a, b)
res2 = np.exp(_log_gauss_mass(a, b))
(ref - res)/ref # 0.16512507259888784
(ref - res2)/ref # array([-0.67678942]) # worse, but both are terrible
The tests in test_extreme_entropy
right now are essentially the same as the half-normal and regular normal distribution because one of the truncation points is either near 0 or deep in the opposite tail. I don't think they expose the shortcomings of the current approach (which is fine; we don't need to add tests that fail unless our goal is to improve the implementation to make them pass).
But again, I'm not saying that any adjustments are required; I just meant to make an observation. It sounds like this PR is better than the default implementation, and it might be too much work to go deeper. I don't know that simply using _log_gauss_mass
for Z
is enough; the rest of the formula would probably need to be logarithmized, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't happen when using the current formula.
Tests tend to pass when the reference values are produced using the implementation that is being tested.
My IDE is doing funny things. I think the problem was that import mpmath as mp
needs to be from mpmath import mp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are enough things to adjust that I'll submit a PR to M.R.'s branch.
Update: see MatteoRaso#2. The 1e-11, 1.0001e-11
test is removed because it doesn't pass with the correct reference value. I didn't update to use _log_gauss_mass
because that's only going to fix part of the problem; might as well make it obvious that this would need future work to be robust.
MAINT: stats.truncnorm.entropy: fix PEP8, tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging as an improvement. Perhaps someone can explore whether it makes sense to logarithmize the rest in a follow-up PR.
A = sc.ndtr(a) | ||
B = sc.ndtr(b) | ||
Z = B - A | ||
C = np.log(np.sqrt(2 * np.pi * np.e) * Z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flip side to that is that taking the logarithm apart puts us at risk of underflow
I don't see that.
It's a very small difference, but calling np.log twice instead of once is slower.
Note that even if it saves an evaluation of log
, the current implementation adds a sqrt
. On my machine, these tend to take the same amount of time. If we're going for speed, Z**.5
seems faster on my machine.
It's not a huge deal
Agreed if it's between these options, but if _log_gauss_mass
were used, adding logZ
would be the better choice.
Reference issue
What does this implement/fix?
Additional information