ENH: Added analytical formula for truncnorm entropy (#17748) #17874

MatteoRaso · 2023-01-28T07:34:49Z

Reference issue

What does this implement/fix?

Additional information

dschmitz89

Hi again @MatteoRaso , thanks for all those entropy related PRs!

Please have a look at the formula again, I think it is not quite right at the moment.

Regarding tests, we should add some with extreme values for an and b:

close bounds (e. g. b = a + 1e-5)
very large range

Regarding the lint errors: I struggled with the linter for some time too for Scipy development as I was not familiar with these concepts. I tried to train muscle memory to always run python dev.py lint before pushing. That output will tell you what exactly needs to be fixed.

dschmitz89 · 2023-01-28T09:08:30Z

scipy/stats/_continuous_distns.py

+    def _entropy(self, a, b):
+        A = 0.5 * (1 + sc.erf(a / np.sqrt(2)))
+        B = 0.5 * (1 + sc.erf(b / np.sqrt(2)))
+        Z = B - A
+        C = (np.log(np.pi) + np.log(2) + 1) / 2
+        D = np.log(Z)


I do not understand how this is the same as the formula found for example in wikipedia:

The second term using the PDF of the standard normal distribution is missing.

With $C=...$ I guess you are taking the first logarithmic term apart, this is a good idea

Please compare this with your closed PR where I had suggested a formulation as many existing functions can be reused here. The standard normal CDF throughout this file is used as _norm_cdf and the standard normal PDF as _norm_pdf .

I do not understand how this is the same as the formula found for example in wikipedia

That's because it's not the same formula. The Wikipedia formula is wrong, so I derived the true formula by hand. I considered putting a comment explaining this, but I'm planning on changing the Wikipedia page once this PR is pulled, so it wouldn't make sense for me to do that.

Please double-check before changing the Wikipedia entry. It's possible that the first three implementations here are incorrect, but I would be very surprised if that were the case.

import numpy as np from scipy import stats from scipy import special as sc from mpmath import mp mp.dps = 50 def entropy_wikipedia(a, b): Z = stats.norm.cdf(b) - stats.norm.cdf(a) return (np.log(np.sqrt(2 * np.pi * np.e) * Z) + (a * stats.norm.pdf(a) - b * stats.norm.pdf(b))/(2*Z)) def entropy_mpmath(a, b): Z = mp.ncdf(b) - mp.ncdf(a) def pdf(x): return mp.npdf(x)/Z def dentropy(x): y = pdf(x) return -mp.log(y) * y return mp.quad(dentropy, (a, b)) def entropy_pr(a, b): A = 0.5 * (1 + sc.erf(a / np.sqrt(2))) B = 0.5 * (1 + sc.erf(b / np.sqrt(2))) Z = B - A C = (np.log(np.pi) + np.log(2) + 1) / 2 D = np.log(Z) h = (C + D) / (2 * Z) return h a, b = -1, 3 print(entropy_wikipedia(a, b)) # 1.0926338726407065 print(entropy_mpmath(a, b)) # 1.0926338726407065848966922217503260384942743385556 print(stats.truncnorm.entropy(a, b)) # 1.0926338726407065 print(entropy_pr(a, b)) # 0.7408253846201

Please double-check before changing the Wikipedia entry. It's possible that the first three implementations here are incorrect, but I would be very surprised if that were the case.

You're right. I was using the formula for 0 to infinity, when I was actually supposed to be using the formula from a to b. Fixed it.

scipy/stats/tests/test_distributions.py

dschmitz89

There is also one styling issue: in Python, capitalized variables are usually used for constants. I am not sure how closely scipy follows these rules but it might make sense to use z instead of Z etc. Will let maintainers decide though.

I know this might seem nitpicky but all these conventions come from some bad experiences somewhere. For the fun of it, compare it to the famous "There's an interesting story behind every sign" scene from "How I met your mother".

MatteoRaso · 2023-02-08T07:28:13Z

scipy/stats/_continuous_distns.py

+        A = sc.ndtr(a)
+        B = sc.ndtr(b)
+        Z = B - A
+        C = np.log(np.sqrt(2 * np.pi * np.e) * Z)


Is this for readability? It does look slightly more readable with this change, but it doesn't seem right not to keep Z with the other term inside one instance of log.

Theoretically, it is possible that $\ln(\sqrt{2\pi e}Z)=\ln(\sqrt{2\pi e})+\ln Z$ overflows. Taking the logarithm apart solves that:

np.log(np.sqrt(2 * np.pi * np.e) * np.finfo(np.float64).max) # RuntimeWarning: overflow encountered in double_scalars np.log(np.sqrt(2 * np.pi * np.e)) + np.log(np.finfo(np.float64).max) # 711.2016514265887

Of course this might seem like an artificial case but taking logarithms apart is good practice anyway and this is a low hanging fruit.

The flip side to that is that taking the logarithm apart puts us at risk of underflow. I'm also not too sure about taking logarithms apart being best practice. It's a very small difference, but calling np.log twice instead of once is slower. It's not a huge deal, but I'd rather not split the logarithm.

The flip side to that is that taking the logarithm apart puts us at risk of underflow

I don't see that.

It's a very small difference, but calling np.log twice instead of once is slower.

Note that even if it saves an evaluation of log, the current implementation adds a sqrt. On my machine, these tend to take the same amount of time. If we're going for speed, Z**.5 seems faster on my machine.

It's not a huge deal

Agreed if it's between these options, but if _log_gauss_mass were used, adding logZ would be the better choice.

I don't see that.

I didn't mean as a realistic risk, just as a hypothetical risk, in the same way that we might hypothetically have an overflow if Z ends up being extremely large.

I think there is a slight difference in that np.log(np.sqrt(2 * np.pi * np.e)) + np.log(Z) cannot underflow for real Z. AFAICT, the closest it can come is one of:

import numpy as np np.log(np.nextafter(1, -np.inf)) # -1.1102230246251565e-16 np.log(np.nextafter(1, np.inf)) # 2.2204460492503128e-16

There could be catastrophic cancellation and the result of the addition could be zero even if the true value is not exactly zero, but that's a different thing, and doing the multiplication instead doesn't really avoid it. I'd be curious to know if there's something I'm not thinking of, though.

It is possible for the imaginary component for complex Z to underflow (e.g. np.log(1e100 + 1e-3001j)), but I think we're only interested in real Z.

scipy/stats/tests/test_distributions.py

scipy/stats/_continuous_distns.py

Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>

MatteoRaso · 2023-02-08T07:25:32Z

capitalized variables are usually used for constants

True, but information for the distribution gives it as a capital Z, so I think it's best to leave it like that. Even though it's not exactly correct from a style point of view, I know that there's other parts of Scipy where capitalized variables are used, so it's acceptable.

MatteoRaso · 2023-02-08T07:28:13Z

scipy/stats/_continuous_distns.py

+        A = sc.ndtr(a)
+        B = sc.ndtr(b)
+        Z = B - A
+        C = np.log(np.sqrt(2 * np.pi * np.e) * Z)


Is this for readability? It does look slightly more readable with this change, but it doesn't seem right not to keep Z with the other term inside one instance of log.

dschmitz89 · 2023-02-09T22:28:39Z

scipy/stats/tests/test_distributions.py

+        #    def cdf(x):
+        #        return 0.5 * (1 + mp.erf(x / mp.sqrt(2)))
+        #
+        #    Z = cdf(b) - cdf(a)
+        #
+        #    def pdf_standard_norm(x):
+        #        return (mp.exp(-0.5 * x ** 2) / mp.sqrt(2 * mp.pi))


mpmath has functions available for the standard normal pdf and cdf: https://mpmath.org/doc/current/functions/expintegrals.html#npdf

They could save you some work here.

The code was already run, so it doesn't matter. Thanks for the heads up, though.

dschmitz89 · 2023-02-10T19:12:33Z

scipy/stats/tests/test_distributions.py

+    [(1, 1.1, -2.3030441048144877733524473299116314416365215401594087),
+    (0.6, 0.7, -2.3027610681852571124367653023002056177369149733744606),
+    (1, 3, 0.29622333134332169527554642294689301750721849032204733)])


Suggested change

[(1, 1.1, -2.3030441048144877733524473299116314416365215401594087),

(0.6, 0.7, -2.3027610681852571124367653023002056177369149733744606),

(1, 3, 0.29622333134332169527554642294689301750721849032204733)])

[(0, 100, 0.7257913526447274),

(0.6, 0.7, -2.3027610681852573),

(1e-6, 2e-6, -13.815510557964274)])

I added the input values, but I didn't shorten the length of the output, the reason being that I want the outputs to be perfectly reproducible if somebody wants to run the code I used to get them.

To make the outputs reproducible you have already provided the actual code that generated them. These type of unit tests compare two double precision values, that's why all throughout the test_distributions.py file multiprecision values are converted to doubles first (as far as I see at least). Let us please stay in line with standard scipy testing procedures here. For a big project like scipy, this type of standardization is important even if it might feel annoying sometimes.

Fair enough. I've made the changes you requested.

scipy/stats/tests/test_distributions.py

Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>

dschmitz89

This looks good to me now.

MatteoRaso · 2023-02-12T18:25:04Z

@mdhaber Does this look good enough now?

mdhaber · 2023-02-12T22:09:25Z

The rest of truncnorm was rewritten very carefully (twice) because we received bug reports about extreme values of the parameters. (The rewrite in gh-10104 helped, but gh-16586 rewrote it again because we received several new bug reports.) This implementation isn't really consistent with all of that; e.g.

        A = _norm_cdf(a)
        B = _norm_cdf(b)
        Z = B - A

could be:

        logZ = _log_gauss_mass(a, b)

which avoids underflows and follows @dschmitz89's suggestion to work in the log space.

That said, we haven't received any complaints about entropy in particular (for any distribution, really), so maybe a basic speed enhancement like this is OK. Is this at least as robust as the default implementation?

scipy/stats/tests/test_distributions.py

MatteoRaso · 2023-02-13T20:26:57Z

Is this at least as robust as the default implementation?

Seems to be a fair bit more. I tested the function on values that the default implementation under/overflows on, and it worked fine.

dschmitz89

Let's add @mdhaber 's suggestion. Should improve precision for small ranges of $[a,\ b]$. This suggestion is the last one from my side.

For large ranges, the standard normal entropy appears again asymptotically but finding a general formula would be difficult. Let's not overengineer this too much and get it in soon.

dschmitz89 · 2023-02-13T20:26:33Z

scipy/stats/_continuous_distns.py

+        A = _norm_cdf(a)
+        B = _norm_cdf(b)
+        Z = B - A
+        C = np.log(np.sqrt(2 * np.pi * np.e) * Z)


Suggested change

A = _norm_cdf(a)

B = _norm_cdf(b)

Z = B - A

C = np.log(np.sqrt(2 * np.pi * np.e) * Z)

logZ = _log_gauss_mass(a, b)

return (np.log(np.sqrt(2 * np.pi * np.e)) + logZ +

(a * _norm_pdf(a) - b * _norm_pdf(b)) / (2 * np.exp(logZ)))

Should improve precision for small ranges of [a, b]

Doesn't seem to be the case, actually.

scipy/stats/tests/test_distributions.py:1114: in test_extreme_entropy assert_allclose(stats.truncnorm.entropy(a, b), ref, rtol=1e-14) a = 1e-11 b = 1.0001e-11 ref = -34.8181418495907 self = <scipy.stats.tests.test_distributions.TestTruncnorm object at 0x7f97edd412d0> /usr/lib/python3.10/contextlib.py:79: in inner return func(*args, **kwds) E AssertionError: E Not equal to tolerance rtol=1e-14, atol=0 E E Mismatched elements: 1 / 1 (100%) E Max absolute difference: 0.99805745 E Max relative difference: 0.02866487 E x: array(-33.820084) E y: array(-34.818142)

This doesn't happen when using the current formula.

I get slightly different reference values with the following snippet (for the last test case -34.53877639491019):

def truncnorm_mpmath_entropy(a, b): a = mp.mpf(a) b = mp.mpf(b) mp.dps = 500 A = mp.ncdf(a) B = mp.ncdf(b) Z = B - A C = mp.log(mp.sqrt(2 * mp.pi * mp.e)) + mp.log(Z) D = (a * mp.npdf(a) - b * mp.npdf(b))/(2 * Z) return float(C + D)

The crucial point is the conversion of the inputs to multiprecision floatsvia mp.mpf(). I think I had pointed to another post here regarding mpmath in the first entropy related PR @MatteoRaso (this one), sorry for missing it here before. The reference values for the tests need to be recomputed unfortunately.

The tricky thing is when a and b are both deep in the same tail. _log_gauss_mass is optimized for this case. It might be slightly better to use the naive implementation here when a and b are close to one another and right in the heart of the distribution, but either way you only get one digit of precision due to catastrophic cancellation:

import numpy as np rng = np.random.default_rng(1638083107694713882823079058616272161) from scipy.stats._continuous_distns import _log_gauss_mass, _norm_cdf from mpmath import mp mp.dps = 100 a, b = 1e-11, 1.0001e-11 def gauss_mass_mpmath(a, b): a, b = mp.mpf(a), mp.mpf(b) return np.float64(mp.ncdf(b) - mp.ncdf(a)) def gauss_mass_scipy(a, b): return _norm_cdf(b) - _norm_cdf(a) ref = gauss_mass_mpmath(a, b) res = gauss_mass_scipy(a, b) res2 = np.exp(_log_gauss_mass(a, b)) (ref - res)/ref # 0.16512507259888784 (ref - res2)/ref # array([-0.67678942]) # worse, but both are terrible

The tests in test_extreme_entropy right now are essentially the same as the half-normal and regular normal distribution because one of the truncation points is either near 0 or deep in the opposite tail. I don't think they expose the shortcomings of the current approach (which is fine; we don't need to add tests that fail unless our goal is to improve the implementation to make them pass).

But again, I'm not saying that any adjustments are required; I just meant to make an observation. It sounds like this PR is better than the default implementation, and it might be too much work to go deeper. I don't know that simply using _log_gauss_mass for Z is enough; the rest of the formula would probably need to be logarithmized, too.

This doesn't happen when using the current formula.

~~Tests tend to pass when the reference values are produced using the implementation that is being tested.~~

My IDE is doing funny things. I think the problem was that import mpmath as mp needs to be from mpmath import mp.

mdhaber

There are enough things to adjust that I'll submit a PR to M.R.'s branch.

Update: see MatteoRaso#2. The 1e-11, 1.0001e-11 test is removed because it doesn't pass with the correct reference value. I didn't update to use _log_gauss_mass because that's only going to fix part of the problem; might as well make it obvious that this would need future work to be robust.

scipy/stats/tests/test_distributions.py

MAINT: stats.truncnorm.entropy: fix PEP8, tests

mdhaber

Merging as an improvement. Perhaps someone can explore whether it makes sense to logarithmize the rest in a follow-up PR.

mdhaber · 2023-02-14T02:04:15Z

scipy/stats/_continuous_distns.py

+        A = sc.ndtr(a)
+        B = sc.ndtr(b)
+        Z = B - A
+        C = np.log(np.sqrt(2 * np.pi * np.e) * Z)


The flip side to that is that taking the logarithm apart puts us at risk of underflow

I don't see that.

It's a very small difference, but calling np.log twice instead of once is slower.

Note that even if it saves an evaluation of log, the current implementation adds a sqrt. On my machine, these tend to take the same amount of time. If we're going for speed, Z**.5 seems faster on my machine.

It's not a huge deal

Agreed if it's between these options, but if _log_gauss_mass were used, adding logZ would be the better choice.

MatteoRaso added 2 commits January 28, 2023 02:25

ENH: Added analytical formula for truncnorm entropy (scipy#17748)

9e1cda2

Removed commentted-out code

28510cb

dschmitz89 self-requested a review January 28, 2023 08:59

dschmitz89 requested changes Jan 28, 2023

View reviewed changes

dschmitz89 reviewed Jan 28, 2023

View reviewed changes

scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved

dschmitz89 added scipy.stats enhancement A new feature or improvement labels Jan 28, 2023

Fixed truncnorm entropy formula

15c1d53

MatteoRaso force-pushed the trunc_entropy branch from 2124b1e to 15c1d53 Compare January 29, 2023 07:57

dschmitz89 reviewed Jan 29, 2023

View reviewed changes

dschmitz89 added this to the 1.11.0 milestone Jan 29, 2023

MatteoRaso and others added 2 commits February 8, 2023 02:06

Improved readability

4ddc5f1

Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>

Update scipy/stats/tests/test_distributions.py

69d78be

Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>

MatteoRaso commented Feb 8, 2023

View reviewed changes

dschmitz89 reviewed Feb 9, 2023

View reviewed changes

dschmitz89 reviewed Feb 10, 2023

View reviewed changes

MatteoRaso and others added 4 commits February 10, 2023 16:31

Clarified dps used to get test outputs

4598775

Increase sensitivity for test

2b61485

Co-authored-by: Daniel Schmitz <40656107+dschmitz89@users.noreply.github.com>

Changed the tests for truncnorm entropy

ca257c5

Standardized the test outputs

0c3cf89

dschmitz89 self-requested a review February 12, 2023 08:21

dschmitz89 approved these changes Feb 12, 2023

View reviewed changes

mdhaber reviewed Feb 12, 2023

View reviewed changes

scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved

Added tests for truncnorm entropy at extreme values

6ef3bf3

dschmitz89 reviewed Feb 13, 2023

View reviewed changes

Added a test for small range of [a, b] for truncnorm entropy

cd4e9ee

mdhaber reviewed Feb 13, 2023

View reviewed changes

MAINT: stats.truncnorm.entropy: fix PEP8, tests

6077f12

mdhaber mentioned this pull request Feb 13, 2023

MAINT: stats.truncnorm.entropy: fix PEP8, tests MatteoRaso/scipy#2

Merged

Merge pull request #2 from mdhaber/gh17874

c6d6a17

MAINT: stats.truncnorm.entropy: fix PEP8, tests

mdhaber approved these changes Feb 14, 2023

View reviewed changes

mdhaber merged commit d2a34a1 into scipy:main Feb 14, 2023

SigureMo mentioned this pull request Feb 15, 2023

[CodeStyle] update flake8 config PaddlePaddle/Paddle#50458

Merged

mdhaber mentioned this pull request Feb 17, 2023

ENH: Added analytical formula for dgamma distribution entropy (#17748) #17873

Merged

mdhaber mentioned this pull request Mar 16, 2023

ENH: Added loggamma entropy method #18156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Added analytical formula for truncnorm entropy (#17748) #17874

ENH: Added analytical formula for truncnorm entropy (#17748) #17874

MatteoRaso commented Jan 28, 2023

dschmitz89 left a comment •

edited

Loading

dschmitz89 Jan 28, 2023

MatteoRaso Jan 29, 2023

mdhaber Jan 29, 2023 •

edited

Loading

MatteoRaso Jan 29, 2023

dschmitz89 left a comment

MatteoRaso Feb 8, 2023

dschmitz89 Feb 9, 2023

MatteoRaso Feb 9, 2023

mdhaber Feb 14, 2023

MatteoRaso Feb 14, 2023

mdhaber Feb 17, 2023 •

edited

Loading

MatteoRaso commented Feb 8, 2023

MatteoRaso Feb 8, 2023

dschmitz89 Feb 9, 2023

MatteoRaso Feb 9, 2023

dschmitz89 Feb 10, 2023

MatteoRaso Feb 10, 2023

dschmitz89 Feb 11, 2023

MatteoRaso Feb 11, 2023

dschmitz89 left a comment

MatteoRaso commented Feb 12, 2023

mdhaber commented Feb 12, 2023 •

edited

Loading

MatteoRaso commented Feb 13, 2023

dschmitz89 left a comment •

edited

Loading

dschmitz89 Feb 13, 2023

MatteoRaso Feb 13, 2023 •

edited

Loading

dschmitz89 Feb 13, 2023 •

edited

Loading

mdhaber Feb 13, 2023 •

edited

Loading

mdhaber Feb 13, 2023 •

edited

Loading

mdhaber left a comment •

edited

Loading

mdhaber left a comment

mdhaber Feb 14, 2023

ENH: Added analytical formula for truncnorm entropy (#17748) #17874

ENH: Added analytical formula for truncnorm entropy (#17748) #17874

Conversation

MatteoRaso commented Jan 28, 2023

Reference issue

What does this implement/fix?

Additional information

dschmitz89 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber Jan 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dschmitz89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdhaber Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

MatteoRaso commented Feb 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dschmitz89 left a comment

Choose a reason for hiding this comment

MatteoRaso commented Feb 12, 2023

mdhaber commented Feb 12, 2023 • edited Loading

MatteoRaso commented Feb 13, 2023

dschmitz89 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatteoRaso Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

dschmitz89 Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

mdhaber Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

mdhaber Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

mdhaber left a comment • edited Loading

Choose a reason for hiding this comment

mdhaber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dschmitz89 left a comment •

edited

Loading

mdhaber Jan 29, 2023 •

edited

Loading

mdhaber Feb 17, 2023 •

edited

Loading

mdhaber commented Feb 12, 2023 •

edited

Loading

dschmitz89 left a comment •

edited

Loading

MatteoRaso Feb 13, 2023 •

edited

Loading

dschmitz89 Feb 13, 2023 •

edited

Loading

mdhaber Feb 13, 2023 •

edited

Loading

mdhaber Feb 13, 2023 •

edited

Loading

mdhaber left a comment •

edited

Loading