Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988

Closed
dbivolaru opened this issue Aug 25, 2020 · 5 comments · Fixed by #7002
Closed

ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988

dbivolaru opened this issue Aug 25, 2020 · 5 comments · Fixed by #7002

Comments

@dbivolaru
Copy link
Contributor

dbivolaru commented Aug 25, 2020

Describe the bug

Unexpected exception raised for KDE fit() when input data has exactly zero bandwidth.

Code Sample (copy-paste works)

from statsmodels.nonparametric.kde import KDEUnivariate
import numpy as np

kde = KDEUnivariate(np.array([0.1, 0.1]))
kde.fit()

I did have checked wether there is a similar open bug and couldn't find one.
I did look at the source code on master, but did not run it and seems the behavior is there as well.

Expected Output

If running statsmodels in a pipeline then the user data might be anything, and would expect it handles this gracefully instead of raising an exception. This would allow the user to see a nice graph in eg seaborn instead of an error for a very simple case.

In [1]: print(kde.density)
Out[1]: array([ ... ])
In [1]: print(kde.support)
Out[1]: array([ ... ])

Also behavior of [0.1, 0.1] should be very similar to [0.1, 0.1000000000001] (consider details points 1, 2, 3 below):

from statsmodels.nonparametric.kde import KDEUnivariate
import numpy as np

kde = KDEUnivariate(np.array([0.1, 0.1000000000001]))
kde.fit()
  1. The == 0 in nonparametric/bandwidths.py:171 select_bandwidth() comparing with a float is not kosher numerical methods practice, usually a comparison using a very small error term eps should be done.

  2. The silverman bandwidth estimation might seem to be correct (== 0) in a situation where two observations that are the same. The bandwidth of a gaussian is actually not zero, but also not trivial to calculate.
    Closest reference I could find is this.
    Net, in this case, the standard deviation estimator std() is not the best linear unbiased estimator for the standard deviation due to low number of samples and the uncertainty eps.
    Based on this approach, I would suggest to use in this case A == max(min(std(X), IQR/1.34), eps * scipy.stats.norm.ppf(q=1-eps/mu)) for input to the silverman or scott bandwidth estimators. In case eps is not specified just use 90% confidence directly, however the user should be allowed to specify it as well.
    The logic is quite simple: the uncertainty eps when compared to the average mu gives a certain confidence level (90% in my example if we take eps=0.01 ie one digit lower) which translates to an equivalent standard deviation based on that uncertainty ie 1.21 * eps.
    TL;DR The eps acts as a lower bound to standard deviation for cases with low number of observations.

  3. While raising an exception in this case might seem pythonic, it's still not warranted, as one could still have a result out of this using a fitted Gaussian with avg=0.1 and std=0.012815515655446004 (based on point 2 above). Think about what a normal user has to do: handle exception by monkey patching the kde results so it can be used further down the code.

Output of import statsmodels.api as sm; sm.show_versions()

INSTALLED VERSIONS

Python: 3.8.5.final.0
OS: Linux 5.7.16-200.fc32.x86_64 #1 SMP Wed Aug 19 16:58:53 UTC 2020 x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

statsmodels

Installed: 0.11.0 (/usr/lib64/python3.8/site-packages/statsmodels)

Required Dependencies

cython: Not installed
numpy: 1.18.4 (/usr/lib64/python3.8/site-packages/numpy)
scipy: 1.4.1 (/usr/lib64/python3.8/site-packages/scipy)
pandas: 0.25.3 (/usr/lib64/python3.8/site-packages/pandas)
dateutil: 2.8.0 (/usr/lib/python3.8/site-packages/dateutil)
patsy: 0.5.1 (/usr/lib/python3.8/site-packages/patsy)

Optional Dependencies

matplotlib: 3.2.2 (/usr/lib64/python3.8/site-packages/matplotlib)
backend: Qt5Agg
cvxopt: Not installed
joblib: 0.13.2 (/usr/lib/python3.8/site-packages/joblib)

Developer Tools

IPython: 7.12.0 (/usr/lib/python3.8/site-packages/IPython)
jinja2: 2.11.2 (/usr/lib/python3.8/site-packages/jinja2)
sphinx: 2.2.2 (/usr/lib/python3.8/site-packages/sphinx)
pygments: 2.4.2 (/usr/lib/python3.8/site-packages/pygments)
pytest: 4.6.11 (/usr/lib/python3.8/site-packages)
virtualenv: Not installed

@bashtage
Copy link
Member

In master you get

RuntimeError: Selected KDE bandwidth is 0. Cannot estimate density. Either provide the bandwidth during initialization or use an alternative method.

This is the expected outcome. It is then up to downstream users to try catch if they think they might pass unsuitable data.

@bashtage
Copy link
Member

All you have to do is

kde = KDEUnivariate(np.array([0.1, 0.1]))
try:
    kde.fit()
except RuntimeError:
    kde.fit(bw=MY_PERSONAL_FAVORITE_BANDWIDTH_ESTIMATOR)

@dbivolaru
Copy link
Contributor Author

Thanks for taking the time to look at it. That is exactly what I am doing, however the behavior is not consistent.
Even in that case, one would raise an Exception and get the other bw estimator.
In the normal case I would go with the default bw resulting in numerically different results for numerically close inputs.

My point is KDEUnivariate(np.array([0.1, 0.1000000000001])) and KDEUnivariate(np.array([0.1, 0.1])) should provide similar results.

@bashtage
Copy link
Member

Even in that case, one would raise an Exception and get the other bw estimator.

We try to not do magic things; if the function call asks for something it should either succeed, or if it cannot succeed, raise a meaningful error. This is why we will never override the option provided.

My point is KDEUnivariate(np.array([0.1, 0.1000000000001])) and KDEUnivariate(np.array([0.1, 0.1])) should provide similar results.

Those arrays are numerically different within the precision of any double precision IEEE floating point compliant computer, so I don't see why they should provide the same behavior.

a = np.array([0.1, 0.1])
b = np.array([0.1, 0.1000000000001])

a==b
Out[5]: array([ True, False])

There are many other places where the output is not continuous as values approach numeric limits, for example, anything that performs a matrix rank check and raises on singular arrays will be discontinuous local to the tolerance of the array's eigenvalues.

Ways forwards

  1. IMO a better idea would be to let bw accept a Callable[[ndarray], float] so that a user could provide their favorite BW function that handles edges in the way they want rather than some ad-hoc approach.
  2. Add a new string method that implements some minimum bandwitdh.

As to your original suggestion,

A = max(min(std(X), IQR/1.34), eps * scipy.stats.norm.ppf(q=1-eps/mu))

this is problematic if mu is even modestly large, in which case you have eps*inf. In general, it is hard to set an absolute lower bound since KDE should work reasonably well for both

x = np.random.standard_normal(1000)
kde = KDEUnivariate(x)
# OMG, why doesn't fit return self?!?!?
kde.fit()

eps = np.finfo(float).eps
kde2 = KDEUnivariate(eps*x)
kde2.fit()

print(kde2.bw / kde.bw)
kde2.bw / kde.bw
Out[14]: 2.220446049250313e-16

print(eps)
Out[15]: 2.220446049250313e-16

@bashtage bashtage changed the title BUG: exception for zero bandwidth ENH: Let users set KDE bandwidth use a user-defined bandwidth function Aug 25, 2020
@dbivolaru
Copy link
Contributor Author

Excellent point with the eps / mu division.

The suggestions for 1 & 2 seem right initially, here is my input:

  1. callable would be super flexible - this would be my main go-to, but again, what would prevent someone from doing even today this
bw = callable(X)
kde.fit(bw=bw)
  1. 'min_normal_reference', 'min_silverman' and 'min_scott' might be confusing to people, and not sure everyone has the need of a min-bandwith estimator. It would work though, I agree.

I did the most sensible thing in this case and got the book from Silverman and looked it up. He does mention that the approximation is only valid in asymptotic cases ref p. 103-104 (I transcribe it here under fair use) in context of Nearest-Neighbour. I beleive his statement is as equally valid for the normal "rule of thumb" case also from p. 45.

If we presume that the likely values of f" are proportional to those of f (i.e. lower values of f" in the tails) then (5.9) suggests that the ideal local bandwidth will be proportional to f^(-1/5). This is a rather heuristic argument depending on asymptotic calculations and possibly rash assumptions.

I hope I'm not that rusty, but for Gaussian:
formula

Considering formula then that would make the bandwidth:
formula
Note that in his p. 45 exposition he considers the integral of f''^2 which is then asymptotic (tails at -inf and +inf), as such he only uses the last part of the term with sigma to the 5th power. The key is in the first part as well in my opinion.

So for asymptotic cases x >> mu (Silverman's approximation from p. 45):
formula

For non-asymptotic cases x ~ mu:
formula

On top, the expansion from the exponential part of f comes into play which for the non-asymptotic case we can aproximate easily around zero as 1-0.5*(x-mu)^2/sigma^2 so another -2 to the exponent for our initial calculation.

This would bring finally the non-asymptotic case
formula
which is exactly the reverse of the asymptotic case for the first case plus some term.

Net, we also need to distinguish between [0.1, 0.1] and [0.1, ... 0.1] with 200 elements. The former one would have a high bandwidth while the latter one would have a very small bandwidth but not zero.

Based on this I would propose to implement a new string bandwidth for 'non_asymptotic' as suggested in proposal 2 with the approach above. Need to figure out the constants that go in front.

What I like about the new 'non_asymptotic' case is that if you would have an array with two values which are quite far apart but you know are coming from a normal distribution then the bandwidth will be quite big, which is what one would expect instead of getting some graph with two bumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants