ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988

dbivolaru · 2020-08-25T03:40:37Z

Describe the bug

Unexpected exception raised for KDE fit() when input data has exactly zero bandwidth.

Code Sample (copy-paste works)

from statsmodels.nonparametric.kde import KDEUnivariate
import numpy as np

kde = KDEUnivariate(np.array([0.1, 0.1]))
kde.fit()

I did have checked wether there is a similar open bug and couldn't find one.
I did look at the source code on master, but did not run it and seems the behavior is there as well.

Expected Output

If running statsmodels in a pipeline then the user data might be anything, and would expect it handles this gracefully instead of raising an exception. This would allow the user to see a nice graph in eg seaborn instead of an error for a very simple case.

In [1]: print(kde.density)
Out[1]: array([ ... ])
In [1]: print(kde.support)
Out[1]: array([ ... ])

Also behavior of [0.1, 0.1] should be very similar to [0.1, 0.1000000000001] (consider details points 1, 2, 3 below):

from statsmodels.nonparametric.kde import KDEUnivariate
import numpy as np

kde = KDEUnivariate(np.array([0.1, 0.1000000000001]))
kde.fit()

The == 0 in nonparametric/bandwidths.py:171 select_bandwidth() comparing with a float is not kosher numerical methods practice, usually a comparison using a very small error term eps should be done.
The silverman bandwidth estimation might seem to be correct (== 0) in a situation where two observations that are the same. The bandwidth of a gaussian is actually not zero, but also not trivial to calculate.
Closest reference I could find is this.
Net, in this case, the standard deviation estimator std() is not the best linear unbiased estimator for the standard deviation due to low number of samples and the uncertainty eps.
Based on this approach, I would suggest to use in this case A == max(min(std(X), IQR/1.34), eps * scipy.stats.norm.ppf(q=1-eps/mu)) for input to the silverman or scott bandwidth estimators. In case eps is not specified just use 90% confidence directly, however the user should be allowed to specify it as well.
The logic is quite simple: the uncertainty eps when compared to the average mu gives a certain confidence level (90% in my example if we take eps=0.01 ie one digit lower) which translates to an equivalent standard deviation based on that uncertainty ie 1.21 * eps.
TL;DR The eps acts as a lower bound to standard deviation for cases with low number of observations.
While raising an exception in this case might seem pythonic, it's still not warranted, as one could still have a result out of this using a fitted Gaussian with avg=0.1 and std=0.012815515655446004 (based on point 2 above). Think about what a normal user has to do: handle exception by monkey patching the kde results so it can be used further down the code.

Output of `import statsmodels.api as sm; sm.show_versions()`

INSTALLED VERSIONS

Python: 3.8.5.final.0
OS: Linux 5.7.16-200.fc32.x86_64 #1 SMP Wed Aug 19 16:58:53 UTC 2020 x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

statsmodels

Installed: 0.11.0 (/usr/lib64/python3.8/site-packages/statsmodels)

Required Dependencies

cython: Not installed
numpy: 1.18.4 (/usr/lib64/python3.8/site-packages/numpy)
scipy: 1.4.1 (/usr/lib64/python3.8/site-packages/scipy)
pandas: 0.25.3 (/usr/lib64/python3.8/site-packages/pandas)
dateutil: 2.8.0 (/usr/lib/python3.8/site-packages/dateutil)
patsy: 0.5.1 (/usr/lib/python3.8/site-packages/patsy)

Optional Dependencies

matplotlib: 3.2.2 (/usr/lib64/python3.8/site-packages/matplotlib)
backend: Qt5Agg
cvxopt: Not installed
joblib: 0.13.2 (/usr/lib/python3.8/site-packages/joblib)

Developer Tools

IPython: 7.12.0 (/usr/lib/python3.8/site-packages/IPython)
jinja2: 2.11.2 (/usr/lib/python3.8/site-packages/jinja2)
sphinx: 2.2.2 (/usr/lib/python3.8/site-packages/sphinx)
pygments: 2.4.2 (/usr/lib/python3.8/site-packages/pygments)
pytest: 4.6.11 (/usr/lib/python3.8/site-packages)
virtualenv: Not installed

The text was updated successfully, but these errors were encountered:

bashtage · 2020-08-25T07:08:42Z

In master you get

RuntimeError: Selected KDE bandwidth is 0. Cannot estimate density. Either provide the bandwidth during initialization or use an alternative method.

This is the expected outcome. It is then up to downstream users to try catch if they think they might pass unsuitable data.

bashtage · 2020-08-25T07:14:05Z

All you have to do is

kde = KDEUnivariate(np.array([0.1, 0.1]))
try:
    kde.fit()
except RuntimeError:
    kde.fit(bw=MY_PERSONAL_FAVORITE_BANDWIDTH_ESTIMATOR)

dbivolaru · 2020-08-25T08:23:59Z

Thanks for taking the time to look at it. That is exactly what I am doing, however the behavior is not consistent.
Even in that case, one would raise an Exception and get the other bw estimator.
In the normal case I would go with the default bw resulting in numerically different results for numerically close inputs.

My point is KDEUnivariate(np.array([0.1, 0.1000000000001])) and KDEUnivariate(np.array([0.1, 0.1])) should provide similar results.

bashtage · 2020-08-25T10:00:01Z

Even in that case, one would raise an Exception and get the other bw estimator.

We try to not do magic things; if the function call asks for something it should either succeed, or if it cannot succeed, raise a meaningful error. This is why we will never override the option provided.

My point is KDEUnivariate(np.array([0.1, 0.1000000000001])) and KDEUnivariate(np.array([0.1, 0.1])) should provide similar results.

Those arrays are numerically different within the precision of any double precision IEEE floating point compliant computer, so I don't see why they should provide the same behavior.

a = np.array([0.1, 0.1])
b = np.array([0.1, 0.1000000000001])

a==b
Out[5]: array([ True, False])

There are many other places where the output is not continuous as values approach numeric limits, for example, anything that performs a matrix rank check and raises on singular arrays will be discontinuous local to the tolerance of the array's eigenvalues.

Ways forwards

IMO a better idea would be to let bw accept a Callable[[ndarray], float] so that a user could provide their favorite BW function that handles edges in the way they want rather than some ad-hoc approach.
Add a new string method that implements some minimum bandwitdh.

As to your original suggestion,

A = max(min(std(X), IQR/1.34), eps * scipy.stats.norm.ppf(q=1-eps/mu))

this is problematic if mu is even modestly large, in which case you have eps*inf. In general, it is hard to set an absolute lower bound since KDE should work reasonably well for both

x = np.random.standard_normal(1000)
kde = KDEUnivariate(x)
# OMG, why doesn't fit return self?!?!?
kde.fit()

eps = np.finfo(float).eps
kde2 = KDEUnivariate(eps*x)
kde2.fit()

print(kde2.bw / kde.bw)
kde2.bw / kde.bw
Out[14]: 2.220446049250313e-16

print(eps)
Out[15]: 2.220446049250313e-16

dbivolaru · 2020-08-25T12:00:22Z

Excellent point with the eps / mu division.

The suggestions for 1 & 2 seem right initially, here is my input:

callable would be super flexible - this would be my main go-to, but again, what would prevent someone from doing even today this

bw = callable(X)
kde.fit(bw=bw)

'min_normal_reference', 'min_silverman' and 'min_scott' might be confusing to people, and not sure everyone has the need of a min-bandwith estimator. It would work though, I agree.

I did the most sensible thing in this case and got the book from Silverman and looked it up. He does mention that the approximation is only valid in asymptotic cases ref p. 103-104 (I transcribe it here under fair use) in context of Nearest-Neighbour. I beleive his statement is as equally valid for the normal "rule of thumb" case also from p. 45.

If we presume that the likely values of f" are proportional to those of f (i.e. lower values of f" in the tails) then (5.9) suggests that the ideal local bandwidth will be proportional to f^(-1/5). This is a rather heuristic argument depending on asymptotic calculations and possibly rash assumptions.

I hope I'm not that rusty, but for Gaussian:
$f''(x)=\frac{(x-\mu)^2-\sigma^2}{\sigma^4}f(x)$

Considering $f~\frac{1}{\sigma}$ then that would make the bandwidth:
$h~\lvert \frac{(x-\mu)^2-\sigma^2}{\sigma^4} \rvert^{-\frac{2}{5}}\frac{1}{\sigma^{\frac{1}{5}}}$
Note that in his p. 45 exposition he considers the integral of f''^2 which is then asymptotic (tails at -inf and +inf), as such he only uses the last part of the term with sigma to the 5th power. The key is in the first part as well in my opinion.

So for asymptotic cases x >> mu (Silverman's approximation from p. 45):
$h~\sigma^{\frac{7}{5}}~\sigma^1.4\approx\sigma$

For non-asymptotic cases x ~ mu:
$h~\sigma^{\frac{3}{5}}~\sigma^0.6$

On top, the expansion from the exponential part of f comes into play which for the non-asymptotic case we can aproximate easily around zero as 1-0.5*(x-mu)^2/sigma^2 so another -2 to the exponent for our initial calculation.

This would bring finally the non-asymptotic case
$h~A\sigma^{-1.4}+B\sigma^{0.6}$
which is exactly the reverse of the asymptotic case for the first case plus some term.

Net, we also need to distinguish between [0.1, 0.1] and [0.1, ... 0.1] with 200 elements. The former one would have a high bandwidth while the latter one would have a very small bandwidth but not zero.

Based on this I would propose to implement a new string bandwidth for 'non_asymptotic' as suggested in proposal 2 with the approach above. Need to figure out the constants that go in front.

What I like about the new 'non_asymptotic' case is that if you would have an array with two values which are quite far apart but you know are coming from a normal distribution then the bandwidth will be quite big, which is what one would expect instead of getting some graph with two bumps.

bashtage added comp-nonparametric maybe_close labels Aug 25, 2020

bashtage added type-enh and removed maybe_close labels Aug 25, 2020

bashtage changed the title ~~BUG: exception for zero bandwidth~~ ENH: Let users set KDE bandwidth use a user-defined bandwidth function Aug 25, 2020

bashtage mentioned this issue Aug 25, 2020

ENH: Return self from KDEUnivariate fit #6991

Merged

4 tasks

dbivolaru mentioned this issue Aug 26, 2020

ENH: Allow custom bandwidth functions in KDEUnivariate fit #6997

Closed

4 tasks

bashtage mentioned this issue Aug 27, 2020

ENH: Allow custom bandwidth functions in KDEUnivariate fit #7002

Merged

5 tasks

bashtage closed this as completed in #7002 Aug 27, 2020

bashtage added this to the 0.12 milestone Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988

ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988

dbivolaru commented Aug 25, 2020 •

edited

Loading

INSTALLED VERSIONS

statsmodels

Required Dependencies

Optional Dependencies

Developer Tools

bashtage commented Aug 25, 2020

bashtage commented Aug 25, 2020

dbivolaru commented Aug 25, 2020

bashtage commented Aug 25, 2020

dbivolaru commented Aug 25, 2020

ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988

ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988

Comments

dbivolaru commented Aug 25, 2020 • edited Loading

Describe the bug

Code Sample (copy-paste works)

Expected Output

Output of import statsmodels.api as sm; sm.show_versions()

INSTALLED VERSIONS

statsmodels

Required Dependencies

Optional Dependencies

Developer Tools

bashtage commented Aug 25, 2020

bashtage commented Aug 25, 2020

dbivolaru commented Aug 25, 2020

bashtage commented Aug 25, 2020

Ways forwards

dbivolaru commented Aug 25, 2020

dbivolaru commented Aug 25, 2020 •

edited

Loading

Output of `import statsmodels.api as sm; sm.show_versions()`