-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Let users set KDE bandwidth use a user-defined bandwidth function #6988
Comments
In master you get
This is the expected outcome. It is then up to downstream users to try catch if they think they might pass unsuitable data. |
All you have to do is
|
Thanks for taking the time to look at it. That is exactly what I am doing, however the behavior is not consistent. My point is |
We try to not do magic things; if the function call asks for something it should either succeed, or if it cannot succeed, raise a meaningful error. This is why we will never override the option provided.
Those arrays are numerically different within the precision of any double precision IEEE floating point compliant computer, so I don't see why they should provide the same behavior.
There are many other places where the output is not continuous as values approach numeric limits, for example, anything that performs a matrix rank check and raises on singular arrays will be discontinuous local to the tolerance of the array's eigenvalues. Ways forwards
As to your original suggestion,
this is problematic if
|
Describe the bug
Unexpected exception raised for KDE fit() when input data has exactly zero bandwidth.
Code Sample (copy-paste works)
I did have checked wether there is a similar open bug and couldn't find one.
I did look at the source code on master, but did not run it and seems the behavior is there as well.
Expected Output
If running statsmodels in a pipeline then the user data might be anything, and would expect it handles this gracefully instead of raising an exception. This would allow the user to see a nice graph in eg seaborn instead of an error for a very simple case.
Also behavior of
[0.1, 0.1]
should be very similar to[0.1, 0.1000000000001]
(consider details points 1, 2, 3 below):The == 0 in
nonparametric/bandwidths.py:171 select_bandwidth()
comparing with a float is not kosher numerical methods practice, usually a comparison using a very small error termeps
should be done.The silverman bandwidth estimation might seem to be correct (== 0) in a situation where two observations that are the same. The bandwidth of a gaussian is actually not zero, but also not trivial to calculate.
Closest reference I could find is this.
Net, in this case, the standard deviation estimator
std()
is not the best linear unbiased estimator for the standard deviation due to low number of samples and the uncertaintyeps
.Based on this approach, I would suggest to use in this case
A == max(min(std(X), IQR/1.34), eps * scipy.stats.norm.ppf(q=1-eps/mu))
for input to the silverman or scott bandwidth estimators. In caseeps
is not specified just use 90% confidence directly, however the user should be allowed to specify it as well.The logic is quite simple: the uncertainty
eps
when compared to the averagemu
gives a certain confidence level (90% in my example if we takeeps=0.01
ie one digit lower) which translates to an equivalent standard deviation based on that uncertainty ie1.21 * eps
.TL;DR The
eps
acts as a lower bound to standard deviation for cases with low number of observations.While raising an exception in this case might seem pythonic, it's still not warranted, as one could still have a result out of this using a fitted Gaussian with
avg=0.1
andstd=0.012815515655446004
(based on point 2 above). Think about what a normal user has to do: handle exception by monkey patching the kde results so it can be used further down the code.Output of
import statsmodels.api as sm; sm.show_versions()
INSTALLED VERSIONS
Python: 3.8.5.final.0
OS: Linux 5.7.16-200.fc32.x86_64 #1 SMP Wed Aug 19 16:58:53 UTC 2020 x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
statsmodels
Installed: 0.11.0 (/usr/lib64/python3.8/site-packages/statsmodels)
Required Dependencies
cython: Not installed
numpy: 1.18.4 (/usr/lib64/python3.8/site-packages/numpy)
scipy: 1.4.1 (/usr/lib64/python3.8/site-packages/scipy)
pandas: 0.25.3 (/usr/lib64/python3.8/site-packages/pandas)
dateutil: 2.8.0 (/usr/lib/python3.8/site-packages/dateutil)
patsy: 0.5.1 (/usr/lib/python3.8/site-packages/patsy)
Optional Dependencies
matplotlib: 3.2.2 (/usr/lib64/python3.8/site-packages/matplotlib)
backend: Qt5Agg
cvxopt: Not installed
joblib: 0.13.2 (/usr/lib/python3.8/site-packages/joblib)
Developer Tools
IPython: 7.12.0 (/usr/lib/python3.8/site-packages/IPython)
jinja2: 2.11.2 (/usr/lib/python3.8/site-packages/jinja2)
sphinx: 2.2.2 (/usr/lib/python3.8/site-packages/sphinx)
pygments: 2.4.2 (/usr/lib/python3.8/site-packages/pygments)
pytest: 4.6.11 (/usr/lib/python3.8/site-packages)
virtualenv: Not installed
The text was updated successfully, but these errors were encountered: