New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KernelDensity incorrect handling of bandwidth #25623
Comments
One can visually see the sickness of the current implementation by running the code below. The resulting plot is shown at the bottom. I'm using "silverman" here but a similar result is obtained with "scott" too. import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn.neighbors import KernelDensity
N = 200
np.random.seed(1)
X = np.concatenate(
(np.random.normal(0, 5, int(0.3 * N)), np.random.normal(15, 5, int(0.7 * N)))
)[:, np.newaxis]
X_plot = np.linspace(-15, 30, 1000)[:, np.newaxis]
true_dens = 0.3 * norm(loc=0, scale=5).pdf(X_plot[:, 0]) + 0.7 * norm(loc=15, scale=5).pdf(X_plot[:, 0])
fig, ax = plt.subplots()
ax.fill(X_plot[:, 0], true_dens, fc="black", alpha=0.2, label="input distribution")
colors = ["navy", "cornflowerblue", "darkorange"]
kernels = ["gaussian", "tophat", "epanechnikov"]
lw = 2
for color, kernel in zip(colors, kernels):
kde = KernelDensity(kernel=kernel, bandwidth="silverman").fit(X)
log_dens = kde.score_samples(X_plot)
ax.plot(
X_plot[:, 0],
np.exp(log_dens),
color=color,
lw=lw,
linestyle="-",
label="kernel = '{0}'".format(kernel),
)
ax.legend(loc="upper left")
ax.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), "+k")
plt.show() Update: I replaced a link to the plot with direct embedding for better readability. |
Thank you for opening this! I agree this is a bug. Here is the same plot using SciPy's kernel function that estimates the correct bandwidth: import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
N = 200
rng = np.random.default_rng(42)
X = np.concatenate(
(rng.normal(0, 5, int(0.3 * N)), rng.normal(15, 5, int(0.7 * N)))
)
X_plot = np.linspace(-15, 30, 1000)
true_dens = 0.3 * norm(loc=0, scale=5).pdf(X_plot) + 0.7 * norm(loc=15, scale=5).pdf(X_plot)
fig, ax = plt.subplots()
ax.fill(X_plot, true_dens, fc="black", alpha=0.2, label="input distribution")
colors = ["navy", "red"]
methods = ["silverman", "scott"]
lw = 2
for color, method in zip(colors, methods):
kde = stats.gaussian_kde(X, bw_method=method)
density = kde(X_plot)
ax.plot(
X_plot,
density,
color=color,
lw=lw,
linestyle="-",
label="method = '{0}'".format(method),
)
ax.legend(loc="upper left") |
@thomasjpfan could you report the |
SciPy's factor and import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
from sklearn.neighbors import KernelDensity
N = 200
rng = np.random.default_rng(42)
X = np.concatenate(
(rng.normal(0, 5, int(0.3 * N)), rng.normal(15, 5, int(0.7 * N)))
)
X_plot = np.linspace(-15, 30, 1000)
fig, ax = plt.subplots()
ax.hist(X, fc="black", alpha=0.2, label="input dist", bins=20, density=True)
kde_sklearn = KernelDensity(bandwidth="silverman").fit(X[:, None])
density_sklearn = np.exp(kde_sklearn.score_samples(X_plot[:, None]))
kde_scipy = stats.gaussian_kde(X, bw_method="silverman")
density_scipy = kde_scipy(X_plot)
for (name, density) in [("sklearn", density_sklearn), ("scipy", density_scipy)]:
ax.plot(
X_plot,
density,
lw=2,
linestyle="-",
label=f"{name} density",
)
ax.legend(loc="upper left") Concretely SciPy computes the factor and cholesky decomposition of the covariance here and use the decomposition to compute the density here. I suspect transforming the data before passing it into |
Thank you so much for nice figures :-) @thomasjpfan In addition to the data density issue, I'm also concerned with possible anisotropy of data. mean = [0, 0]
cov = [[4000, 0], [0, 1]]
np.random.seed(12)
data = np.random.multivariate_normal(mean, cov, 220) @ np.array([[1., 0.04], [-0.04, 1.]])
plt.tick_params(labelsize=17)
plt.scatter(data[:, 0], data[:, 1], marker='^', s=33, c='cyan', edgecolors='grey')
plt.show() |
ref #26658 (comment) may maintainers take a look? @thomasjpfan Transforming data seems to be reasonable, though maybe it should be done in the binary tree code when calling @TaTKSM at the bottom of the linked comment above there is a 2D example. Though different from yours I think it is a similar case as #25623 (comment). Also I think the covariance matrix approach can be used not only for Gaussian kernel. In general where |
Describe the bug
I was using kernel density estimator
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html
using 'silverman' or 'scott' as the bandwidth argument. Then I found that the bandwidth automatically adjusted by the algorithm is independent of the actual scale of the dataset. In fact, I was shocked to find that the calculation of a bandwidth in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/neighbors/_kde.py for 'silverman' and 'scott' does not check the scales of data at all.
Suppose I fit the model
kde
to some 2D dataX
and get the bandwidth askde.bandwidth_
.Next, I fit the model
kde
to the same 2D dataX
but with all elements multiplied by, say, 20 and get the bandwidth askde.bandwidth_
.I found that these two values of
kde.bandwidth_
are equal (it is calculated from the shape ofX
, see the source code). But obviously they should differ by a factor of 20 if the bandwidth is really computed in a truly adaptive manner.For your reference, I want to mention that scipy's KDE https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html calculates the covariance of data to extract the scale of data. I think this is the right thing to do.
Note that if the bandwidth is incorrect, everything else is incorrect too, including probablities of samples, etc.
Steps/Code to Reproduce
Expected Results
Different bandwidths for data sets with different scales.
Actual Results
0.31622776601683794
0.31622776601683794
Versions
The text was updated successfully, but these errors were encountered: