KernelDensity incorrect handling of bandwidth #25623

TaTKSM · 2023-02-16T06:16:04Z

Describe the bug

I was using kernel density estimator
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html
using 'silverman' or 'scott' as the bandwidth argument. Then I found that the bandwidth automatically adjusted by the algorithm is independent of the actual scale of the dataset. In fact, I was shocked to find that the calculation of a bandwidth in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/neighbors/_kde.py for 'silverman' and 'scott' does not check the scales of data at all.

Suppose I fit the model kde to some 2D data X and get the bandwidth as kde.bandwidth_.
Next, I fit the model kde to the same 2D data X but with all elements multiplied by, say, 20 and get the bandwidth as kde.bandwidth_.
I found that these two values of kde.bandwidth_ are equal (it is calculated from the shape of X, see the source code). But obviously they should differ by a factor of 20 if the bandwidth is really computed in a truly adaptive manner.

For your reference, I want to mention that scipy's KDE https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html calculates the covariance of data to extract the scale of data. I think this is the right thing to do.

Note that if the bandwidth is incorrect, everything else is incorrect too, including probablities of samples, etc.

Steps/Code to Reproduce

import numpy as np
from sklearn.neighbors import KernelDensity
X = np.random.randn(1000, 2)
kde = KernelDensity(bandwidth='scott')
kde.fit(X)
print(kde.bandwidth_)

kde.fit(X * 20)
print(kde.bandwidth_)

Expected Results

Different bandwidths for data sets with different scales.

Actual Results

0.31622776601683794
0.31622776601683794

Versions

1.2.1

The text was updated successfully, but these errors were encountered:

TaTKSM · 2023-02-21T02:55:46Z

One can visually see the sickness of the current implementation by running the code below. The resulting plot is shown at the bottom. I'm using "silverman" here but a similar result is obtained with "scott" too.
@glemaitre @jovan-stojanovic

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn.neighbors import KernelDensity

N = 200
np.random.seed(1)

X = np.concatenate(
    (np.random.normal(0, 5, int(0.3 * N)), np.random.normal(15, 5, int(0.7 * N)))
)[:, np.newaxis]

X_plot = np.linspace(-15, 30, 1000)[:, np.newaxis]

true_dens = 0.3 * norm(loc=0, scale=5).pdf(X_plot[:, 0]) + 0.7 * norm(loc=15, scale=5).pdf(X_plot[:, 0])

fig, ax = plt.subplots()
ax.fill(X_plot[:, 0], true_dens, fc="black", alpha=0.2, label="input distribution")

colors = ["navy", "cornflowerblue", "darkorange"]
kernels = ["gaussian", "tophat", "epanechnikov"]
lw = 2

for color, kernel in zip(colors, kernels):
    kde = KernelDensity(kernel=kernel, bandwidth="silverman").fit(X)
    log_dens = kde.score_samples(X_plot)
    ax.plot(
        X_plot[:, 0],
        np.exp(log_dens),
        color=color,
        lw=lw,
        linestyle="-",
        label="kernel = '{0}'".format(kernel),
    )

ax.legend(loc="upper left")
ax.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), "+k")

plt.show()

Update: I replaced a link to the plot with direct embedding for better readability.

thomasjpfan · 2023-02-24T16:52:24Z

Thank you for opening this! I agree this is a bug. Here is the same plot using SciPy's kernel function that estimates the correct bandwidth:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats

N = 200
rng = np.random.default_rng(42)

X = np.concatenate(
    (rng.normal(0, 5, int(0.3 * N)), rng.normal(15, 5, int(0.7 * N)))
)

X_plot = np.linspace(-15, 30, 1000)
true_dens = 0.3 * norm(loc=0, scale=5).pdf(X_plot) + 0.7 * norm(loc=15, scale=5).pdf(X_plot)

fig, ax = plt.subplots()
ax.fill(X_plot, true_dens, fc="black", alpha=0.2, label="input distribution")

colors = ["navy", "red"]
methods = ["silverman", "scott"]
lw = 2

for color, method in zip(colors, methods):
    kde = stats.gaussian_kde(X, bw_method=method)
    density = kde(X_plot)
    ax.plot(
        X_plot,
        density,
        color=color,
        lw=lw,
        linestyle="-",
        label="method = '{0}'".format(method),
    )

ax.legend(loc="upper left")

glemaitre · 2023-02-24T17:02:40Z

@thomasjpfan could you report the factor given by SciPy with our kernel.bandwidth_.
I think they will be the same. I am not sure that we should alter the bandwidth directly but rather scale it when passing it to the parameter h to the tree_?

thomasjpfan · 2023-02-24T18:03:50Z

SciPy's factor and KernelDensity.bandwidth_ are the same, but SciPy also takes into account the covariance when computing density. Here is a simpler snippet comparing just silverman between SciPy and scikit-learn:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
from sklearn.neighbors import KernelDensity

N = 200
rng = np.random.default_rng(42)

X = np.concatenate(
    (rng.normal(0, 5, int(0.3 * N)), rng.normal(15, 5, int(0.7 * N)))
)
X_plot = np.linspace(-15, 30, 1000)

fig, ax = plt.subplots()
ax.hist(X, fc="black", alpha=0.2, label="input dist", bins=20, density=True)

kde_sklearn = KernelDensity(bandwidth="silverman").fit(X[:, None])
density_sklearn = np.exp(kde_sklearn.score_samples(X_plot[:, None]))

kde_scipy = stats.gaussian_kde(X, bw_method="silverman")
density_scipy = kde_scipy(X_plot)

for (name, density) in [("sklearn", density_sklearn), ("scipy", density_scipy)]:
    ax.plot(
        X_plot,
        density,
        lw=2,
        linestyle="-",
        label=f"{name} density",
    )

ax.legend(loc="upper left")

Concretely SciPy computes the factor and cholesky decomposition of the covariance here and use the decomposition to compute the density here.

I suspect transforming the data before passing it into tree_ would be easier than adjusting bandwidth.

TaTKSM · 2023-02-24T21:15:37Z

Thank you so much for nice figures :-) @thomasjpfan

In addition to the data density issue, I'm also concerned with possible anisotropy of data.
Currently sklearn's KernelDensity implementation seems to asssume an isotropic kernel, while scipy's gaussian kernel seems to allow for an anisotropic covariance matrix (please correct me if I'm wrong).
In fact, data points (in 2D, for example) can have wildly different scales of nearest neighbor distance along X axis and Y axis, as shown in the sample data below.
So I suppose the desiderata are to reflect the actual density of data points in the bandwidth, and to allow for anisotropy of kernels (at least for gaussian kernel).

mean = [0, 0]
cov = [[4000, 0], [0, 1]]
np.random.seed(12)
data = np.random.multivariate_normal(mean, cov, 220) @ np.array([[1., 0.04], [-0.04, 1.]])
plt.tick_params(labelsize=17)
plt.scatter(data[:, 0], data[:, 1], marker='^', s=33, c='cyan', edgecolors='grey')
plt.show()

Charlie-XIAO · 2023-07-04T00:45:59Z

ref #26658 (comment) may maintainers take a look? @thomasjpfan Transforming data seems to be reasonable, though maybe it should be done in the binary tree code when calling kernel_density because this is a public method.

@TaTKSM at the bottom of the linked comment above there is a 2D example. Though different from yours I think it is a similar case as #25623 (comment). Also I think the covariance matrix approach can be used not only for Gaussian kernel. In general

$$ \frac{1}{N|H|}\sum_{i=1}^N\mathcal{K}\left(H^{-1}(x-x^i)\right), $$

where $H$ is the bandwidth matrix chosen proportional to $\Sigma^{1/2}$, square root of the covariance matrix. $\mathcal{K}$ can be any kernel here I believe.

TaTKSM added Bug Needs Triage Issue requires triage labels Feb 16, 2023

thomasjpfan added module:neighbors and removed Needs Triage Issue requires triage labels Feb 24, 2023

betatim mentioned this issue Jun 29, 2023

Automatic bandwidth calculation valid only for normalized data #26658

Open

Charlie-XIAO mentioned this issue Dec 17, 2023

FIX KernelDensity incorrectly handling bandwidth #27971

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KernelDensity incorrect handling of bandwidth #25623

KernelDensity incorrect handling of bandwidth #25623

TaTKSM commented Feb 16, 2023 •

edited

TaTKSM commented Feb 21, 2023 •

edited

thomasjpfan commented Feb 24, 2023 •

edited

glemaitre commented Feb 24, 2023

thomasjpfan commented Feb 24, 2023

TaTKSM commented Feb 24, 2023 •

edited

Charlie-XIAO commented Jul 4, 2023 •

edited

KernelDensity incorrect handling of bandwidth #25623

KernelDensity incorrect handling of bandwidth #25623

Comments

TaTKSM commented Feb 16, 2023 • edited

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

TaTKSM commented Feb 21, 2023 • edited

thomasjpfan commented Feb 24, 2023 • edited

glemaitre commented Feb 24, 2023

thomasjpfan commented Feb 24, 2023

TaTKSM commented Feb 24, 2023 • edited

Charlie-XIAO commented Jul 4, 2023 • edited

TaTKSM commented Feb 16, 2023 •

edited

TaTKSM commented Feb 21, 2023 •

edited

thomasjpfan commented Feb 24, 2023 •

edited

TaTKSM commented Feb 24, 2023 •

edited

Charlie-XIAO commented Jul 4, 2023 •

edited