Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelDensity incorrect handling of bandwidth #25623

Open
TaTKSM opened this issue Feb 16, 2023 · 6 comments
Open

KernelDensity incorrect handling of bandwidth #25623

TaTKSM opened this issue Feb 16, 2023 · 6 comments

Comments

@TaTKSM
Copy link

TaTKSM commented Feb 16, 2023

Describe the bug

I was using kernel density estimator
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html
using 'silverman' or 'scott' as the bandwidth argument. Then I found that the bandwidth automatically adjusted by the algorithm is independent of the actual scale of the dataset. In fact, I was shocked to find that the calculation of a bandwidth in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/neighbors/_kde.py for 'silverman' and 'scott' does not check the scales of data at all.

Suppose I fit the model kde to some 2D data X and get the bandwidth as kde.bandwidth_.
Next, I fit the model kde to the same 2D data X but with all elements multiplied by, say, 20 and get the bandwidth as kde.bandwidth_.
I found that these two values of kde.bandwidth_ are equal (it is calculated from the shape of X, see the source code). But obviously they should differ by a factor of 20 if the bandwidth is really computed in a truly adaptive manner.

For your reference, I want to mention that scipy's KDE https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html calculates the covariance of data to extract the scale of data. I think this is the right thing to do.

Note that if the bandwidth is incorrect, everything else is incorrect too, including probablities of samples, etc.

Steps/Code to Reproduce

import numpy as np
from sklearn.neighbors import KernelDensity
X = np.random.randn(1000, 2)
kde = KernelDensity(bandwidth='scott')
kde.fit(X)
print(kde.bandwidth_)

kde.fit(X * 20)
print(kde.bandwidth_)

Expected Results

Different bandwidths for data sets with different scales.

Actual Results

0.31622776601683794
0.31622776601683794

Versions

1.2.1
@TaTKSM TaTKSM added Bug Needs Triage Issue requires triage labels Feb 16, 2023
@TaTKSM
Copy link
Author

TaTKSM commented Feb 21, 2023

One can visually see the sickness of the current implementation by running the code below. The resulting plot is shown at the bottom. I'm using "silverman" here but a similar result is obtained with "scott" too.
@glemaitre @jovan-stojanovic

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from sklearn.neighbors import KernelDensity

N = 200
np.random.seed(1)

X = np.concatenate(
    (np.random.normal(0, 5, int(0.3 * N)), np.random.normal(15, 5, int(0.7 * N)))
)[:, np.newaxis]

X_plot = np.linspace(-15, 30, 1000)[:, np.newaxis]

true_dens = 0.3 * norm(loc=0, scale=5).pdf(X_plot[:, 0]) + 0.7 * norm(loc=15, scale=5).pdf(X_plot[:, 0])

fig, ax = plt.subplots()
ax.fill(X_plot[:, 0], true_dens, fc="black", alpha=0.2, label="input distribution")

colors = ["navy", "cornflowerblue", "darkorange"]
kernels = ["gaussian", "tophat", "epanechnikov"]
lw = 2

for color, kernel in zip(colors, kernels):
    kde = KernelDensity(kernel=kernel, bandwidth="silverman").fit(X)
    log_dens = kde.score_samples(X_plot)
    ax.plot(
        X_plot[:, 0],
        np.exp(log_dens),
        color=color,
        lw=lw,
        linestyle="-",
        label="kernel = '{0}'".format(kernel),
    )

ax.legend(loc="upper left")
ax.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), "+k")

plt.show()

Update: I replaced a link to the plot with direct embedding for better readability.

@thomasjpfan
Copy link
Member

thomasjpfan commented Feb 24, 2023

Thank you for opening this! I agree this is a bug. Here is the same plot using SciPy's kernel function that estimates the correct bandwidth:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats

N = 200
rng = np.random.default_rng(42)

X = np.concatenate(
    (rng.normal(0, 5, int(0.3 * N)), rng.normal(15, 5, int(0.7 * N)))
)

X_plot = np.linspace(-15, 30, 1000)
true_dens = 0.3 * norm(loc=0, scale=5).pdf(X_plot) + 0.7 * norm(loc=15, scale=5).pdf(X_plot)

fig, ax = plt.subplots()
ax.fill(X_plot, true_dens, fc="black", alpha=0.2, label="input distribution")

colors = ["navy", "red"]
methods = ["silverman", "scott"]
lw = 2

for color, method in zip(colors, methods):
    kde = stats.gaussian_kde(X, bw_method=method)
    density = kde(X_plot)
    ax.plot(
        X_plot,
        density,
        color=color,
        lw=lw,
        linestyle="-",
        label="method = '{0}'".format(method),
    )

ax.legend(loc="upper left")

Screenshot 2023-02-24 at 12 19 11 PM

@thomasjpfan thomasjpfan added module:neighbors and removed Needs Triage Issue requires triage labels Feb 24, 2023
@glemaitre
Copy link
Member

@thomasjpfan could you report the factor given by SciPy with our kernel.bandwidth_.
I think they will be the same. I am not sure that we should alter the bandwidth directly but rather scale it when passing it to the parameter h to the tree_?

@thomasjpfan
Copy link
Member

SciPy's factor and KernelDensity.bandwidth_ are the same, but SciPy also takes into account the covariance when computing density. Here is a simpler snippet comparing just silverman between SciPy and scikit-learn:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
from sklearn.neighbors import KernelDensity

N = 200
rng = np.random.default_rng(42)

X = np.concatenate(
    (rng.normal(0, 5, int(0.3 * N)), rng.normal(15, 5, int(0.7 * N)))
)
X_plot = np.linspace(-15, 30, 1000)

fig, ax = plt.subplots()
ax.hist(X, fc="black", alpha=0.2, label="input dist", bins=20, density=True)

kde_sklearn = KernelDensity(bandwidth="silverman").fit(X[:, None])
density_sklearn = np.exp(kde_sklearn.score_samples(X_plot[:, None]))

kde_scipy = stats.gaussian_kde(X, bw_method="silverman")
density_scipy = kde_scipy(X_plot)

for (name, density) in [("sklearn", density_sklearn), ("scipy", density_scipy)]:
    ax.plot(
        X_plot,
        density,
        lw=2,
        linestyle="-",
        label=f"{name} density",
    )

ax.legend(loc="upper left")

Screenshot 2023-02-24 at 12 33 28 PM

Concretely SciPy computes the factor and cholesky decomposition of the covariance here and use the decomposition to compute the density here.

I suspect transforming the data before passing it into tree_ would be easier than adjusting bandwidth.

@TaTKSM
Copy link
Author

TaTKSM commented Feb 24, 2023

Thank you so much for nice figures :-) @thomasjpfan

In addition to the data density issue, I'm also concerned with possible anisotropy of data.
Currently sklearn's KernelDensity implementation seems to asssume an isotropic kernel, while scipy's gaussian kernel seems to allow for an anisotropic covariance matrix (please correct me if I'm wrong).
In fact, data points (in 2D, for example) can have wildly different scales of nearest neighbor distance along X axis and Y axis, as shown in the sample data below.
So I suppose the desiderata are to reflect the actual density of data points in the bandwidth, and to allow for anisotropy of kernels (at least for gaussian kernel).

mean = [0, 0]
cov = [[4000, 0], [0, 1]]
np.random.seed(12)
data = np.random.multivariate_normal(mean, cov, 220) @ np.array([[1., 0.04], [-0.04, 1.]])
plt.tick_params(labelsize=17)
plt.scatter(data[:, 0], data[:, 1], marker='^', s=33, c='cyan', edgecolors='grey')
plt.show()

@Charlie-XIAO
Copy link
Contributor

Charlie-XIAO commented Jul 4, 2023

ref #26658 (comment) may maintainers take a look? @thomasjpfan Transforming data seems to be reasonable, though maybe it should be done in the binary tree code when calling kernel_density because this is a public method.

@TaTKSM at the bottom of the linked comment above there is a 2D example. Though different from yours I think it is a similar case as #25623 (comment). Also I think the covariance matrix approach can be used not only for Gaussian kernel. In general

$$ \frac{1}{N|H|}\sum_{i=1}^N\mathcal{K}\left(H^{-1}(x-x^i)\right), $$

where $H$ is the bandwidth matrix chosen proportional to $\Sigma^{1/2}$, square root of the covariance matrix. $\mathcal{K}$ can be any kernel here I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants