Add kernel density estimation to the statistics module #115532

rhettinger · 2024-02-15T17:53:44Z

Feature

Proposal:

I propose promoting the KDE statistics recipe to be an actual part of the statistics module.
See: https://docs.python.org/3.13/library/statistics.html#kernel-density-estimation

My thought Is that it is an essential part of statistical reasoning about sample data
See: https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf

Discussion of this feature:

This was discussed and reviewed with the module author who gave it his full support:

This looks really great, both as a concept and the code itself. It's a
good showcase for match, and an important statistics function.

Thank you for working on this.

Working Draft:

from typing import Sequence, Callable
from bisect import bisect_left, bisect_right
from math import sqrt, exp, cos, pi, cosh

def kde(sample: Sequence[float],
        h: float,
        kernel_function: str='gauss',
    ) -> Callable[[float], float]:
    """Kernel Density Estimation.  Creates a continuous probability
    density function from discrete samples.

    The basic idea is to smooth the data using a kernel function
    to help draw inferences about a population from a sample.

    The degree of smoothing is controlled by the scaling parameter h
    which is called the bandwidth.  Smaller values emphasize local
    features while larger values give smoother results.

    Kernel functions that give some weight to every sample point:

       gauss or normal
       logistic
       sigmoid

    Kernel functions that only give weight to sample points within
    the bandwidth:

       rectangular or uniform
       triangular
       epanechnikov or parabolic
       biweight or quartic
       triweight
       cosine

    Example
    -------

    Given a sample of six data points, estimate the
    probablity density at evenly spaced points from -6 to 10:

        >>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
        >>> f_hat = kde(sample, h=1.5)
        >>> total = 0.0
        >>> for x in range(-6, 11):
        ...     density = f_hat(x)
        ...     total += density
        ...     plot = ' ' * int(density * 400) + 'x'
        ...     print(f'{x:2}: {density:.3f} {plot}')
        ...
        -6: 0.002 x
        -5: 0.009    x
        -4: 0.031             x
        -3: 0.070                             x
        -2: 0.111                                             x
        -1: 0.125                                                   x
         0: 0.110                                            x
         1: 0.086                                   x
         2: 0.068                            x
         3: 0.059                        x
         4: 0.066                           x
         5: 0.082                                 x
         6: 0.082                                 x
         7: 0.058                        x
         8: 0.028            x
         9: 0.009    x
        10: 0.002 x
        >>> round(total, 3)
        0.999

    References
    ----------

    Kernel density estimation and its application:
    https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf

    Kernel functions in common use:
    https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use

    Interactive graphical demonstration and exploration:
    https://demonstrations.wolfram.com/KernelDensityEstimation/

    """

    kernel: Callable[[float], float]
    support: float | None

    match kernel_function:

        case 'gauss' | 'normal':
            c = 1 / sqrt(2 * pi)
            kernel = lambda t: c * exp(-1/2 * t * t)
            support = None

        case 'logistic':
            # 1.0 / (exp(t) + 2.0 + exp(-t))
            kernel = lambda t: 1/2 / (1.0 + cosh(t))
            support = None

        case 'sigmoid':
            # (2/pi) / (exp(t) + exp(-t))
            c = 1 / pi
            kernel = lambda t: c / cosh(t)
            support = None

        case 'rectangular' | 'uniform':
            kernel = lambda t: 1/2
            support = 1.0

        case 'triangular':
            kernel = lambda t: 1.0 - abs(t)
            support = 1.0

        case 'epanechnikov' | 'parabolic':
            kernel = lambda t: 3/4 * (1.0 - t * t)
            support = 1.0

        case 'biweight' | 'quartic':
            kernel = lambda t: 15/16 * (1.0 - t * t) ** 2
            support = 1.0

        case 'triweight':
            kernel = lambda t: 35/32 * (1.0 - t * t) ** 3
            support = 1.0

        case 'cosine':
            c1 = pi / 4
            c2 = pi / 2
            kernel = lambda t: c1 * cos(c2 * t)
            support = 1.0

        case _:
            raise ValueError(f'Unknown kernel function: {kernel_function!r}')

    n = len(sample)

    if support is None:

        def pdf(x: float) -> float:
            return sum(kernel((x - x_i) / h) for x_i in sample) / (n * h)

    else:

        sample = sorted(sample)
        bandwidth = h * support

        def pdf(x: float) -> float:
            i = bisect_left(sample, x - bandwidth)
            j = bisect_right(sample, x + bandwidth)
            supported = sample[i : j]
            return sum(kernel((x - x_i) / h) for x_i in supported) / (n * h)

    return pdf


def _test() -> None:
    from statistics import NormalDist

    def kde_normal(sample: Sequence[float], h: float) -> Callable[[float], float]:
        "Create a continuous probability density function from a sample."
        # Smooth the sample with a normal distribution kernel scaled by h.
        kernel_h = NormalDist(0.0, h).pdf
        n = len(sample)
        def pdf(x: float) -> float:
            return sum(kernel_h(x - x_i) for x_i in sample) / n
        return pdf

    D = NormalDist(250, 50)
    data = D.samples(100)
    h = 30
    pd = D.pdf

    fg = kde(data, h, 'gauss')
    fg2 = kde_normal(data, h)
    fr = kde(data, h, 'rectangular')
    ft = kde(data, h, 'triangular')
    fb = kde(data, h, 'biweight')
    fe = kde(data, h, 'epanechnikov')
    fc = kde(data, h, 'cosine')
    fl = kde(data, h, 'logistic')
    fs = kde(data, h, 'sigmoid')

    def show(x: float) -> None:
        for func in (pd, fg, fg2, fr, ft, fb, fe, fc, fl, fs):
            print(func(x))
        print()

    show(197)
    show(255)
    show(320)

    def integrate(func: Callable[[float], float],
                  low: float, high: float, steps: int=1000) -> float:
        width = (high - low) / steps
        xarr = (low + width * i for i in range(steps))
        return sum(map(func, xarr)) * width

    print('\nIntegrals:')
    for func in (pd, fg, fg2, fr, ft, fb, fe, fc, fl, fs):
        print(integrate(func, 0, 500, steps=10_000))


if __name__ == '__main__':


    import doctest

    _test()
    print(doctest.testmod())

Linked PRs

gpshead · 2024-02-17T00:31:41Z

This looks like a pretty clean useful addition. Nice draft!

Some things to consider (better explored in draft PR form than inline in an issue draft): kernel_function= being a str is perhaps unusual - could that instead be a direct reference to a Callable[float, float] that also carries a .support: float|None attribute? Those could be pre-defined by name perhaps within a statistics.kernel-ish namespace to avoid the match/case and on the fly definitions entirely. This allows people to also bring their own kernel rather than only use our predefined common ones.

rhettinger · 2024-02-21T01:11:56Z

Thanks for the compliment on the draft. I've been iterating on it for a few weeks.

FWIW, having a string kernel name is consistent with the API style for quantiles() and correlation() elsewhere in the module. I also like having kde() function fully self-contained, making it easier to use. The listed methods are consistently the only ones mentioned in KDE literature and SciPy only implements the normal kernel since it dominates the others in practice. So I think supporting custom kernels would be an over-generalization beyond what people actually use.

Side note: While the code volume makes it look like the kernel is the most important part, the referenced paper says that changing kernels only makes minor differences in the output. The significant parameter by far is h which has dramatic effects on the final result.

…-115863)

…le (pythongh-115863)

rhettinger added type-feature A feature request or enhancement 3.13 bugs and security fixes labels Feb 15, 2024

bedevere-app bot mentioned this issue Feb 23, 2024

gh-115532: Add kernel density estimation to the statistics module #115863

Merged

rhettinger added a commit that referenced this issue Feb 25, 2024

gh-115532: Add kernel density estimation to the statistics module (gh…

6d34eb0

…-115863)

rhettinger closed this as completed Feb 25, 2024

woodruffw pushed a commit to woodruffw-forks/cpython that referenced this issue Mar 4, 2024

pythongh-115532: Add kernel density estimation to the statistics modu…

fc8abd7

…le (pythongh-115863)

adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024

pythongh-115532: Add kernel density estimation to the statistics modu…

0174a54

…le (pythongh-115863)

tpvasconcelos mentioned this issue Apr 9, 2024

Different KDE implementations tpvasconcelos/ridgeplot#116

Closed

bedevere-app bot mentioned this issue Apr 15, 2024

gh-115532: Minor tweaks to kde() #117897

Merged

rhettinger added a commit that referenced this issue Apr 15, 2024

gh-115532: Minor tweaks to kde() (gh-117897)

0823f43

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024

pythongh-115532: Add kernel density estimation to the statistics modu…

430e61e

…le (pythongh-115863)

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024

pythongh-115532: Minor tweaks to kde() (pythongh-117897)

62c96e4

bedevere-app bot mentioned this issue Apr 24, 2024

gh-115532 Add kde_random() to the statistic module #118210

Merged

rhettinger added a commit that referenced this issue May 4, 2024

gh-115532 Add kde_random() to the statistic module (#118210)

42dc5b4

SonicField pushed a commit to SonicField/cpython that referenced this issue May 8, 2024

pythongh-115532 Add kde_random() to the statistic module (python#118210)

44ef1ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kernel density estimation to the statistics module #115532

Add kernel density estimation to the statistics module #115532

rhettinger commented Feb 15, 2024 •

edited by bedevere-app bot

Loading

gpshead commented Feb 17, 2024

rhettinger commented Feb 21, 2024 •

edited

Loading

Add kernel density estimation to the statistics module #115532

Add kernel density estimation to the statistics module #115532

Comments

rhettinger commented Feb 15, 2024 • edited by bedevere-app bot Loading

Feature

Proposal:

Discussion of this feature:

Working Draft:

Linked PRs

gpshead commented Feb 17, 2024

rhettinger commented Feb 21, 2024 • edited Loading

rhettinger commented Feb 15, 2024 •

edited by bedevere-app bot

Loading

rhettinger commented Feb 21, 2024 •

edited

Loading