KDEMultivariate for unordered variables output depends on the numerical value of the unordered variable #3790

ijmbarr · 2017-06-27T14:26:55Z

Version: '0.8.0'

When using the default bandwidth selection with un-ordered variables in KDEMultivariate, the result depends on the numeric values of the training variables.

Compare:

import statsmodels.api as sm
x = np.array([0,0,0,0,1])
kde_x = sm.nonparametric.KDEMultivariate(data=x, var_type="u")
print kde_x.pdf([0])
# output: 0.61561605356

With

import statsmodels.api as sm
x = np.array([0,0,0,0,1]) * 1000
kde_x = sm.nonparametric.KDEMultivariate(data=x, var_type="u")
print kde_x.pdf([0])
# output: -183.58394644

This happens because the bandwidth is estimated as if the values were continuous. I suggest that this behavior is changed, or at the very least documented.

One suggestion for default might be to set the bandwidth to 1. This way the probability estimate overlaps with the MLE.

I'm happy to submit a pull request for this if people are ok with the default value.

The text was updated successfully, but these errors were encountered:

josef-pkt · 2017-10-07T20:34:44Z

Sorry issue got lost, and no labels were added (my vacation time)

A bug or at least a weird result (example slightly adjusted and more results)
and how can the pdf be negative? (needs clipping ?)

import numpy as np
from statsmodels.nonparametric.api import KDEMultivariate

x = np.array([0,0,0,0,1])
kde_x = KDEMultivariate(data=x, var_type="u")
print(kde_x.pdf([0]))
# output: 0.61561605356
print(kde_x.pdf([0, 1]))
# output: [ 0.61561605  0.38438395]
print(kde_x.pdf([0, 0.5, 1]))
# output: [ 0.61561605  0.30730658  0.38438395]


x = np.array([0,0,0,0,1]) * 1000
kde_x = KDEMultivariate(data=x, var_type="u")
print(kde_x.pdf([0]))
# output: -183.58394644

x = np.array([0,0,0,0,1]) * 1000

kde_x = KDEMultivariate(data=x, var_type="u", bw=[1.])
print(kde_x.pdf([0, 1]))
# output: [ 0.2  1. ]
print(kde_x.pdf([0, 0.5, 1]))
# output: [ 0.2  1.   1. ]

josef-pkt · 2018-04-19T04:39:27Z

default is normal reference and is independent of the variable type.
AFAICS, bandwidth h=0 would put weight only on observations with the same category, in aitchison_aitken .

I don't see an option to specify normal reference for some variables, e.g. on continuous variables, and a fixed bw on other variables.

I'm puzzled by the previous result with bw=1., What's a density when we only have a categorical variable?

josef-pkt added comp-nonparametric type-bug type-enh labels Oct 7, 2017

josef-pkt added this to the 0.9 milestone Oct 7, 2017

josef-pkt added this to bugs in 0.9 Oct 8, 2017

josef-pkt modified the milestones: 0.9, 0.10 Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KDEMultivariate for unordered variables output depends on the numerical value of the unordered variable #3790

KDEMultivariate for unordered variables output depends on the numerical value of the unordered variable #3790

ijmbarr commented Jun 27, 2017

josef-pkt commented Oct 7, 2017

josef-pkt commented Apr 19, 2018

KDEMultivariate for unordered variables output depends on the numerical value of the unordered variable #3790

KDEMultivariate for unordered variables output depends on the numerical value of the unordered variable #3790

Comments

ijmbarr commented Jun 27, 2017

josef-pkt commented Oct 7, 2017

josef-pkt commented Apr 19, 2018