Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KDEMultivariate for unordered variables output depends on the numerical value of the unordered variable #3790

Open
ijmbarr opened this issue Jun 27, 2017 · 2 comments

Comments

@ijmbarr
Copy link

ijmbarr commented Jun 27, 2017

Version: '0.8.0'

When using the default bandwidth selection with un-ordered variables in KDEMultivariate, the result depends on the numeric values of the training variables.

Compare:

import statsmodels.api as sm
x = np.array([0,0,0,0,1])
kde_x = sm.nonparametric.KDEMultivariate(data=x, var_type="u")
print kde_x.pdf([0])
# output: 0.61561605356

With

import statsmodels.api as sm
x = np.array([0,0,0,0,1]) * 1000
kde_x = sm.nonparametric.KDEMultivariate(data=x, var_type="u")
print kde_x.pdf([0])
# output: -183.58394644

This happens because the bandwidth is estimated as if the values were continuous. I suggest that this behavior is changed, or at the very least documented.

One suggestion for default might be to set the bandwidth to 1. This way the probability estimate overlaps with the MLE.

I'm happy to submit a pull request for this if people are ok with the default value.

@josef-pkt
Copy link
Member

Sorry issue got lost, and no labels were added (my vacation time)

A bug or at least a weird result (example slightly adjusted and more results)
and how can the pdf be negative? (needs clipping ?)

import numpy as np
from statsmodels.nonparametric.api import KDEMultivariate

x = np.array([0,0,0,0,1])
kde_x = KDEMultivariate(data=x, var_type="u")
print(kde_x.pdf([0]))
# output: 0.61561605356
print(kde_x.pdf([0, 1]))
# output: [ 0.61561605  0.38438395]
print(kde_x.pdf([0, 0.5, 1]))
# output: [ 0.61561605  0.30730658  0.38438395]


x = np.array([0,0,0,0,1]) * 1000
kde_x = KDEMultivariate(data=x, var_type="u")
print(kde_x.pdf([0]))
# output: -183.58394644

x = np.array([0,0,0,0,1]) * 1000

kde_x = KDEMultivariate(data=x, var_type="u", bw=[1.])
print(kde_x.pdf([0, 1]))
# output: [ 0.2  1. ]
print(kde_x.pdf([0, 0.5, 1]))
# output: [ 0.2  1.   1. ]

@josef-pkt josef-pkt added this to bugs in 0.9 Oct 8, 2017
@josef-pkt
Copy link
Member

default is normal reference and is independent of the variable type.
AFAICS, bandwidth h=0 would put weight only on observations with the same category, in aitchison_aitken .

I don't see an option to specify normal reference for some variables, e.g. on continuous variables, and a fixed bw on other variables.

I'm puzzled by the previous result with bw=1., What's a density when we only have a categorical variable?

@josef-pkt josef-pkt modified the milestones: 0.9, 0.10 Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
0.10
Awaiting triage
0.9
bugs
Development

No branches or pull requests

2 participants