New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weighted KDE #4394

Closed
cbonnett opened this Issue Mar 14, 2015 · 14 comments

Comments

Projects
None yet
9 participants
@cbonnett

cbonnett commented Mar 14, 2015

Not sure this is the correct place, but I would very much appreciate the ability to
pass a weight for each sample in kde density estimation.

There exits a adapted version of scipy.stats.gaussian_kde :
http://stackoverflow.com/questions/27623919/weighted-gaussian-kernel-density-estimation-in-python

@amueller

This comment has been minimized.

Member

amueller commented Mar 16, 2015

I think that wouldn't be too hard to add but @jakevdp knows better.

@cbonnett

This comment has been minimized.

cbonnett commented Mar 18, 2015

Thats good news.
Well I would use it for astronomy project, so @jakevdp help/advice would be welcome.
Hope to be able to work on it after paper deadlines, but can't promise anything.

@jakevdp

This comment has been minimized.

Member

jakevdp commented Mar 18, 2015

It's actually not trivial, because of the fast tree-based KDE that sklearn uses. Currently, nodes are ranked by distance and the local estimate is updated until it can be shown that the desired tolerance has been reached. With non-uniform weights, the ranking procedure would have to be based on a combination of minimum distance and maximum weight in each node, which would require a slightly different KD-tree/Ball tree traversal algorithm, along with an updated node data structure to store those weights.

It would be relatively easy to add a slower brute-force version of KDE which supports weighted points, however.

@amueller

This comment has been minimized.

Member

amueller commented Mar 18, 2015

Hum, for some reason I thought the trees did support weights. I guess I was confused by the weighting in KNN which is much easier to implement.

@jakevdp

This comment has been minimized.

Member

jakevdp commented May 5, 2015

Quick question – I've heard a number of requests for this feature. Though it would be difficult to implement for the tree-based KDE, it would be relatively straightforward to add an algorithm='brute' option to KernelDensity which could support a weights or similar attribute for the class.

Do you think that would be a worthwhile contribution?

@cbonnett

This comment has been minimized.

cbonnett commented May 6, 2015

I think it would. In practice it means it would it would only be practical for small-ish data sets of course, but I don't see that as not a good reason to implement it.
Furthermore, if proven popular, it might lead to someone developing a fast version.
just my 2 cents

@Padarn

This comment has been minimized.

Padarn commented May 20, 2015

Just a comment - for low dimensional data sets statsmodels already has a weighted KDE.

@raimon-fa

This comment has been minimized.

raimon-fa commented Nov 2, 2017

It would also be extremely convenient for me if there was a version of the algorithm that accepted weights. I think it's a very important feature and surprisingly almost none of the python libraries have it. Statsmodels does have it, but only for univariate KDE; for multivariate KDE the feature is also missing.

@raimon-fa

This comment has been minimized.

raimon-fa commented Nov 2, 2017

2 years have passed since this issue was opened and it hasn't been solved yet

@jnothman jnothman added the help wanted label Nov 2, 2017

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 2, 2017

Do you want to contribute it? Go ahead!

@iosonofabio

This comment has been minimized.

iosonofabio commented Dec 31, 2017

Hi, I'm interested in this too. What about this?

https://gist.github.com/afrendeiro/9ab8a1ea379030d10f17

I can ask and try and integrate this into sklearn if you think it's fine.

@samronsin

This comment has been minimized.

Contributor

samronsin commented Mar 13, 2018

Hi, I've been working on this lately.
I ended up implementing the "slightly different KD-tree/Ball tree traversal algorithm, along with an updated node data structure to store those weights" @jakevdp mentioned.

@mllobera

This comment has been minimized.

mllobera commented Oct 31, 2018

(scikit learn v0.20.0) Using 'score_samples()' function after fitting a kernel density with 'sample_weight' in Jupyter notebook forces the kernel to restart constantly (it does produce any output). I used a numpy array with shape (2305, 2) for the training dataset and a numpy array with shape (2305,) for the weights. I am able to get results (see image below) when using the the same function without a sample_weight array. I assume this is a known bug!??

download

@jnothman

This comment has been minimized.

Member

jnothman commented Oct 31, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment