Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weighted KDE #4394

Closed
cbonnett opened this issue Mar 14, 2015 · 14 comments
Closed

weighted KDE #4394

cbonnett opened this issue Mar 14, 2015 · 14 comments

Comments

@cbonnett
Copy link

@cbonnett cbonnett commented Mar 14, 2015

Not sure this is the correct place, but I would very much appreciate the ability to
pass a weight for each sample in kde density estimation.

There exits a adapted version of scipy.stats.gaussian_kde :
http://stackoverflow.com/questions/27623919/weighted-gaussian-kernel-density-estimation-in-python

@amueller
Copy link
Member

@amueller amueller commented Mar 16, 2015

I think that wouldn't be too hard to add but @jakevdp knows better.

@cbonnett
Copy link
Author

@cbonnett cbonnett commented Mar 18, 2015

Thats good news.
Well I would use it for astronomy project, so @jakevdp help/advice would be welcome.
Hope to be able to work on it after paper deadlines, but can't promise anything.

@jakevdp
Copy link
Member

@jakevdp jakevdp commented Mar 18, 2015

It's actually not trivial, because of the fast tree-based KDE that sklearn uses. Currently, nodes are ranked by distance and the local estimate is updated until it can be shown that the desired tolerance has been reached. With non-uniform weights, the ranking procedure would have to be based on a combination of minimum distance and maximum weight in each node, which would require a slightly different KD-tree/Ball tree traversal algorithm, along with an updated node data structure to store those weights.

It would be relatively easy to add a slower brute-force version of KDE which supports weighted points, however.

@amueller
Copy link
Member

@amueller amueller commented Mar 18, 2015

Hum, for some reason I thought the trees did support weights. I guess I was confused by the weighting in KNN which is much easier to implement.

@jakevdp
Copy link
Member

@jakevdp jakevdp commented May 5, 2015

Quick question – I've heard a number of requests for this feature. Though it would be difficult to implement for the tree-based KDE, it would be relatively straightforward to add an algorithm='brute' option to KernelDensity which could support a weights or similar attribute for the class.

Do you think that would be a worthwhile contribution?

@cbonnett
Copy link
Author

@cbonnett cbonnett commented May 6, 2015

I think it would. In practice it means it would it would only be practical for small-ish data sets of course, but I don't see that as not a good reason to implement it.
Furthermore, if proven popular, it might lead to someone developing a fast version.
just my 2 cents

@Padarn
Copy link

@Padarn Padarn commented May 20, 2015

Just a comment - for low dimensional data sets statsmodels already has a weighted KDE.

@raimon-fa
Copy link

@raimon-fa raimon-fa commented Nov 2, 2017

It would also be extremely convenient for me if there was a version of the algorithm that accepted weights. I think it's a very important feature and surprisingly almost none of the python libraries have it. Statsmodels does have it, but only for univariate KDE; for multivariate KDE the feature is also missing.

@raimon-fa
Copy link

@raimon-fa raimon-fa commented Nov 2, 2017

2 years have passed since this issue was opened and it hasn't been solved yet

@jnothman jnothman added the help wanted label Nov 2, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Nov 2, 2017

Do you want to contribute it? Go ahead!

@iosonofabio
Copy link

@iosonofabio iosonofabio commented Dec 31, 2017

Hi, I'm interested in this too. What about this?

https://gist.github.com/afrendeiro/9ab8a1ea379030d10f17

I can ask and try and integrate this into sklearn if you think it's fine.

@samronsin
Copy link
Contributor

@samronsin samronsin commented Mar 13, 2018

Hi, I've been working on this lately.
I ended up implementing the "slightly different KD-tree/Ball tree traversal algorithm, along with an updated node data structure to store those weights" @jakevdp mentioned.

@mllobera
Copy link

@mllobera mllobera commented Oct 31, 2018

(scikit learn v0.20.0) Using 'score_samples()' function after fitting a kernel density with 'sample_weight' in Jupyter notebook forces the kernel to restart constantly (it does produce any output). I used a numpy array with shape (2305, 2) for the training dataset and a numpy array with shape (2305,) for the weights. I am able to get results (see image below) when using the the same function without a sample_weight array. I assume this is a known bug!??

download

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 31, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

9 participants