New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Sample weighting feature implementation #165
Conversation
A bulk commit to pass sample weights wherever needed using explicitly named arguments and some reformatting.
When a weighted point is big enough to form a cluster the max_lambda for this cluster is infinite, so we return a membership value of 1.
Using weights at this point seems to be a dead end (it is not necessary).
Hello @m-dz, Thank you for updating !
Comment last updated on April 11, 2018 at 10:21 Hours UTC |
Oh, set up line length to 120... Will fix those 1-2 character overflows after discussing the actual PR. |
Left in code for further discussion of weighting approach.
Had to restart Travis, something is not working as expected. Also, 2 or 3 commits are not really related to this feature (e.g. the KD-Tree vs Balltree unification), we should have removed them before making the PR, apologies for that. |
I'll try to get some time to review this properly soon. I have, at least, reconfigured pep8speaks to be more forgiving over line length silliness. |
Please take your time, it's not a minor amendment (although it's mostly just passing Yeah, with the current screen sizes line length <80 characters is slightly silly, even despite a lot of other widely used modules like np or pd all have it. There is also this longish commented out code which I would like to discuss, as it brings some really interesting behaviour, but might be considered "weird". Will comment more on it later. |
I am interested in this PR. @m-dz , is there any reason (in your sample code) why the weights are integers, with most of them equal to 0? (I had no problems running your code, by the way) |
@ihincks , glad to see it compiled and ran! The weights' distribution is based on our use case, no other reason. The weights are integers, as this was how the tree was originally build (all nodes starting with size 1 and joined with their weights summed, see the |
Hi @lmcinnes , is there something we could do to help you make the decision? I'll try to find some time for tests, but if you could suggest what should be tested it would simplify the process. I though about:
|
Hi, I have noticed @jc-healy is assigned to this PR, are you willing to do some work on it to merge it with the master? Commit fdd2c36 broke the travis CI, but it was needed for us to use it - I am going to separate this PR and our "internal" copy within the next few days, so all should be fine again (but I would strongly advise to us a more recent cython as some parts in the old code are a bit redundant like using Also, I would like to help a bit with refactoring to fewer functions (I think I have a skeleton for that ready) after finishing this PR. |
Sorry, this completely slipped through the cracks. I can't promise when we'll have time for a proper review, but we'll do what we can. |
No worries, I'll try to clean this up and confirm it still works as intended in the next week or two. |
Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan. Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help. |
There's a workaround you can do while this feature is pending. The tricky part is to find a find p.s. I'm also waiting for this feature, but as my ML knowledge is not enough yet to help on this WIP, that's the most I can help |
Thanks for the tip - would this approach entail a change the size of the dataset though? The workflow I'm trying to execute is: 500K data --> compress to 10K clusters using BIRCH --> cluster the 10K clusters in hdbscan using membership counts as weights. I think what you are suggesting would essentially entail reinflating the 10K cluster back to the 500K dataset? That unfortunately wouldn't work in my case, as my hardware conks out at around 16K points. Am I understanding the suggestion correctly? |
You could still use your 10K dataset, but you'd have to repeat each point according to its amplitude (which would increase the number of points), apply HDBSCAN and than come back to 10K. I asked it in stackoverflow, somene helped me and maybe that can give you an insight, link to the question |
It looks like not enough of the us who are involved in this have sufficient time to make this happen. I'm closing it for now. |
As discussed, a sample weighting implementation PR. "Tested" with all 6 algorithm options, but lacking any formal tests (on a roadmap...).
Weights are basically used as starting sizes for tree creation in
_hdbscan_linkage.pyx
, the rest (e.g. core distance calculation) is (hopefully) untouched. As you can see in the commit log, initially weighs were passed down to the core dist calculation, but this does not seem to be a good idea (it was causing all 0 weights points to be virtually "excluded" from clustering, we believe this approach is much more appropriate).A quick demonstration (please uncomment plots if feeling adventurous):
We are VERY open to discussion.