Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Sample weighting feature implementation #165

Closed
wants to merge 18 commits into from

Conversation

m-dz
Copy link
Contributor

@m-dz m-dz commented Jan 18, 2018

As discussed, a sample weighting implementation PR. "Tested" with all 6 algorithm options, but lacking any formal tests (on a roadmap...).

Weights are basically used as starting sizes for tree creation in _hdbscan_linkage.pyx, the rest (e.g. core distance calculation) is (hopefully) untouched. As you can see in the commit log, initially weighs were passed down to the core dist calculation, but this does not seem to be a good idea (it was causing all 0 weights points to be virtually "excluded" from clustering, we believe this approach is much more appropriate).

A quick demonstration (please uncomment plots if feeling adventurous):

'''
Example of weighted HDBSCAN.
'''

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

from hdbscan import HDBSCAN

plot_kwds = {'linewidths':0}
palette = sns.color_palette("hls", 20)


# Generate data
np.random.seed(1)
X, y = make_blobs(n_samples=400, cluster_std=3, shuffle=True, random_state=10)
X, y = shuffle(X, y, random_state=7)
X = StandardScaler().fit_transform(X)

sample_weights = np.floor(np.random.standard_gamma(shape=0.5, size=len(X))) * 4

sizes = ((sample_weights+1)*10).astype(int)
alphas = np.fromiter((0.5 if s < 1 else 0.75 for s in sample_weights), np.float, 400).reshape(-1,1)

algorithm = 'best'
# algorithm = 'generic'
# algorithm = 'prims_kdtree'
# algorithm = 'prims_balltree'
# algorithm = 'boruvka_kdtree'
# algorithm = 'boruvka_balltree'

min_cluster_size, min_samples = 8, 1


# (Unweighted) HDBSCAN
np.random.seed(1)
hdb_unweighted = HDBSCAN(
    min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
    algorithm=algorithm, prediction_data=True)
hdb_unweighted.fit(X)

cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_unweighted.labels_])
cluster_colours = np.hstack([cluster_colours, alphas])

fig = plt.figure()
plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
fig.suptitle('Unweighted HDBSCAN'); plt.show()

## Lots of plots, all working!
fig = plt.figure()
hdb_unweighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
fig.suptitle('Unweighted HDBSCAN minimum spanning tree'); plt.show()
fig = plt.figure()
hdb_unweighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
fig.suptitle('Unweighted HDBSCAN single linkage tree'); plt.show()
fig = plt.figure()
hdb_unweighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
fig.suptitle('Unweighted HDBSCAN condensed tree plot'); plt.show()


# (Weighted) HDBSCAN
np.random.seed(1)
hdb_weighted = HDBSCAN(
    min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
    algorithm=algorithm, prediction_data=True)
hdb_weighted.fit(X, sample_weights=sample_weights)

cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_weighted.labels_])
cluster_colours = np.hstack([cluster_colours, alphas])

fig = plt.figure()
plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
fig.suptitle('Weighted HDBSCAN'); plt.show()

## Lots of plots, all working (except one or two warnings...)!
fig = plt.figure()
hdb_weighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
fig.suptitle('Weighted HDBSCAN minimum spanning tree'); plt.show()
# single_linkage_tree throws "plots.py:563: RuntimeWarning: divide by zero encountered in log2" but works.
fig = plt.figure()
hdb_weighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
fig.suptitle('Weighted HDBSCAN single linkage tree'); plt.show()
fig = plt.figure()
hdb_weighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
fig.suptitle('Weighted HDBSCAN condensed tree plot'); plt.show()

## Check weights by cluster:
unweighted_weights = np.asarray([[lab, sum(sample_weights[hdb_unweighted.labels_ == lab])] for lab in set(hdb_unweighted.labels_)])
weighted_weights = np.asarray([[lab, sum(sample_weights[hdb_weighted.labels_ == lab])] for lab in set(hdb_weighted.labels_)])
print(unweighted_weights)
print(weighted_weights)

We are VERY open to discussion.

@pep8speaks
Copy link

pep8speaks commented Jan 18, 2018

Hello @m-dz, Thank you for updating !

Line 46:1: E402 module level import not at top of file
Line 48:1: E302 expected 2 blank lines, found 1
Line 181:25: E128 continuation line under-indented for visual indent
Line 517:101: E501 line too long (101 > 100 characters)
Line 537:101: E501 line too long (101 > 100 characters)
Line 542:101: E501 line too long (103 > 100 characters)
Line 552:101: E501 line too long (103 > 100 characters)
Line 557:101: E501 line too long (105 > 100 characters)
Line 569:13: E127 continuation line over-indented for visual indent
Line 913:18: E128 continuation line under-indented for visual indent
Line 928:18: E128 continuation line under-indented for visual indent
Line 929:18: E128 continuation line under-indented for visual indent
Line 940:18: E128 continuation line under-indented for visual indent
Line 941:18: E128 continuation line under-indented for visual indent

Line 114:48: W503 line break before binary operator
Line 129:50: W503 line break before binary operator
Line 247:25: E126 continuation line over-indented for hanging indent
Line 248:33: E127 continuation line over-indented for visual indent
Line 250:48: W503 line break before binary operator
Line 309:25: E126 continuation line over-indented for hanging indent
Line 310:33: E128 continuation line under-indented for visual indent
Line 310:33: W503 line break before binary operator
Line 312:56: W503 line break before binary operator

Line 279:57: E127 continuation line over-indented for visual indent
Line 280:57: E127 continuation line over-indented for visual indent
Line 281:57: E127 continuation line over-indented for visual indent
Line 448:18: E128 continuation line under-indented for visual indent

Comment last updated on April 11, 2018 at 10:21 Hours UTC

@m-dz
Copy link
Contributor Author

m-dz commented Jan 18, 2018

Oh, set up line length to 120... Will fix those 1-2 character overflows after discussing the actual PR.

Left in code for further discussion of weighting approach.
@m-dz m-dz closed this Jan 18, 2018
@m-dz m-dz reopened this Jan 18, 2018
@coveralls
Copy link

coveralls commented Jan 18, 2018

Coverage Status

Coverage decreased (-0.3%) to 90.779% when pulling 2a0434e on m-dz:WIP_sample_weighting into f3769a5 on scikit-learn-contrib:master.

@m-dz
Copy link
Contributor Author

m-dz commented Jan 18, 2018

Had to restart Travis, something is not working as expected.

Also, 2 or 3 commits are not really related to this feature (e.g. the KD-Tree vs Balltree unification), we should have removed them before making the PR, apologies for that.

@lmcinnes
Copy link
Collaborator

I'll try to get some time to review this properly soon. I have, at least, reconfigured pep8speaks to be more forgiving over line length silliness.

@m-dz
Copy link
Contributor Author

m-dz commented Jan 19, 2018

Please take your time, it's not a minor amendment (although it's mostly just passing sample_weights around and initialising node sizes with it). I'll write some validation and performance tests through the weekend as well.

Yeah, with the current screen sizes line length <80 characters is slightly silly, even despite a lot of other widely used modules like np or pd all have it.

There is also this longish commented out code which I would like to discuss, as it brings some really interesting behaviour, but might be considered "weird". Will comment more on it later.

@ihincks
Copy link

ihincks commented Jan 23, 2018

I am interested in this PR. @m-dz , is there any reason (in your sample code) why the weights are integers, with most of them equal to 0? (I had no problems running your code, by the way)

@m-dz
Copy link
Contributor Author

m-dz commented Jan 24, 2018

@ihincks , glad to see it compiled and ran!

The weights' distribution is based on our use case, no other reason.

The weights are integers, as this was how the tree was originally build (all nodes starting with size 1 and joined with their weights summed, see the __init__ method of UnionFind class). It should be pretty straightforward to implement this using floats, but I would rather wait for this being merged.

@m-dz
Copy link
Contributor Author

m-dz commented Feb 9, 2018

Hi @lmcinnes , is there something we could do to help you make the decision? I'll try to find some time for tests, but if you could suggest what should be tested it would simplify the process. I though about:

  1. Runtime "offline" test ("old" vs "new" unweighted clustering reported here);
  2. Unweighted clustering vs all weights == 1, results should be equal;
  3. Some "standard" tests of weighted cases to ensure everything runs without errors and produces reproducible results.

@m-dz
Copy link
Contributor Author

m-dz commented Apr 4, 2018

Hi, I have noticed @jc-healy is assigned to this PR, are you willing to do some work on it to merge it with the master? Commit fdd2c36 broke the travis CI, but it was needed for us to use it - I am going to separate this PR and our "internal" copy within the next few days, so all should be fine again (but I would strongly advise to us a more recent cython as some parts in the old code are a bit redundant like using with gil code blocks within with gil in the function headers if I remember right.

Also, I would like to help a bit with refactoring to fewer functions (I think I have a skeleton for that ready) after finishing this PR.

@m-dz
Copy link
Contributor Author

m-dz commented Jul 17, 2019

It will take some time to refresh my memory how all of this worked, but happy to do some work and make this PR into the repo. @lmcinnes , @jc-healy , any comments from your side? Would you have time for reviewing?

@lmcinnes
Copy link
Collaborator

Sorry, this completely slipped through the cracks. I can't promise when we'll have time for a proper review, but we'll do what we can.

@m-dz
Copy link
Contributor Author

m-dz commented Jul 17, 2019

No worries, I'll try to clean this up and confirm it still works as intended in the next week or two.

@ddifranco
Copy link

ddifranco commented Aug 19, 2021

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.

Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

@edugmes
Copy link

edugmes commented Aug 19, 2021

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.

Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

There's a workaround you can do while this feature is pending.
You can use the amplitude of your signal/data as weights by repeating the points by n-amplitude times.

The tricky part is to find a find min_cluster and min_samples parameters that best represents the data.

p.s. I'm also waiting for this feature, but as my ML knowledge is not enough yet to help on this WIP, that's the most I can help

@ddifranco
Copy link

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.
Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

There's a workaround you can do while this feature is pending.
You can use the amplitude of your signal/data as weights by repeating the points by n-amplitude times.

The tricky part is to find a find min_cluster and min_samples parameters that best represents the data.

p.s. I'm also waiting for this feature, but as my ML knowledge is not enough yet to help on this WIP, that's the most I can help

Thanks for the tip - would this approach entail a change the size of the dataset though? The workflow I'm trying to execute is: 500K data --> compress to 10K clusters using BIRCH --> cluster the 10K clusters in hdbscan using membership counts as weights. I think what you are suggesting would essentially entail reinflating the 10K cluster back to the 500K dataset? That unfortunately wouldn't work in my case, as my hardware conks out at around 16K points.

Am I understanding the suggestion correctly?

@edugmes
Copy link

edugmes commented Aug 19, 2021

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.
Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

There's a workaround you can do while this feature is pending.
You can use the amplitude of your signal/data as weights by repeating the points by n-amplitude times.
The tricky part is to find a find min_cluster and min_samples parameters that best represents the data.
p.s. I'm also waiting for this feature, but as my ML knowledge is not enough yet to help on this WIP, that's the most I can help

Thanks for the tip - would this approach entail a change the size of the dataset though? The workflow I'm trying to execute is: 500K data --> compress to 10K clusters using BIRCH --> cluster the 10K clusters in hdbscan using membership counts as weights. I think what you are suggesting would essentially entail reinflating the 10K cluster back to the 500K dataset? That unfortunately wouldn't work in my case, as my hardware conks out at around 16K points.

Am I understanding the suggestion correctly?

You could still use your 10K dataset, but you'd have to repeat each point according to its amplitude (which would increase the number of points), apply HDBSCAN and than come back to 10K. I asked it in stackoverflow, somene helped me and maybe that can give you an insight, link to the question

@jc-healy jc-healy removed their assignment Oct 8, 2021
@jc-healy
Copy link
Collaborator

jc-healy commented Oct 8, 2021

It looks like not enough of the us who are involved in this have sufficient time to make this happen. I'm closing it for now.

@jc-healy jc-healy closed this Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants