[WIP] Sample weighting feature implementation #165

m-dz · 2018-01-18T19:09:57Z

As discussed, a sample weighting implementation PR. "Tested" with all 6 algorithm options, but lacking any formal tests (on a roadmap...).

Weights are basically used as starting sizes for tree creation in _hdbscan_linkage.pyx, the rest (e.g. core distance calculation) is (hopefully) untouched. As you can see in the commit log, initially weighs were passed down to the core dist calculation, but this does not seem to be a good idea (it was causing all 0 weights points to be virtually "excluded" from clustering, we believe this approach is much more appropriate).

A quick demonstration (please uncomment plots if feeling adventurous):

'''
Example of weighted HDBSCAN.
'''

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

from hdbscan import HDBSCAN

plot_kwds = {'linewidths':0}
palette = sns.color_palette("hls", 20)


# Generate data
np.random.seed(1)
X, y = make_blobs(n_samples=400, cluster_std=3, shuffle=True, random_state=10)
X, y = shuffle(X, y, random_state=7)
X = StandardScaler().fit_transform(X)

sample_weights = np.floor(np.random.standard_gamma(shape=0.5, size=len(X))) * 4

sizes = ((sample_weights+1)*10).astype(int)
alphas = np.fromiter((0.5 if s < 1 else 0.75 for s in sample_weights), np.float, 400).reshape(-1,1)

algorithm = 'best'
# algorithm = 'generic'
# algorithm = 'prims_kdtree'
# algorithm = 'prims_balltree'
# algorithm = 'boruvka_kdtree'
# algorithm = 'boruvka_balltree'

min_cluster_size, min_samples = 8, 1


# (Unweighted) HDBSCAN
np.random.seed(1)
hdb_unweighted = HDBSCAN(
    min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
    algorithm=algorithm, prediction_data=True)
hdb_unweighted.fit(X)

cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_unweighted.labels_])
cluster_colours = np.hstack([cluster_colours, alphas])

fig = plt.figure()
plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
fig.suptitle('Unweighted HDBSCAN'); plt.show()

## Lots of plots, all working!
fig = plt.figure()
hdb_unweighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
fig.suptitle('Unweighted HDBSCAN minimum spanning tree'); plt.show()
fig = plt.figure()
hdb_unweighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
fig.suptitle('Unweighted HDBSCAN single linkage tree'); plt.show()
fig = plt.figure()
hdb_unweighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
fig.suptitle('Unweighted HDBSCAN condensed tree plot'); plt.show()


# (Weighted) HDBSCAN
np.random.seed(1)
hdb_weighted = HDBSCAN(
    min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
    algorithm=algorithm, prediction_data=True)
hdb_weighted.fit(X, sample_weights=sample_weights)

cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_weighted.labels_])
cluster_colours = np.hstack([cluster_colours, alphas])

fig = plt.figure()
plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
fig.suptitle('Weighted HDBSCAN'); plt.show()

## Lots of plots, all working (except one or two warnings...)!
fig = plt.figure()
hdb_weighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
fig.suptitle('Weighted HDBSCAN minimum spanning tree'); plt.show()
# single_linkage_tree throws "plots.py:563: RuntimeWarning: divide by zero encountered in log2" but works.
fig = plt.figure()
hdb_weighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
fig.suptitle('Weighted HDBSCAN single linkage tree'); plt.show()
fig = plt.figure()
hdb_weighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
fig.suptitle('Weighted HDBSCAN condensed tree plot'); plt.show()

## Check weights by cluster:
unweighted_weights = np.asarray([[lab, sum(sample_weights[hdb_unweighted.labels_ == lab])] for lab in set(hdb_unweighted.labels_)])
weighted_weights = np.asarray([[lab, sum(sample_weights[hdb_weighted.labels_ == lab])] for lab in set(hdb_weighted.labels_)])
print(unweighted_weights)
print(weighted_weights)

We are VERY open to discussion.

A bulk commit to pass sample weights wherever needed using explicitly named arguments and some reformatting.

When a weighted point is big enough to form a cluster the max_lambda for this cluster is infinite, so we return a membership value of 1.

Using weights at this point seems to be a dead end (it is not necessary).

pep8speaks · 2018-01-18T19:10:00Z

Hello @m-dz, Thank you for updating !

In the file hdbscan/hdbscan_.py, following are the PEP8 issues :

Line 46:1: E402 module level import not at top of file
Line 48:1: E302 expected 2 blank lines, found 1
Line 181:25: E128 continuation line under-indented for visual indent
Line 517:101: E501 line too long (101 > 100 characters)
Line 537:101: E501 line too long (101 > 100 characters)
Line 542:101: E501 line too long (103 > 100 characters)
Line 552:101: E501 line too long (103 > 100 characters)
Line 557:101: E501 line too long (105 > 100 characters)
Line 569:13: E127 continuation line over-indented for visual indent
Line 913:18: E128 continuation line under-indented for visual indent
Line 928:18: E128 continuation line under-indented for visual indent
Line 929:18: E128 continuation line under-indented for visual indent
Line 940:18: E128 continuation line under-indented for visual indent
Line 941:18: E128 continuation line under-indented for visual indent

In the file hdbscan/prediction.py, following are the PEP8 issues :

Line 114:48: W503 line break before binary operator
Line 129:50: W503 line break before binary operator
Line 247:25: E126 continuation line over-indented for hanging indent
Line 248:33: E127 continuation line over-indented for visual indent
Line 250:48: W503 line break before binary operator
Line 309:25: E126 continuation line over-indented for hanging indent
Line 310:33: E128 continuation line under-indented for visual indent
Line 310:33: W503 line break before binary operator
Line 312:56: W503 line break before binary operator

In the file hdbscan/robust_single_linkage_.py, following are the PEP8 issues :

Line 279:57: E127 continuation line over-indented for visual indent
Line 280:57: E127 continuation line over-indented for visual indent
Line 281:57: E127 continuation line over-indented for visual indent
Line 448:18: E128 continuation line under-indented for visual indent

Comment last updated on April 11, 2018 at 10:21 Hours UTC

m-dz · 2018-01-18T19:12:05Z

Oh, set up line length to 120... Will fix those 1-2 character overflows after discussing the actual PR.

Left in code for further discussion of weighting approach.

coveralls · 2018-01-18T20:27:44Z

Coverage decreased (-0.3%) to 90.779% when pulling 2a0434e on m-dz:WIP_sample_weighting into f3769a5 on scikit-learn-contrib:master.

m-dz · 2018-01-18T21:24:20Z

Had to restart Travis, something is not working as expected.

Also, 2 or 3 commits are not really related to this feature (e.g. the KD-Tree vs Balltree unification), we should have removed them before making the PR, apologies for that.

lmcinnes · 2018-01-19T02:50:27Z

I'll try to get some time to review this properly soon. I have, at least, reconfigured pep8speaks to be more forgiving over line length silliness.

m-dz · 2018-01-19T08:28:02Z

Please take your time, it's not a minor amendment (although it's mostly just passing sample_weights around and initialising node sizes with it). I'll write some validation and performance tests through the weekend as well.

Yeah, with the current screen sizes line length <80 characters is slightly silly, even despite a lot of other widely used modules like np or pd all have it.

There is also this longish commented out code which I would like to discuss, as it brings some really interesting behaviour, but might be considered "weird". Will comment more on it later.

ihincks · 2018-01-23T20:25:48Z

I am interested in this PR. @m-dz , is there any reason (in your sample code) why the weights are integers, with most of them equal to 0? (I had no problems running your code, by the way)

m-dz · 2018-01-24T12:07:24Z

@ihincks , glad to see it compiled and ran!

The weights' distribution is based on our use case, no other reason.

The weights are integers, as this was how the tree was originally build (all nodes starting with size 1 and joined with their weights summed, see the __init__ method of UnionFind class). It should be pretty straightforward to implement this using floats, but I would rather wait for this being merged.

m-dz · 2018-02-09T11:12:31Z

Hi @lmcinnes , is there something we could do to help you make the decision? I'll try to find some time for tests, but if you could suggest what should be tested it would simplify the process. I though about:

Runtime "offline" test ("old" vs "new" unweighted clustering reported here);
Unweighted clustering vs all weights == 1, results should be equal;
Some "standard" tests of weighted cases to ensure everything runs without errors and produces reproducible results.

m-dz · 2018-04-04T17:21:19Z

Hi, I have noticed @jc-healy is assigned to this PR, are you willing to do some work on it to merge it with the master? Commit fdd2c36 broke the travis CI, but it was needed for us to use it - I am going to separate this PR and our "internal" copy within the next few days, so all should be fine again (but I would strongly advise to us a more recent cython as some parts in the old code are a bit redundant like using with gil code blocks within with gil in the function headers if I remember right.

Also, I would like to help a bit with refactoring to fewer functions (I think I have a skeleton for that ready) after finishing this PR.

m-dz · 2019-07-17T00:12:09Z

It will take some time to refresh my memory how all of this worked, but happy to do some work and make this PR into the repo. @lmcinnes , @jc-healy , any comments from your side? Would you have time for reviewing?

lmcinnes · 2019-07-17T01:39:53Z

Sorry, this completely slipped through the cracks. I can't promise when we'll have time for a proper review, but we'll do what we can.

m-dz · 2019-07-17T07:48:16Z

No worries, I'll try to clean this up and confirm it still works as intended in the next week or two.

ddifranco · 2021-08-19T20:27:19Z

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.

Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

edugmes · 2021-08-19T20:50:08Z

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.

Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

There's a workaround you can do while this feature is pending.
You can use the amplitude of your signal/data as weights by repeating the points by n-amplitude times.

The tricky part is to find a find min_cluster and min_samples parameters that best represents the data.

p.s. I'm also waiting for this feature, but as my ML knowledge is not enough yet to help on this WIP, that's the most I can help

ddifranco · 2021-08-19T21:02:09Z

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.
Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

There's a workaround you can do while this feature is pending.
You can use the amplitude of your signal/data as weights by repeating the points by n-amplitude times.

The tricky part is to find a find min_cluster and min_samples parameters that best represents the data.

p.s. I'm also waiting for this feature, but as my ML knowledge is not enough yet to help on this WIP, that's the most I can help

Thanks for the tip - would this approach entail a change the size of the dataset though? The workflow I'm trying to execute is: 500K data --> compress to 10K clusters using BIRCH --> cluster the 10K clusters in hdbscan using membership counts as weights. I think what you are suggesting would essentially entail reinflating the 10K cluster back to the 500K dataset? That unfortunately wouldn't work in my case, as my hardware conks out at around 16K points.

Am I understanding the suggestion correctly?

edugmes · 2021-08-19T21:11:20Z

Hi all, wondering if there are still plans to move this enhancement forward? I am interested in using weighting to process a large dataset in two stages - first compress with BIRCH, then send weighted clusters to hdbscan for fine tuning. If weights aren't available, I'll unfortunately have to use regular dbscan.
Pls advise. Depending on the complexity of the enhancement, I might be able to take a crack at it; if that'd help.

There's a workaround you can do while this feature is pending.
You can use the amplitude of your signal/data as weights by repeating the points by n-amplitude times.
The tricky part is to find a find min_cluster and min_samples parameters that best represents the data.
p.s. I'm also waiting for this feature, but as my ML knowledge is not enough yet to help on this WIP, that's the most I can help

Thanks for the tip - would this approach entail a change the size of the dataset though? The workflow I'm trying to execute is: 500K data --> compress to 10K clusters using BIRCH --> cluster the 10K clusters in hdbscan using membership counts as weights. I think what you are suggesting would essentially entail reinflating the 10K cluster back to the 500K dataset? That unfortunately wouldn't work in my case, as my hardware conks out at around 16K points.

Am I understanding the suggestion correctly?

You could still use your 10K dataset, but you'd have to repeat each point according to its amplitude (which would increase the number of points), apply HDBSCAN and than come back to 10K. I asked it in stackoverflow, somene helped me and maybe that can give you an insight, link to the question

jc-healy · 2021-10-08T17:47:26Z

It looks like not enough of the us who are involved in this have sufficient time to make this happen. I'm closing it for now.

m-dz added 14 commits January 17, 2018 19:03

Add core_distance calculation for weighted case

936ecfc

Pass weights down to UnionFind __init__ method

4242019

Initialise size array with sample weights if given

f851fb0

Declare variables

0c5938f

Check weights for consistent length and integer type

092ddfd

Pass sample weights wherever needed

7a2fbef

A bulk commit to pass sample weights wherever needed using explicitly named arguments and some reformatting.

Explicitly name variables

6f30729

Return 1 when max lambda of cluster is inf

d0a5750

When a weighted point is big enough to form a cluster the max_lambda for this cluster is infinite, so we return a membership value of 1.

Unrelated reformatting

90225fc

Fix k-nn and indexing discrepancies between KD-Tree and Balltree

c9e9bee

Add some TODO comments

5e52721

A minor reformatting

d32af60

Remove weights from core distance calculation

e216e10

Using weights at this point seems to be a dead end (it is not necessary).

Merge remote-tracking branch 'upstream/master' into WIP_sample_weighting

bdb941d

Comment out INF = np.inf declaration

6562028

Left in code for further discussion of weighting approach.

m-dz closed this Jan 18, 2018

m-dz reopened this Jan 18, 2018

m-dz added 2 commits February 19, 2018 12:22

Merge remote-tracking branch 'upstream/master' into WIP_sample_weighting

2a0434e

Remove .c files from git

fdd2c36

lmcinnes assigned jc-healy Mar 16, 2018

lmcinnes added the enhancement label Mar 16, 2018

lmcinnes added the new feature label Mar 16, 2018

Repo cleanup - idea setup files etc.

4a4d02f

m-dz mentioned this pull request Jan 7, 2019

Introduce sample weighting in HDBSCAN #148

Open

jc-healy removed their assignment Oct 8, 2021

jc-healy closed this Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Sample weighting feature implementation #165

[WIP] Sample weighting feature implementation #165

m-dz commented Jan 18, 2018 •

edited

pep8speaks commented Jan 18, 2018 •

edited

m-dz commented Jan 18, 2018

coveralls commented Jan 18, 2018 •

edited

m-dz commented Jan 18, 2018

lmcinnes commented Jan 19, 2018

m-dz commented Jan 19, 2018

ihincks commented Jan 23, 2018 •

edited

m-dz commented Jan 24, 2018 •

edited

m-dz commented Feb 9, 2018

m-dz commented Apr 4, 2018

m-dz commented Jul 17, 2019

lmcinnes commented Jul 17, 2019

m-dz commented Jul 17, 2019

ddifranco commented Aug 19, 2021 •

edited

edugmes commented Aug 19, 2021

ddifranco commented Aug 19, 2021

edugmes commented Aug 19, 2021

jc-healy commented Oct 8, 2021

[WIP] Sample weighting feature implementation #165

[WIP] Sample weighting feature implementation #165

Conversation

m-dz commented Jan 18, 2018 • edited

pep8speaks commented Jan 18, 2018 • edited

Comment last updated on April 11, 2018 at 10:21 Hours UTC

m-dz commented Jan 18, 2018

coveralls commented Jan 18, 2018 • edited

m-dz commented Jan 18, 2018

lmcinnes commented Jan 19, 2018

m-dz commented Jan 19, 2018

ihincks commented Jan 23, 2018 • edited

m-dz commented Jan 24, 2018 • edited

m-dz commented Feb 9, 2018

m-dz commented Apr 4, 2018

m-dz commented Jul 17, 2019

lmcinnes commented Jul 17, 2019

m-dz commented Jul 17, 2019

ddifranco commented Aug 19, 2021 • edited

edugmes commented Aug 19, 2021

ddifranco commented Aug 19, 2021

edugmes commented Aug 19, 2021

jc-healy commented Oct 8, 2021

m-dz commented Jan 18, 2018 •

edited

pep8speaks commented Jan 18, 2018 •

edited

coveralls commented Jan 18, 2018 •

edited

ihincks commented Jan 23, 2018 •

edited

m-dz commented Jan 24, 2018 •

edited

ddifranco commented Aug 19, 2021 •

edited