Efficient clustering (unknown number of clusters; optionally online) by greedily picking exemplars #22117

page200 · 2022-01-03T03:16:20Z

Describe the workflow you want to enable

Cluster a dataset (many samples) where distances are small within clusters and large between clusters. Memory requirements and runtime should be low.

Describe your proposed solution

If a sample's distance to all exemplars (that were defined so far) is above a user-defined threshold, concatenate that sample to the exemplars. Continue with the next samples.

The runtime requirements seem to be in O(n_samples*n_clusters), and memory requirements in O(n_clusters).

In practice this seemed to be by far the fastest and most memory-economic clustering algorithm for my dataset (as said: many samples, well separated clusters).

Here's the code:

DISTANCE_THRESHOLD = 100

# Initialize array of exemplars using the first sample, in order to avoid cheking in each loop iteration whether the exemplar array is empty.
exemplars = samples[0:1,:]

for batch in batches:
    # If one of the new samples is far away from all exemplars, concatenate it to the exemplars.
    while (distances_to_nearest_exemplars := (distance_matrix := scipy.spatial.distance.cdist(batch, exemplars)).min(axis=1)).max() > DISTANCE_THRESHOLD:
        new_exemplar_index = distances_to_nearest_exemplars.argmax()
        exemplars = np.concatenate([exemplars, batch[new_exemplar_index:new_exemplar_index+1,:]])
        # TODO check whether keeping exemplars in a list rather than a Numpy array is faster.
        # `distance_matrix.argmax(axis=1)` can be used to compute cluster assignments for the samples.
        # Deleting the sample from the batch is optional because the sample will be below the threshold after recomputing the ditances
        #   (which the `while` loop recomputes anyway in order to avoid adding several similar samples from the same batch).
        # `scipy.spatial.distance.cdist` seems faster than `scipy.spatial.distance_matrix`.

Optionally, exemplars (centroids) can be updated whenever desired by averaging the samples assigned to their cluster so far.

Describe alternatives you've considered, if relevant

I tried various clustering algorithms from sklearn for (very) large n_samples, but they were too memory-hungry and slow.

Additional context

As a bonus, the algorithm is applicable to online clustering.

Knowing the number of clusters is not necessary.

If a specific number of clusters is desired, this can be achieved as described below.

Centroids can be updated in the end (see above) and iteratively merged, if a cluster hierarchy (or a smaller number of clusters) is desired. (To get a larger number of clusters first, the threshold can be set low.)

The text was updated successfully, but these errors were encountered:

glemaitre · 2022-01-04T12:54:38Z

Could you provide us with the literature related to this algorithm?

Please refer to the inclusion criterion in order to propose a new feature: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

Did you have a look at MiniBatchKMeans that should already be answering some of the needs?

page200 · 2022-01-04T22:56:45Z

MiniBatchKMeans requires n_clusters as input. This algorithm is for cases where we don't know the number of clusters.

It seems to be a special (fast and simple) case of https://en.wikipedia.org/wiki/Canopy_clustering_algorithm (McCallum et al., 2000) by setting one of the two hyperparameters (thresholds) to its extreme value, thus minimizing the number of required iterations and checks.

Micky774 · 2022-01-07T08:16:10Z

I think it would be helpful if you could provide some use-cases, outside of your personal dataset, which exemplify the value this algorithm would add. If the main selling point is its speed, this should be fairly straightforwards to demonstrate.

page200 · 2022-01-10T17:55:57Z

What are the criteria for accepting a dataset as a use case? Is any dataset OK?

Micky774 · 2022-01-10T18:03:17Z

I'm not an official maintainer, so I don't know if there are any such criterion (I don't think there are). I'm mainly asking informally as an interest contributor.

You could also quickly build a synthetic dataset by constructing a Gaussian Mixture Model, with varying means/variances, to demonstrate its efficacy.

In any case, try to establish some simple benchmark process to compare the Greedy-Canopy clustering you're proposing with the "next-best" built-in clustering methods (something that should be reasonably effective on whatever dataset you're using).

ogrisel · 2022-01-12T13:23:13Z

Have you tried to play with BIRCH with n_clusters=None and then use a radius based clustering method such as DBSCAN or OPTICS as a global clustering finalization step that does not set a priori the number of clusters to find?

See the user guide for more details: https://scikit-learn.org/stable/modules/clustering.html#birch (in particular the last section on the partial_fit API.

ogrisel · 2022-01-12T13:27:24Z

BTW, I would also be interested in the results of the study proposed by @Micky774.

page200 · 2022-01-26T17:56:11Z

Gaussians overlap, whereas this method is for clusters that are well separated.

A toy dataset with separate clusters is for example np.stack((np.random.rand(N,1),np.random.rand(N,1)+3)) (maybe transposed, depending on the conventions). When N is sufficiently large and the hyperparameters (batch size and threshold) are optimized, this method seems to outperform all methods from sklearn in terms of speed and in terms of the speed/memory trade-off.

The hyperparameters are easy to tune, for example here the threshold can be 1.5 and the batch size can be such that things still fit into memory.

Micky774 · 2022-01-26T22:26:38Z

Gaussians overlap, whereas this method is for clusters that are well separated.

Gaussians don't necessarily overlap, or at least, that strongly. See below.

A toy dataset with separate clusters is for example np.stack((np.random.rand(N,1),np.random.rand(N,1)+3)) (maybe transposed, depending on the conventions).

~~np.random.randn pulls from a standard normal, i.e., gaussian distribution~~. My bad, I misread your message! Still, it would work w/ randn. The randn implementation is along the lines of what I meant by constructing a toy dataset out of a gaussian mixture model. Gaussian distributions with means placed sufficiently far apart (e.g. 3 standard deviations) are usually "well separated".

When N is sufficiently large and the hyperparameters (batch size and threshold) are optimized, this method seems to outperform all methods from sklearn in terms of speed and in terms of the speed/memory trade-off.

If you could provide some concrete numbers -- or better yet, graphs -- it would greatly help. There are plenty of questions here, such as "At what number of points does the algorithm begin to overtake other methods", or "How much is the speedup?", etc. All of these can first be demonstrated on a toy, well-separated dataset. If you have questions about such benchmarks, let me know and I can try to offer advice

Edit: addressed my mistake in reading your original reply -- thank's for bringing that to my attention!

page200 · 2022-01-26T23:26:51Z

A toy dataset with separate clusters is for example np.stack((np.random.rand(N,1),np.random.rand(N,1)+3)) (maybe transposed, depending on the conventions).

np.random.randn pulls from a standard normal, i.e., gaussian distribution.

I proposed rand, not randn.

Gaussian distributions with means placed sufficiently far apart (e.g. 3 standard deviations) are usually "well separated".

Edit: With Gaussians, the probability of setting an exemplar to a point in between clusters is tiny, but not zero. To lower this probability further, the algorithm can be run several times with different (e.g. random) orders of samples. But there's no need for that and no problems if the dataset has gaps between clusters.

If you could provide some concrete numbers -- or better yet, graphs -- it would greatly help. There are plenty of questions here, such as "At what number of points does the algorithm begin to overtake other methods", or "How much is the speedup?", etc. All of these can first be demonstrated on a toy, well-separated dataset. If you have questions about such benchmarks, let me know and I can try to offer advice

I don't know when I'll do this, so anyone interested is welcome to do this. With a very large number of points, other algorithms fill my RAM and pretty much freeze the computer for probably hours (so measuring the runtime is not feasible), whereas the proposed algorithm takes much less than a minute.

Micky774 · 2022-01-27T02:02:13Z

I proposed rand, not randn.

My bad, I misread that the first time. Updated my comment to reflect. Ty for pointing it out!

page200 added Needs Triage Issue requires triage New Feature labels Jan 3, 2022

glemaitre added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Jan 4, 2022

cmarmo added the module:cluster label Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient clustering (unknown number of clusters; optionally online) by greedily picking exemplars #22117

Efficient clustering (unknown number of clusters; optionally online) by greedily picking exemplars #22117

page200 commented Jan 3, 2022 •

edited

glemaitre commented Jan 4, 2022 •

edited

page200 commented Jan 4, 2022 •

edited

Micky774 commented Jan 7, 2022

page200 commented Jan 10, 2022

Micky774 commented Jan 10, 2022

ogrisel commented Jan 12, 2022 •

edited

ogrisel commented Jan 12, 2022

page200 commented Jan 26, 2022

Micky774 commented Jan 26, 2022 •

edited

page200 commented Jan 26, 2022 •

edited

Micky774 commented Jan 27, 2022

Efficient clustering (unknown number of clusters; optionally online) by greedily picking exemplars #22117

Efficient clustering (unknown number of clusters; optionally online) by greedily picking exemplars #22117

Comments

page200 commented Jan 3, 2022 • edited

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Jan 4, 2022 • edited

page200 commented Jan 4, 2022 • edited

Micky774 commented Jan 7, 2022

page200 commented Jan 10, 2022

Micky774 commented Jan 10, 2022

ogrisel commented Jan 12, 2022 • edited

ogrisel commented Jan 12, 2022

page200 commented Jan 26, 2022

Micky774 commented Jan 26, 2022 • edited

page200 commented Jan 26, 2022 • edited

Micky774 commented Jan 27, 2022

page200 commented Jan 3, 2022 •

edited

glemaitre commented Jan 4, 2022 •

edited

page200 commented Jan 4, 2022 •

edited

ogrisel commented Jan 12, 2022 •

edited

Micky774 commented Jan 26, 2022 •

edited

page200 commented Jan 26, 2022 •

edited