Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient clustering (unknown number of clusters; optionally online) by greedily picking exemplars #22117

Open
page200 opened this issue Jan 3, 2022 · 11 comments

Comments

@page200
Copy link

page200 commented Jan 3, 2022

Describe the workflow you want to enable

Cluster a dataset (many samples) where distances are small within clusters and large between clusters. Memory requirements and runtime should be low.

Describe your proposed solution

If a sample's distance to all exemplars (that were defined so far) is above a user-defined threshold, concatenate that sample to the exemplars. Continue with the next samples.

The runtime requirements seem to be in O(n_samples*n_clusters), and memory requirements in O(n_clusters).

In practice this seemed to be by far the fastest and most memory-economic clustering algorithm for my dataset (as said: many samples, well separated clusters).

Here's the code:

DISTANCE_THRESHOLD = 100

# Initialize array of exemplars using the first sample, in order to avoid cheking in each loop iteration whether the exemplar array is empty.
exemplars = samples[0:1,:]

for batch in batches:
    # If one of the new samples is far away from all exemplars, concatenate it to the exemplars.
    while (distances_to_nearest_exemplars := (distance_matrix := scipy.spatial.distance.cdist(batch, exemplars)).min(axis=1)).max() > DISTANCE_THRESHOLD:
        new_exemplar_index = distances_to_nearest_exemplars.argmax()
        exemplars = np.concatenate([exemplars, batch[new_exemplar_index:new_exemplar_index+1,:]])
        # TODO check whether keeping exemplars in a list rather than a Numpy array is faster.
        # `distance_matrix.argmax(axis=1)` can be used to compute cluster assignments for the samples.
        # Deleting the sample from the batch is optional because the sample will be below the threshold after recomputing the ditances
        #   (which the `while` loop recomputes anyway in order to avoid adding several similar samples from the same batch).
        # `scipy.spatial.distance.cdist` seems faster than `scipy.spatial.distance_matrix`.

Optionally, exemplars (centroids) can be updated whenever desired by averaging the samples assigned to their cluster so far.

Describe alternatives you've considered, if relevant

I tried various clustering algorithms from sklearn for (very) large n_samples, but they were too memory-hungry and slow.

Additional context

As a bonus, the algorithm is applicable to online clustering.

Knowing the number of clusters is not necessary.

If a specific number of clusters is desired, this can be achieved as described below.

Centroids can be updated in the end (see above) and iteratively merged, if a cluster hierarchy (or a smaller number of clusters) is desired. (To get a larger number of clusters first, the threshold can be set low.)

@page200 page200 added Needs Triage Issue requires triage New Feature labels Jan 3, 2022
@glemaitre glemaitre added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Jan 4, 2022
@glemaitre
Copy link
Member

glemaitre commented Jan 4, 2022

Could you provide us with the literature related to this algorithm?

Please refer to the inclusion criterion in order to propose a new feature: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

Did you have a look at MiniBatchKMeans that should already be answering some of the needs?

@page200
Copy link
Author

page200 commented Jan 4, 2022

MiniBatchKMeans requires n_clusters as input. This algorithm is for cases where we don't know the number of clusters.

It seems to be a special (fast and simple) case of https://en.wikipedia.org/wiki/Canopy_clustering_algorithm (McCallum et al., 2000) by setting one of the two hyperparameters (thresholds) to its extreme value, thus minimizing the number of required iterations and checks.

@Micky774
Copy link
Contributor

Micky774 commented Jan 7, 2022

I think it would be helpful if you could provide some use-cases, outside of your personal dataset, which exemplify the value this algorithm would add. If the main selling point is its speed, this should be fairly straightforwards to demonstrate.

@page200
Copy link
Author

page200 commented Jan 10, 2022

What are the criteria for accepting a dataset as a use case? Is any dataset OK?

@Micky774
Copy link
Contributor

I'm not an official maintainer, so I don't know if there are any such criterion (I don't think there are). I'm mainly asking informally as an interest contributor.

You could also quickly build a synthetic dataset by constructing a Gaussian Mixture Model, with varying means/variances, to demonstrate its efficacy.

In any case, try to establish some simple benchmark process to compare the Greedy-Canopy clustering you're proposing with the "next-best" built-in clustering methods (something that should be reasonably effective on whatever dataset you're using).

@ogrisel
Copy link
Member

ogrisel commented Jan 12, 2022

Have you tried to play with BIRCH with n_clusters=None and then use a radius based clustering method such as DBSCAN or OPTICS as a global clustering finalization step that does not set a priori the number of clusters to find?

See the user guide for more details: https://scikit-learn.org/stable/modules/clustering.html#birch (in particular the last section on the partial_fit API.

@ogrisel
Copy link
Member

ogrisel commented Jan 12, 2022

BTW, I would also be interested in the results of the study proposed by @Micky774.

@page200
Copy link
Author

page200 commented Jan 26, 2022

Gaussians overlap, whereas this method is for clusters that are well separated.

A toy dataset with separate clusters is for example np.stack((np.random.rand(N,1),np.random.rand(N,1)+3)) (maybe transposed, depending on the conventions). When N is sufficiently large and the hyperparameters (batch size and threshold) are optimized, this method seems to outperform all methods from sklearn in terms of speed and in terms of the speed/memory trade-off.

The hyperparameters are easy to tune, for example here the threshold can be 1.5 and the batch size can be such that things still fit into memory.

@Micky774
Copy link
Contributor

Micky774 commented Jan 26, 2022

Gaussians overlap, whereas this method is for clusters that are well separated.

Gaussians don't necessarily overlap, or at least, that strongly. See below.

A toy dataset with separate clusters is for example np.stack((np.random.rand(N,1),np.random.rand(N,1)+3)) (maybe transposed, depending on the conventions).

np.random.randn pulls from a standard normal, i.e., gaussian distribution. My bad, I misread your message! Still, it would work w/ randn. The randn implementation is along the lines of what I meant by constructing a toy dataset out of a gaussian mixture model. Gaussian distributions with means placed sufficiently far apart (e.g. 3 standard deviations) are usually "well separated".

When N is sufficiently large and the hyperparameters (batch size and threshold) are optimized, this method seems to outperform all methods from sklearn in terms of speed and in terms of the speed/memory trade-off.

If you could provide some concrete numbers -- or better yet, graphs -- it would greatly help. There are plenty of questions here, such as "At what number of points does the algorithm begin to overtake other methods", or "How much is the speedup?", etc. All of these can first be demonstrated on a toy, well-separated dataset. If you have questions about such benchmarks, let me know and I can try to offer advice

Edit: addressed my mistake in reading your original reply -- thank's for bringing that to my attention!

@page200
Copy link
Author

page200 commented Jan 26, 2022

A toy dataset with separate clusters is for example np.stack((np.random.rand(N,1),np.random.rand(N,1)+3)) (maybe transposed, depending on the conventions).

np.random.randn pulls from a standard normal, i.e., gaussian distribution.

I proposed rand, not randn.

Gaussian distributions with means placed sufficiently far apart (e.g. 3 standard deviations) are usually "well separated".

Edit: With Gaussians, the probability of setting an exemplar to a point in between clusters is tiny, but not zero. To lower this probability further, the algorithm can be run several times with different (e.g. random) orders of samples. But there's no need for that and no problems if the dataset has gaps between clusters.

If you could provide some concrete numbers -- or better yet, graphs -- it would greatly help. There are plenty of questions here, such as "At what number of points does the algorithm begin to overtake other methods", or "How much is the speedup?", etc. All of these can first be demonstrated on a toy, well-separated dataset. If you have questions about such benchmarks, let me know and I can try to offer advice

I don't know when I'll do this, so anyone interested is welcome to do this. With a very large number of points, other algorithms fill my RAM and pretty much freeze the computer for probably hours (so measuring the runtime is not feasible), whereas the proposed algorithm takes much less than a minute.

@Micky774
Copy link
Contributor

I proposed rand, not randn.

My bad, I misread that the first time. Updated my comment to reflect. Ty for pointing it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants