-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient clustering (unknown number of clusters; optionally online) by greedily picking exemplars #22117
Comments
Could you provide us with the literature related to this algorithm? Please refer to the inclusion criterion in order to propose a new feature: https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms Did you have a look at |
It seems to be a special (fast and simple) case of https://en.wikipedia.org/wiki/Canopy_clustering_algorithm (McCallum et al., 2000) by setting one of the two hyperparameters (thresholds) to its extreme value, thus minimizing the number of required iterations and checks. |
I think it would be helpful if you could provide some use-cases, outside of your personal dataset, which exemplify the value this algorithm would add. If the main selling point is its speed, this should be fairly straightforwards to demonstrate. |
What are the criteria for accepting a dataset as a use case? Is any dataset OK? |
I'm not an official maintainer, so I don't know if there are any such criterion (I don't think there are). I'm mainly asking informally as an interest contributor. You could also quickly build a synthetic dataset by constructing a Gaussian Mixture Model, with varying means/variances, to demonstrate its efficacy. In any case, try to establish some simple benchmark process to compare the Greedy-Canopy clustering you're proposing with the "next-best" built-in clustering methods (something that should be reasonably effective on whatever dataset you're using). |
Have you tried to play with BIRCH with See the user guide for more details: https://scikit-learn.org/stable/modules/clustering.html#birch (in particular the last section on the |
BTW, I would also be interested in the results of the study proposed by @Micky774. |
Gaussians overlap, whereas this method is for clusters that are well separated. A toy dataset with separate clusters is for example The hyperparameters are easy to tune, for example here the threshold can be |
Gaussians don't necessarily overlap, or at least, that strongly. See below.
If you could provide some concrete numbers -- or better yet, graphs -- it would greatly help. There are plenty of questions here, such as "At what number of points does the algorithm begin to overtake other methods", or "How much is the speedup?", etc. All of these can first be demonstrated on a toy, well-separated dataset. If you have questions about such benchmarks, let me know and I can try to offer advice Edit: addressed my mistake in reading your original reply -- thank's for bringing that to my attention! |
I proposed
Edit: With Gaussians, the probability of setting an exemplar to a point in between clusters is tiny, but not zero. To lower this probability further, the algorithm can be run several times with different (e.g. random) orders of samples. But there's no need for that and no problems if the dataset has gaps between clusters.
I don't know when I'll do this, so anyone interested is welcome to do this. With a very large number of points, other algorithms fill my RAM and pretty much freeze the computer for probably hours (so measuring the runtime is not feasible), whereas the proposed algorithm takes much less than a minute. |
My bad, I misread that the first time. Updated my comment to reflect. Ty for pointing it out! |
Describe the workflow you want to enable
Cluster a dataset (many samples) where distances are small within clusters and large between clusters. Memory requirements and runtime should be low.
Describe your proposed solution
If a sample's distance to all exemplars (that were defined so far) is above a user-defined threshold, concatenate that sample to the exemplars. Continue with the next samples.
The runtime requirements seem to be in O(
n_samples*n_clusters
), and memory requirements in O(n_clusters
).In practice this seemed to be by far the fastest and most memory-economic clustering algorithm for my dataset (as said: many samples, well separated clusters).
Here's the code:
Optionally, exemplars (centroids) can be updated whenever desired by averaging the samples assigned to their cluster so far.
Describe alternatives you've considered, if relevant
I tried various clustering algorithms from sklearn for (very) large
n_samples
, but they were too memory-hungry and slow.Additional context
As a bonus, the algorithm is applicable to online clustering.
Knowing the number of clusters is not necessary.
If a specific number of clusters is desired, this can be achieved as described below.
Centroids can be updated in the end (see above) and iteratively merged, if a cluster hierarchy (or a smaller number of clusters) is desired. (To get a larger number of clusters first, the threshold can be set low.)
The text was updated successfully, but these errors were encountered: