MRG: Evidence Accumulation Clustering #1830

Open
wants to merge 55 commits into
from

8 participants

@robertlayton
scikit-learn member

Evidence accumulation clustering: EAC, an ensemble based clustering framework:
Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
accumulation." Pattern Recognition, 2002. Proceedings. 16th International
Conference on. Vol. 4. IEEE, 2002.

Basic overview of algorithm:

  1. Cluster the data many times using a clustering algorithm with randomly (within reason) selected parameters.
  2. Create a co-association matrix, which records the number of times each pair of instances were clustered together.
  3. Cluster this matrix.

This seems to work really well, like a kernel method, making the clustering "easier" that it was for the original dataset.

The default of the algorithm are setup to follow the defaults used by Fred and Jain (2002), whereby the clustering in step 1 is k-means with k selected randomly from 10 and 30. The clustering in step 3 is the MST algorithm, which I have yet to implement (will do in this PR).

After initial feedback, I think people are happy with the API.

TODO:

  • MST algorithm from the paper, which was used as the final clusterer. Completed in PR #1991
  • There is an improvement to the speed of the algorithm (don't have the paper on hand) that has been published, that should be incorporated (will be done in a later PR)
  • Examples/Usage
  • Narrative documentation
  • Revert test_clustering, line 508, to only check for SpectralClustering
  • Use a sparse matrix for the co-association matrix
bob and others added some commits Feb 28, 2013
bob First draft of new mini-batch k-means 5d2cba0
bob Updates to documentation wording 7c72986
@robertlayton robertlayton Updated docs to clarrify mini-batches f99a3d4
@robertlayton robertlayton Note to view the reference for an empircal result a94d3e6
bob Initial commit -- algorithm is mostly there, except for final clusterer.
The algorithm works, but isn't very fast or accurate.
Not fast because I haven't optimised, not accurate due to the poor final clusterer (I think)
ca4be3a
@robertlayton robertlayton final_clusterer no longer updated on training (wrong?) 7942708
@robertlayton robertlayton Fixed a bug, but performance is still not good enough, indicating ano…
…ther bug somewhere.

The common clustering test now tells you which clustering algorithm failed (if one does).
b21cbe5
@robertlayton robertlayton Changed the final clusterer to SpectralClustering to improve accuracy…
… until I finish the MST algorithm.

This required a changed to the test, which should be removed after the change.X
d1728ea
@jaquesgrobler
scikit-learn member

I had a read through. Looks very interesting. The API makes sense to me so far 👍
Seems clear enough and isn't hard to follow.
Nice work :)

@satra
scikit-learn member

this looks rather interesting - i have two questions before reading the papers:

  • could this be used in general across any set of clusters/clustering algorithms?
  • could this be used in some ways to do online learning along these lines (http://arxiv.org/pdf/1209.0237v1.pdf)?
@satra
scikit-learn member

to clarify the across any set of clusters comment. currently the api is given a single X and many clustering algorithms. what if it was given a single algorithm but many Xs. In principle, it seems that should work as well.

@larsmans larsmans commented on an outdated diff Apr 19, 2013
sklearn/cluster/eac.py
+ -------
+ final_model: model (extends ClusterMixin)
+ The model given as `final_clusterer`, fitted with the evidence
+ accumulated through this process.
+
+ Notes
+ -----
+ See examples/plot_eac.py for an example.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ X = np.asarray(X)
@larsmans
scikit-learn member
larsmans added a line comment Apr 19, 2013

If you're using k-means, then sparse matrix support can be added quite easily by doing atleast2d_or_csr here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans and 2 others commented on an outdated diff Apr 19, 2013
sklearn/cluster/eac.py
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ if 'n_clusters' in kmeans_args:
+ error_msg = "n_clusters cannot be assigned for the default clusterers."
+ raise ValueError(error_msg)
+ random_state = check_random_state(random_state)
+ num_iterations = 150
+ k_low, k_high = (10, 30)
+ if n_samples < k_high:
+ k_high = n_samples
+ k_low = min(k_low, int(k_high / 2))
+ k_values = random_state.randint(k_low, high=k_high, size=num_iterations)
+ return (KMeans(n_clusters=k, **kmeans_args) for k in k_values)
@larsmans
scikit-learn member
larsmans added a line comment Apr 19, 2013

It might be better to let the user pass an object to be used here, then clone that. That way, the user can set parameters (other than n_clusters) on the KMeans estimator. I'm not sure, though, since that means a non-k-means estimator can be passed...

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Apr 19, 2013
@larsmans
scikit-learn member
larsmans added a line comment Apr 19, 2013

Sure, and a KMeans with default settings would do fine for that purpose.

@robertlayton
scikit-learn member
robertlayton added a line comment Apr 20, 2013

I was hoping to get around that due to the use of the initial_clusterers parameter in the calling function. If it is None, it uses this function, which gives the "default" initial clusters as used in the reference.

The variation in k is also needed -- k varies between 10 and 30 (unless that doesn't make sense), forcing different clusters in most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans commented on an outdated diff Apr 19, 2013
sklearn/cluster/eac.py
+ Array of distances between samples, or a feature array.
+ The array is treated as a feature array unless the metric is given as
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is MST.
+ use_distance: boolean, or callable
+ If True, convert the coassociation matrix to distance using
+ `D=1./(C + 1)`. If callable, the function is called with the
+ coassication matrix as input. If False (default), then the matrix is
@larsmans
scikit-learn member
larsmans added a line comment Apr 19, 2013

typo: coassociation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans and 1 other commented on an outdated diff Apr 19, 2013
sklearn/cluster/eac.py
+ num_initial_clusterers = 0
+ for model in initial_clusterers:
+ num_initial_clusterers += 1
+ # Update random state
+ # Fit model to X
+ model.fit(X)
+ # Calculate new coassociation matrix and add that to the tally
+ C = update_coassociation_matrix(C, model.labels_)
+ C /= num_initial_clusterers
+ if use_distance:
+ if use_distance is True:
+ # Turn into a distance matrix
+ C = 1. - C
+ elif callable(use_distance): # If a callable
+ C = use_distance(C)
+ np.savetxt(open("/home/bob/test_eac_data.txt", 'w'), C, fmt='%.3f')
@larsmans
scikit-learn member
larsmans added a line comment Apr 19, 2013

You're not getting an account on my workstation :p

@robertlayton
scikit-learn member
robertlayton added a line comment Apr 20, 2013

hmmm, not sure how that got there. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans larsmans and 1 other commented on an outdated diff Apr 19, 2013
sklearn/cluster/eac.py
@@ -0,0 +1,228 @@
+# -*- coding: utf-8 -*-
+"""
+EAC: Evidence Accumulation Clustering
+"""
+
+# Author: Robert Layton <robertlayton@gmail.com>
+#
+# License: BSD
@larsmans
scikit-learn member
larsmans added a line comment Apr 19, 2013

3-clause BSD!

@robertlayton
scikit-learn member
robertlayton added a line comment Apr 20, 2013

OK, but I copied this from another file. I'll do a grep and post results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton
scikit-learn member

@satra Thanks for your comments. You are right on the multiple X question -- I've used that myself, but (1) I can't think of a very clear way to do it and (2) it isn't the "base" algorithm. If you have an idea for solving (1) I'm happy to include it.

@everyone_else, thanks for your comments, I'll finish up the PR.

@satra
scikit-learn member

@robertlayton: how about having a function that takes C, X and clustering_algo and updates C and returns it? i believe you already have it inside the eac function.

@jnothman jnothman and 1 other commented on an outdated diff Apr 20, 2013
sklearn/cluster/eac.py
+ """
+ X = np.asarray(X)
+ n_samples = X.shape[0]
+ # If index order not given, create random order.
+ random_state = check_random_state(random_state)
+ # If initial_clusterers is None, it is k-means 150 times with randomly
+ # initialised k values (as per original paper).
+ if initial_clusterers is None:
+ initial_clusterers = _kmeans_random_k(n_samples, random_state)
+ # If the final_clusterer is None, create the default model
+ if final_clusterer is None:
+ final_clusterer = create_default_final_clusterer(random_state)
+ # Co-association matrix, originally zeros everywhere
+ C = np.zeros((n_samples, n_samples), dtype='float')
+ num_initial_clusterers = 0
+ for model in initial_clusterers:
@jnothman
scikit-learn member
jnothman added a line comment Apr 20, 2013

Is it worth doing this fitting in parallel?

@robertlayton
scikit-learn member
robertlayton added a line comment Apr 30, 2013

Definitely. I was going with "get it right, then optimise".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jnothman jnothman and 1 other commented on an outdated diff Apr 20, 2013
sklearn/cluster/eac.py
+ C /= num_initial_clusterers
+ if use_distance:
+ if use_distance is True:
+ # Turn into a distance matrix
+ C = 1. - C
+ elif callable(use_distance): # If a callable
+ C = use_distance(C)
+ np.savetxt(open("/home/bob/test_eac_data.txt", 'w'), C, fmt='%.3f')
+ final_clusterer.fit(C)
+ return final_clusterer
+
+
+def update_coassociation_matrix(C, labels):
+ """Updates a co-association matrix from an array of labels.
+ """
+ labels = np.asarray(labels)
@jnothman
scikit-learn member
jnothman added a line comment Apr 20, 2013

A vectorised implementation (though perhaps not the simplest):
C += np.repeat([labels], labels.size, axis=0).T == labels
(this result .triu() should be the same as what you calculate.)

@jnothman
scikit-learn member
jnothman added a line comment Apr 20, 2013

The similarity of this algorithm to a parameter search makes me wonder whether coassociation should be found in sklearn.metrics.

@jnothman
scikit-learn member
jnothman added a line comment Apr 20, 2013

[Another vectorised implementation that works with an intermediate boolean array of n_clusters x n_samples rather than n_samples x n_samples:

cluster_assignments = np.repeat([np.arange(np.max(labels) + 1)], labels.size, axis=0).T == labels
C += cluster_assignments[labels]

This has the benefit of allowing you to first convert cluster_assignments to a sparse matrix which may be very worthwhile for C updating. But this is all premature optimisation on my part. Even more optimised would use this trick: http://stackoverflow.com/questions/5564098/repeat-numpy-array-without-replicating-data]

@robertlayton
scikit-learn member
robertlayton added a line comment Apr 30, 2013

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
scikit-learn member

Question without reading paper or code: is the MST clustering just single-link agglomerative?
If so, can code be reused / refactored from WARD?

Sounds like a very interesting algorithm btw :)
Are there any links to Buhmann's work?

@GaelVaroquaux
scikit-learn member
@robertlayton
scikit-learn member

For calculating the minimum spanning tree, I see three options:

1 This version in scipy here, but it requires v0.11, which is higher than scikit-learn's current dependency. (Currently it is 0.7, which is probably a little low, but there hasn't been a need for higher so far I believe.)
2 Use @GaelVaroquaux 's code from here
3 Use @amueller 's code from here

Thoughts on the best option? I'd rather not reimplement it myself -- it's tricky to optimise properly and pointless if others have already solved this problem.

@amueller
scikit-learn member

I haven't looked at @GaelVaroquaux but I'd vote for backporting scipy.

@amueller
scikit-learn member

Do you only need the euclidean case? For high-dims, that shouldn't really make a (practical) difference, though...

@robertlayton
scikit-learn member

I think it would be better to use the "proper" method, even if the euclidean case works practically.

@robertlayton
scikit-learn member

What would be the process of backporting from scipy? Any examples I could use?

robertlayton added some commits Jun 5, 2013
@robertlayton robertlayton Merge branch 'master' of git://github.com/scikit-learn/scikit-learn i…
…nto eac
61908bf
@robertlayton robertlayton In broken state: Most of the algorithm is there and working, but the …
…tests are not running yet
d76c243
@robertlayton robertlayton Trying new csgraph 439bae0
@robertlayton robertlayton Using scipy's connected_components for now, which will be backported …
…next
fb12eac
@robertlayton robertlayton Merge branch 'master' of git://github.com/scikit-learn/scikit-learn i…
…nto eac
18bd327
@robertlayton robertlayton MST algorithm. in broken state until I get the imports working (next …
…commit)
bd6abff
@robertlayton robertlayton EAC now works, with test running as well 4e12586
@robertlayton robertlayton Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
…into eac

Conflicts:
	sklearn/cluster/__init__.py
62a3a3b
@robertlayton robertlayton random_state now used for k-means, makes the algorithm more stable 9fc6689
@robertlayton robertlayton New form of spanning tree, should be working now.
Not sure why teh results are so low though (only just gets over the 0.4 limit!)
82be0fa
@robertlayton robertlayton Added an example. Doesn't look right yet, but it works 5c04e9d
@robertlayton robertlayton Different parameter values to ensure consistent testing 9696294
@robertlayton robertlayton Example showing the problem/solution much better. Still a long way to…
… go with it.

Update to documentation too, unfinished, but nearly there.
bad4508
@robertlayton robertlayton Doc draft is finished, as is the example. It doesn't yet really drive…
… home the point I'm trying to make (that the intuition behind the parameter in EAC is more easy to understand than MST).
d9752b1
@robertlayton robertlayton Discussion about teh intuitive nature of the parameter cba494f
@robertlayton robertlayton Reduced the number of subplots in example, much clearer now. Needs de…
…scriptions
35a13a6
@robertlayton robertlayton Example finished. 9e4017b
@robertlayton robertlayton Added documentation for MSTCluster (example to come) and what's new.
MSTCluster can also now take a feature matrix, but this is untested (that's next).
574ef93
@robertlayton robertlayton Forgot to set self.metric. It works now. 4d37eec
@robertlayton robertlayton Added example for MST clustering 5b69547
@robertlayton robertlayton Tests require that X not be a precomputed distance matrix. Have adjusted
the defaults and the docs.

In addition, import clustering algorithms only when needed for EAC.
2afa378
@robertlayton robertlayton Adding +1 to the creation of the co-assocation matrix is needed.
Because it's sparse, without the +1 the zero distances are removed, which leads to poor span trees.
a6394a9
@robertlayton robertlayton Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
…into eac

Conflicts:
	doc/whats_new.rst
7175fd7
@robertlayton robertlayton Fixed 5 month old typo 192b9fa
@ogrisel ogrisel commented on an outdated diff Sep 3, 2013
doc/modules/clustering.rst
@@ -571,6 +571,123 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
+
+
+Minimum Spanning Tree Clustering
+================================
+The :class:`MSTCluster` algorithm forms a minimum spann tree over the data, and
@ogrisel
scikit-learn member
ogrisel added a line comment Sep 3, 2013

spann => spanning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
scikit-learn member

I am not a big fan of acronyms as class names. What do people thing about renaming the EAC class to EvidenceAccumulationClustering?

The same remark holds for the MSTCluster. Maybe we could just call it MinimumSpanningTree?

@ogrisel
scikit-learn member

@GaelVaroquaux how does the MSTCluster class overlap with your work on single linkage clustering? Do you also use the minimum spanning tree implementation of scipy when you don't have connectivity constraints?

@ogrisel ogrisel commented on an outdated diff Sep 3, 2013
examples/cluster/plot_eac.py
+ax.scatter(X[:, 0], X[:,1], color=colors[y_pred].tolist(), s=10)
+ax.set_title("EAC")
+
+# Plot distribution of scores (from main_metric)
+ax = fig.add_subplot(3, 1, 3)
+# k-means
+ax.plot(k_values, km_ami_means)
+ax.errorbar(k_values, km_ami_means, yerr=km_ami_std, fmt='ro', label='k-means')
+
+# MST
+ax.plot(mst_values, mst_ami_means)
+ax.errorbar(mst_values, mst_ami_means, fmt='g*', label='MST')
+score = main_metric(y_true, y_pred)
+ax.scatter([n_clusters_,], [score,], label='EAC', s=40)
+ax.legend()
+ax.set_title("V-measure comparison")
@ogrisel
scikit-learn member
ogrisel added a line comment Sep 3, 2013

I think it's better to use individual plot figures rather than several axis in the same figure. Here is an example that uses multiple plots and how it's rendered in the sphinx website: http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel and 1 other commented on an outdated diff Sep 3, 2013
sklearn/cluster/eac.py
+ The array is treated as a feature array unless the metric is given as
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is MST.
+ use_distance: boolean, or callable
+ If True, convert the coassociation matrix to distance using
+ `D=1./(C + 1)`. If callable, the function is called with the
+ coassociation matrix as input. If False (default), then the matrix is
+ given as input to the `final_clusterer`.
@ogrisel
scikit-learn member
ogrisel added a line comment Sep 3, 2013

This parameter list does not match the constructor params of the class.

@robertlayton
scikit-learn member
robertlayton added a line comment Sep 5, 2013

Ah, forgot about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel ogrisel commented on an outdated diff Sep 3, 2013
sklearn/cluster/eac.py
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is MST.
+ use_distance: boolean, or callable
+ If True, convert the coassociation matrix to distance using
+ `D=1./(C + 1)`. If callable, the function is called with the
+ coassociation matrix as input. If False (default), then the matrix is
+ given as input to the `final_clusterer`.
+ random_state: numpy.RandomState, optional
@ogrisel
scikit-learn member
ogrisel added a line comment Sep 3, 2013

or int

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
scikit-learn member
@ogrisel
scikit-learn member

So maybe the MSTCluster class could be renamed SingleLinkageClustering. WDYT?

@robertlayton
scikit-learn member

I don't mind the change to EvidenceAccumulationClustering, not sure which way to go on MSTCluster though. Thoughts?

@robertlayton
scikit-learn member

All issues fixed, except for renaming MSTCluster.

Which option did you think was best?

@ogrisel
scikit-learn member

I think I still like SingleLinkageClustering better. But please ask on the ML. Also this PR need to be rebased on master.

@ogrisel ogrisel commented on an outdated diff Sep 6, 2013
doc/modules/classes.rst
@@ -29,9 +29,11 @@ Classes
cluster.AffinityPropagation
cluster.DBSCAN
+ cluster.EAC
@ogrisel
scikit-learn member
ogrisel added a line comment Sep 6, 2013

This needs an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp
scikit-learn member

MST is also known as "Friends of Friends" clustering in some circles (notably astrophysical N-body simulation groups) That might also be a naming option, though I'm not sure how wide-spread the usage is. In astroML, I dubbed a similar estimator HierarchicalClustering, which may be a clearer name.

@robertlayton
scikit-learn member
@larsmans larsmans and 1 other commented on an outdated diff Sep 7, 2013
doc/modules/clustering.rst
@@ -571,6 +571,134 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
+
+
+Minimum Spanning Tree Clustering
+================================
+The :class:`MSTCluster` algorithm forms a minimum spanning tree over the data,
+and then cuts any edges that have a weight higher than some predefined
+threshold. This separates the data into connected components, each representing
+a separate git cocluster.
@larsmans
scikit-learn member
larsmans added a line comment Sep 7, 2013

git cocluster?

@robertlayton
scikit-learn member
robertlayton added a line comment Sep 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton
scikit-learn member

OK, I think I've addressed all of the issues, except for merging with upstream.

I'll leave it here if anyone wants to review, otherwise I'll self review early next week and update this comment.

@larsmans larsmans commented on the diff Sep 15, 2013
sklearn/utils/__init__.py
@@ -14,7 +14,8 @@
atleast2d_or_csr, warn_if_not_float,
check_random_state, column_or_1d)
from .class_weight import compute_class_weight
-from sklearn.utils.sparsetools import minimum_spanning_tree
+from sklearn.utils.sparsetools import (minimum_spanning_tree,
+ connected_components)
@larsmans
scikit-learn member
larsmans added a line comment Sep 15, 2013

Where is connected_components being used?

@robertlayton
scikit-learn member
robertlayton added a line comment Sep 18, 2013

spectral_embedding.py uses it, but imports it directly from .sparsetools. Same in a new other places. That said, single_linkage uses it.

@larsmans
scikit-learn member
larsmans added a line comment Sep 18, 2013

Oh, right, this is an __init__. Sorry, hadn't seen that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton
scikit-learn member

OK, I've reviewed this PR, I believe it's ready to go.

@larsmans larsmans commented on the diff Sep 18, 2013
sklearn/cluster/single_linkage.py
+ -----
+ See examples/plot_single_linkage.py for a visualisation of the threshold
+ cutting algorithm on a small dataset.
+
+ See examples/plot_eac.py for an example. The Evidence Accumulation
+ Clustering (EAC) algorithm uses this clusterer in its final clustering
+ step.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ X = atleast2d_or_csr(X)
+ X = pairwise_distances(X, metric=metric)
@larsmans
scikit-learn member
larsmans added a line comment Sep 18, 2013

This takes O(n²) time. With an explicit, sparse similarity matrix instead of distances, it can be done in O(E lg(E)) time where E are the edges of the similarity graph, i.e. the non-zero similarities. In the optimal algorithm, you don't even need to build the spanning tree. By promising to return the spanning tree and working with metrics instead of similarities, we're tying ourselves down to a suboptimal algorithm.

@jakevdp
scikit-learn member
jakevdp added a line comment Sep 19, 2013

One easy (but admittedly ad-hoc) trick is to instead use an approximate minimum spanning tree by building a graph only over the $k$ nearest neighbors. For most datasets, given a reasonable choice of $k$, the result is likely to be the same as in the exact case.

@ogrisel
scikit-learn member
ogrisel added a line comment Sep 19, 2013
@jakevdp
scikit-learn member
jakevdp added a line comment Sep 19, 2013

I think the best solution would be to build the graph over the edges given by scipy's QHULL wrapper. Then you get a true MST in order NlogN. From my recollection, though, I think the qhull wrapper in scipy doesn't provide direct access to the edges...

@robertlayton
scikit-learn member
robertlayton added a line comment Sep 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
scikit-learn member

To be sure, a faster algorithm to do this (in pure Python) is:

def flat_single_linkage(X):            # X is a coo_matrix of similarities
    heap = zip(-X.data, X.col, X.row)  # linear time
    heapq.heapify(heap)                # linear time

    # Kruskal's algorithm with early stopping
    disjsets = DisjointSetForest(X.shape[0])     # linear time
    while heap:
        sim, i, j = heapq.heappop(heap)
        if sim < threshold:
            break
        if disjsets.find(i) != disjsets.find(j)  # O(α(n)) time
            disjsets.union(i, j)                 # O(α(n)) time

    return [disjsets.find(i) for i in xrange(X.shape[0])]

A further optimized version would look only at the bottom half of X and skips the find calls in the loop because they'll be re-done by union anyway. This algorithm can be easily modified to put a maximum on the number of clusters.

@robertlayton
scikit-learn member

Where is DisjointSetForest from?

@larsmans
scikit-learn member

It's not in the stdlib, but it's a standard data structure for handling connected components. I have an implementation in C++ that is easily translated to Cython. (The heap operations will take more work).

@jakevdp
scikit-learn member

This all looks really excellent -- really nice work Robert. Regarding the above discussion, I'd be in favor of addressing the order N^2 issue before merging: that way it's more likely to be addressed! It sounds like there are three main avenues:

  1. @amueller's route of using a k-neighbors-based approximate MST
  2. @larsmans's suggestion of avoiding the MST phase through the specialized data structure
  3. the QHULL route of using edges of a tessalation to quickly compute a true MST

I'm not sure which is the best at this point. (1) is certainly the easiest, (3) is IMHO the best option (though I don't know whether it's currently possible) and (2) seems like it would require a fair bit of work to implement. Any thoughts?

@robertlayton
scikit-learn member
@larsmans
scikit-learn member

If (1) is easily implementable, then I say we go that route. (2) isn't terribly hard to code up (I have a ready pure Python implementation that is scalable but slow), but it needs a different API to be really efficient, with similarities/affinities rather than distances. We could do (1) now, add (3) as an option, and (2) as a separate estimator, so I'd say go ahead with (1) now.

(The problem, I found out, has yet a third name: max-spacing clustering.)

@jakevdp
scikit-learn member

(1) is very easy. I believe all it takes is to replace the pairwise_distances line with a suitable kneighbors_graph call. I took a similar approach in astroML: https://github.com/astroML/astroML/blob/master/astroML/clustering/mst_clustering.py#L78

Also need to add a n_neighbors parameter to the class constructor.

@robertlayton
scikit-learn member
@jakevdp
scikit-learn member

I did some experiments on a variety of datasets and found that 15-20 neighbors is usually sufficient to recover the true minimum spanning tree

@robertlayton
scikit-learn member

OK, I just took a stab at #1. I'm not sure if I'm missing something obvious but I can't do a nearest neighbour from a sparse distance matrix (i.e. n_samples by n_samples). Thoughts?

e.g. This won't work (after I put 'precomputed' in the list of valid distance matrices)

    X = atleast2d_or_csr(X)
    X = pairwise_distances(X, metric=metric)
    clf = NearestNeighbors(n_neighbors, algorithm='brute', metric='precomputed')
    X = clf.fit(X).kneighbors_graph(X._fit_X, n_neighbors, mode='distance')

Trying to fix this leads me down a path of updating nearest neighbours and so on. (argsort is used on in neighbors/base.py line 301, which is invalid for a sparse matrix)

Am I missing something obvious?

@jakevdp
scikit-learn member

Hi,
I think something like this should work for either sparse or dense input. For dense input, though, it would be better to not use brute force.

X = atleast2d_or_csr(X)
clf = NearestNeighbors(5, algorithm='brute')
G = clf.fit(X).kneighbors_graph(X, mode='distance')
@robertlayton
scikit-learn member

That will work is X is n_samples by n_features, and will even work in the case of n_samples by n_samples (i.e. a distance matrix), when n_samples is small. The problem is that nearest neighbours will interpret this as n_samples by n_features, and that could be infeasible for large n_samples.

I need to accept n_samples by n_samples, as that is what is returned by the evidence_accumulation_clustering algorithm.

@jakevdp
scikit-learn member

With a dense n_samples x n_samples distance matrix, you could compute the graph with something like this:

# X is n_samples x n_samples
# k is number of neighbors
G = X[np.argsort(X, 1) <= k]

There might be a clever way to do this in the sparse case as well

@robertlayton
scikit-learn member
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment