Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

MRG: Evidence Accumulation Clustering #1830

Open
wants to merge 55 commits into from

9 participants

@robertlayton
Owner

Evidence accumulation clustering: EAC, an ensemble based clustering framework:
Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
accumulation." Pattern Recognition, 2002. Proceedings. 16th International
Conference on. Vol. 4. IEEE, 2002.

Basic overview of algorithm:

  1. Cluster the data many times using a clustering algorithm with randomly (within reason) selected parameters.
  2. Create a co-association matrix, which records the number of times each pair of instances were clustered together.
  3. Cluster this matrix.

This seems to work really well, like a kernel method, making the clustering "easier" that it was for the original dataset.

The default of the algorithm are setup to follow the defaults used by Fred and Jain (2002), whereby the clustering in step 1 is k-means with k selected randomly from 10 and 30. The clustering in step 3 is the MST algorithm, which I have yet to implement (will do in this PR).

After initial feedback, I think people are happy with the API.

TODO:

  • MST algorithm from the paper, which was used as the final clusterer. Completed in PR #1991
  • There is an improvement to the speed of the algorithm (don't have the paper on hand) that has been published, that should be incorporated (will be done in a later PR)
  • Examples/Usage
  • Narrative documentation
  • Revert test_clustering, line 508, to only check for SpectralClustering
  • Use a sparse matrix for the co-association matrix
bob and others added some commits
bob First draft of new mini-batch k-means 5d2cba0
bob Updates to documentation wording 7c72986
@robertlayton robertlayton Updated docs to clarrify mini-batches f99a3d4
@robertlayton robertlayton Note to view the reference for an empircal result a94d3e6
bob Initial commit -- algorithm is mostly there, except for final clusterer.
The algorithm works, but isn't very fast or accurate.
Not fast because I haven't optimised, not accurate due to the poor final clusterer (I think)
ca4be3a
@robertlayton robertlayton final_clusterer no longer updated on training (wrong?) 7942708
@robertlayton robertlayton Fixed a bug, but performance is still not good enough, indicating ano…
…ther bug somewhere.

The common clustering test now tells you which clustering algorithm failed (if one does).
b21cbe5
@robertlayton robertlayton Changed the final clusterer to SpectralClustering to improve accuracy…
… until I finish the MST algorithm.

This required a changed to the test, which should be removed after the change.X
d1728ea
@jaquesgrobler

I had a read through. Looks very interesting. The API makes sense to me so far :+1:
Seems clear enough and isn't hard to follow.
Nice work :)

@satra
Owner

this looks rather interesting - i have two questions before reading the papers:

  • could this be used in general across any set of clusters/clustering algorithms?
  • could this be used in some ways to do online learning along these lines (http://arxiv.org/pdf/1209.0237v1.pdf)?
@satra
Owner

to clarify the across any set of clusters comment. currently the api is given a single X and many clustering algorithms. what if it was given a single algorithm but many Xs. In principle, it seems that should work as well.

sklearn/cluster/eac.py
((47 lines not shown))
+ -------
+ final_model: model (extends ClusterMixin)
+ The model given as `final_clusterer`, fitted with the evidence
+ accumulated through this process.
+
+ Notes
+ -----
+ See examples/plot_eac.py for an example.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ X = np.asarray(X)
@larsmans Owner

If you're using k-means, then sparse matrix support can be added quite easily by doing atleast2d_or_csr here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/eac.py
((129 lines not shown))
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ if 'n_clusters' in kmeans_args:
+ error_msg = "n_clusters cannot be assigned for the default clusterers."
+ raise ValueError(error_msg)
+ random_state = check_random_state(random_state)
+ num_iterations = 150
+ k_low, k_high = (10, 30)
+ if n_samples < k_high:
+ k_high = n_samples
+ k_low = min(k_low, int(k_high / 2))
+ k_values = random_state.randint(k_low, high=k_high, size=num_iterations)
+ return (KMeans(n_clusters=k, **kmeans_args) for k in k_values)
@larsmans Owner

It might be better to let the user pass an object to be used here, then clone that. That way, the user can set parameters (other than n_clusters) on the KMeans estimator. I'm not sure, though, since that means a non-k-means estimator can be passed...

@GaelVaroquaux Owner
@larsmans Owner

Sure, and a KMeans with default settings would do fine for that purpose.

@robertlayton Owner

I was hoping to get around that due to the use of the initial_clusterers parameter in the calling function. If it is None, it uses this function, which gives the "default" initial clusters as used in the reference.

The variation in k is also needed -- k varies between 10 and 30 (unless that doesn't make sense), forcing different clusters in most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/eac.py
((167 lines not shown))
+ Array of distances between samples, or a feature array.
+ The array is treated as a feature array unless the metric is given as
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is MST.
+ use_distance: boolean, or callable
+ If True, convert the coassociation matrix to distance using
+ `D=1./(C + 1)`. If callable, the function is called with the
+ coassication matrix as input. If False (default), then the matrix is
@larsmans Owner

typo: coassociation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/eac.py
((75 lines not shown))
+ num_initial_clusterers = 0
+ for model in initial_clusterers:
+ num_initial_clusterers += 1
+ # Update random state
+ # Fit model to X
+ model.fit(X)
+ # Calculate new coassociation matrix and add that to the tally
+ C = update_coassociation_matrix(C, model.labels_)
+ C /= num_initial_clusterers
+ if use_distance:
+ if use_distance is True:
+ # Turn into a distance matrix
+ C = 1. - C
+ elif callable(use_distance): # If a callable
+ C = use_distance(C)
+ np.savetxt(open("/home/bob/test_eac_data.txt", 'w'), C, fmt='%.3f')
@larsmans Owner

You're not getting an account on my workstation :p

@robertlayton Owner

hmmm, not sure how that got there. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/eac.py
@@ -0,0 +1,228 @@
+# -*- coding: utf-8 -*-
+"""
+EAC: Evidence Accumulation Clustering
+"""
+
+# Author: Robert Layton <robertlayton@gmail.com>
+#
+# License: BSD
@larsmans Owner

3-clause BSD!

@robertlayton Owner

OK, but I copied this from another file. I'll do a grep and post results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton

@satra Thanks for your comments. You are right on the multiple X question -- I've used that myself, but (1) I can't think of a very clear way to do it and (2) it isn't the "base" algorithm. If you have an idea for solving (1) I'm happy to include it.

@everyone_else, thanks for your comments, I'll finish up the PR.

@satra
Owner

@robertlayton: how about having a function that takes C, X and clustering_algo and updates C and returns it? i believe you already have it inside the eac function.

sklearn/cluster/eac.py
((61 lines not shown))
+ """
+ X = np.asarray(X)
+ n_samples = X.shape[0]
+ # If index order not given, create random order.
+ random_state = check_random_state(random_state)
+ # If initial_clusterers is None, it is k-means 150 times with randomly
+ # initialised k values (as per original paper).
+ if initial_clusterers is None:
+ initial_clusterers = _kmeans_random_k(n_samples, random_state)
+ # If the final_clusterer is None, create the default model
+ if final_clusterer is None:
+ final_clusterer = create_default_final_clusterer(random_state)
+ # Co-association matrix, originally zeros everywhere
+ C = np.zeros((n_samples, n_samples), dtype='float')
+ num_initial_clusterers = 0
+ for model in initial_clusterers:
@jnothman Owner

Is it worth doing this fitting in parallel?

@robertlayton Owner

Definitely. I was going with "get it right, then optimise".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/eac.py
((83 lines not shown))
+ C /= num_initial_clusterers
+ if use_distance:
+ if use_distance is True:
+ # Turn into a distance matrix
+ C = 1. - C
+ elif callable(use_distance): # If a callable
+ C = use_distance(C)
+ np.savetxt(open("/home/bob/test_eac_data.txt", 'w'), C, fmt='%.3f')
+ final_clusterer.fit(C)
+ return final_clusterer
+
+
+def update_coassociation_matrix(C, labels):
+ """Updates a co-association matrix from an array of labels.
+ """
+ labels = np.asarray(labels)
@jnothman Owner

A vectorised implementation (though perhaps not the simplest):
C += np.repeat([labels], labels.size, axis=0).T == labels
(this result .triu() should be the same as what you calculate.)

@jnothman Owner

The similarity of this algorithm to a parameter search makes me wonder whether coassociation should be found in sklearn.metrics.

@jnothman Owner

[Another vectorised implementation that works with an intermediate boolean array of n_clusters x n_samples rather than n_samples x n_samples:

cluster_assignments = np.repeat([np.arange(np.max(labels) + 1)], labels.size, axis=0).T == labels
C += cluster_assignments[labels]

This has the benefit of allowing you to first convert cluster_assignments to a sparse matrix which may be very worthwhile for C updating. But this is all premature optimisation on my part. Even more optimised would use this trick: http://stackoverflow.com/questions/5564098/repeat-numpy-array-without-replicating-data]

@robertlayton Owner

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

Question without reading paper or code: is the MST clustering just single-link agglomerative?
If so, can code be reused / refactored from WARD?

Sounds like a very interesting algorithm btw :)
Are there any links to Buhmann's work?

@GaelVaroquaux
@robertlayton

For calculating the minimum spanning tree, I see three options:

1 This version in scipy here, but it requires v0.11, which is higher than scikit-learn's current dependency. (Currently it is 0.7, which is probably a little low, but there hasn't been a need for higher so far I believe.)
2 Use @GaelVaroquaux 's code from here
3 Use @amueller 's code from here

Thoughts on the best option? I'd rather not reimplement it myself -- it's tricky to optimise properly and pointless if others have already solved this problem.

@amueller
Owner

I haven't looked at @GaelVaroquaux but I'd vote for backporting scipy.

@amueller
Owner

Do you only need the euclidean case? For high-dims, that shouldn't really make a (practical) difference, though...

@robertlayton

I think it would be better to use the "proper" method, even if the euclidean case works practically.

@robertlayton

What would be the process of backporting from scipy? Any examples I could use?

robertlayton added some commits
@robertlayton robertlayton Merge branch 'master' of git://github.com/scikit-learn/scikit-learn i…
…nto eac
61908bf
@robertlayton robertlayton In broken state: Most of the algorithm is there and working, but the …
…tests are not running yet
d76c243
@robertlayton robertlayton Trying new csgraph 439bae0
@robertlayton robertlayton Using scipy's connected_components for now, which will be backported …
…next
fb12eac
@robertlayton robertlayton Merge branch 'master' of git://github.com/scikit-learn/scikit-learn i…
…nto eac
18bd327
@robertlayton robertlayton MST algorithm. in broken state until I get the imports working (next …
…commit)
bd6abff
@robertlayton robertlayton EAC now works, with test running as well 4e12586
@robertlayton robertlayton Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
…into eac

Conflicts:
	sklearn/cluster/__init__.py
62a3a3b
@robertlayton robertlayton random_state now used for k-means, makes the algorithm more stable 9fc6689
@robertlayton robertlayton New form of spanning tree, should be working now.
Not sure why teh results are so low though (only just gets over the 0.4 limit!)
82be0fa
@robertlayton robertlayton Added an example. Doesn't look right yet, but it works 5c04e9d
@robertlayton robertlayton Different parameter values to ensure consistent testing 9696294
@robertlayton robertlayton Example showing the problem/solution much better. Still a long way to…
… go with it.

Update to documentation too, unfinished, but nearly there.
bad4508
@robertlayton robertlayton Doc draft is finished, as is the example. It doesn't yet really drive…
… home the point I'm trying to make (that the intuition behind the parameter in EAC is more easy to understand than MST).
d9752b1
@robertlayton robertlayton Discussion about teh intuitive nature of the parameter cba494f
@robertlayton robertlayton Reduced the number of subplots in example, much clearer now. Needs de…
…scriptions
35a13a6
@robertlayton robertlayton Example finished. 9e4017b
@robertlayton robertlayton Added documentation for MSTCluster (example to come) and what's new.
MSTCluster can also now take a feature matrix, but this is untested (that's next).
574ef93
@robertlayton robertlayton Forgot to set self.metric. It works now. 4d37eec
@robertlayton robertlayton Added example for MST clustering 5b69547
@robertlayton robertlayton Tests require that X not be a precomputed distance matrix. Have adjusted
the defaults and the docs.

In addition, import clustering algorithms only when needed for EAC.
2afa378
@robertlayton robertlayton Adding +1 to the creation of the co-assocation matrix is needed.
Because it's sparse, without the +1 the zero distances are removed, which leads to poor span trees.
a6394a9
@robertlayton robertlayton Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
…into eac

Conflicts:
	doc/whats_new.rst
7175fd7
@robertlayton robertlayton Fixed 5 month old typo 192b9fa
doc/modules/clustering.rst
@@ -571,6 +571,123 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
+
+
+Minimum Spanning Tree Clustering
+================================
+The :class:`MSTCluster` algorithm forms a minimum spann tree over the data, and
@ogrisel Owner
ogrisel added a note

spann => spanning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@ogrisel
Owner

I am not a big fan of acronyms as class names. What do people thing about renaming the EAC class to EvidenceAccumulationClustering?

The same remark holds for the MSTCluster. Maybe we could just call it MinimumSpanningTree?

@ogrisel
Owner

@GaelVaroquaux how does the MSTCluster class overlap with your work on single linkage clustering? Do you also use the minimum spanning tree implementation of scipy when you don't have connectivity constraints?

examples/cluster/plot_eac.py
((121 lines not shown))
+ax.scatter(X[:, 0], X[:,1], color=colors[y_pred].tolist(), s=10)
+ax.set_title("EAC")
+
+# Plot distribution of scores (from main_metric)
+ax = fig.add_subplot(3, 1, 3)
+# k-means
+ax.plot(k_values, km_ami_means)
+ax.errorbar(k_values, km_ami_means, yerr=km_ami_std, fmt='ro', label='k-means')
+
+# MST
+ax.plot(mst_values, mst_ami_means)
+ax.errorbar(mst_values, mst_ami_means, fmt='g*', label='MST')
+score = main_metric(y_true, y_pred)
+ax.scatter([n_clusters_,], [score,], label='EAC', s=40)
+ax.legend()
+ax.set_title("V-measure comparison")
@ogrisel Owner
ogrisel added a note

I think it's better to use individual plot figures rather than several axis in the same figure. Here is an example that uses multiple plots and how it's rendered in the sphinx website: http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/eac.py
((172 lines not shown))
+ The array is treated as a feature array unless the metric is given as
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is MST.
+ use_distance: boolean, or callable
+ If True, convert the coassociation matrix to distance using
+ `D=1./(C + 1)`. If callable, the function is called with the
+ coassociation matrix as input. If False (default), then the matrix is
+ given as input to the `final_clusterer`.
@ogrisel Owner
ogrisel added a note

This parameter list does not match the constructor params of the class.

@robertlayton Owner

Ah, forgot about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/cluster/eac.py
((173 lines not shown))
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is MST.
+ use_distance: boolean, or callable
+ If True, convert the coassociation matrix to distance using
+ `D=1./(C + 1)`. If callable, the function is called with the
+ coassociation matrix as input. If False (default), then the matrix is
+ given as input to the `final_clusterer`.
+ random_state: numpy.RandomState, optional
@ogrisel Owner
ogrisel added a note

or int

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
@ogrisel
Owner

So maybe the MSTCluster class could be renamed SingleLinkageClustering. WDYT?

@robertlayton
Owner

I don't mind the change to EvidenceAccumulationClustering, not sure which way to go on MSTCluster though. Thoughts?

@robertlayton
Owner

All issues fixed, except for renaming MSTCluster.

Which option did you think was best?

@ogrisel
Owner

I think I still like SingleLinkageClustering better. But please ask on the ML. Also this PR need to be rebased on master.

doc/modules/classes.rst
@@ -29,9 +29,11 @@ Classes
cluster.AffinityPropagation
cluster.DBSCAN
+ cluster.EAC
@ogrisel Owner
ogrisel added a note

This needs an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp
Collaborator

MST is also known as "Friends of Friends" clustering in some circles (notably astrophysical N-body simulation groups) That might also be a naming option, though I'm not sure how wide-spread the usage is. In astroML, I dubbed a similar estimator HierarchicalClustering, which may be a clearer name.

@robertlayton
Owner
doc/modules/clustering.rst
@@ -571,6 +571,134 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
+
+
+Minimum Spanning Tree Clustering
+================================
+The :class:`MSTCluster` algorithm forms a minimum spanning tree over the data,
+and then cuts any edges that have a weight higher than some predefined
+threshold. This separates the data into connected components, each representing
+a separate git cocluster.
@larsmans Owner
larsmans added a note

git cocluster?

@robertlayton Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton

OK, I think I've addressed all of the issues, except for merging with upstream.

I'll leave it here if anyone wants to review, otherwise I'll self review early next week and update this comment.

@larsmans larsmans commented on the diff
sklearn/utils/__init__.py
@@ -14,7 +14,8 @@
atleast2d_or_csr, warn_if_not_float,
check_random_state, column_or_1d)
from .class_weight import compute_class_weight
-from sklearn.utils.sparsetools import minimum_spanning_tree
+from sklearn.utils.sparsetools import (minimum_spanning_tree,
+ connected_components)
@larsmans Owner

Where is connected_components being used?

@robertlayton Owner

spectral_embedding.py uses it, but imports it directly from .sparsetools. Same in a new other places. That said, single_linkage uses it.

@larsmans Owner

Oh, right, this is an __init__. Sorry, hadn't seen that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@robertlayton

OK, I've reviewed this PR, I believe it's ready to go.

@larsmans larsmans commented on the diff
sklearn/cluster/single_linkage.py
((64 lines not shown))
+ -----
+ See examples/plot_single_linkage.py for a visualisation of the threshold
+ cutting algorithm on a small dataset.
+
+ See examples/plot_eac.py for an example. The Evidence Accumulation
+ Clustering (EAC) algorithm uses this clusterer in its final clustering
+ step.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ X = atleast2d_or_csr(X)
+ X = pairwise_distances(X, metric=metric)
@larsmans Owner

This takes O(n²) time. With an explicit, sparse similarity matrix instead of distances, it can be done in O(E lg(E)) time where E are the edges of the similarity graph, i.e. the non-zero similarities. In the optimal algorithm, you don't even need to build the spanning tree. By promising to return the spanning tree and working with metrics instead of similarities, we're tying ourselves down to a suboptimal algorithm.

@jakevdp Collaborator
jakevdp added a note

One easy (but admittedly ad-hoc) trick is to instead use an approximate minimum spanning tree by building a graph only over the $k$ nearest neighbors. For most datasets, given a reasonable choice of $k$, the result is likely to be the same as in the exact case.

@ogrisel Owner
ogrisel added a note
@jakevdp Collaborator
jakevdp added a note

I think the best solution would be to build the graph over the edges given by scipy's QHULL wrapper. Then you get a true MST in order NlogN. From my recollection, though, I think the qhull wrapper in scipy doesn't provide direct access to the edges...

@robertlayton Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@larsmans
Owner

To be sure, a faster algorithm to do this (in pure Python) is:

def flat_single_linkage(X):            # X is a coo_matrix of similarities
    heap = zip(-X.data, X.col, X.row)  # linear time
    heapq.heapify(heap)                # linear time

    # Kruskal's algorithm with early stopping
    disjsets = DisjointSetForest(X.shape[0])     # linear time
    while heap:
        sim, i, j = heapq.heappop(heap)
        if sim < threshold:
            break
        if disjsets.find(i) != disjsets.find(j)  # O(α(n)) time
            disjsets.union(i, j)                 # O(α(n)) time

    return [disjsets.find(i) for i in xrange(X.shape[0])]

A further optimized version would look only at the bottom half of X and skips the find calls in the loop because they'll be re-done by union anyway. This algorithm can be easily modified to put a maximum on the number of clusters.

@robertlayton

Where is DisjointSetForest from?

@larsmans
Owner

It's not in the stdlib, but it's a standard data structure for handling connected components. I have an implementation in C++ that is easily translated to Cython. (The heap operations will take more work).

@jakevdp
Collaborator

This all looks really excellent -- really nice work Robert. Regarding the above discussion, I'd be in favor of addressing the order N^2 issue before merging: that way it's more likely to be addressed! It sounds like there are three main avenues:

  1. @amueller's route of using a k-neighbors-based approximate MST
  2. @larsmans's suggestion of avoiding the MST phase through the specialized data structure
  3. the QHULL route of using edges of a tessalation to quickly compute a true MST

I'm not sure which is the best at this point. (1) is certainly the easiest, (3) is IMHO the best option (though I don't know whether it's currently possible) and (2) seems like it would require a fair bit of work to implement. Any thoughts?

@robertlayton
@larsmans
Owner

If (1) is easily implementable, then I say we go that route. (2) isn't terribly hard to code up (I have a ready pure Python implementation that is scalable but slow), but it needs a different API to be really efficient, with similarities/affinities rather than distances. We could do (1) now, add (3) as an option, and (2) as a separate estimator, so I'd say go ahead with (1) now.

(The problem, I found out, has yet a third name: max-spacing clustering.)

@jakevdp
Collaborator

(1) is very easy. I believe all it takes is to replace the pairwise_distances line with a suitable kneighbors_graph call. I took a similar approach in astroML: https://github.com/astroML/astroML/blob/master/astroML/clustering/mst_clustering.py#L78

Also need to add a n_neighbors parameter to the class constructor.

@robertlayton
@jakevdp
Collaborator

I did some experiments on a variety of datasets and found that 15-20 neighbors is usually sufficient to recover the true minimum spanning tree

@robertlayton
Owner

OK, I just took a stab at #1. I'm not sure if I'm missing something obvious but I can't do a nearest neighbour from a sparse distance matrix (i.e. n_samples by n_samples). Thoughts?

e.g. This won't work (after I put 'precomputed' in the list of valid distance matrices)

    X = atleast2d_or_csr(X)
    X = pairwise_distances(X, metric=metric)
    clf = NearestNeighbors(n_neighbors, algorithm='brute', metric='precomputed')
    X = clf.fit(X).kneighbors_graph(X._fit_X, n_neighbors, mode='distance')

Trying to fix this leads me down a path of updating nearest neighbours and so on. (argsort is used on in neighbors/base.py line 301, which is invalid for a sparse matrix)

Am I missing something obvious?

@jakevdp
Collaborator

Hi,
I think something like this should work for either sparse or dense input. For dense input, though, it would be better to not use brute force.

X = atleast2d_or_csr(X)
clf = NearestNeighbors(5, algorithm='brute')
G = clf.fit(X).kneighbors_graph(X, mode='distance')
@robertlayton

That will work is X is n_samples by n_features, and will even work in the case of n_samples by n_samples (i.e. a distance matrix), when n_samples is small. The problem is that nearest neighbours will interpret this as n_samples by n_features, and that could be infeasible for large n_samples.

I need to accept n_samples by n_samples, as that is what is returned by the evidence_accumulation_clustering algorithm.

@jakevdp
Collaborator

With a dense n_samples x n_samples distance matrix, you could compute the graph with something like this:

# X is n_samples x n_samples
# k is number of neighbors
G = X[np.argsort(X, 1) <= k]

There might be a clever way to do this in the sparse case as well

@robertlayton
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 28, 2013
  1. First draft of new mini-batch k-means

    bob authored
Commits on Mar 1, 2013
  1. Updates to documentation wording

    bob authored
Commits on Mar 5, 2013
  1. @robertlayton
  2. @robertlayton
Commits on Mar 24, 2013
  1. Initial commit -- algorithm is mostly there, except for final clusterer.

    bob authored
    The algorithm works, but isn't very fast or accurate.
    Not fast because I haven't optimised, not accurate due to the poor final clusterer (I think)
Commits on Mar 28, 2013
  1. @robertlayton
Commits on Mar 29, 2013
  1. @robertlayton

    Fixed a bug, but performance is still not good enough, indicating ano…

    robertlayton authored
    …ther bug somewhere.
    
    The common clustering test now tells you which clustering algorithm failed (if one does).
Commits on Apr 2, 2013
  1. @robertlayton

    Changed the final clusterer to SpectralClustering to improve accuracy…

    robertlayton authored
    … until I finish the MST algorithm.
    
    This required a changed to the test, which should be removed after the change.X
Commits on Apr 26, 2013
  1. @robertlayton
Commits on Apr 29, 2013
  1. @robertlayton
Commits on Apr 30, 2013
  1. @robertlayton
Commits on Jun 5, 2013
  1. @robertlayton
  2. @robertlayton
Commits on Jun 6, 2013
  1. @robertlayton

    Trying new csgraph

    robertlayton authored
  2. @robertlayton
Commits on Jul 22, 2013
  1. @robertlayton
  2. @robertlayton
Commits on Aug 22, 2013
  1. @robertlayton
  2. @robertlayton

    Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

    robertlayton authored
    …into eac
    
    Conflicts:
    	sklearn/cluster/__init__.py
Commits on Aug 26, 2013
  1. @robertlayton
Commits on Aug 27, 2013
  1. @robertlayton

    New form of spanning tree, should be working now.

    robertlayton authored
    Not sure why teh results are so low though (only just gets over the 0.4 limit!)
Commits on Aug 28, 2013
  1. @robertlayton
Commits on Aug 29, 2013
  1. @robertlayton
Commits on Aug 30, 2013
  1. @robertlayton

    Example showing the problem/solution much better. Still a long way to…

    robertlayton authored
    … go with it.
    
    Update to documentation too, unfinished, but nearly there.
  2. @robertlayton

    Doc draft is finished, as is the example. It doesn't yet really drive…

    robertlayton authored
    … home the point I'm trying to make (that the intuition behind the parameter in EAC is more easy to understand than MST).
  3. @robertlayton
Commits on Aug 31, 2013
  1. @robertlayton
Commits on Sep 2, 2013
  1. @robertlayton

    Example finished.

    robertlayton authored
  2. @robertlayton

    Added documentation for MSTCluster (example to come) and what's new.

    robertlayton authored
    MSTCluster can also now take a feature matrix, but this is untested (that's next).
  3. @robertlayton
  4. @robertlayton
  5. @robertlayton

    Tests require that X not be a precomputed distance matrix. Have adjusted

    robertlayton authored
    the defaults and the docs.
    
    In addition, import clustering algorithms only when needed for EAC.
  6. @robertlayton

    Adding +1 to the creation of the co-assocation matrix is needed.

    robertlayton authored
    Because it's sparse, without the +1 the zero distances are removed, which leads to poor span trees.
Commits on Sep 3, 2013
  1. @robertlayton

    Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

    robertlayton authored
    …into eac
    
    Conflicts:
    	doc/whats_new.rst
  2. @robertlayton
Commits on Sep 5, 2013
  1. @robertlayton

    Fixed typo

    robertlayton authored
  2. @robertlayton

    Rebalance lines

    robertlayton authored
  3. @robertlayton
  4. @robertlayton
  5. @robertlayton
Commits on Sep 6, 2013
  1. @robertlayton
  2. @robertlayton

    pep8

    robertlayton authored
Commits on Sep 7, 2013
  1. @robertlayton
Commits on Sep 9, 2013
  1. @robertlayton

    Typo

    robertlayton authored
Commits on Sep 14, 2013
  1. @robertlayton

    MSTCluster -> SingleLinkageCluster.

    robertlayton authored
    All tests passing
  2. @robertlayton

    Updates to documentation

    robertlayton authored
  3. @robertlayton

    Hanging MST reference

    robertlayton authored
  4. @robertlayton

    Merge branch 'eac' of github.com:robertlayton/scikit-learn into eac

    robertlayton authored
    Conflicts:
    	doc/modules/clustering.rst
    	examples/cluster/plot_eac.py
    	sklearn/cluster/__init__.py
    	sklearn/cluster/eac.py
  5. @robertlayton

    Docs build correctly now

    robertlayton authored
Commits on Sep 17, 2013
  1. @robertlayton
  2. @robertlayton

    Spelling

    robertlayton authored
Commits on Sep 18, 2013
  1. @robertlayton

    Spelling

    robertlayton authored
  2. @robertlayton

    Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

    robertlayton authored
    …into eac
    
    Conflicts:
    	doc/whats_new.rst
  3. @robertlayton
  4. @robertlayton
This page is out of date. Refresh to see the latest.
View
9 doc/modules/classes.rst
@@ -29,9 +29,11 @@ Classes
cluster.AffinityPropagation
cluster.DBSCAN
+ cluster.EvidenceAccumulationClustering
cluster.KMeans
cluster.MiniBatchKMeans
cluster.MeanShift
+ cluster.SingleLinkageCluster
cluster.SpectralClustering
cluster.Ward
@@ -41,13 +43,14 @@ Functions
:toctree: generated/
:template: function.rst
- cluster.estimate_bandwidth
- cluster.k_means
- cluster.ward_tree
cluster.affinity_propagation
cluster.dbscan
+ cluster.eac
+ cluster.estimate_bandwidth
+ cluster.k_means
cluster.mean_shift
cluster.spectral_clustering
+ cluster.ward_tree
.. _bicluster_ref:
View
142 doc/modules/clustering.rst
@@ -578,6 +578,148 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
+
+
+Single Linkage Clustering
+=========================
+In single linkage hierarchical clustering, each sample begins in its own
+cluster. At each step, the two nearest clusters are merged (agglomerative
+clustering). The distance between two clusters is determined by finding the
+minimum distance between sample in each cluster. This makes it different
+from other forms of hierarchical clustering, such as complete linkage which
+uses the maximum distance between a pair of points, one from each cluster.
+
+The :class:`SingleLinkageCluster` algorithm forms a minimum spanning tree over
+the data, and
+then cuts any edges that have a weight higher than some predefined threshold.
+This separates the data into connected components, each representing a separate
+cluster.
+This algorithm is equivalent to performing single linkage clustering in the way
+outlined above.
+
+In graph theory, a minimum spanning tree is a set of connections between nodes
+that link all nodes together with the minimum sum of weights. In the
+implementation in scikit-learn, a graph is formed by taking the samples to be
+nodes and the edge weight to be the distance between them. Therefore, the
+clusters are formed through maintaining a linkage of samples with a low distance
+to each other.
+
+The :class:`SingleLinkageCluster` takes a distance matrix as input, rather
+than a feature
+matrix. A feature matrix can be used by setting the `metric` parameter, in
+which case the `pairwise_distances` function in the `metrics` module will be
+called to create the distance matrix. This parameter can be set to the string
+`precomputed`, meaning the matrix will be interpreted as being a precomputed
+distance matrix. If running batch jobs, precomputing the distance matrix can
+dramatically speed up computation.
+
+The example below uses :class:`SingleLinkageCluster` to cluster a small
+dataset, showing
+also the edges that compose the minimum spanning tree. Red edges are cut while
+green edges are maintained, forming the final clusters which are color coded.
+
+
+.. |sl_results| image:: ../auto_examples/cluster/images/plot_single_linkage_1.png
+ :target: ../auto_examples/cluster/plot_single_linkage.html
+ :scale: 50
+
+.. centered:: |sl_results|
+
+.. topic:: Examples:
+
+ * :ref:`example_cluster_plot_single_linkage.py`
+
+.. topic:: References:
+
+ * "Data clustering using evidence accumulation." Fred, A., and Anil J.
+ Pattern Recognition, 2002. Proceedings. 16th International Conference on.
+ Vol. 4. IEEE, 2002.
+
+
+Evidence Accumulation Clustering (EAC)
+======================================
+
+The :class:`EvidenceAccumulationClustering` algorithm is an ensemble clustering
+framework that is able to discover clusters of an arbitrary shape. This occurs
+through a remapping of the new points into a kernel-like space through the
+notion of a *co-association matrix*. There are three steps; the initial
+clustering, the creation of the co-association matrix and the final clustering
+step.
+
+In the initial clustering step, a low-level clustering algorithm is used, with
+random parameter values, to cluster the data many times. In the default
+parameters implemented in scikit-learn, the :class:`KMeans` algorithm is run with
+`n_clusters` varying randomly from 10 to 30 inclusive. This can be changed by
+the user by setting the `default_initial_clusterers` parameters to be a list of
+clustering algorithms.
+
+The next step is to create a co-association matrix, `C`. This matrix has shape
+`n_samples` by `n_samples`, where `C[i][j]` is the frequency that samples `i`
+and `j` were clustered together in the initial clustering step. This creates a
+remapping of the data onto a new space. Intuitively, samples have a low
+distance to each other will have a high value in `C`.
+
+The final step is to apply a final clustering algorithm onto `C` to form the
+final "definitive" labels of the dataset. This step can use any clustering
+algorithm, but works best on algorithms that take a distance or similarity
+matrix as input. The default clustering algorithm is the
+:class:`SingleLinkageCluster` algorithm.
+
+The algorithm needs no parameters, but works better if the threshold for the
+final clustering step is given (see the example for how to do this). This
+parameter is more intuitive than many others, and can be summarised as
+"put items in the same cluster if they are clustered together (1-t)% of the
+time", where t is the threshold parameter. Given the randomness of the k-means
+clustering, this allows us to approximately say that k-means considers the
+items to be (1-t)% similar. This is opposed to saying "these items have a
+distance of x", which may not be as intuitive in high dimension contexts.
+
+In the figure below, we use the rings dataset to show the utility of this
+framework. While the initial clustering algorithms (K-means) are not able to
+find appropriate clusters in this dataset, by combining their outputs into the
+co-association matrix, the final clustering is able to. In addition, the
+V-measure score is significantly higher than K-means.
+
+.. |eac_results1| image:: ../auto_examples/cluster/images/plot_eac_1.png
+ :target: ../auto_examples/cluster/plot_eac.html
+ :scale: 50
+
+.. |eac_results2| image:: ../auto_examples/cluster/images/plot_eac_2.png
+ :target: ../auto_examples/cluster/plot_eac.html
+ :scale: 50
+
+.. |eac_results3| image:: ../auto_examples/cluster/images/plot_eac_3.png
+ :target: ../auto_examples/cluster/plot_eac.html
+ :scale: 50
+
+.. centered:: |eac_results1| |eac_results2|
+
+.. centered:: |eac_results3|
+
+
+.. topic:: Examples:
+
+ * :ref:`example_cluster_plot_eac.py`
+
+.. topic:: Implementation
+
+ The Evidence Accumulation Clustering algorithm has been designed for
+ scikit-learn to be very modular. The parameters allow for the selection
+ of both the initial clustering algorithms and the final clusterer, which
+ can be any clustering model following scikit-learn's API. Meanwhile, the
+ default parameters match the design originally proposed by Fred and Anil
+ (2002). This allows the algorithm to be run without parameters, which
+ should achieve reasonable results on a wide variety of datasets. See the
+ example listed above for usage information on changing the final clustering
+ step.
+
+.. topic:: References:
+
+ * "Data clustering using evidence accumulation." Fred, A., and Anil J.
+ Pattern Recognition, 2002. Proceedings. 16th International Conference on.
+ Vol. 4. IEEE, 2002.
+
+
.. _clustering_evaluation:
Clustering performance evaluation
View
19 doc/whats_new.rst
@@ -13,7 +13,7 @@ Changelog
any kind of base estimator. See the :ref:`Bagging <bagging>` section of
the user guide for details and examples. By `Gilles Louppe`_.
- - Added :func:`metrics.pairwise_distances_argmin_min`, by Philippe Gervais.
+ - Added :func:`metrics.pairwise_distances_argmin_min`, by Philippe Gervais.
- Added predict method to :class:`cluster.AffinityPropagation` and
:class:`cluster.MeanShift`, by `Mathieu Blondel`_.
@@ -27,12 +27,27 @@ Changelog
:class:`feature_selection.VarianceThreshold`, by `Lars Buitinck`_.
- Precision-recall and ROC examples now use train_test_split, and have more
- explanation of why these metrics are useful. By `Kyle Kastner`_
+ explanation of why these metrics are useful. By `Kyle Kastner`_.
- The training algorithm for :class:`decomposition.NMF` is faster for
sparse matrices and has much lower memory complexity, meaning it will
scale up gracefully to large datasets. By `Lars Buitinck`_.
+ - :class:`cluster.SingleLinkageCluster`, which uses a threshold based
+ parameter
+ to cut edges, forming clusters from the connected components. By
+ `Robert Layton`_.
+
+ - :class:`cluster.EvidenceAccumulationClustering`, which runs K-means many
+ times to form a co-association matrix, then using
+ :class:`cluster.SingleLinkageCluster`
+ to form final clusters. By `Robert Layton`_.
+
+ - Upgrading the sparsetools utility functions (which are backported from
+ scipy). This was done to give access to the minimum spanning tree function
+ which was used for the SingleLinkageCluster algorithm.
+ By `Robert Layton`_.
+
- Added svd_method option with default value to "randomized" to
:class:`decomposition.factor_analysis.FactorAnalysis` to save memory and
significantly speedup computation by `Denis Engemann`_, and
View
140 examples/cluster/plot_eac.py
@@ -0,0 +1,140 @@
+# -*- coding: utf-8 -*-
+"""
+===========================================================
+Demo of EvidenceAccumulationClustering clustering algorithm
+===========================================================
+
+Uses many iterations of k-means with random values for k to create a
+"co-association matrix", where C[i][j] is the frequency of times that instances
+i and j are clustered together. The SingleLinkageCluster algorithm is run on
+this co-association matrix to form the final clusters.
+
+"""
+print(__doc__)
+
+from collections import defaultdict
+from operator import itemgetter
+
+import numpy as np
+import pylab as pl
+from scipy.cluster.hierarchy import dendrogram
+from sklearn.cluster import EvidenceAccumulationClustering
+from sklearn.cluster import KMeans
+from sklearn.cluster import SingleLinkageCluster
+from sklearn import metrics
+from sklearn import datasets
+from sklearn.preprocessing import StandardScaler
+from sklearn.metrics import pairwise_distances
+
+
+##############################################################################
+# Generate sample data
+n_samples = 750
+noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
+ noise=.05)
+X, y_true = noisy_circles
+main_metric = metrics.adjusted_mutual_info_score
+
+
+##############################################################################
+# Compute EAC
+print("Running Evidence Accumulation Clustering Algorithm")
+# Create a final clustering, allowing us to set the threshold value ourselves.
+threshold = 0.8
+final_clusterer = SingleLinkageCluster(threshold=threshold,
+ metric='precomputed')
+model = EvidenceAccumulationClustering(default_final_clusterer=final_clusterer,
+ random_state=42).fit(X)
+y_pred = model.labels_
+span_tree = model.final_clusterer.span_tree
+
+print("Threshold: {:.4f}".format(threshold))
+# Number of clusters in labels, ignoring noise if present.
+n_clusters_ = len(set(y_pred)) - (1 if -1 in y_pred else 0)
+print('Estimated number of clusters: %d' % n_clusters_)
+print("Homogeneity: %0.3f" % metrics.homogeneity_score(y_true, y_pred))
+print("Completeness: %0.3f" % metrics.completeness_score(y_true, y_pred))
+print("V-measure: %0.3f" % metrics.v_measure_score(y_true, y_pred))
+print("Adjusted Rand Index: %0.3f"
+ % metrics.adjusted_rand_score(y_true, y_pred))
+print("Adjusted Mutual Information: %0.3f"
+ % metrics.adjusted_mutual_info_score(y_true, y_pred))
+print("")
+
+##############################################################################
+# Run K-Means many times with different k-values and record the distribution
+# of adjusted mutual information scores.
+
+print("Running k-means many times")
+num_iterations = 10
+km_ami_means = []
+km_ami_std = []
+k_values = list(range(2, 30))
+for k in k_values:
+ predictions = (KMeans(n_clusters=k).fit(X).labels_
+ for i in range(num_iterations))
+ ami_scores = [main_metric(y_true, labels)
+ for labels in predictions]
+ km_ami_means.append(np.mean(ami_scores))
+ km_ami_std.append(np.std(ami_scores))
+
+
+# Example k-means
+km_labels = KMeans(n_clusters=2).fit(X).labels_
+
+
+##############################################################################
+# Run SingleLinkageCluster many times with different threshold values
+
+print("Running SingleLinkageCluster many times")
+num_iterations = 50
+sl_scores = defaultdict(list)
+D = pairwise_distances(X, metric="euclidean", n_jobs=1)
+d_min = np.min(D)
+d_max = np.max(D)
+for threshold in np.arange(d_min, d_max, (d_max - d_min) / 100.):
+ model = SingleLinkageCluster(threshold=threshold, metric='euclidean')
+ predictions = model.fit(D).labels_
+ score = main_metric(y_true, predictions)
+ n_clusters = len(set(predictions))
+ sl_scores[n_clusters].append(score)
+for key in sl_scores:
+ sl_scores[key] = np.mean(sl_scores[key])
+xx = sorted(sl_scores.items(), key=itemgetter(0))
+sl_values, sl_ami_means = zip(*xx)
+
+
+##############################################################################
+# Plot results
+
+colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
+colors = np.hstack([colors] * 100)
+# Subplot showing distribution of scores
+
+# Plot two examples of k-means
+pl.figure()
+pl.scatter(X[:, 0], X[:,1], color=colors[km_labels].tolist(), s=10)
+pl.title("K-means")
+
+# Plot EvidenceAccumulationClustering labels
+pl.figure()
+pl.scatter(X[:, 0], X[:,1], color=colors[y_pred].tolist(), s=10)
+pl.title("Evidence Accumulation Clustering")
+
+# Plot distribution of scores (from main_metric)
+pl.figure()
+# k-means
+pl.plot(k_values, km_ami_means)
+pl.errorbar(k_values, km_ami_means, yerr=km_ami_std, fmt='ro', label='k-means')
+
+# SingleLinkageCluster
+pl.plot(sl_values, sl_ami_means)
+pl.errorbar(sl_values, sl_ami_means, fmt='g*', label='Single Linkage')
+score = main_metric(y_true, y_pred)
+pl.scatter([n_clusters_,], [score,], label='EAC', s=40)
+pl.legend()
+pl.title("V-measure comparison")
+
+
+
+pl.show()
View
67 examples/cluster/plot_single_linkage.py
@@ -0,0 +1,67 @@
+# -*- coding: utf-8 -*-
+"""
+======================================
+Demo of SingleLinkageCluster algorithm
+======================================
+
+In single linkage heirarchical clustering, each element begins in its own
+cluster. At each step, the two nearest clusters are merged (aglomerative
+clustering).
+
+Creates a minimum spanning tree of the data, then cuts edges with a weight
+higher than a given threshold. The remaining connected components form the
+clusters of the data
+
+This example shows this for a small dataset. The minimum spanning tree is
+computed and shown using edges connecting the points. Green edges are under
+the threshold and therefore are kept. Red edges are over the thrsehold and are
+cut. The colours of the remaining points are the final clusters found by the
+algorithm.
+
+"""
+print(__doc__)
+
+import numpy as np
+import pylab as pl
+from sklearn.cluster import SingleLinkageCluster
+from sklearn.utils.sparsetools import minimum_spanning_tree
+from sklearn.metrics import pairwise_distances
+from sklearn import datasets
+
+threshold = 5.0
+n_samples = 50
+X, y = datasets.make_blobs(n_samples=n_samples, random_state=8)
+D = pairwise_distances(X, metric='euclidean')
+
+# Need to compute the spanning tree from the original data, not the model
+# This is due to the model automatically removing those edges.
+span_tree = minimum_spanning_tree(D)
+rows, cols = span_tree.nonzero()
+
+
+# We could set metric to be 'precomputed' here, and fit(D) on the next line.
+# This example shows how to set the metric parameter and fit on X instead.
+model = SingleLinkageCluster(threshold=threshold, metric='euclidean')
+labels = model.fit(X).labels_
+
+fig = pl.figure()
+fig.suptitle('Minimum Spanning Tree Clustering')
+
+
+ax = fig.add_subplot(111)
+colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
+colors = np.hstack([colors] * 20)
+
+for sample1, sample2 in zip(rows, cols):
+ p = zip(X[sample1], X[sample2])
+ if D[sample1,sample2] < threshold:
+ color = 'g'
+ else:
+ color = 'r'
+ ax.plot(p[0], p[1], color=color, linewidth=2, zorder=1)
+
+
+ax.scatter(X[:, 0], X[:,1], color=colors[labels].tolist(), s=80, zorder=2)
+
+pl.show()
+
View
6 sklearn/cluster/__init__.py
@@ -10,6 +10,9 @@
from .hierarchical import ward_tree, Ward, WardAgglomeration
from .k_means_ import k_means, KMeans, MiniBatchKMeans
from .dbscan_ import dbscan, DBSCAN
+from .eac import evidence_accumulation_clustering
+from .eac import EvidenceAccumulationClustering
+from .single_linkage import SingleLinkageCluster
from .bicluster import SpectralBiclustering, SpectralCoclustering
@@ -22,6 +25,9 @@
'Ward',
'WardAgglomeration',
'affinity_propagation',
+ 'SingleLinkageCluster',
+ 'evidence_accumulation_clustering',
+ 'EvidenceAccumulationClustering',
'dbscan',
'estimate_bandwidth',
'get_bin_seeds',
View
236 sklearn/cluster/eac.py
@@ -0,0 +1,236 @@
+# -*- coding: utf-8 -*-
+"""
+EAC: Evidence Accumulation Clustering
+"""
+
+# Author: Robert Layton <robertlayton@gmail.com>
+#
+# License: 3-clause BSD.
+
+import numpy as np
+from scipy import sparse
+from collections import defaultdict
+
+# Import .cluster.SingleLinkageCluster if final_clusterer = None in an EAC call
+# Import .cluster.Kmeans if initial_clusterers = None in any EAC call.
+from ..base import BaseEstimator, ClusterMixin
+from ..utils import check_random_state, atleast2d_or_csr
+
+
+def evidence_accumulation_clustering(X, initial_clusterers=None,
+ final_clusterer=None,
+ random_state=None):
+ """Perform Evidence Accumulation Clustering clustering on a dataset.
+
+ Evidence Accumulation Clustering (EAC) is an ensemble cluster that uses
+ many iterations of k-means with randomly chosen k values (``n_clusters``)
+ each time. The number of times two instances are clustered together is
+ given in a co-association matrix, which is then clustered a final time to
+ produce the 'final clustering'. In practice, this gives a more easily
+ separable set of attributes that the original attributes.
+
+ Parameters
+ ----------
+ X: array [n_samples, n_samples] or [n_samples, n_features]
+ Array of distances between samples, or a feature array.
+ The array is treated as a feature array unless the metric is given as
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is SingleLinkageCluster.
+ random_state: numpy.RandomState or int, optional
+ The generator used to initialize the initial_clusterers.
+ Defaults to numpy.random.
+
+ Returns
+ -------
+ final_model: model (extends ClusterMixin)
+ The model given as `final_clusterer`, fitted with the evidence
+ accumulated through this process.
+
+ Notes
+ -----
+ See examples/plot_eac.py for an example.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ X = atleast2d_or_csr(X)
+ n_samples = X.shape[0]
+ # If index order not given, create random order.
+ random_state = check_random_state(random_state)
+ # If initial_clusterers is None, it is k-means 150 times with randomly
+ # initialised k values (as per original paper).
+ if initial_clusterers is None:
+ initial_clusterers = _kmeans_random_k(n_samples, random_state)
+ # If the final_clusterer is None, create the default model
+ if final_clusterer is None:
+ from ..cluster import SingleLinkageCluster
+ final_clusterer = SingleLinkageCluster(metric='precomputed')
+ # Co-association matrix, originally zeros everywhere
+ C = defaultdict(float)
+ # initial_clusterers could be a generator, so we do not know in advance
+ # how many clusterers there will be.
+ num_initial_clusterers = 0
+ for model in initial_clusterers:
+ num_initial_clusterers += 1
+ # Update random state
+ # Fit model to X
+ model.fit(X)
+ # Calculate new coassociation matrix and add that to the tally
+ C = _update_coassociation_matrix(C, model.labels_)
+ # Convert the defaultdict C into a sparse distance matrix
+ row, col = list(), list()
+ data = list()
+ for key in C.keys():
+ row.append(key[0])
+ col.append(key[1])
+ data.append(C[key])
+ # Normalise data, and turn into a distance matrix
+ data = 1.0 - (np.array(data, dtype='float') / (num_initial_clusterers + 1))
+ D = sparse.csr_matrix((data, (row, col)), shape=(n_samples, n_samples))
+ final_clusterer.fit(D)
+ return final_clusterer
+
+
+def _update_coassociation_matrix(C, labels):
+ """Updates a co-association defaultdict from an array of labels.
+ """
+ labels = np.asarray(labels)
+ for i in range(len(labels)):
+ indices = np.where(labels[i:] == labels[i])[0] + i
+ for idx in indices:
+ C[(i, idx)] += 1.
+ return C
+
+
+def _kmeans_random_k(n_samples, random_state=None, **kmeans_args):
+ """Returns a generator for the default initial clustering for EAC
+
+ This initial clustering is k-means, initialised randomly with k values
+ chosen randomly between 10 and 30 inclusive.
+
+ Parameters
+ ----------
+ random_state: numpy.RandomState, optional
+ The generator used to initialize the initial_clusterers.
+ Defaults to numpy.random.
+
+ kmeans_args: other keywords
+ Any additional arguments to this function are passed onto the
+ initialiser for `sklearn.cluster.KMeans` with the exception of the
+ `n_clusters` argument, which cannot be given (an error is raised if
+ it is).
+
+ Returns
+ -------
+ models: generator of KMeans instances
+ Length will be 150, each instance initialised randomly using the
+ supplied random_state.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ from ..cluster import KMeans
+ if 'n_clusters' in kmeans_args:
+ error_msg = "n_clusters cannot be assigned for the default clusterers."
+ raise ValueError(error_msg)
+ random_state = check_random_state(random_state)
+ num_iterations = 150
+ k_low, k_high = (10, 30)
+ if n_samples < k_high:
+ k_high = n_samples
+ k_low = min(k_low, int(k_high / 2))
+ k_values = random_state.randint(k_low, high=k_high, size=num_iterations)
+ return (KMeans(n_clusters=k, random_state=random_state, **kmeans_args)
+ for k in k_values)
+
+
+class EvidenceAccumulationClustering(BaseEstimator, ClusterMixin):
+ """Perform Evidence Accumulation Clustering clustering on a dataset.
+
+ Evidence Accumulation Clustering (EAC) is an ensemble cluster that uses
+ many iterations of k-means with randomly chosen k values (``n_clusters``)
+ each time. The number of times two instances are clustered together is
+ given in a co-association matrix, which is then clustered a final time to
+ produce the 'final clustering'. In practice, this gives a more easily
+ separable set of attributes that the original attributes.
+
+
+ Parameters
+ ----------
+ X: array [n_samples, n_samples] or [n_samples, n_features]
+ Array of distances between samples, or a feature array.
+ The array is treated as a feature array unless the metric is given as
+ 'precomputed'.
+ initial_clusterers: iterable, or None
+ The clusterers used in the first step of the process. If an iterable is
+ given, then each one is called. If None is given (default), 150 runs of
+ k-means with k randomly selected between 10 and 30 are used.
+ final_clusterer: model (extends ClusterMixin), or None
+ The clusterer to apply to the final clustering matrix. The method must
+ be able to take a coassociation matrix as input, which is an array of
+ size [n_samples, n_samples].
+ If None, the default model is used, which is SingleLinkageCluster.
+ random_state: numpy.RandomState or int, optional
+ The generator used to initialize the initial_clusterers.
+ Defaults to numpy.random.
+
+ Attributes
+ ----------
+ final_clusterer: model (same as the parameter final_clusterer)
+ The clusterer given as a parameter, now fit to the evidence
+ accumulation from learning.
+
+ `labels_` : array, shape = [n_samples]
+ Cluster labels for each point in the dataset given to fit().
+ Same as the self.final_clusterer.labels_
+
+ Notes
+ -----
+ See examples/plot_eac.py for an example.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+
+ def __init__(self, default_initial_clusterers=None,
+ default_final_clusterer=None,
+ random_state=None):
+ self.default_initial_clusterers = default_initial_clusterers
+ self.default_final_clusterer = default_final_clusterer
+ self.random_state = random_state
+
+ def fit(self, X):
+ """Perform EAC clustering from vector array or distance matrix.
+
+ Parameters
+ ----------
+ X: array [n_samples, n_samples] or [n_samples, n_features]
+ Array of distances between samples, or a feature array.
+ The array is treated as a feature array unless the metric is
+ given as 'precomputed'.
+ """
+ model = evidence_accumulation_clustering(
+ X,
+ initial_clusterers=self.default_initial_clusterers,
+ final_clusterer=self.default_final_clusterer,
+ random_state=self.random_state)
+ self.final_clusterer = model
+ self.labels_ = model.labels_
+ return self
View
162 sklearn/cluster/single_linkage.py
@@ -0,0 +1,162 @@
+# -*- coding: utf-8 -*-
+"""
+SingleLinkageCluster:
+
+Compute Minimum Spanning Tree, cut weak links using a threshold. This is
+equivalent to single linkage heirarchical clustering.
+"""
+
+# Author: Robert Layton <robertlayton@gmail.com>
+#
+# License: 3-clause BSD.
+
+from scipy.sparse import csr_matrix
+
+from ..base import BaseEstimator, ClusterMixin
+from ..metrics import pairwise_distances
+from ..utils import atleast2d_or_csr
+from ..utils import minimum_spanning_tree
+from ..utils import connected_components
+
+
+def single_linkage_cluster(X, threshold=0.85, metric='euclidean'):
+ """Perform clustering using the single linkage, with a threshold cut.
+
+ In single linkage hierarchical clustering, each sample begins in its own
+ cluster. At each step, the two nearest clusters are merged (agglomerative
+ clustering). The distance between two clusters is determined by finding the
+ minimum distance between sample in each cluster. This makes it different
+ from other forms of hierarchical clustering, such as complete linkage which
+ uses the maximum distance between a pair of points, one from each cluster.
+
+ The computation used here is an equivalent but different method. A minimum
+ spanning tree is created from the input data (interpreted as a graph) and
+ then weak links are cut. The threshold parameter determines whether a link
+ is cut.
+
+ Parameters
+ ----------
+ X: array [n_samples, n_samples]
+ Array of similarities between samples.
+
+ threshold: float (default 0.5)
+ The threshold to cut weak links. This algorithm is based on similarity,
+ and therefore any links with distance lower than this values are cut.
+
+ metric: string, or callable
+ The metric to use when calculating distance between instances in a
+ feature array. If metric is a string or callable, it must be one of
+ the options allowed by metrics.pairwise.calculate_distance for its
+ metric parameter.
+ If metric is "precomputed", X is assumed to be a distance matrix and
+ must be square.
+
+ Returns
+ -------
+ `labels_` : array, shape = [n_samples]
+ Cluster labels for each point in the dataset given to fit().
+
+ `span_tree_`: csr_matrix, shape=[n_samples, n_samples]
+ A sparse matrix representing the minimum spanning tree of the given
+ input matrix, with edges greater than the given threshold removed.
+
+ Notes
+ -----
+ See examples/plot_single_linkage.py for a visualisation of the threshold
+ cutting algorithm on a small dataset.
+
+ See examples/plot_eac.py for an example. The Evidence Accumulation
+ Clustering (EAC) algorithm uses this clusterer in its final clustering
+ step.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+ X = atleast2d_or_csr(X)
+ X = pairwise_distances(X, metric=metric)
@larsmans Owner

This takes O(n²) time. With an explicit, sparse similarity matrix instead of distances, it can be done in O(E lg(E)) time where E are the edges of the similarity graph, i.e. the non-zero similarities. In the optimal algorithm, you don't even need to build the spanning tree. By promising to return the spanning tree and working with metrics instead of similarities, we're tying ourselves down to a suboptimal algorithm.

@jakevdp Collaborator
jakevdp added a note

One easy (but admittedly ad-hoc) trick is to instead use an approximate minimum spanning tree by building a graph only over the $k$ nearest neighbors. For most datasets, given a reasonable choice of $k$, the result is likely to be the same as in the exact case.

@ogrisel Owner
ogrisel added a note
@jakevdp Collaborator
jakevdp added a note

I think the best solution would be to build the graph over the edges given by scipy's QHULL wrapper. Then you get a true MST in order NlogN. From my recollection, though, I think the qhull wrapper in scipy doesn't provide direct access to the edges...

@robertlayton Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ assert X.shape[0] == X.shape[1]
+ span_tree = minimum_spanning_tree(X)
+ idx = span_tree.data < threshold
+ data = span_tree.data[idx]
+ rows, cols = span_tree.nonzero()
+ rows = rows[idx]
+ cols = cols[idx]
+ # Compute clusters by finding connected subgraphs in self.span_tree
+ new_data = (data, (rows, cols))
+ span_tree = csr_matrix(new_data, shape=span_tree.shape)
+ n_components, labels_ = connected_components(span_tree, directed=False)
+ return labels_, span_tree
+
+
+class SingleLinkageCluster(BaseEstimator, ClusterMixin):
+ """Perform clustering using the single linkage, with a threshold cut.
+
+ In single linkage hierarchical clustering, each sample begins in its own
+ cluster. At each step, the two nearest clusters are merged (agglomerative
+ clustering). The distance between two clusters is determined by finding the
+ minimum distance between sample in each cluster. This makes it different
+ from other forms of hierarchical clustering, such as complete linkage which
+ uses the maximum distance between a pair of points, one from each cluster.
+
+ The computation used here is an equivalent but different method. A minimum
+ spanning tree is created from the input data (interpreted as a graph) and
+ then weak links are cut. The threshold parameter determines whether a link
+ is cut.
+
+ Parameters
+ ----------
+ X: array [n_samples, n_samples]
+ Array of similarities between samples.
+
+ threshold: float (default 0.5)
+ The threshold to cut weak links. This algorithm is based on similarity,
+ and therefore any links with distance lower than this values are cut.
+
+ metric: string, or callable
+ The metric to use when calculating distance between instances in a
+ feature array. If metric is a string or callable, it must be one of
+ the options allowed by metrics.pairwise.calculate_distance for its
+ metric parameter.
+ If metric is "precomputed", X is assumed to be a distance matrix and
+ must be square.
+
+ Attributes
+ ----------
+ `labels_` : array, shape = [n_samples]
+ Cluster labels for each point in the dataset given to fit().
+
+ `span_tree_`: csr_matrix, shape=[n_samples, n_samples]
+ A sparse matrix representing the minimum spanning tree of the given
+ input matrix, with edges greater than the given threshold removed.
+
+ Notes
+ -----
+ See examples/plot_eac.py for an example. The Evidence Accumulation
+ Clustering (EAC) algorithm uses this clusterer in its final clustering
+ step.
+
+ References
+ ----------
+ Fred, Ana LN, and Anil K. Jain. "Data clustering using evidence
+ accumulation." Pattern Recognition, 2002. Proceedings. 16th International
+ Conference on. Vol. 4. IEEE, 2002.
+ """
+
+ def __init__(self, threshold=0.85, metric='euclidean'):
+ self.threshold = threshold
+ self.metric = metric
+
+ def fit(self, X):
+ """Perform Single Linkage clustering from a similarity matrix.
+
+ Parameters
+ ----------
+ X: array (dense or sparse) [n_samples, n_samples]
+ Array of similarities between samples.
+ """
+ self.labels_, self.span_tree = single_linkage_cluster(
+ X, threshold=self.threshold, metric=self.metric)
+ return self
View
36 sklearn/cluster/tests/test_eac.py
@@ -0,0 +1,36 @@
+"""
+Tests components of the EvidenceAccumulationClustering
+
+The algorithm is quite expensive, even on smaller datasets, so a full
+run is not performed as a unit test.
+"""
+
+import pickle
+
+import numpy as np
+from collections import defaultdict
+
+from sklearn.utils.testing import assert_equal
+from sklearn.cluster.eac import _update_coassociation_matrix
+
+
+def test_coassociation_matrix_building():
+ """Tests that the coassociation matrix builds properly."""
+ C = defaultdict(int)
+ n_samples = 4
+ test_labels = np.array([[0, 0, 1, 1],
+ [1, 1, 0, 0],
+ [0, 1, 1, 0],
+ [1, 1, 1, 1]])
+ for labels in test_labels:
+ C = _update_coassociation_matrix(C, labels)
+ for i in range(n_samples):
+ C[(i,i)] = len(test_labels)
+ C_expected = defaultdict(int)
+ # format: (i, j, C[i][j])
+ data = [(0, 0, 4), (0, 1, 3), (0, 2, 1), (0, 3, 2), (1, 1, 4), (1, 2, 2),
+ (1, 3, 1), (2, 2, 4), (2, 3, 3), (3, 3, 4)]
+ for x, y, v in data:
+ C_expected[(x, y)] = v
+ assert_equal(C, C_expected)
+
View
58 sklearn/cluster/tests/test_single_linkage_cluster.py
@@ -0,0 +1,58 @@
+"""
+Tests for SingleLinkageCluster clustering algorithm
+"""
+
+import pickle
+
+import numpy as np
+from scipy import sparse
+from collections import defaultdict
+
+from sklearn.utils.testing import assert_array_equal
+from sklearn.cluster import SingleLinkageCluster
+from sklearn.datasets.samples_generator import make_blobs
+from sklearn.metrics import pairwise_distances
+
+
+def test_single_linkage_cluster_cluster():
+ """
+ Tests that SingleLinkageCluster has same results for different input.
+ """
+ n_samples = 150
+ threshold = 2.0
+ centers = np.array([
+ [0.0, 5.0, 0.0, 0.0, 0.0],
+ [1.0, 1.0, 4.0, 0.0, 0.0],
+ [1.0, 0.0, 0.0, 5.0, 1.0],
+ ])
+ X, true_labels = make_blobs(n_samples=n_samples, centers=centers,
+ cluster_std=1., random_state=42)
+ D = pairwise_distances(X, metric='euclidean')
+ print np.histogram(D)
+ # Using precomputed distance matrix
+ model_p = SingleLinkageCluster(threshold=threshold, metric='precomputed')
+ labels_p = model_p.fit(D).labels_
+ # Computing distance matrix
+ model_d = SingleLinkageCluster(threshold=threshold)
+ labels_d = model_d.fit(X).labels_
+ assert_array_equal(labels_p, labels_d)
+ # Using a sparse distance matrix
+ Dsp = sparse.csr_matrix(D)
+ model_sp = SingleLinkageCluster(threshold=threshold, metric='precomputed')
+ labels_sp = model_sp.fit(Dsp).labels_
+ assert_array_equal(labels_p, labels_sp)
+
+
+def test_trivial_mst():
+ """Tests a trivial SingleLinkageCluster example (from scipy docs)."""
+ X = sparse.csr_matrix([[0, 8, 0, 3],
+ [0, 0, 2, 5],
+ [0, 0, 0, 6],
+ [0, 0, 0, 0]])
+ model = SingleLinkageCluster(threshold=4.9, metric='precomputed')
+ y_pred = model.fit(X).labels_
+ y_true = np.array([0, 1, 1, 0], dtype='int32')
+ assert_array_equal(y_pred, y_true)
+
+
+test_trivial_mst()
View
5 sklearn/tests/test_common.py
@@ -513,7 +513,10 @@ def test_clustering():
assert_equal(alg.labels_.shape, (n_samples,))
pred = alg.labels_
- assert_greater(adjusted_rand_score(pred, y), 0.4)
+ score = adjusted_rand_score(pred, y)
+ error_message = "{} failed with score {} (<0.4 benchmark) {}".format(name,
+ score, alg)
+ assert_greater(score, 0.4, error_message)
# fit another time with ``fit_predict`` and compare results
if name is 'SpectralClustering':
# there is no way to make Spectral clustering deterministic :(
View
3  sklearn/utils/__init__.py
@@ -14,7 +14,8 @@
atleast2d_or_csr, warn_if_not_float,
check_random_state, column_or_1d)
from .class_weight import compute_class_weight
-from sklearn.utils.sparsetools import minimum_spanning_tree
+from sklearn.utils.sparsetools import (minimum_spanning_tree,
+ connected_components)
@larsmans Owner

Where is connected_components being used?

@robertlayton Owner

spectral_embedding.py uses it, but imports it directly from .sparsetools. Same in a new other places. That said, single_linkage uses it.

@larsmans Owner

Oh, right, this is an __init__. Sorry, hadn't seen that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
__all__ = ["murmurhash3_32", "as_float_array", "check_arrays", "safe_asarray",
Something went wrong with that request. Please try again.