New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Single linkage clustering #9372

Merged
merged 61 commits into from Jan 22, 2018

Conversation

Projects
None yet
9 participants
@lmcinnes
Contributor

lmcinnes commented Jul 15, 2017

Work regarding issue #4103 -- addition of single linkage as a linkage option for hierarchical clustering.

This is the minimal changes required to get single linkage working. We take advantage of scipy.cluster for the basic single linkage, and scipy.sparse.csgraph for the connectivity-constrained case. Both of these could be improved upon at some point in the future if required.

I believe I caught most cases in the documentation where linkage strategies are mentioned and single linkage is relevant, but I admit I may not have caught them all.

@amueller

needs tests ;)

@@ -579,7 +579,9 @@ Agglomerative cluster has a "rich get richer" behavior that leads to
uneven cluster sizes. In this regard, complete linkage is the worst
strategy, and Ward gives the most regular sizes. However, the affinity
(or distance used in clustering) cannot be varied with Ward, thus for non
Euclidean metrics, average linkage is a good alternative.
Euclidean metrics, average linkage is a good alternative. Single linkage,
while not robust to noisy data, can computed very efficiently and can

This comment has been minimized.

@amueller

amueller Jul 15, 2017

Member

be computed

mst_array = mst_array[np.argsort(mst_array.T[2]),:]
# Convert edge list into standard hierarchical clustering format
single_linkage_tree = _hierarchical.single_linkage_label(mst_array)

This comment has been minimized.

@amueller

amueller Jul 15, 2017

Member

is this faster then connected_components?

This comment has been minimized.

@amueller

amueller Jul 15, 2017

Member

Oh, I guess we're computing the linkage tree here? Are we using this later? I feel like we usually are ok with just the clustering, but I don't use these algorithms often.

This comment has been minimized.

@lmcinnes

lmcinnes Jul 15, 2017

Contributor

This converts the MST into scipy.cluster.hierarchy format. It just uses a union find so is pretty fast, and I already had code that does this (and was well tested), so it seemed the most efficient way to get something working. I am open to other options.

# Compute parents
parent = np.arange(n_nodes, dtype=np.intp)
for i, (left, right) in enumerate(children_, n_samples):

This comment has been minimized.

@amueller

amueller Jul 15, 2017

Member

Isn't this slow?

This comment has been minimized.

@lmcinnes

lmcinnes Jul 15, 2017

Contributor

Relatively speaking, no. I can cythonize it if you are concerned.

This comment has been minimized.

@amueller

amueller Jul 15, 2017

Member

if no, then please don't ;) I'd rather avoid cython if it's not necessary.
Can you maybe post a benchmark of this against the other linkage criteria?

This comment has been minimized.

@lmcinnes

lmcinnes Jul 16, 2017

Contributor

Here is a scaling performance comparison for the sparse matrix case:
image
Most of the time is single linkage is actually spent fixing the connectivity matrix -- I'm factoring that out of the test and will post that soon, but it already looks pretty good.

The dense case is simply scipy.cluster.hierarchy, so I presume the performance of single linkage there is already demonstrated.

This comment has been minimized.

@lmcinnes

lmcinnes Jul 16, 2017

Contributor

Here we are with the connectivity factored out (performed beforehand so it doesn't need to occur in the fit method).

image

This comment has been minimized.

@amueller

amueller Nov 20, 2017

Member

sorry can you explain the difference between the two graphs again? The cythonization of the tree creation?

This comment has been minimized.

@lmcinnes

lmcinnes Nov 23, 2017

Contributor

Both graphs use the same code in hierarchy_.py, the difference is in the data passed to in. In the first graph the sparse matrix that is passed in is not guaranteed to have a single connected component, which then requires some pre-processing to "fix" that by adding extra entries to the matrix. This pre-processing is common to all the linkage approaches, but isn't part of the algorithms per se (but the fit method will do it if it is required). The second graph simple ensures that the required pre-processing is done before the fit method is called, and the timing is thus only for the actual linkage methods, and no longer includes the common pre-processing step (which can be time consuming).

This comment has been minimized.

@jnothman

jnothman Nov 28, 2017

Member

The iteration and tuple unpacking is likely to be slow relative to using children_.tolist(), but it's unimportant

@amueller

This comment has been minimized.

Member

amueller commented Jul 15, 2017

I guess this is more in line with how the other algorithms are implemented, so maybe @GaelVaroquaux has a more informed opinion. Are we supporting a pre-specified graph here?

@lmcinnes

This comment has been minimized.

Contributor

lmcinnes commented Jul 15, 2017

With regard to tests: I modified the existing test suite to exercise single linkage in all the general linkage tests, so it is definitely getting tested. Did you want some new tests specific to single linkage? What did you have in mind?

With regard to pre-specified graphs: yes those are supported, presuming I am interpreting what you mean by that correctly (a connectivity (sparse) matrix constraining things).

################################################################################
# Efficient labelling/conversion of MSTs to single linkage hierarchies
cdef class UnionFind (object):

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 15, 2017

Member

To respect pep8, I'd rather avoid the space before the "(" above.

This comment has been minimized.

@lmcinnes

lmcinnes Jul 15, 2017

Contributor

Done. Sorry about that.

@@ -556,10 +556,10 @@ considers at each step all the possible merges.
number of features. It is a dimensionality reduction tool, see
:ref:`data_reduction`.
Different linkage type: Ward, complete and average linkage
Different linkage type: Ward, complete, average and single linkage
-----------------------------------------------------------

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 15, 2017

Member

You need to update the length of the line below.

This comment has been minimized.

@lmcinnes

lmcinnes Jul 15, 2017

Contributor

Done. Thanks!

lmcinnes added some commits Jul 15, 2017

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 15, 2017

@amueller

This comment has been minimized.

Member

amueller commented Jul 15, 2017

I'm not sure what the current tests for the linkage critera are. It would be nice if there was a test with an example that has obvious different solutions for the different linkage critera. There's also an example for them, I think it would be nice to add single linkage here.

@GaelVaroquaux the question was whether we want to create a linkage tree, but I suppose the answer is yes.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 15, 2017

@lmcinnes

This comment has been minimized.

Contributor

lmcinnes commented Jul 15, 2017

I'll see if I can craft a couple of test examples that are suitable. I'll also see If I can write a simple example demonstrating the effects of the different linkages (I already added single linkage to the existing examples where it made sense).

@lmcinnes

This comment has been minimized.

Contributor

lmcinnes commented Jul 15, 2017

If I make an example using the current cluster comparison on toy datasets framework demonstrating the results under different linkage methods would that be a suitable example?

@amueller

This comment has been minimized.

Member

amueller commented Jul 15, 2017

sure, sounds good. I imagine the example that's already there doesn't show the differences that clearly?

@lmcinnes

This comment has been minimized.

Contributor

lmcinnes commented Jul 15, 2017

The current example is on a 2D embedding of the digits dataset. That isn't terrible, but it doesn't highlight the different particularly. In fact because it is a noisy embedding single linkage just picks out the noise points on the fringe and calls everything else one cluster. It certainly demonstrates the shortcomings, but fails to demonstrate the cases where it can be useful.

lmcinnes added some commits Jul 15, 2017

Provide a more complete comparison of the different linkage methods, …
…highlighting the relative strengths and weaknesses.
@jakevdp

This comment has been minimized.

Member

jakevdp commented Jan 18, 2018

OK, sounds good.

@lmcinnes

This comment has been minimized.

Contributor

lmcinnes commented Jan 18, 2018

Thanks for all the help @jakevdp and @jnothman, it was greatly appreciated, and I gained a lot from working with both of you in ironing out the many fine details.

raise ValueError("Input MST array is not a validly formatted MST array")
if not np.all(np.sort(L[:, 2]) == L[:, 2]):
raise ValueError("Input MST array must be sorted by weight")

This comment has been minimized.

@jakevdp

jakevdp Jan 18, 2018

Member

Sorry, one more comment: I'd change this to

is_sorted = lambda x: np.all(x[:-1] <= x[1:])
if not is_sorted(L[:, 2]):
   raise ...

no need to actually perform a full (expensive) sort to check if the array is sorted.

This comment has been minimized.

@jakevdp

jakevdp Jan 18, 2018

Member

You could do it all in one line, of course, but I think using the lambda function makes it easier for someone to skim the code and understand what's going on.

@jakevdp

This comment has been minimized.

Member

jakevdp commented Jan 18, 2018

The last remaining issue is that if the connectivity matrix has integer types, our strategy for creating epsilon values will lead to weird results. If that's ever a possibility, we should proactively convert the connectivity matrix to float64, because that's what's eventually done within the minimum_spanning_tree function anyway.

@lmcinnes

This comment has been minimized.

Contributor

lmcinnes commented Jan 18, 2018

Assuming that _single_linkage_tree is only called internally then I believe this is resolved already in that connectivity data is either populated with the results of paired_distances which is float64, or with an explicit cast to float64 of the content of X (in the case that affinity='precomputed'. I'm happy to add a cast in _single_linkage_tree to ensure safety if it gets called any other way in the future.

@jakevdp

This comment has been minimized.

Member

jakevdp commented Jan 18, 2018

Makes sense. Thanks for all the great work on this @lmcinnes!

@jakevdp jakevdp changed the title from [MRG+1] Single linkage clustering to [MRG+2] Single linkage clustering Jan 18, 2018

@@ -567,30 +567,24 @@ considers at each step all the possible merges.
number of features. It is a dimensionality reduction tool, see
:ref:`data_reduction`.
Different linkage type: Ward, complete and average linkage
-----------------------------------------------------------
Different linkage type: Ward, complete, average and single linkage

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jan 19, 2018

Member

Nitpick: Oxford coma "... average, and single".

# ============
plt.figure(figsize=(9 * 2 + 3, 12.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
hspace=.01)

This comment has been minimized.

@GaelVaroquaux
# ============
# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
# ============

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jan 19, 2018

Member

Should we rather use sphinx-gallery separators here: insert a line of continuous "#" longer than 70 chars before the block and only before. Sphinx-gallery will use this to create html rendering and Jupyter notebooks.

@@ -79,7 +81,7 @@ def plot_clustering(X_red, X, labels, title=None):
from sklearn.cluster import AgglomerativeClustering
for linkage in ('ward', 'average', 'complete'):
for linkage in ('ward', 'average', 'complete', 'single'):

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jan 19, 2018

Member

With the added subplot, the figure got a bit more narrow and the titles are not well separated. I think that it would be useful to add a "\n" in the title between the name of the linkage and the timing.

This comment has been minimized.

@lmcinnes

lmcinnes Jan 20, 2018

Contributor

Sorry, but I am not quite clear exactly what you would like (certainly the titles on the plots are a little tight). I've made some adjustments, but would welcome any further clarification as I suspect I am missing something here.

Edit: Ah -- you are referring to examples/cluster/plot_agglomerative_clustering.py I suspect. I can certainly fix that.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 19, 2018

Comments as I go (sorry, as I can get interrupted any time, I will paste the comments here in separate messages):

In the documentation, in the clustering.rst, in the warning box "Connectivity constraints with average and complete linkage", "single" should be added to the list of linkages that are brittle.

In the corresponding example, "plot_agglomerative_clustering.py", there also needs to add "single" to each enumeration containing complete and average. It would be useful to stress that single linkage is even more brittle than complete and average linkage.

return result_arr
def single_linkage_label(L):

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jan 19, 2018

Member

This seems very much like a pure Python function (it has not typing information). Any reason to have it in a Cython file?

This comment has been minimized.

@lmcinnes

lmcinnes Jan 19, 2018

Contributor

I was keeping it together with the private cythonized function above -- this is the future-proof wrapper in case future users wish to use the routine safely (i.e. with appropriate checks). It is not currently called at all. I am certainly happy to move it, but it is not clear to me what the appropriate place is at this time.

from scipy.sparse.csgraph import minimum_spanning_tree
# explicitly cast connectivity to ensure safety
connectivity = connectivity.astype('float64')

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jan 19, 2018

Member

Would it be possible to use fused type and support both float64 and float32, in the interest of memory? We've been slowly but surely trying to make it so that in scikit-learn there is support of 32 and 64 bit in the interest of memory. It also tends to make code faster (data fits more in CPU cache).

This comment has been minimized.

@lmcinnes

lmcinnes Jan 19, 2018

Contributor

I would certainly be willing to look into that (I don't currently know how, but presume it can't be too hard). As @jakevdp pointed out, however, the first thing the minimum_spanning_tree will do is convert to float64 so ultimately it is going to end up converted in a few lines anyway; by converting here we can do the finessing of zero distances without getting into type specific difficulties (in particular integer types). Is it worth using the fused type here and then letting scipy do the conversion in minimum_spanning_tree?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 19, 2018

Awesome PR!!! Thanks you so much.

I made a bunch of comments, but I finished reviewing it. Most of them are minor. The comment on the fused type is the only non trivial one.

I am +1 for merge after those comments have been addressed.

@lmcinnes

This comment has been minimized.

Contributor

lmcinnes commented Jan 19, 2018

I've commented on points of discussion; the rest are straightforward and I'll try to get to them tonight or tomorrow.

Most of them were addressed easily, but clarification of a few points would be helpful. Thanks for all the feedback.

@@ -567,7 +567,7 @@ considers at each step all the possible merges.
number of features. It is a dimensionality reduction tool, see
:ref:`data_reduction`.
Different linkage type: Ward, complete, average and single linkage
Different linkage type: Ward, complete, average, and single linkage
------------------------------------------------------------------

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jan 22, 2018

Member

Sorry to always come back with a comment, but haven't you forgotten to extend the line below?

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 22, 2018

@lmcinnes : point taken on the suggestion of cython fused types.

I'll address the last cosmetic comments myself and merge. We have enough +1s on this one. Thanks a lot, this is a big deal!!

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 22, 2018

I've pushed the cosmetic changes. I'll merge once travis is green.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 22, 2018

All checks have passed. Merging! Hurray!

@GaelVaroquaux GaelVaroquaux merged commit 8233829 into scikit-learn:master Jan 22, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.19%)
Details
codecov/project 96.5% (+0.31%) compared to 229183b
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@amueller

This comment has been minimized.

Member

amueller commented Jan 22, 2018

OH YEAH!!!

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jan 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment