Clustering with leidenalg #350

ktpolanski · 2018-11-07T18:38:32Z

Hello,

It would appear that louvain-igraph has been obsoleted in favour of leidenalg, and the author makes a persuasive case as to the superiority of the new approach. To my untrained eye, the algorithm is conceptually similar to the Louvain modification used by Seurat, but introduces an extra collapsed network refinement step.

it should be easy to support this in Scanpy - the syntax appears to be identical to the old louvain innards, and I was able to construct a very minimal dummy function for testing by taking the key bits of sc.tl.louvain() and replacing louvain. with leidenalg.:

import leidenalg
import numpy as np
import pandas as pd
from scanpy import utils
from natsort import natsorted

def leiden(adata, use_weights=False, resolution=1, iterations=-1):
	g = utils.get_igraph_from_adjacency(adata.uns['neighbors']['connectivities'], directed=True)
	weights = None
	if use_weights:
		weights = np.array(g.es["weight"]).astype(np.float64)
	part = leidenalg.find_partition(
		g, leidenalg.RBConfigurationVertexPartition, 
		resolution_parameter = resolution, weights = weights, 
		n_iterations = iterations,
	)
	groups = np.array(part.membership)
	adata.obs['louvain'] = pd.Categorical(
		values=groups.astype('U'),
		categories=natsorted(np.unique(groups).astype('U')),
	)

As such, replacing any louvain. with leidenalg. in sc.tl.louvain() would do most of the work. Probably the only new thing that would need support would the the n_iterations parameter in leidenalg.find_partition(). The default value is 2, positive values control how many passes of the algorithm are performed. -1 just makes it run until it fails to improve the clustering.

The text was updated successfully, but these errors were encountered:

fidelram · 2018-11-08T15:02:24Z

Thanks for the information. Very interesting read. From the paper it seems that the Leiden method indeed is superior to the louvain method. However, instead of replacing the louvain function I think is better to explicitly create a leiden clustering function as you have done and you can add the n_iterations parameter.

I would suggest to add a PR with this function as part of tools, but because of the new dependency required I am not sure. @falexwolf any suggestions?

Meanwhile I will give it a try.

ktpolanski · 2018-11-08T15:22:40Z

In that case, it'd probably be easiest to just add a 'leiden' flavour to sc.tl.louvain(). Due to the overlap of the syntax, you may even get away with explicitly recycling the existing Louvain code in the function and just swap the package behind the scenes:

if flavor == 'vtraag':
	import louvain
elif flavor ==  'leiden':
	import leidenalg as louvain

flying-sheep · 2018-11-08T15:37:38Z

I mean, @vtraag is is the person I’d believe when asked which algorithm is superior, so we could

add sc.tl.leiden as an alternative that doesn’t have a flavour argument.
make leidenalg a dependency and louvain-igraph an optional one.
when calling sc.tl.louvain (no matter the flavor used), emit a DeprecationWarning('We recommend to use `sc.tool.leiden` instead. Refer to its documentation for details')

This meets the following goals:

education: people will learn why we recommend the new function
ease of use: no weird errors pop up suddenly

reproducibility: If louvain-igraph is installed, the code works exactly as before (with an added warning), else it crashes. we could do the following within sc.tl.louvain to help users:

try:
    import louvain
except ImportError:
    raise ImportError(
        'The package “louvain-igraph“ is not installed. '
        'Try using `sc.tl.leiden` in case you do not need '
        'to reproduce results produced using `sc.tl.louvain`'
    )

fidelram · 2018-11-08T15:55:54Z

I just did a quick comparison between louvain and leiden algorithms using the pbmc68k_reduced dataset:

adata = sc.datasets.pbmc68k_reduced()
sc.pp.neighbors(adata)
leiden(adata,  use_weights=True)
sc.tl.louvain(adata, use_weights=True)
sc.pl.umap(adata, color=['louvain', 'leiden'], s=50, alpha=0.6, ncols=2)

The results are almost identical. However, while in the louvain results some cells appear in the wrong cluster (red circle) this is not the case for the leiden method.

I should note that the installation of leidenalg didn't go smooth and took me a while to set it up.

vtraag · 2018-11-08T16:41:32Z

Sorry to hear it took you some time to set it up. I've created a recipe for conda-forge channel (conda-forge/staged-recipes#6911), once merged that should hopefully simplify some of the installation.

wflynny · 2018-11-08T16:53:26Z

@vtraag FWIW, pip install leidenalg worked without a hitch for me (CentOS 6.5, Python 3.6.6 in a relatively empty conda env: scanpy + scikit stack).

ktpolanski · 2018-11-08T16:56:03Z

Likewise, I just ran pip install leidenalg on an OSX machine which already had scanpy and louvain on it, and it set up effortlessly.

vtraag · 2018-11-08T17:03:06Z

Good to hear also positive installation results!

falexwolf · 2018-11-12T01:54:39Z

I'd definitely make a new function tl.leiden; then @vtraag 's package can also be taken credit of in an appropriate way, using the reference to the recent arxiv.

Also, backwards compat is guaranteed and @flying-sheep's two other points (eduction and ease-of-use) are met, too.

So, @ktpolanski, would you make a PR for a function tl.leiden? Of course, it would be nice if didn't duplicate all code in the louvain function, but that's up to you.

I would not yet make the leidenalg package a at this stage, but transition to that either in Scanpy 1.4 or later. People can definitely achieve decent results with the current setup and we don't want everyone to change everything. After the PR, those who want can slowly transition to the new clustering algorithm. After a major release, we can broadly advertise the package.

@vtraag: Great to see your new preprint and package. I thought that your Louvain implementation already yielded very well-connected communities and even made a remark on that here in the first version more than a year ago. But great, in hard cases, I'd expect better results using your new algorithm for partitioning the graph...

LuckyMD · 2018-11-12T09:28:19Z

I think the disconnected communities in Louvain should have less of an effect in KNN graphs as the degree distribution is a lot more regular. This issue appears to occur a lot more frequently when the node degrees in a community are quite different (or at least this is what I found on PPI networks). Nonetheless, it's a good idea to upgrade I reckon.

ktpolanski · 2018-11-12T12:57:41Z

Sure thing, I can take a shot at a function.

vtraag · 2018-11-13T08:55:45Z

Thanks @falexwolf ! Indeed, @LuckyMD, it is reasonable to expect that graphs with a more regular degree suffer less from this problem.

LuckyMD · 2018-11-13T15:32:19Z

@vtraag I believe we emailed about this briefly in 2014 when I noticed I was getting disconnected communities with louvain ;).

vtraag · 2018-11-15T16:39:35Z

@LuckyMD Yes, I was recently reminded of that by Mason Porter. I saw you also wrote a small piece on it in your thesis. I had already forgotten about our email exchange again. But it's nice we can continue the conversation 4 years later here :)

The approach I took back then was quite straightforward: simply separate the different connected components. This of course ignores the larger issue that even when communities are not completely disconnected, they may still be quite badly connected. The approach taken in the Leiden algorithm is quite different, and has much nicer properties I think.

LuckyMD · 2018-11-15T17:06:02Z

@vtraag The Leiden algorithm is definitely a much better solution and the preprint explains nicely how it's not just about disconnecting communities but a more general problem. I needed a quick fix at the time, which worked well enough. Thanks again for that ;).

I did spend some more time on trying to understand the process of getting these disconnected communities back in 2014, and hence put a bit in my thesis. As it was a bit off-topic for me, that section was shortened by revisions unfortunately. I'm surprised you found it actually.... although I'm glad someone read it :).

Also, I thought I saw an abstract of yours in some conference proceedings a couple years ago about the Leiden algorithm already. It seems to have been in the workings for quite some time. Congratulations on getting it out!

vtraag · 2018-11-15T20:59:48Z

@LuckyMD Thanks! Yes, the algorithm has been in development for quite some time. I presented it already in 2016 at a conference in Amsterdam, and after that in several other places. It kept changing in relatively minor ways, although that also affected the exact guarantees it could offer.

Unfortunate that the section on disconnected communities got trimmed down! Would you have more extensive results described somewhere else?

LuckyMD · 2018-11-15T22:02:24Z

@vtraag I have quite a few more examples of communities that were disconnected that I went through in my notes (and I should have one in an earlier draft of the thesis)... including a community that can only be reconnected with 2 intermediate nodes (though this was a rare case). The general trend was that the "removed" nodes were always of a higher degree than any node remaining in the disconnected community. Just by the number of alternative local merger options for hub nodes that would make sense I guess. If you're interested I can see if I can find the older versions.

falexwolf · 2018-11-19T01:39:13Z

Ah, I was offline from this.... Great that you got together! Just because I should have mentioned this before: the comment in the paga manuscript was based on a remark by @LuckyMD, which is also acknowledged in the paper, but I should have mentioned it above, too, of course... 🙈 I typed this in a hurry...

falexwolf · 2018-11-19T01:41:28Z

These days, I can only make it by going through 70 github issues in a row after the twins finally sleep... there will be times with more thoughtful comments from my side, again, I hope...

vtraag · 2018-11-19T07:40:40Z

Interesting example, @LuckyMD, indeed I'd be interested to reading more about it (you can email me if you want). Thank you for letting me know @falexwolf that the disconnected comunities are also mentioned in the paper. I'll be sure to point to both your thesis @LuckyMD and the paper @falexwolf in a revised version.

Interestingly, in an earlier implementation of the Louvain algorithm (https://launchpad.net/louvain), I indeed included this splitting of disconnected communities in their connected components before the aggregation (although it seems I never pushed the code to launchpad and only shared with @LuckyMD). However, I had forgotten about that problem when developing the new package https://github.com/vtraag/louvain-igraph, so I never split communities in that package. You did that yourself in that implementation? Or at least, that would be my guess, if I read the paper correctly.

P.S. I sympathise @falexwolf, I never had twins, but single babies already take up quite some of your time. Good luck! Surely it will get less stressful over time...

falexwolf · 2018-11-20T20:29:20Z

Maybe I didn't get what you wrote above and you're not talking about the PAGA paper. But in that, I mention the louvain algorithm (and cite your louvain-igraph package in particular) as the primary candidate for a clustering algorithm that produces connected clusters in the knn-graph representation of the data (which was originally represented in some feature space). I made this statement to highlight the difference to non-graph based algorithms like k-means, dbscan, etc. Already quite more than a year ago, after one of my talks, Malte highlighted the fact that your package does indeed a good job of producing connected communities and that he had had some communication with you about this. So, I thought, that's worth a note in the supplement. But it's good to know that the louvain-igraph never actually had this fix. In practice though, it definitely does a very good job for us.

Still, really cool that this has been thoroughly addressed in leiden. So, we can all transition to that at some point.

falexwolf · 2018-11-20T20:44:16Z

P.S. And thank you for the kind words about the babies, @vtraag! Yes, it already got a lot better after the first couple of months in the nights... but at daytime, you still don't get one single quiet moment... I expect this to go on like this for another year or so... let's see 😄

vtraag · 2018-11-23T11:02:22Z

What I specifically referred to is this part:

The original implementation of the Louvain algorithm could lead to disconnected communities when nodes were assigned to a common community with a single node connecting two or more parts of this community. When the central node was reconsidered by the algorithm and moved to a different community, a disconnected community remained. This unexpected behaviour could be fixed by splitting disconnected communities before each community aggregation step in the implementation of [47].

Here [47] refers to the louvain-igraph package. My point was that in that package I do not split disconnected communities before aggregation, contrary to the code that I shared with @LuckyMD at one point.

Anyway, good to hear the louvain-igraph package already did a good job for you! Looking forward to seeing the Leiden algorithm implemented in the scanpy package.

flying-sheep · 2018-11-23T11:08:44Z

Well, “implemented”… I’d say “wrapped”.

vtraag · 2018-11-23T11:33:58Z

Ah well, as long as it works :)

falexwolf · 2018-11-25T21:19:44Z

My point was that in that package I do not split disconnected communities before aggregation, contrary to the code that I shared with @LuckyMD at one point.

@vtraag Yes, I understood that from your previous post, that's why wrote:

But it's good to know that the louvain-igraph never actually had this fix.

So, yes, the note you pulled out from the paper (the revised version of it was worded by @LuckyMD) gives louvain-igraph wrong credit. So, if I'm able to make another revision, I'll replace it with the reference to the leiden preprint. Nonetheless, louvain-igraph remains an awesome package and I'll maintain all other references to it. 😄

LuckyMD · 2018-11-26T09:57:37Z

Yeah... sorry about that @falexwolf. I assumed the the python implementation had the same fix as the C++ version I got from you @vtraag.

vtraag · 2018-11-26T21:40:05Z

@falexwolf thanks! And no worries about it, I just wanted to make clear that louvain-igraph didn't contain that fix.

Indeed @LuckyMD, I would have assumed the same myself 😄. But alas, the problem got lost in the mists of time (or well, my mist of time: I forgot about it), until it resurfaced in another context.

vtraag mentioned this issue Nov 15, 2018

Do not know how to install louvain vtraag/louvain-igraph#46

Closed

olgabot mentioned this issue Nov 15, 2018

readme updates chanzuckerberg/cellxgene#436

Merged

gokceneraslan closed this as completed Sep 7, 2019

AnnaChristina mentioned this issue Sep 13, 2021

Preprocessing & visualization introduction theislab/single-cell-best-practices#10

Closed

Zethson mentioned this issue Oct 15, 2021

Clustering theislab/single-cell-best-practices#31

Closed

ivirshup mentioned this issue Feb 15, 2022

Leiden now included in python-igraph #1053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering with leidenalg #350

Clustering with leidenalg #350

ktpolanski commented Nov 7, 2018 •

edited by flying-sheep

fidelram commented Nov 8, 2018

ktpolanski commented Nov 8, 2018 •

edited

flying-sheep commented Nov 8, 2018 •

edited

fidelram commented Nov 8, 2018

vtraag commented Nov 8, 2018 •

edited

wflynny commented Nov 8, 2018

ktpolanski commented Nov 8, 2018

vtraag commented Nov 8, 2018

falexwolf commented Nov 12, 2018

LuckyMD commented Nov 12, 2018

ktpolanski commented Nov 12, 2018

vtraag commented Nov 13, 2018

LuckyMD commented Nov 13, 2018

vtraag commented Nov 15, 2018

LuckyMD commented Nov 15, 2018

vtraag commented Nov 15, 2018

LuckyMD commented Nov 15, 2018 •

edited

falexwolf commented Nov 19, 2018

falexwolf commented Nov 19, 2018

vtraag commented Nov 19, 2018

falexwolf commented Nov 20, 2018 •

edited

falexwolf commented Nov 20, 2018

vtraag commented Nov 23, 2018

flying-sheep commented Nov 23, 2018

vtraag commented Nov 23, 2018

falexwolf commented Nov 25, 2018

LuckyMD commented Nov 26, 2018

vtraag commented Nov 26, 2018

Clustering with leidenalg #350

Clustering with leidenalg #350

Comments

ktpolanski commented Nov 7, 2018 • edited by flying-sheep

fidelram commented Nov 8, 2018

ktpolanski commented Nov 8, 2018 • edited

flying-sheep commented Nov 8, 2018 • edited

fidelram commented Nov 8, 2018

vtraag commented Nov 8, 2018 • edited

wflynny commented Nov 8, 2018

ktpolanski commented Nov 8, 2018

vtraag commented Nov 8, 2018

falexwolf commented Nov 12, 2018

LuckyMD commented Nov 12, 2018

ktpolanski commented Nov 12, 2018

vtraag commented Nov 13, 2018

LuckyMD commented Nov 13, 2018

vtraag commented Nov 15, 2018

LuckyMD commented Nov 15, 2018

vtraag commented Nov 15, 2018

LuckyMD commented Nov 15, 2018 • edited

falexwolf commented Nov 19, 2018

falexwolf commented Nov 19, 2018

vtraag commented Nov 19, 2018

falexwolf commented Nov 20, 2018 • edited

falexwolf commented Nov 20, 2018

vtraag commented Nov 23, 2018

flying-sheep commented Nov 23, 2018

vtraag commented Nov 23, 2018

falexwolf commented Nov 25, 2018

LuckyMD commented Nov 26, 2018

vtraag commented Nov 26, 2018

ktpolanski commented Nov 7, 2018 •

edited by flying-sheep

ktpolanski commented Nov 8, 2018 •

edited

flying-sheep commented Nov 8, 2018 •

edited

vtraag commented Nov 8, 2018 •

edited

LuckyMD commented Nov 15, 2018 •

edited

falexwolf commented Nov 20, 2018 •

edited