Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering with leidenalg #350

Closed
ktpolanski opened this issue Nov 7, 2018 · 28 comments
Closed

Clustering with leidenalg #350

ktpolanski opened this issue Nov 7, 2018 · 28 comments

Comments

@ktpolanski
Copy link
Contributor

ktpolanski commented Nov 7, 2018

Hello,

It would appear that louvain-igraph has been obsoleted in favour of leidenalg, and the author makes a persuasive case as to the superiority of the new approach. To my untrained eye, the algorithm is conceptually similar to the Louvain modification used by Seurat, but introduces an extra collapsed network refinement step.

it should be easy to support this in Scanpy - the syntax appears to be identical to the old louvain innards, and I was able to construct a very minimal dummy function for testing by taking the key bits of sc.tl.louvain() and replacing louvain. with leidenalg.:

import leidenalg
import numpy as np
import pandas as pd
from scanpy import utils
from natsort import natsorted

def leiden(adata, use_weights=False, resolution=1, iterations=-1):
	g = utils.get_igraph_from_adjacency(adata.uns['neighbors']['connectivities'], directed=True)
	weights = None
	if use_weights:
		weights = np.array(g.es["weight"]).astype(np.float64)
	part = leidenalg.find_partition(
		g, leidenalg.RBConfigurationVertexPartition, 
		resolution_parameter = resolution, weights = weights, 
		n_iterations = iterations,
	)
	groups = np.array(part.membership)
	adata.obs['louvain'] = pd.Categorical(
		values=groups.astype('U'),
		categories=natsorted(np.unique(groups).astype('U')),
	)

As such, replacing any louvain. with leidenalg. in sc.tl.louvain() would do most of the work. Probably the only new thing that would need support would the the n_iterations parameter in leidenalg.find_partition(). The default value is 2, positive values control how many passes of the algorithm are performed. -1 just makes it run until it fails to improve the clustering.

@fidelram
Copy link
Collaborator

fidelram commented Nov 8, 2018

Thanks for the information. Very interesting read. From the paper it seems that the Leiden method indeed is superior to the louvain method. However, instead of replacing the louvain function I think is better to explicitly create a leiden clustering function as you have done and you can add the n_iterations parameter.

I would suggest to add a PR with this function as part of tools, but because of the new dependency required I am not sure. @falexwolf any suggestions?

Meanwhile I will give it a try.

@ktpolanski
Copy link
Contributor Author

ktpolanski commented Nov 8, 2018

In that case, it'd probably be easiest to just add a 'leiden' flavour to sc.tl.louvain(). Due to the overlap of the syntax, you may even get away with explicitly recycling the existing Louvain code in the function and just swap the package behind the scenes:

if flavor == 'vtraag':
	import louvain
elif flavor ==  'leiden':
	import leidenalg as louvain

@flying-sheep
Copy link
Member

flying-sheep commented Nov 8, 2018

I mean, @vtraag is is the person I’d believe when asked which algorithm is superior, so we could

  1. add sc.tl.leiden as an alternative that doesn’t have a flavour argument.
  2. make leidenalg a dependency and louvain-igraph an optional one.
  3. when calling sc.tl.louvain (no matter the flavor used), emit a DeprecationWarning('We recommend to use `sc.tool.leiden` instead. Refer to its documentation for details')

This meets the following goals:

  • education: people will learn why we recommend the new function

  • ease of use: no weird errors pop up suddenly

  • reproducibility: If louvain-igraph is installed, the code works exactly as before (with an added warning), else it crashes. we could do the following within sc.tl.louvain to help users:

    try:
        import louvain
    except ImportError:
        raise ImportError(
            'The package “louvain-igraph“ is not installed. '
            'Try using `sc.tl.leiden` in case you do not need '
            'to reproduce results produced using `sc.tl.louvain`'
        )

@fidelram
Copy link
Collaborator

fidelram commented Nov 8, 2018

I just did a quick comparison between louvain and leiden algorithms using the pbmc68k_reduced dataset:

adata = sc.datasets.pbmc68k_reduced()
sc.pp.neighbors(adata)
leiden(adata,  use_weights=True)
sc.tl.louvain(adata, use_weights=True)
sc.pl.umap(adata, color=['louvain', 'leiden'], s=50, alpha=0.6, ncols=2)

image

The results are almost identical. However, while in the louvain results some cells appear in the wrong cluster (red circle) this is not the case for the leiden method.

I should note that the installation of leidenalg didn't go smooth and took me a while to set it up.

@vtraag
Copy link

vtraag commented Nov 8, 2018

Sorry to hear it took you some time to set it up. I've created a recipe for conda-forge channel (conda-forge/staged-recipes#6911), once merged that should hopefully simplify some of the installation.

@wflynny
Copy link

wflynny commented Nov 8, 2018

@vtraag FWIW, pip install leidenalg worked without a hitch for me (CentOS 6.5, Python 3.6.6 in a relatively empty conda env: scanpy + scikit stack).

@ktpolanski
Copy link
Contributor Author

Likewise, I just ran pip install leidenalg on an OSX machine which already had scanpy and louvain on it, and it set up effortlessly.

@vtraag
Copy link

vtraag commented Nov 8, 2018

Good to hear also positive installation results!

@falexwolf
Copy link
Member

I'd definitely make a new function tl.leiden; then @vtraag 's package can also be taken credit of in an appropriate way, using the reference to the recent arxiv.

Also, backwards compat is guaranteed and @flying-sheep's two other points (eduction and ease-of-use) are met, too.

So, @ktpolanski, would you make a PR for a function tl.leiden? Of course, it would be nice if didn't duplicate all code in the louvain function, but that's up to you.

I would not yet make the leidenalg package a at this stage, but transition to that either in Scanpy 1.4 or later. People can definitely achieve decent results with the current setup and we don't want everyone to change everything. After the PR, those who want can slowly transition to the new clustering algorithm. After a major release, we can broadly advertise the package.

@vtraag: Great to see your new preprint and package. I thought that your Louvain implementation already yielded very well-connected communities and even made a remark on that here in the first version more than a year ago. But great, in hard cases, I'd expect better results using your new algorithm for partitioning the graph...

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 12, 2018

I think the disconnected communities in Louvain should have less of an effect in KNN graphs as the degree distribution is a lot more regular. This issue appears to occur a lot more frequently when the node degrees in a community are quite different (or at least this is what I found on PPI networks). Nonetheless, it's a good idea to upgrade I reckon.

@ktpolanski
Copy link
Contributor Author

Sure thing, I can take a shot at a function.

@vtraag
Copy link

vtraag commented Nov 13, 2018

Thanks @falexwolf ! Indeed, @LuckyMD, it is reasonable to expect that graphs with a more regular degree suffer less from this problem.

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 13, 2018

@vtraag I believe we emailed about this briefly in 2014 when I noticed I was getting disconnected communities with louvain ;).

@vtraag
Copy link

vtraag commented Nov 15, 2018

@LuckyMD Yes, I was recently reminded of that by Mason Porter. I saw you also wrote a small piece on it in your thesis. I had already forgotten about our email exchange again. But it's nice we can continue the conversation 4 years later here :)

The approach I took back then was quite straightforward: simply separate the different connected components. This of course ignores the larger issue that even when communities are not completely disconnected, they may still be quite badly connected. The approach taken in the Leiden algorithm is quite different, and has much nicer properties I think.

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 15, 2018

@vtraag The Leiden algorithm is definitely a much better solution and the preprint explains nicely how it's not just about disconnecting communities but a more general problem. I needed a quick fix at the time, which worked well enough. Thanks again for that ;).

I did spend some more time on trying to understand the process of getting these disconnected communities back in 2014, and hence put a bit in my thesis. As it was a bit off-topic for me, that section was shortened by revisions unfortunately. I'm surprised you found it actually.... although I'm glad someone read it :).

Also, I thought I saw an abstract of yours in some conference proceedings a couple years ago about the Leiden algorithm already. It seems to have been in the workings for quite some time. Congratulations on getting it out!

@vtraag
Copy link

vtraag commented Nov 15, 2018

@LuckyMD Thanks! Yes, the algorithm has been in development for quite some time. I presented it already in 2016 at a conference in Amsterdam, and after that in several other places. It kept changing in relatively minor ways, although that also affected the exact guarantees it could offer.

Unfortunate that the section on disconnected communities got trimmed down! Would you have more extensive results described somewhere else?

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 15, 2018

@vtraag I have quite a few more examples of communities that were disconnected that I went through in my notes (and I should have one in an earlier draft of the thesis)... including a community that can only be reconnected with 2 intermediate nodes (though this was a rare case). The general trend was that the "removed" nodes were always of a higher degree than any node remaining in the disconnected community. Just by the number of alternative local merger options for hub nodes that would make sense I guess. If you're interested I can see if I can find the older versions.

@falexwolf
Copy link
Member

Ah, I was offline from this.... Great that you got together! Just because I should have mentioned this before: the comment in the paga manuscript was based on a remark by @LuckyMD, which is also acknowledged in the paper, but I should have mentioned it above, too, of course... 🙈 I typed this in a hurry...

@falexwolf
Copy link
Member

These days, I can only make it by going through 70 github issues in a row after the twins finally sleep... there will be times with more thoughtful comments from my side, again, I hope...

@vtraag
Copy link

vtraag commented Nov 19, 2018

Interesting example, @LuckyMD, indeed I'd be interested to reading more about it (you can email me if you want). Thank you for letting me know @falexwolf that the disconnected comunities are also mentioned in the paper. I'll be sure to point to both your thesis @LuckyMD and the paper @falexwolf in a revised version.

Interestingly, in an earlier implementation of the Louvain algorithm (https://launchpad.net/louvain), I indeed included this splitting of disconnected communities in their connected components before the aggregation (although it seems I never pushed the code to launchpad and only shared with @LuckyMD). However, I had forgotten about that problem when developing the new package https://github.com/vtraag/louvain-igraph, so I never split communities in that package. You did that yourself in that implementation? Or at least, that would be my guess, if I read the paper correctly.

P.S. I sympathise @falexwolf, I never had twins, but single babies already take up quite some of your time. Good luck! Surely it will get less stressful over time...

@falexwolf
Copy link
Member

falexwolf commented Nov 20, 2018

Maybe I didn't get what you wrote above and you're not talking about the PAGA paper. But in that, I mention the louvain algorithm (and cite your louvain-igraph package in particular) as the primary candidate for a clustering algorithm that produces connected clusters in the knn-graph representation of the data (which was originally represented in some feature space). I made this statement to highlight the difference to non-graph based algorithms like k-means, dbscan, etc. Already quite more than a year ago, after one of my talks, Malte highlighted the fact that your package does indeed a good job of producing connected communities and that he had had some communication with you about this. So, I thought, that's worth a note in the supplement. But it's good to know that the louvain-igraph never actually had this fix. In practice though, it definitely does a very good job for us.

Still, really cool that this has been thoroughly addressed in leiden. So, we can all transition to that at some point.

@falexwolf
Copy link
Member

P.S. And thank you for the kind words about the babies, @vtraag! Yes, it already got a lot better after the first couple of months in the nights... but at daytime, you still don't get one single quiet moment... I expect this to go on like this for another year or so... let's see 😄

@vtraag
Copy link

vtraag commented Nov 23, 2018

What I specifically referred to is this part:

The original implementation of the Louvain algorithm could lead to disconnected communities when nodes were assigned to a common community with a single node connecting two or more parts of this community. When the central node was reconsidered by the algorithm and moved to a different community, a disconnected community remained. This unexpected behaviour could be fixed by splitting disconnected communities before each community aggregation step in the implementation of [47].

Here [47] refers to the louvain-igraph package. My point was that in that package I do not split disconnected communities before aggregation, contrary to the code that I shared with @LuckyMD at one point.

Anyway, good to hear the louvain-igraph package already did a good job for you! Looking forward to seeing the Leiden algorithm implemented in the scanpy package.

@flying-sheep
Copy link
Member

Well, “implemented”… I’d say “wrapped”.

@vtraag
Copy link

vtraag commented Nov 23, 2018

Ah well, as long as it works :)

@falexwolf
Copy link
Member

My point was that in that package I do not split disconnected communities before aggregation, contrary to the code that I shared with @LuckyMD at one point.

@vtraag Yes, I understood that from your previous post, that's why wrote:

But it's good to know that the louvain-igraph never actually had this fix.

So, yes, the note you pulled out from the paper (the revised version of it was worded by @LuckyMD) gives louvain-igraph wrong credit. So, if I'm able to make another revision, I'll replace it with the reference to the leiden preprint. Nonetheless, louvain-igraph remains an awesome package and I'll maintain all other references to it. 😄

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 26, 2018

Yeah... sorry about that @falexwolf. I assumed the the python implementation had the same fix as the C++ version I got from you @vtraag.

@vtraag
Copy link

vtraag commented Nov 26, 2018

@falexwolf thanks! And no worries about it, I just wanted to make clear that louvain-igraph didn't contain that fix.

Indeed @LuckyMD, I would have assumed the same myself 😄. But alas, the problem got lost in the mists of time (or well, my mist of time: I forgot about it), until it resurfaced in another context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants