Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leiden now included in python-igraph #1053

Closed
vtraag opened this issue Feb 15, 2020 · 29 comments · Fixed by #2815
Closed

Leiden now included in python-igraph #1053

vtraag opened this issue Feb 15, 2020 · 29 comments · Fixed by #2815
Assignees
Milestone

Comments

@vtraag
Copy link

vtraag commented Feb 15, 2020

The Leiden algorithm is now included in the latest release of python-igraph, version 0.8.0. I believe this alleviates the need to depend on the leidenalg packages. The Leiden algorithm provided in python-igraph is substantially faster than the leidenalg package. It is simpler though, providing fewer options, but I believe the more extensive options of the leidenalg package are not necessarily needed for the purposes of scanpy. We provide binary wheels on PyPI and binaries for conda are available from the conda-forge channel, also for Windows.

@ivirshup
Copy link
Member

Thanks for the update! I'm not sure if we'll be able to migrate very easily though. We allow users to choose the quality function, and use the leidenalg.RBConfigurationVertexPartition as the default. We've also been considering using the multiplex partitioning methods.

  • Do you think the performance improvements will also be implemented in leidenalg?
  • Is modularity with a resolution parameter equivalent to leidenalg.RBConfigurationVertexPartition?

@vtraag
Copy link
Author

vtraag commented Feb 16, 2020

The Leiden algorithm from igraph only implements two quality functions: CPM and modularity. Both indeed use resolution parameters, and are the equivalent of CPMVertexPartition and RBConfigurationVertexPartition. The reason the igraph implementation is more performant than leidenalg is exactly because it does not provide other quality functions or a multiplex approach. If you need the other quality functions (such as significance or surprise), you will need to stick to leidenalg. So if you need those other quality functions, or a multiplex approach, you might still want to stick to the leidenalg package after all.

@ivirshup
Copy link
Member

If we can cleanly switch to the igraph implementation for modularity with weights, it could make sense for that to be the default. Any chance you could point me to some benchmarks on performance? An initial test looks very impressive!

@vtraag
Copy link
Author

vtraag commented Feb 17, 2020

I haven't performed an in-depth benchmark comparison. But results from a single run of modularity detection on an example (a Facebook graph) is sufficiently revealing I think:

Running Leiden 0.7.0.post1+71.g14ba1e4.dirty
Running igraph 0.8.0
Read graph (n=63731,m=817035), starting community detection.
leidenalg: t=8.048258741036989, m=0.6175825273363675
igraph community_leiden: t=1.159165252931416, m=0.6298702028415605

This is only a relatively small graph, and the difference is likely to be even bigger for larger graphs. Perhaps the igraph Leiden algorithm can indeed be the default, with leidenalg being an optional choice or something?

@mezwick
Copy link

mezwick commented Feb 14, 2022

I have a large single cell set, where i'd love to speed up the Leiden clustering.

Is it the igraph version of leiden implemented in scanpy? Just wondering if this might be a potential area for speed gains.

Thanks!

@ivirshup
Copy link
Member

This hasn't been implemented yet, but a pull request would be welcome.

There would also have to be documentation about changing results and how to get previous behavior. Some benchmarks of performance would also be great.

@mezwick
Copy link

mezwick commented Feb 14, 2022

So there are definite speed gains to be had with the igraph implementation of Leiden clustering. But the results are not exactly the same.

I have run igraph straight from adata by piggy backing scanpy utility functions with the following code...

adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(objective_function='modularity', resolution_parameter=0.15)
adata_analyse.obs['leiden_igraph'] = pd.Series(clustering.membership, dtype='category', index=adata_analyse.obs.index)

On my spatial data (only mention this to explain the poor cell-cell separation in the results!) igraph is 5.86x faster at clustering a 185,000 cell dataset vs. sc.tl.leiden(adata, resolution=0.15). The methods output different cluster numbers, but the results broadly make sense for both outputs biologically speaking.
image

On the PBMC 3k dataset from scanpy tutorials, there is no practically speed difference (too small a dateset to matter). The cluster number output from both methods is the same for both methods. There are a few differences in cell cluster assignment though.
image

@mezwick
Copy link

mezwick commented Feb 15, 2022

The Leiden algorithm is now included in the latest release of python-igraph, version 0.8.0. I believe this alleviates the need to depend on the leidenalg packages. The Leiden algorithm provided in python-igraph is substantially faster than the leidenalg package. It is simpler though, providing fewer options, but I believe the more extensive options of the leidenalg package are not necessarily needed for the purposes of scanpy. We provide binary wheels on PyPI and binaries for conda are available from the conda-forge channel, also for Windows.

I have now done a speed comparison with adata object of 1.85 million cells.

igraph on adata as implemented above ran in 33 minutes vs sc.tl.leiden() which took ~14 hours

@mezwick
Copy link

mezwick commented Feb 15, 2022

@vtraag @ivirshup I accidentally ran the igraph_community_leiden in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?

More generally, any advice on which settings to implement with the igraph_community_leiden call to replicate sc.tl.leiden? It is not clear to me what some of the sc.tl.leiden default calls are to leidenalg.

Thanks!

Thanks!

@vtraag
Copy link
Author

vtraag commented Feb 15, 2022

Good to see the large speed gains! Do note that g.community_leiden(objective_function='modularity', resolution_parameter=0.15) won't use any edge weights by default. Possibly, sc.tl.leiden(), might use edge weights? This might explain some of the discrepancies you might see. Another difference is that community_leiden only runs on undirected graphs, while leidenalg will also work on directed graphs, which makes a difference when using modularity. This could perhaps also explain some of the discrepancies.

@vtraag
Copy link
Author

vtraag commented Feb 15, 2022

I accidentally ran the igraph_community_leiden in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?

Yes, you can set a random seed. You can set the RNG in igraph using set_random_number_generator. When simply passing the standard random Python library, you can set the seed using random.seed.

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 15, 2022

I'm not sure the objective should be to get the same output from two runs. I recall louvain having a low average NMI (below 0.3 or sth like that?) for multiple runs at different random seeds. In the end this is an NP-hard problem with a good heuristic solution that is affected by the random ordering of nodes. Maybe it would be more useful to look at multiple runs of sc.tl.leiden and multiple runs of leiden from python-igraph and compare the outputs in terms of NMI to one another or to some baseline label assignment?

@vtraag
Copy link
Author

vtraag commented Feb 15, 2022

Getting identical output from two runs without using the same seed is unlikely in many cases indeed. In some cases, you really do want to get identical output, so then setting a seed could be useful.

In addition, it would be good to ensure that at least the approach is conceptually the same for both the python-igraph implementation and sc.tl.leiden. But the suggestion of taking multiple runs and compare them makes sense!

@ivirshup
Copy link
Member

@vtraag, could you comment on what about this implementation makes it faster? I'm wondering how much the speed gains could be from number of iterations.

Some other notes:

  • The typical "connectivity graph" we get from UMAP is symmetric, so not being directed shouldn't be an issue
  • We do use weights by default, which seems to give more stable results (especially when number of neighbors changes)

@vtraag
Copy link
Author

vtraag commented Feb 15, 2022

The leidenalg package was originally built for flexibility, and you can easily plugin new quality functions. As a result, some of the admin stuff is being done less efficiently than it could be done. In igraph, there is less flexibility, so that the implementation can be made more efficient.

Additionally, some of the iteration over neighbours in the leidenalg package is less efficient than how it is implemented in igraph at the moment. This could be made more efficient though, but it is something that requires quite some rewriting, for which I would first need to find the time. I'm not sure how large the speed gains of this would be immediately.

@mezwick
Copy link

mezwick commented Feb 15, 2022

For the 3k test dataset, introducing edge weights with

import random
random.seed(1234) 

adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(objective_function='modularity', weights='weight')
adata.obs['leiden_igraph_edge_weighted'] = pd.Series(clustering.membership, dtype='category', index=adata.obs.index)

Leads to a more similar clustering to sc.tl.leiden.

Setting the seed makes all reproducible.
image

Including igraph edge weights does not seem to impact run times on my larger datasets vs. igraph without weights.

@ivirshup
Copy link
Member

I suspect a lot of the runtime increase may come from the number of iterations.

Using the default value, n_interations=2, g.community_leiden("modularity", weights="weight") took ~2.5 minutes.

Using n_iterations=-1, which is what scanpy does, g.community_leiden("modularity", weights="weight", n_iterations=-1) took ~1.5 hours.

I am unsure why we set n_iterations=-1, since that's not what leidenalg does by default. I'm not seeing much discussion of why it's done this way in the issue (#350) or PR (#361)...

@vtraag
Copy link
Author

vtraag commented Feb 15, 2022

Ah right, you meant those iterations, of course. Yes, that definitely can make a large difference! When comparing the speed of both implementations, indeed the number of iterations should of course be identical.

@dawe
Copy link
Contributor

dawe commented Feb 15, 2022

@vtraag @ivirshup I accidentally ran the igraph_community_leiden in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?

More generally, any advice on which settings to implement with the igraph_community_leiden call to replicate sc.tl.leiden? It is not clear to me what some of the sc.tl.leiden default calls are to leidenalg.

Thanks!

Thanks!

As mentioned below, you should set the RNG. AFAIK scanpy does that by default.

@ivirshup
Copy link
Member

It would probably good to see how important the n_iterations parameter is for our data. Hopefully after the first couple iterations it's only a few points of the million shuffling around per iteration. I'll try to take a closer when I can, but basically need to try something like:

def iterativley_cluster(
    g: igraph.Graph,
    *,
    n_iterations: int = 10,
    random_state: int = 0,
    leiden_kwargs: dict = {}
) -> list:
    import random
    random.seed(random_state)

    _leiden_kwargs = {"objective_function": "modularity", "weights": "weight"}
    _leiden_kwargs.update(leiden_kwargs)

    partition = g.community_leiden(n_iterations=1, **_leiden_kwargs)

    steps = [partition]
    for _ in range(n_iterations-1):
        partition = g.community_leiden(n_iterations=1, initial_membership=partition.membership, **_leiden_kwargs)
        steps.append(partition)

    return steps

My suspicion (and hope) would be that unstable clusters / points are the ones that drag on the optimization process. E.g. groups that aren't maintained when you change the random seed also aren't maintained through later iterations.

@mezwick
Copy link

mezwick commented Feb 17, 2022

Setting n_iterations=-1 in g.community_leiden certainly impacts run time (vs. default n_iterations=2), making runtimes more similar to sc.tl.leiden(). For large datasets though, run times with g.comunity_leiden still appear faster.

The average of 4 leiden runs on my 185,000 cell subsampled dataset:
sc.tl.leiden, 11.5 minutes
g.community_leiden, 9.5 minutes

1 leiden run on my 1,850,000 cell subsampled dataset:
sc.tl.leiden, 11 hours, 26 minutes
g.community_leiden, 7 hours, 30 minutes

@vtraag
Copy link
Author

vtraag commented Feb 22, 2022

@mezwick, I don't fully understand something. The timings you report seem to be lower for sc.tl.leiden than for g.community_leiden, suggesting that g.community_leiden is slower than sc.tl.leiden, but in your comments you say that g.community_leiden appears faster. Am I misunderstanding something? Could you perhaps clarify?

@mezwick
Copy link

mezwick commented Feb 22, 2022

Sorry, i made a critical typo in the time reports, where i listed the functions the wrong way round. I have updated the comment to correct this.

To be clear. g.community_leiden is faster than sc.tl.leiden in my case, particulalrly for large datasets.

Setting n_iterations=-1 in g.community_leiden certainly impacts run time (vs. default n_iterations=2), making runtimes more similar to sc.tl.leiden(). For large datasets though, run times with g.comunity_leiden still appear faster.

The average of 4 leiden runs on my 185,000 cell subsampled dataset: sc.tl.leiden, 11.5 minutes g.community_leiden, 9.5 minutes

1 leiden run on my 1,850,000 cell subsampled dataset: sc.tl.leiden, 11 hours, 26 minutes g.community_leiden, 7 hours, 30 minutes

@mezwick
Copy link

mezwick commented Apr 14, 2022

Hi @vtraag @ivirshup,

I have been unable to get the leiden clustering to run on a large 18.5 million cell dataset with either sc.tl.leiden() or the python-igraph implementation outlined above. Smaller subsets of the data run fine. I think the problem lies somewhere with weights.

When i run

obs_name = f"leiden {leiden_suffix}"
sc.tl.leiden(adata, resolution=0.1, n_iterations=2, key_added=obs_name)

I get the following traceback

Traceback (most recent call last):
  File "/opt/notebooks/output/../scripts/sc_cluster.py", line 92, in <module>
    adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
  File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/_utils/__init__.py", line 219, in get_igraph_from_adjacency
    g.es['weight'] = weights
SystemError: /opt/conda/conda-bld/python-split_1649141344976/work/Objects/listobject.c:138: bad argument to internal function

When i run the python-igraph leiden implementation like so...

obs_name = f"leiden {leiden_suffix}"
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(
    objective_function="modularity", 
    resolution_parameter=0.1,            
    weights = 'weight',
    n_iterations=2
    )
adata.obs[obs_name] = (
    pd.Series(
        clustering.membership, 
        dtype='category', 
        index=adata.obs.index
        )
    )

I get the following, similar traceback

UMAP leiden clustering resolution 0.1 commenced 12:27:06.619473
running Leiden clustering
Traceback (most recent call last):
  File "/opt/notebooks/output/../scripts/sc_cluster.py", line 88, in <module>
    obs_name = f"leiden {leiden_suffix}"
  File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/tools/_leiden.py", line 129, in leiden
    g = _utils.get_igraph_from_adjacency(adjacency, directed=directed)
  File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/_utils/__init__.py", line 219, in get_igraph_from_adjacency
    g.es['weight'] = weights
SystemError: /opt/conda/conda-bld/python-split_1649141344976/work/Objects/listobject.c:138: bad argument to internal function

The script runs completely fine when I subsample the adata with sc.pp.neighbours. So far, i have managed to run 0.5 fraction of the cells (9.25 million cells). On the cluster i am trying to run this on, memory and processing power are not limiting.

Any advice on what might be going wrong here?

Thanks!

@mezwick
Copy link

mezwick commented Apr 19, 2022

This issue appears due to a memory overflow on the number of edges in my graph. Fixing will probably need to wait till igraph 0.10, which should handle 64-bit integers, and so be able to handle far larger graphs
igraph/python-igraph#531 (comment)

@vtraag
Copy link
Author

vtraag commented Apr 20, 2022

When the C core igraph version 0.10 is released, including a new release of the Python interface building on this new version, i.e. including 64-bit support, I will also update the leidenalg implementation to follow suit. In principle, leidenalg is already working with 64-bit integers, but since it builds on igraph, it is limited by that.

@adamgayoso
Copy link
Member

@ivirshup if it's still faster with the same value of n_iterations, then why not make the switch?

@ilan-gold
Copy link
Contributor

ilan-gold commented Jan 11, 2024

Could also look at using https://igraph.org/c/doc/igraph-Layout.html#igraph_layout_umap_compute_weights while we're at it with python-igraph

@ilan-gold
Copy link
Contributor

ilan-gold commented Jan 15, 2024

@ivirshup The code diff is here but I'll explain the logic of the changes.

  1. igraph's implementation does not allow directed graphs; we throw a ValueError if someone tries to do directed + igraph.
  2. igraph's default resolution parameter is 2, so that changed as well. I don't think we should swap it back to -1 for leidenalg.
  3. Of course use_igraph is now True (and a new argument).

I'll look into some larger datasets. Also there is no plotting test from what I see. Should we add that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

7 participants