New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leiden now included in python-igraph #1053
Comments
Thanks for the update! I'm not sure if we'll be able to migrate very easily though. We allow users to choose the quality function, and use the
|
The Leiden algorithm from |
If we can cleanly switch to the igraph implementation for modularity with weights, it could make sense for that to be the default. Any chance you could point me to some benchmarks on performance? An initial test looks very impressive! |
I haven't performed an in-depth benchmark comparison. But results from a single run of modularity detection on an example (a Facebook graph) is sufficiently revealing I think:
This is only a relatively small graph, and the difference is likely to be even bigger for larger graphs. Perhaps the |
I have a large single cell set, where i'd love to speed up the Leiden clustering. Is it the igraph version of leiden implemented in scanpy? Just wondering if this might be a potential area for speed gains. Thanks! |
This hasn't been implemented yet, but a pull request would be welcome. There would also have to be documentation about changing results and how to get previous behavior. Some benchmarks of performance would also be great. |
So there are definite speed gains to be had with the igraph implementation of Leiden clustering. But the results are not exactly the same. I have run igraph straight from adata by piggy backing scanpy utility functions with the following code...
On my spatial data (only mention this to explain the poor cell-cell separation in the results!) igraph is 5.86x faster at clustering a 185,000 cell dataset vs. On the PBMC 3k dataset from scanpy tutorials, there is no practically speed difference (too small a dateset to matter). The cluster number output from both methods is the same for both methods. There are a few differences in cell cluster assignment though. |
I have now done a speed comparison with adata object of 1.85 million cells. igraph on adata as implemented above ran in 33 minutes vs |
@vtraag @ivirshup I accidentally ran the More generally, any advice on which settings to implement with the Thanks! Thanks! |
Good to see the large speed gains! Do note that |
Yes, you can set a random seed. You can set the RNG in |
I'm not sure the objective should be to get the same output from two runs. I recall louvain having a low average NMI (below 0.3 or sth like that?) for multiple runs at different random seeds. In the end this is an NP-hard problem with a good heuristic solution that is affected by the random ordering of nodes. Maybe it would be more useful to look at multiple runs of |
Getting identical output from two runs without using the same seed is unlikely in many cases indeed. In some cases, you really do want to get identical output, so then setting a seed could be useful. In addition, it would be good to ensure that at least the approach is conceptually the same for both the |
@vtraag, could you comment on what about this implementation makes it faster? I'm wondering how much the speed gains could be from number of iterations. Some other notes:
|
The Additionally, some of the iteration over neighbours in the |
I suspect a lot of the runtime increase may come from the number of iterations. Using the default value, Using I am unsure why we set |
Ah right, you meant those iterations, of course. Yes, that definitely can make a large difference! When comparing the speed of both implementations, indeed the number of iterations should of course be identical. |
As mentioned below, you should set the RNG. AFAIK scanpy does that by default. |
It would probably good to see how important the def iterativley_cluster(
g: igraph.Graph,
*,
n_iterations: int = 10,
random_state: int = 0,
leiden_kwargs: dict = {}
) -> list:
import random
random.seed(random_state)
_leiden_kwargs = {"objective_function": "modularity", "weights": "weight"}
_leiden_kwargs.update(leiden_kwargs)
partition = g.community_leiden(n_iterations=1, **_leiden_kwargs)
steps = [partition]
for _ in range(n_iterations-1):
partition = g.community_leiden(n_iterations=1, initial_membership=partition.membership, **_leiden_kwargs)
steps.append(partition)
return steps My suspicion (and hope) would be that unstable clusters / points are the ones that drag on the optimization process. E.g. groups that aren't maintained when you change the random seed also aren't maintained through later iterations. |
Setting The average of 4 leiden runs on my 185,000 cell subsampled dataset: 1 leiden run on my 1,850,000 cell subsampled dataset: |
@mezwick, I don't fully understand something. The timings you report seem to be lower for |
Sorry, i made a critical typo in the time reports, where i listed the functions the wrong way round. I have updated the comment to correct this. To be clear.
|
I have been unable to get the leiden clustering to run on a large 18.5 million cell dataset with either When i run
I get the following traceback
When i run the
I get the following, similar traceback
The script runs completely fine when I subsample the adata with Any advice on what might be going wrong here? Thanks! |
This issue appears due to a memory overflow on the number of edges in my graph. Fixing will probably need to wait till igraph 0.10, which should handle 64-bit integers, and so be able to handle far larger graphs |
When the C core |
@ivirshup if it's still faster with the same value of |
Could also look at using https://igraph.org/c/doc/igraph-Layout.html#igraph_layout_umap_compute_weights while we're at it with |
@ivirshup The code diff is here but I'll explain the logic of the changes.
I'll look into some larger datasets. |
The Leiden algorithm is now included in the latest release of
python-igraph
, version 0.8.0. I believe this alleviates the need to depend on theleidenalg
packages. The Leiden algorithm provided inpython-igraph
is substantially faster than theleidenalg
package. It is simpler though, providing fewer options, but I believe the more extensive options of theleidenalg
package are not necessarily needed for the purposes ofscanpy
. We provide binary wheels on PyPI and binaries for conda are available from the conda-forge channel, also for Windows.The text was updated successfully, but these errors were encountered: