speed up define_clonotypes #368

grst · 2022-10-09T14:17:37Z

The define_clonotypes function scales badly. There are two problems with it

it could be faster (while it relies heavily on numpy, there are parts implemented in Python)
parallelization doesn't work properly with large data. Due to how multiprocessing is implemented in Python, parallelization involves a lot of copying. If parallelization worked properly, the speed would still be bearable if one throws enough cores at the problem.

INPUT:

OUTPUT:

CURRENT IMPLEMENTATION:

compute unique receptor configurations (i.e. combining cells with the same sequences into a single entry) (fast)
build a lookup table from which the neighbors of each cell can be retrieved (fast enough)
loop through all unique receptor configurations and find neighbors (SLOW)
build a distance matrix (fast)
graph partition using igraph (fast)

ALTERNATIVE IMPLEMENTATIONS I considered but discarded

reindexing sequence distance matrices such that they match the table of unique receptor configurations
Then perform matrix operations to combine primary/secondary and TRA/TRB matrices.
The problem with this approach is that large dense blocks in the sparse matrices can arise if many unique receptors have the same sequence (e.g. same TRA but different TRB).

fix parallelization (shared memory)
reimplement using jax/numba (this may also solve the parallelization and provide GPU support)
Combine 2-4 into a single step (maybe possible with sequence embedding -- see Autoencoder-based sequence embedding #369 ). Note that this would be an alternative route and wouldn't replace ir_dist/define_clonotypes completely.
Special-casing: In the case of omniscope data (which only has TRB chains), the problem simplifies to reindexing a sparse matrix. If using only one pair of sequences per cell, the problem is likely also simpler.

grst added this to ToDo in scirpy-dev Oct 9, 2022

grst mentioned this issue Oct 9, 2022

Scalability to >1M cells #370

Open

grst moved this from ToDo to In progress in scirpy-dev Dec 26, 2023

grst linked a pull request Jan 9, 2024 that will close this issue

Speed up clonotype distance calculation #470

Open

grst moved this from In progress to On Hold in scirpy-dev Jan 23, 2024

Provide feedback