Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TCRdist metric #509

Closed
grst opened this issue Apr 21, 2024 · 1 comment
Closed

Optimize TCRdist metric #509

grst opened this issue Apr 21, 2024 · 1 comment
Projects

Comments

@grst
Copy link
Collaborator

grst commented Apr 21, 2024

The "tcrdist" metric was added by @felixpetschko in #502.

It's already quite fast, but memory usage is excessive with more than ~200-300k sequences. @felixpetschko already mentioned that he has some ideas to improve on that.

I also believe that the way how this is parallelized could be optimized adapting the code from here (using joblib.parallel):

problem_size = len(seqs) * len(seqs2) if seqs2 is not None else len(seqs) ** 2
# dynamicall adjust the block size such that there are ~1000 blocks within a range of 50 and 5000
block_size = int(np.ceil(min(max(np.sqrt(problem_size / 1000), 50), 5000)))
logging.info(f"block size set to {block_size}")
# precompute blocks as list to have total number of blocks for progressbar
blocks = list(self._block_iter(seqs, seqs2, block_size=block_size))
block_results = _parallelize_with_joblib(
(joblib.delayed(self._compute_block)(*block) for block in blocks), total=len(blocks), n_jobs=self.n_jobs
)

There are two main advantages of using joblib over the multiprocessing module:

  • it is more robust (avoiding random issues like ir_dist alignment stuck  #468)
  • you get out-of-machine parallelization via dask for free - which makes using this metric feasible beyond 1M sequencing.

@ShihanL fyi

@grst
Copy link
Collaborator Author

grst commented Apr 29, 2024

Closed in #511

@grst grst closed this as completed Apr 29, 2024
@grst grst added this to Done in scirpy-dev Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
scirpy-dev
  
Done
Development

No branches or pull requests

1 participant