Multithreading for scanpy.tl.rank_genes_group? #2390

evanbiederstedt · 2022-12-29T00:35:47Z

Additional function parameters / changed functionality / changed defaults?

Hi ScanPy team

I emailed @ivirshup but others should be involved I think.

This function would be useful if we could specify the number of threads to use: https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html

Based on the number of items in the "groupby" field, we could use a basic split-merge approach here: each thread would take several of these items, the calculations are entirely independent of one another, and then when each is completed we would join + concatenate the results.

I'm happy to help write up a PR help (or participate), but I'd like to hear if this is something you'd be willing to prioritize. (It's related to a project whereby Fabian is the PI.)

Best, Evan

ivirshup · 2023-01-18T17:06:12Z

cc: @Zethson @grst

Hey,

In principle this sounds good, but I'd like to hear a little bit more about the usecase.

For context on our side, there are some other paths for speeding up DE available (probably some form of calculating statistics via scverse/anndata#564). There're also increased momentum on more featureful DE in the scverse ecosystem.

If you are specifically looking for faster scanpy DE, this makes sense, though there may be some easier paths forward (at least to me).

If you need anything fancier or even just different, it could be good to check in with other efforts. E.g.

Better support for DE in scverse pertpy#189

evanbiederstedt · 2023-01-18T18:13:09Z

Hi @ivirshup

Thanks for the help

In terms of the use cases here:

(1) Any user doing data processing or interactive analysis could benefit from multithreading here. Consider the two big for-loops which through all of the genes between compared in the samples, and the for loop which automatically does this for each "group" in the ScanPy object.

I'm a bit confused why Seurat or ScanPy never did this....but then I realize that Pagoda2 didn't either: https://github.com/kharchenkolab/pagoda2/blob/main/R/Pagoda2.R#L900

(There's a bit of multithreading there at the end...)

Given the file sizes nowadays and the number of "groups", this is getting fairly computationally intensive. It's one of those simple things your biologists will love ("this is so fast now!").

(2) In terms of our use case, an interactive way to run DE via the client is too slow. We've just started to implement the above ourselves.

RE: pertpy

Could does this relate to @davidsebfischer and diffxpy?

Best, Evan

grst · 2023-01-19T06:58:55Z

Given the file sizes nowadays and the number of "groups", this is getting fairly computationally intensive. It's one of those simple things your biologists will love ("this is so fast now!").

I agree it doesn't harm to have rank_genes_groups parallelized (given that it should be straightforward to implement).
What @ivirshup was referring to though, is that rank_genes_groups on single cells in general isn't seen anymore as best practice for DE analysis because it doesn't account for pseudoreplication bias. Please take a look at @Zethson's book chapter.

RE: pertpy

Could does this relate to @davidsebfischer and diffxpy?

Diffxpy is currently being reimplemented. Once it is released, it would likely be included in pertpy as an additional method. I.e. pertpy is more general and strives to provide a consistent interface to multiple methods.

ivirshup · 2023-01-19T19:17:48Z

(1) Any user doing data processing or interactive analysis could benefit from multithreading here. Consider the two big for-loops which through all of the genes between compared in the samples, and the for loop which automatically does this for each "group" in the ScanPy object.

My concern is that there will be issues if you keep the current calculations, but parallelize over the groups. Within that loop, I believe large amounts of memory can be allocated. If it's "group vs rest", at least one X worth of memory is allocated per loop from matrix subsetting – since there's an X_group = X[mask]; X_rest = X[~mask].

If you parallelize over groups, now the max memory usage can be ~ min(n_procs, n_groups) * X as opposed to ~2X. For large X (probably where you want to see the speed up most), this can make you run out of memory.

scanpy/scanpy/tools/_rank_genes_groups.py

Lines 164 to 178 in d26be44

    
               X_rest = self.X[mask_rest] 
        
               self.means[self.ireference], self.vars[self.ireference] = _get_mean_var( 
        
                   X_rest 
        
               ) 
        
               # deleting the next line causes a memory leak for some reason 
        
               del X_rest 
        
           if issparse(self.X): 
        
               get_nonzeros = lambda X: X.getnnz(axis=0) 
        
           else: 
        
               get_nonzeros = lambda X: np.count_nonzero(X, axis=0) 
        
           for imask, mask in enumerate(self.groups_masks): 
        
               X_mask = self.X[mask]

Another memory related concern comes from multiprocessing (mentioned in your email). I think there's recently been some improvement here, but my impression was it's difficult/ impossible to share memory with multiprocessing – so everything that goes into or out of a subprocess has to be copied.

So while I think we can absolutely make use of more processing power here, I think we need to consider the approach.

The link I mentioned above should reduce copies, and could potentially use a parallelized BLAS for compute
Much of the loops over "all of the genes between compared samples" are already in compiled code, but could be parallelized

In terms of our use case, an interactive way to run DE via the client is too slow.

What is the interface here? Scanpy computes results for all groups at once, but in most interfaces I've used you can only really "ask" for one comparison at a time. This could also be much faster, if you can just reduce total computation.

What @ivirshup was referring to though, is that rank_genes_groups on single cells in general isn't seen anymore as best practice for DE analysis because it doesn't account for pseudoreplication bias.

Partially, I'm not sure what comparisons are actually being run. I was also wondering if you'd benefit from something fancy like a covariate.

Diffxpy is currently being reimplemented.

As a heads up, I'm unaware of a timeline here

ivirshup added Enhancement ✨ Area – Differential Expression Differential expression labels Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading for scanpy.tl.rank_genes_group? #2390

Multithreading for scanpy.tl.rank_genes_group? #2390

evanbiederstedt commented Dec 29, 2022 •

edited by flying-sheep

ivirshup commented Jan 18, 2023

evanbiederstedt commented Jan 18, 2023

grst commented Jan 19, 2023

ivirshup commented Jan 19, 2023

Multithreading for scanpy.tl.rank_genes_group? #2390

Multithreading for scanpy.tl.rank_genes_group? #2390

Comments

evanbiederstedt commented Dec 29, 2022 • edited by flying-sheep

ivirshup commented Jan 18, 2023

evanbiederstedt commented Jan 18, 2023

grst commented Jan 19, 2023

ivirshup commented Jan 19, 2023

evanbiederstedt commented Dec 29, 2022 •

edited by flying-sheep