Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diffxpy integration #1955

Open
3 of 5 tasks
pinin4fjords opened this issue Jul 22, 2021 · 14 comments
Open
3 of 5 tasks

diffxpy integration #1955

pinin4fjords opened this issue Jul 22, 2021 · 14 comments
Labels
Area – Differential Expression Differential expression

Comments

@pinin4fjords
Copy link
Contributor

pinin4fjords commented Jul 22, 2021

  • Additional function parameters / changed functionality / changed defaults?
  • New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
  • New plotting function: A kind of plot you would like to seein sc.pl?
  • External tools: Do you know an existing package that should go into sc.external.*?
  • Other?

I'd really like a smarter d/e method to be accessible easily from Scanpy, one that allows proper treatment of covariates etc. MAST is obviously very popular, but fiddly to integrate from R. I see diffxpy mentioned about the place here, and see it's an in-house tool of yours. Is there a reason it's not been integrated already? If nobody's working on it, shall I take a crack at it?

@LuckyMD
Copy link
Contributor

LuckyMD commented Jul 22, 2021

@davidsebfischer

@davidsebfischer
Copy link

Hi @pinin4fjords! I understand by integration, you mean access under the scanpy api. We try to advance the scanpy environment by modular extensions, which are packages with their own API, that also work on adata instances. This is currently what diffxpy is and there are no plans to collect all scanpy-related packages under sc.* as far as I am aware.

@pinin4fjords
Copy link
Contributor Author

pinin4fjords commented Jul 22, 2021

@davidsebfischer ahh, maybe what I need is already there.

Basically I'd like to be able to treat diffxpy as a drop-in replacement for rank_genes_groups(), such that the same slots (as far as makes sense) are populated in the anndata object, similar plots can be generated etc. My admittedly superficial skim of diffxpy suggested that anndata was accepted as an input format, but results are not stored in the anndata object- is that incorrect?

@davidsebfischer
Copy link

Ok, you are raising a valid point which is centred around a unified output structure for differential expression. Currently, diffxpy yields custom objects that can do a number of things, but we could also populated adata with similar entries as rank_genes_groups does. @ivirshup do you know of a roadmap for the rank_genes_groups output signature or is that meant to stay as it is now?

@pinin4fjords
Copy link
Contributor Author

We have our CLI layer for Scanpy, and I could put this integration there, but it'd be a shame to silo code that might be useful to other Scanpy users, so happy to contribute to something in the external API if you guys are willing.

@ivirshup
Copy link
Member

@Koncopd has looked at refactoring the rank_genes_groups methods, but in the big picture we don't really love the output format that rank_genes_groups uses.

Maybe an easier path forward would be to be able to directly pass values into the various plotting functions? You can already generate mostly similar plots from sc.pl.rank_genes_groups_{plot_func} and sc.pl.{plot_func} apart from using logfc and pvalues. If we allowed passing those in, it would be simple enough to make the same plots/ add a wrapper that generates the plots into diffxpy.

@pinin4fjords
Copy link
Contributor Author

What we're really after is the more general ability to execute different d/e tools without too much extra work, and have the results stored consistently in the annData for whatever downstream applications (plotting or otherwise), or just so that they're available for consumers of our annData objects.

But maybe if it's something you guys aren't keen on we can just code it up in our own software layer.

@ivirshup ivirshup added the Area – Differential Expression Differential expression label Jul 26, 2021
@ivirshup
Copy link
Member

That sounds nice! I can't say I recommend the use of recarrays to store the results of differential expression, and would suggest using dataframes more directly. If you develop a better storage model for differential expression results, I think we could be interested in adopting it (would be good to hear thoughts from @Koncopd around this).

Some previous discussions that are relevant here: #562 (comment), #723 (comment), #1156

@davidsebfischer
Copy link

I find .varm to be the ideal output destination of traditional DE analyses, ie gene-indexed tables. We could streamline the name and a few column names of a varm element?

@pinin4fjords
Copy link
Contributor Author

Sure, this all sounds good, and .varm sounds sensible. I may need some solution that preserves compatibility with the existing structures so I don't break plotting- our CLI tools are used for training etc where they do use those functions.

We already do some dataframe conversions for exporting tables, so maybe we'll just repurpose that code.

@ivirshup
Copy link
Member

ivirshup commented Jul 26, 2021

We already do some dataframe conversions for exporting tables, so maybe we'll just repurpose that code.

You may be interested in sc.get.rank_genes_groups_df

I find .varm to be the ideal output destination of traditional DE analyses, ie gene-indexed tables. We could streamline the name and a few column names of a varm element?

I think this data fits in varm, but I'd be a little worried about how many entries you're adding, and knowing which ones came from which DE call (if there are multiple). This will conflict with computing differential expression on .raw, since the var dimension can differ.

@pinin4fjords
Copy link
Contributor Author

Thanks on that method @ivirshup - our version was written before I was maintainer, and maybe that function wasn't available.

Happy to adopt whatever general approach you recommend.

@ivirshup
Copy link
Member

Happy to adopt whatever general approach you recommend.

I think you and @davidsebfischer would be better able to figure this out, since you're going to be more familiar with the use cases that need to be accommodated. My main points would be:

  • I do not recommend doing storage the way we do it currently. Dataframes or xarray objects are almost certainly going to be easier/ closer to what a user wants to end up with.
  • There has been a lot of discussion of this in the past in the issues I linked above. I'd recommend looking through those to get some more context.

@pinin4fjords
Copy link
Contributor Author

Yep, reading #562 especially was enlightening, but it does seem like a bit of a can o' worms. For example I'm thoroughly out of my depth when we get into HDF5 and how the choice of data structure impacts on that.

I may have to do something naive internally for our immediate purposes and help out with a more satisfying solution in the longer term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area – Differential Expression Differential expression
Projects
None yet
Development

No branches or pull requests

4 participants