-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some data access helpers to utils #619
Conversation
This looks very good! No, we don't yet have centralized docs for |
Thanks! I've updated the docs, but it turned out not much was actually shared. Where should I put these in the api docs? A new |
scanpy/utils.py
Outdated
Key differential expression groups were stored under. | ||
pval_cutoff | ||
Minimum adjusted pval to return. | ||
logfc_min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussions, we thought that log2FC
is the most unambiguous and most commonly used name, and we would rename the slot this way in v2.0
(#453 (comment)). Do you agree? Would you adapt the parameter names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! No strong opinion on this, as long as we're definitely calculating a log2 fold change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we're definitely calculating log2. I think this is an established convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked to make sure I did it right, but turns out I didn't...
Still: log2fc
or log2FC
? My pinkies would prefer less caps, but I'd probably also tab complete it most times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with log2fc
but I think diffxpy uses log2FC
and if you think about an axes label, the capped version might be more appealing. Your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, please update #453 (comment). 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like diffxpy uses "log2fc"
, at least as of theislab/diffxpy@0054f90, so I think I'll go with that
scanpy/utils.py
Outdated
|
||
# Would an array be faster? | ||
@doc_params(raw_layer_params=_docs.doc_raw_layers) | ||
def obs_values( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some duplicatation with https://github.com/theislab/anndata/blob/34f4eb63710628fbc15e7050e5efcac1d7806062/anndata/base.py#L1464?
I think we could have a public function in AnnData for this purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your implementation in combination with obs_values_df
is definitely better. But I really think it should go into AnnData, next to .to_df()
which right now just gives the data matrix, but one could give it the options you have below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely duplication, I hadn't realized we had that function in AnnData and was just going off how we assign colors for scatter plots. I agree it's better to have one function for this, and AnnData makes sense for where to put it.
Do all the current argument make sense for AnnData? For example, what about gene_symbols
? If AnnData is meant to be agnostic to datatype, I could handle resolving gene_symbols in obs_values_df
?
Also, these functions have different conventions for layers
. Which one should we standardize on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, gene_symbols
doesn't make sense for AnnData! Your suggestion makes sense.
Maybe Fidel didn't use the original https://github.com/theislab/scanpy/blob/9b522f54e0f839e1a0c9874ca658400bfe79a894/scanpy/plotting/_anndata.py#L311 in his functions? Unfortunate if I didn't notice. I don't have strong opinion on the convention for layers as long as it's there. We could also have a Slack discussion with Fidel, if you think there might be issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was looking at this function: https://github.com/theislab/scanpy/blob/9e4e5ee02e04cf618872d9b098e24f0542e8b227/scanpy/plotting/_tools/scatterplots.py#L651-L736
I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On layers, it looks like scanpy uses layer=None
as the default and anndata does layer='X'
(via find . -name "*.py" -exec grep -n "layer=" {} +
).
I prefer the scanpy
style, since if someone specifies layer='X'
and they actually have a layer named 'X'
they probably want to use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, I also prefer the Scanpy style. @Koncopd built most of the layers for AnnData, any reason to move away from 'X'? Any reason why you chose to do it that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.
100% agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivirshup Could you please give an example where we have layer='X'
, i don't see it.
@falexwolf not sure, maybe i had loom as a model, there is a similar thimg there if i remember correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, i see. I'm not sure i wrote these.
Since the Edit: Actually, this isn't the case since we'll need backwards compatibility anyways, nvm |
Does |
I think It's no problem at all to make the next Scanpy release depend on the current AnnData release, both in the requirements and the minimal version check upon importing Scanpy. |
To me, |
|
Should we merge this? Or is this going to partly end up in AnnData? :) |
Kinda? This is waiting on scverse/anndata#144 and a following AnnData point release. But I think that PR is ready to go. |
83c069e
to
27ac65a
Compare
Getting back around to this, I think it's pretty close to ready. Two last things to consider:
|
Also, should this go in |
* Now uses `AnnData.obs_vector` * Bump required version of AnnData to `0.6.21` to allow this * No longer supports raw
fe26185
to
faa8d07
Compare
I think this functionality is ready to go. I'm going to try using this a bit before deciding if it needs an argument for |
@@ -12,6 +12,7 @@ Post v1.4 :small:`May 13, 2019` | |||
|
|||
New functionality: | |||
|
|||
- New module :ref:`sc.get<module-get>` adds helper functions for extracting data in convenient formats :pr:`619` :smaller:`thanks to I Virshup` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not :mod:`scanpy.get`
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should link to the docs, which :mod:`scanpy.get`
didn't seem to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could definitely get changed to :ref:`scanpy.get<module-get>`
, though I'm not sure if how the styling can be fixed to look like a module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can’t. We should fix that :mod:`scanpy.get`
doesn’t link to the proper location.
I guess that could be done by creating the module entry via .. module:: scanpy.get
directly before using .. autosummary::
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it! 0485cb8
I just put the module link targets at the correct position and immediately reset the “current module” to scanpy each time, so that relative links still work!
The docs look great! I just wonder about the above: In the release notes, we refer to everything as |
Do we have some kind of tutorial around the new |
Add data access helpers to new module `sc.get`.
This adds two new convenience functions to
utils
.obs_values_df
Basically does the data access part of the scatter plots (actually copied the core of the code from there). Basically, lets you get a data frame of values from obs, obsm, and expression matrix back as a dataframe. I'd planned on this being the data access part of
ridge_plot
PR, but I've found it generally useful for data access. Also finding a feature-ful KDE that isn't buggy has been an issue for the ridge plots.This uses the obsm access I had suggested to @gokceneraslan in #613.
I'm also open to adding a
var_values_df
to this PR, I just haven't had a use case yet.rank_genes_groups_df
Returns a dataframe of differential expression results, because accessing DE results right now is a pain. This was a part of #467, but I can just remove it from there.
Whats left to do.
Docs, but it's boilerplate. Do we have centralized docstrings for things like
gene_symbols
,use_raw
,layers
, andadata
?