Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some data access helpers to utils #619

Merged
merged 15 commits into from
Jun 24, 2019

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Apr 25, 2019

This adds two new convenience functions to utils.

obs_values_df

Basically does the data access part of the scatter plots (actually copied the core of the code from there). Basically, lets you get a data frame of values from obs, obsm, and expression matrix back as a dataframe. I'd planned on this being the data access part of ridge_plot PR, but I've found it generally useful for data access. Also finding a feature-ful KDE that isn't buggy has been an issue for the ridge plots.

This uses the obsm access I had suggested to @gokceneraslan in #613.

I'm also open to adding a var_values_df to this PR, I just haven't had a use case yet.

rank_genes_groups_df

Returns a dataframe of differential expression results, because accessing DE results right now is a pain. This was a part of #467, but I can just remove it from there.

Whats left to do.

Docs, but it's boilerplate. Do we have centralized docstrings for things like gene_symbols, use_raw, layers, and adata?

@falexwolf
Copy link
Member

This looks very good!

No, we don't yet have centralized docs for gene_symbols, adata etc. but of course, we should have them. Maybe even just in the root scanpy/_docs.py. What do you think?

@ivirshup
Copy link
Member Author

Thanks!

I've updated the docs, but it turned out not much was actually shared. Where should I put these in the api docs? A new utils section, or something under Further modules? I'm thinking I'd just include rank_genes_groups_df and obs_values_df on the site.

scanpy/utils.py Outdated
Key differential expression groups were stored under.
pval_cutoff
Minimum adjusted pval to return.
logfc_min
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussions, we thought that log2FC is the most unambiguous and most commonly used name, and we would rename the slot this way in v2.0 (#453 (comment)). Do you agree? Would you adapt the parameter names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! No strong opinion on this, as long as we're definitely calculating a log2 fold change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we're definitely calculating log2. I think this is an established convention.

Copy link
Member Author

@ivirshup ivirshup Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked to make sure I did it right, but turns out I didn't...

Still: log2fc or log2FC? My pinkies would prefer less caps, but I'd probably also tab complete it most times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with log2fc but I think diffxpy uses log2FC and if you think about an axes label, the capped version might be more appealing. Your call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, please update #453 (comment). 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like diffxpy uses "log2fc", at least as of theislab/diffxpy@0054f90, so I think I'll go with that

scanpy/utils.py Outdated

# Would an array be faster?
@doc_params(raw_layer_params=_docs.doc_raw_layers)
def obs_values(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some duplicatation with https://github.com/theislab/anndata/blob/34f4eb63710628fbc15e7050e5efcac1d7806062/anndata/base.py#L1464?

I think we could have a public function in AnnData for this purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your implementation in combination with obs_values_df is definitely better. But I really think it should go into AnnData, next to .to_df() which right now just gives the data matrix, but one could give it the options you have below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely duplication, I hadn't realized we had that function in AnnData and was just going off how we assign colors for scatter plots. I agree it's better to have one function for this, and AnnData makes sense for where to put it.

Do all the current argument make sense for AnnData? For example, what about gene_symbols? If AnnData is meant to be agnostic to datatype, I could handle resolving gene_symbols in obs_values_df?

Also, these functions have different conventions for layers. Which one should we standardize on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, gene_symbols doesn't make sense for AnnData! Your suggestion makes sense.

Maybe Fidel didn't use the original https://github.com/theislab/scanpy/blob/9b522f54e0f839e1a0c9874ca658400bfe79a894/scanpy/plotting/_anndata.py#L311 in his functions? Unfortunate if I didn't notice. I don't have strong opinion on the convention for layers as long as it's there. We could also have a Slack discussion with Fidel, if you think there might be issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was looking at this function: https://github.com/theislab/scanpy/blob/9e4e5ee02e04cf618872d9b098e24f0542e8b227/scanpy/plotting/_tools/scatterplots.py#L651-L736

I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On layers, it looks like scanpy uses layer=None as the default and anndata does layer='X' (via find . -name "*.py" -exec grep -n "layer=" {} +).

I prefer the scanpy style, since if someone specifies layer='X' and they actually have a layer named 'X' they probably want to use that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, I also prefer the Scanpy style. @Koncopd built most of the layers for AnnData, any reason to move away from 'X'? Any reason why you chose to do it that way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.

100% agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivirshup Could you please give an example where we have layer='X', i don't see it.
@falexwolf not sure, maybe i had loom as a model, there is a similar thimg there if i remember correctly.

Copy link
Member

@Koncopd Koncopd Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see. I'm not sure i wrote these.

@ivirshup
Copy link
Member Author

ivirshup commented Apr 30, 2019

Since the obs_values_df will now depend on an updated version of AnnData, I'm thinking I'll move this version of rank_genes_groups_df over to #467 so that can get merged.

Edit: Actually, this isn't the case since we'll need backwards compatibility anyways, nvm

@ivirshup
Copy link
Member Author

Does obs_values_df need the _df, or could it be obs_values?

@falexwolf
Copy link
Member

I think obs_values is fine. But maybe, aggregate_obs is even better, as this describes what it does (aggregating annotations of observations with partial (projections of) observations).

It's no problem at all to make the next Scanpy release depend on the current AnnData release, both in the requirements and the minimal version check upon importing Scanpy.

@ivirshup
Copy link
Member Author

ivirshup commented May 1, 2019

To me, aggregate_obs implies a reduction, but that might mostly be because pd.DataFrame.aggregate. I could go for collect_obs, but if this goes into sc.get there's already a verb. I think sc.get.obs_df would work (though kind of ambiguous with adata.obs) as would sc.get.obs_values.

@falexwolf
Copy link
Member

sc.get.select_obs (meaning selected_obs)? You're right about aggregate. obs_values is fine, too (it's really just that it reminds of .values).

@falexwolf
Copy link
Member

Should we merge this? Or is this going to partly end up in AnnData? :)

@ivirshup
Copy link
Member Author

ivirshup commented May 16, 2019

Kinda? This is waiting on scverse/anndata#144 and a following AnnData point release.

But I think that PR is ready to go.

@ivirshup
Copy link
Member Author

Getting back around to this, I think it's pretty close to ready.

Two last things to consider:

  • Name change of obs_values_df. I likeobs_df since it fits with obs_vector and has a nice short name.
  • Support for .raw, I've dropped it at the moment, but maybe should add it back in.

@ivirshup
Copy link
Member Author

Also, should this go in sc.get?

@ivirshup
Copy link
Member Author

I think this functionality is ready to go. I'm going to try using this a bit before deciding if it needs an argument for raw. I'm going to merge this so I can use the sc.get.rank_genes_groups for #467. @flying-sheep, let me know if I messed up any docs and I'll fix it separately.

@ivirshup ivirshup merged commit a5e9806 into scverse:master Jun 24, 2019
@@ -12,6 +12,7 @@ Post v1.4 :small:`May 13, 2019`

New functionality:

- New module :ref:`sc.get<module-get>` adds helper functions for extracting data in convenient formats :pr:`619` :smaller:`thanks to I Virshup`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not :mod:`scanpy.get`?

Copy link
Member Author

@ivirshup ivirshup Jun 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should link to the docs, which :mod:`scanpy.get` didn't seem to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could definitely get changed to :ref:`scanpy.get<module-get>` , though I'm not sure if how the styling can be fixed to look like a module.

Copy link
Member

@flying-sheep flying-sheep Jun 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can’t. We should fix that :mod:`scanpy.get` doesn’t link to the proper location.

I guess that could be done by creating the module entry via .. module:: scanpy.get directly before using .. autosummary::

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it! 0485cb8

I just put the module link targets at the correct position and immediately reset the “current module” to scanpy each time, so that relative links still work!

@flying-sheep
Copy link
Member

The docs look great! I just wonder about the above: In the release notes, we refer to everything as scanpy.*, not sc.*

@LuckyMD
Copy link
Contributor

LuckyMD commented Jun 24, 2019

Do we have some kind of tutorial around the new sc.get module?

awnimo pushed a commit to dpeerlab/scanpy that referenced this pull request Dec 17, 2019
Add data access helpers to new module `sc.get`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants