Add some data access helpers to utils #619

ivirshup · 2019-04-25T04:50:43Z

This adds two new convenience functions to utils.

`obs_values_df`

Basically does the data access part of the scatter plots (actually copied the core of the code from there). Basically, lets you get a data frame of values from obs, obsm, and expression matrix back as a dataframe. I'd planned on this being the data access part of ridge_plot PR, but I've found it generally useful for data access. Also finding a feature-ful KDE that isn't buggy has been an issue for the ridge plots.

This uses the obsm access I had suggested to @gokceneraslan in #613.

I'm also open to adding a var_values_df to this PR, I just haven't had a use case yet.

`rank_genes_groups_df`

Returns a dataframe of differential expression results, because accessing DE results right now is a pain. This was a part of #467, but I can just remove it from there.

Whats left to do.

Docs, but it's boilerplate. Do we have centralized docstrings for things like gene_symbols, use_raw, layers, and adata?

falexwolf · 2019-04-26T11:41:10Z

This looks very good!

No, we don't yet have centralized docs for gene_symbols, adata etc. but of course, we should have them. Maybe even just in the root scanpy/_docs.py. What do you think?

ivirshup · 2019-04-27T07:37:22Z

Thanks!

I've updated the docs, but it turned out not much was actually shared. Where should I put these in the api docs? A new utils section, or something under Further modules? I'm thinking I'd just include rank_genes_groups_df and obs_values_df on the site.

falexwolf · 2019-04-28T18:33:41Z

scanpy/utils.py

+        Key differential expression groups were stored under.
+    pval_cutoff
+        Minimum adjusted pval to return.
+    logfc_min


After some discussions, we thought that log2FC is the most unambiguous and most commonly used name, and we would rename the slot this way in v2.0 (#453 (comment)). Do you agree? Would you adapt the parameter names?

Sure! No strong opinion on this, as long as we're definitely calculating a log2 fold change.

Yes, we're definitely calculating log2. I think this is an established convention.

Just checked to make sure I did it right, but turns out I didn't...

Still: log2fc or log2FC? My pinkies would prefer less caps, but I'd probably also tab complete it most times.

I'm fine with log2fc but I think diffxpy uses log2FC and if you think about an axes label, the capped version might be more appealing. Your call.

But, please update #453 (comment). 🙂

It looks like diffxpy uses "log2fc", at least as of theislab/diffxpy@0054f90, so I think I'll go with that

falexwolf · 2019-04-28T18:37:34Z

scanpy/utils.py

+
+# Would an array be faster?
+@doc_params(raw_layer_params=_docs.doc_raw_layers)
+def obs_values(


Is there some duplicatation with https://github.com/theislab/anndata/blob/34f4eb63710628fbc15e7050e5efcac1d7806062/anndata/base.py#L1464?

I think we could have a public function in AnnData for this purpose.

Your implementation in combination with obs_values_df is definitely better. But I really think it should go into AnnData, next to .to_df() which right now just gives the data matrix, but one could give it the options you have below.

Definitely duplication, I hadn't realized we had that function in AnnData and was just going off how we assign colors for scatter plots. I agree it's better to have one function for this, and AnnData makes sense for where to put it.

Do all the current argument make sense for AnnData? For example, what about gene_symbols? If AnnData is meant to be agnostic to datatype, I could handle resolving gene_symbols in obs_values_df?

Also, these functions have different conventions for layers. Which one should we standardize on?

Right, gene_symbols doesn't make sense for AnnData! Your suggestion makes sense.

Maybe Fidel didn't use the original https://github.com/theislab/scanpy/blob/9b522f54e0f839e1a0c9874ca658400bfe79a894/scanpy/plotting/_anndata.py#L311 in his functions? Unfortunate if I didn't notice. I don't have strong opinion on the convention for layers as long as it's there. We could also have a Slack discussion with Fidel, if you think there might be issues.

Ah, I was looking at this function: https://github.com/theislab/scanpy/blob/9e4e5ee02e04cf618872d9b098e24f0542e8b227/scanpy/plotting/_tools/scatterplots.py#L651-L736

I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.

On layers, it looks like scanpy uses layer=None as the default and anndata does layer='X' (via find . -name "*.py" -exec grep -n "layer=" {} +).

I prefer the scanpy style, since if someone specifies layer='X' and they actually have a layer named 'X' they probably want to use that.

Oh, yes, I also prefer the Scanpy style. @Koncopd built most of the layers for AnnData, any reason to move away from 'X'? Any reason why you chose to do it that way?

I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.

100% agreed.

@ivirshup Could you please give an example where we have layer='X', i don't see it.
@falexwolf not sure, maybe i had loom as a model, there is a similar thimg there if i remember correctly.

Ah, i see. I'm not sure i wrote these.

ivirshup · 2019-04-30T03:32:17Z

Since the obs_values_df will now depend on an updated version of AnnData, I'm thinking I'll move this version of rank_genes_groups_df over to #467 so that can get merged.

Edit: Actually, this isn't the case since we'll need backwards compatibility anyways, nvm

ivirshup · 2019-04-30T04:49:41Z

Does obs_values_df need the _df, or could it be obs_values?

falexwolf · 2019-04-30T11:26:56Z

I think obs_values is fine. But maybe, aggregate_obs is even better, as this describes what it does (aggregating annotations of observations with partial (projections of) observations).

It's no problem at all to make the next Scanpy release depend on the current AnnData release, both in the requirements and the minimal version check upon importing Scanpy.

ivirshup · 2019-05-01T02:06:51Z

To me, aggregate_obs implies a reduction, but that might mostly be because pd.DataFrame.aggregate. I could go for collect_obs, but if this goes into sc.get there's already a verb. I think sc.get.obs_df would work (though kind of ambiguous with adata.obs) as would sc.get.obs_values.

falexwolf · 2019-05-06T09:09:26Z

sc.get.select_obs (meaning selected_obs)? You're right about aggregate. obs_values is fine, too (it's really just that it reminds of .values).

falexwolf · 2019-05-16T09:58:01Z

Should we merge this? Or is this going to partly end up in AnnData? :)

ivirshup · 2019-05-16T10:33:30Z

Kinda? This is waiting on scverse/anndata#144 and a following AnnData point release.

But I think that PR is ready to go.

ivirshup · 2019-06-10T06:08:49Z

Getting back around to this, I think it's pretty close to ready.

Two last things to consider:

Name change of obs_values_df. I likeobs_df since it fits with obs_vector and has a nice short name.
Support for .raw, I've dropped it at the moment, but maybe should add it back in.

ivirshup · 2019-06-17T09:48:17Z

Also, should this go in sc.get?

* Now uses `AnnData.obs_vector` * Bump required version of AnnData to `0.6.21` to allow this * No longer supports raw

Based on discussion from in: scverse#562

ivirshup · 2019-06-24T04:22:34Z

I think this functionality is ready to go. I'm going to try using this a bit before deciding if it needs an argument for raw. I'm going to merge this so I can use the sc.get.rank_genes_groups for #467. @flying-sheep, let me know if I messed up any docs and I'll fix it separately.

flying-sheep · 2019-06-24T07:14:13Z

docs/release_notes.rst

@@ -12,6 +12,7 @@ Post v1.4 :small:`May 13, 2019`

 New functionality:

+- New module :ref:`sc.get<module-get>` adds helper functions for extracting data in convenient formats :pr:`619` :smaller:`thanks to I Virshup`


why not :mod:`scanpy.get`?

I think it should link to the docs, which :mod:`scanpy.get` didn't seem to do.

It could definitely get changed to :ref:`scanpy.get<module-get>` , though I'm not sure if how the styling can be fixed to look like a module.

It can’t. We should fix that :mod:`scanpy.get` doesn’t link to the proper location.

I guess that could be done by creating the module entry via .. module:: scanpy.get directly before using .. autosummary::

Fixed it! 0485cb8

I just put the module link targets at the correct position and immediately reset the “current module” to scanpy each time, so that relative links still work!

flying-sheep · 2019-06-24T07:16:52Z

The docs look great! I just wonder about the above: In the release notes, we refer to everything as scanpy.*, not sc.*

LuckyMD · 2019-06-24T11:47:46Z

Do we have some kind of tutorial around the new sc.get module?

Add data access helpers to new module `sc.get`.

ivirshup mentioned this pull request Apr 25, 2019

t-tests fails when variance of both groups is 0 #620

Closed

falexwolf reviewed Apr 28, 2019

View reviewed changes

falexwolf mentioned this pull request Apr 28, 2019

Dotplot where sizes are proportional to p-value and the color to log2-fold change? #562

Open

ivirshup mentioned this pull request Apr 30, 2019

Add obs_array and var_array functions scverse/anndata#144

Merged

ivirshup mentioned this pull request Apr 30, 2019

inconsistent slicing of .X vs .layer scverse/anndata#145

Closed

LuckyMD mentioned this pull request May 22, 2019

Linking patient data with cells #658

Open

ivirshup force-pushed the plotting_helpers branch from 83c069e to 27ac65a Compare June 10, 2019 05:42

ivirshup added 11 commits June 24, 2019 13:32

First draft of plotting helper funcs

9d76eba

Added obsm_keys arg and docs to obs_values_df

4eb7e00

Update rank_genes_groups_df docs

49ea341

Update rank_genes_group_df and test it

0ad19dc

Update rank_genes_groups_df and obs_values_df docs.

f832e98

Switch out obs_values for adata._get_obs_array

cf3d618

Changed logfc to log2fc

6c40438

Update obs_values_df

9876f6a

* Now uses `AnnData.obs_vector` * Bump required version of AnnData to `0.6.21` to allow this * No longer supports raw

Update obs_values_df tests

3f229dd

Minor fixup and formatting

389499e

Rename to obs_df, add var_df

4e9b81c

ivirshup added 4 commits June 24, 2019 13:33

Improve support for array-likes in obs_df and var_df

6797e42

Moved data accesors to new module sc.get

6176db3

Based on discussion from in: scverse#562

Update docs for sc.get

dcaad1c

Update release notes for sc.get

faa8d07

ivirshup force-pushed the plotting_helpers branch from fe26185 to faa8d07 Compare June 24, 2019 03:35

ivirshup merged commit a5e9806 into scverse:master Jun 24, 2019

flying-sheep reviewed Jun 24, 2019

View reviewed changes

awnimo pushed a commit to dpeerlab/scanpy that referenced this pull request Dec 17, 2019

Merge pull request scverse#619 from ivirshup/plotting_helpers

9fd9804

Add data access helpers to new module `sc.get`.

		@@ -12,6 +12,7 @@ Post v1.4 :small:`May 13, 2019`

		New functionality:

		- New module :ref:`sc.get<module-get>` adds helper functions for extracting data in convenient formats :pr:`619` :smaller:`thanks to I Virshup`

Add some data access helpers to utils #619

Add some data access helpers to utils #619

Conversation

ivirshup commented Apr 25, 2019 • edited Loading

obs_values_df

rank_genes_groups_df

Whats left to do.

falexwolf commented Apr 26, 2019

ivirshup commented Apr 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivirshup Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Koncopd Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

ivirshup commented Apr 30, 2019 • edited Loading

ivirshup commented Apr 30, 2019

falexwolf commented Apr 30, 2019

ivirshup commented May 1, 2019 • edited Loading

falexwolf commented May 6, 2019

falexwolf commented May 16, 2019

ivirshup commented May 16, 2019 • edited Loading

ivirshup commented Jun 10, 2019

ivirshup commented Jun 17, 2019

ivirshup commented Jun 24, 2019

Choose a reason for hiding this comment

ivirshup Jun 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flying-sheep Jun 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flying-sheep commented Jun 24, 2019

LuckyMD commented Jun 24, 2019

ivirshup commented Apr 25, 2019 •

edited

Loading

`obs_values_df`

`rank_genes_groups_df`

ivirshup Apr 30, 2019 •

edited

Loading

Koncopd Apr 30, 2019 •

edited

Loading

ivirshup commented Apr 30, 2019 •

edited

Loading

ivirshup commented May 1, 2019 •

edited

Loading

ivirshup commented May 16, 2019 •

edited

Loading

ivirshup Jun 24, 2019 •

edited

Loading

flying-sheep Jun 24, 2019 •

edited

Loading