Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some data access helpers to utils #619

Merged
merged 15 commits into from Jun 24, 2019

Conversation

@ivirshup
Copy link
Collaborator

commented Apr 25, 2019

This adds two new convenience functions to utils.

obs_values_df

Basically does the data access part of the scatter plots (actually copied the core of the code from there). Basically, lets you get a data frame of values from obs, obsm, and expression matrix back as a dataframe. I'd planned on this being the data access part of ridge_plot PR, but I've found it generally useful for data access. Also finding a feature-ful KDE that isn't buggy has been an issue for the ridge plots.

This uses the obsm access I had suggested to @gokceneraslan in #613.

I'm also open to adding a var_values_df to this PR, I just haven't had a use case yet.

rank_genes_groups_df

Returns a dataframe of differential expression results, because accessing DE results right now is a pain. This was a part of #467, but I can just remove it from there.

Whats left to do.

Docs, but it's boilerplate. Do we have centralized docstrings for things like gene_symbols, use_raw, layers, and adata?

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

This looks very good!

No, we don't yet have centralized docs for gene_symbols, adata etc. but of course, we should have them. Maybe even just in the root scanpy/_docs.py. What do you think?

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 27, 2019

Thanks!

I've updated the docs, but it turned out not much was actually shared. Where should I put these in the api docs? A new utils section, or something under Further modules? I'm thinking I'd just include rank_genes_groups_df and obs_values_df on the site.

Key differential expression groups were stored under.
pval_cutoff
Minimum adjusted pval to return.
logfc_min

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 28, 2019

Member

After some discussions, we thought that log2FC is the most unambiguous and most commonly used name, and we would rename the slot this way in v2.0 (#453 (comment)). Do you agree? Would you adapt the parameter names?

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 29, 2019

Author Collaborator

Sure! No strong opinion on this, as long as we're definitely calculating a log2 fold change.

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 29, 2019

Member

Yes, we're definitely calculating log2. I think this is an established convention.

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 30, 2019

Author Collaborator

Just checked to make sure I did it right, but turns out I didn't...

Still: log2fc or log2FC? My pinkies would prefer less caps, but I'd probably also tab complete it most times.

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 30, 2019

Member

I'm fine with log2fc but I think diffxpy uses log2FC and if you think about an axes label, the capped version might be more appealing. Your call.

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 30, 2019

Member

But, please update #453 (comment). 🙂

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 30, 2019

Author Collaborator

It looks like diffxpy uses "log2fc", at least as of theislab/diffxpy@0054f90, so I think I'll go with that


# Would an array be faster?
@doc_params(raw_layer_params=_docs.doc_raw_layers)
def obs_values(

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 28, 2019

Member

Is there some duplicatation with https://github.com/theislab/anndata/blob/34f4eb63710628fbc15e7050e5efcac1d7806062/anndata/base.py#L1464?

I think we could have a public function in AnnData for this purpose.

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 28, 2019

Member

Your implementation in combination with obs_values_df is definitely better. But I really think it should go into AnnData, next to .to_df() which right now just gives the data matrix, but one could give it the options you have below.

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 29, 2019

Author Collaborator

Definitely duplication, I hadn't realized we had that function in AnnData and was just going off how we assign colors for scatter plots. I agree it's better to have one function for this, and AnnData makes sense for where to put it.

Do all the current argument make sense for AnnData? For example, what about gene_symbols? If AnnData is meant to be agnostic to datatype, I could handle resolving gene_symbols in obs_values_df?

Also, these functions have different conventions for layers. Which one should we standardize on?

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 29, 2019

Member

Right, gene_symbols doesn't make sense for AnnData! Your suggestion makes sense.

Maybe Fidel didn't use the original

x_arr = adata._get_obs_array(x, use_raw=use_raw, layer=layers[0])
in his functions? Unfortunate if I didn't notice. I don't have strong opinion on the convention for layers as long as it's there. We could also have a Slack discussion with Fidel, if you think there might be issues.

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 30, 2019

Author Collaborator

Ah, I was looking at this function:

def _get_color_values(adata, value_to_plot, groups=None, palette=None, use_raw=False,
gene_symbols=None, layer=None):
"""
Returns the value or color associated to each data point.
For categorical data, the return value is list of colors taken
from the category palette or from the given `palette` value.
For non-categorical data, the values are returned
"""
###
# when plotting, the color of the dots is determined for each plot
# the data is either categorical or continuous and the data could be in
# 'obs' or in 'var'
categorical = False
if value_to_plot is None:
color_vector = 'lightgray'
# check if value to plot is in obs
elif value_to_plot in adata.obs.columns:
if is_categorical_dtype(adata.obs[value_to_plot]):
categorical = True
if palette:
# use category colors base on given palette
_set_colors_for_categorical_obs(adata, value_to_plot, palette)
else:
if value_to_plot + '_colors' not in adata.uns or \
len(adata.uns[value_to_plot + '_colors']) < len(adata.obs[value_to_plot].cat.categories):
# set a default palette in case that no colors or few colors are found
_set_default_colors_for_categorical_obs(adata, value_to_plot)
else:
# check that the colors in 'uns' are valid
_palette = []
for color in adata.uns[value_to_plot + '_colors']:
if not is_color_like(color):
# check if the color is a valid R color and translate it
# to a valid hex color value
if color in utils.additional_colors:
color = utils.additional_colors[color]
else:
logg.warn("The following color value found in adata.uns['{}'] "
" is not valid: '{}'. Default colors are used.".format(value_to_plot + '_colors', color))
_set_default_colors_for_categorical_obs(adata, value_to_plot)
_palette = None
break
_palette.append(color)
if _palette is not None:
adata.uns[value_to_plot + '_colors'] = _palette
# for categorical data, colors should be
# stored in adata.uns[value_to_plot + '_colors']
# Obtain color vector by converting every category
# into its respective color
color_vector = [adata.uns[value_to_plot + '_colors'][x] for x in adata.obs[value_to_plot].cat.codes]
if groups is not None:
if isinstance(groups, str):
groups = [groups]
color_vector = np.array(color_vector, dtype='<U15')
# set color to 'light gray' for all values
# that are not in the groups
color_vector[~adata.obs[value_to_plot].isin(groups)] = "lightgray"
else:
color_vector = adata.obs[value_to_plot].values
# when value_to_plot is not in adata.obs
else:
if gene_symbols is not None and gene_symbols in adata.var.columns:
if value_to_plot not in adata.var[gene_symbols].values:
logg.error("Gene symbol {!r} not found in given gene_symbols "
"column: {!r}".format(value_to_plot, gene_symbols))
return
value_to_plot = adata.var[adata.var[gene_symbols] == value_to_plot].index[0]
if layer is not None and value_to_plot in adata.var_names:
if layer not in adata.layers.keys():
raise KeyError('Selected layer: {} is not in the layers list. The list of '
'valid layers is: {}'.format(layer, adata.layers.keys()))
color_vector = adata[:, value_to_plot].layers[layer]
elif use_raw and value_to_plot in adata.raw.var_names:
color_vector = adata.raw[:, value_to_plot].X
elif value_to_plot in adata.var_names:
color_vector = adata[:, value_to_plot].X
else:
raise ValueError("The passed `color` {} is not a valid observation annotation "
"or variable name. Valid observation annotation keys are: {}"
.format(value_to_plot, adata.obs.columns))
return color_vector, categorical

I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 30, 2019

Author Collaborator

On layers, it looks like scanpy uses layer=None as the default and anndata does layer='X' (via find . -name "*.py" -exec grep -n "layer=" {} +).

I prefer the scanpy style, since if someone specifies layer='X' and they actually have a layer named 'X' they probably want to use that.

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 30, 2019

Member

Oh, yes, I also prefer the Scanpy style. @Koncopd built most of the layers for AnnData, any reason to move away from 'X'? Any reason why you chose to do it that way?

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 30, 2019

Member

I don't think there's an issue per-say, I just think it'd be easier for me to follow/ debug the plotting functions if they were a little more standardized. We pretty frequently want a dataframe (or at least aligned arrays) of values per cell or gene, but this is done in a variety of ways.

100% agreed.

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 30, 2019

Member

@ivirshup Could you please give an example where we have layer='X', i don't see it.
@falexwolf not sure, maybe i had loom as a model, there is a similar thimg there if i remember correctly.

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 30, 2019

Member

Ah, i see. I'm not sure i wrote these.

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 30, 2019

Since the obs_values_df will now depend on an updated version of AnnData, I'm thinking I'll move this version of rank_genes_groups_df over to #467 so that can get merged.

Edit: Actually, this isn't the case since we'll need backwards compatibility anyways, nvm

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 30, 2019

Does obs_values_df need the _df, or could it be obs_values?

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I think obs_values is fine. But maybe, aggregate_obs is even better, as this describes what it does (aggregating annotations of observations with partial (projections of) observations).

It's no problem at all to make the next Scanpy release depend on the current AnnData release, both in the requirements and the minimal version check upon importing Scanpy.

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented May 1, 2019

To me, aggregate_obs implies a reduction, but that might mostly be because pd.DataFrame.aggregate. I could go for collect_obs, but if this goes into sc.get there's already a verb. I think sc.get.obs_df would work (though kind of ambiguous with adata.obs) as would sc.get.obs_values.

@falexwolf

This comment has been minimized.

Copy link
Member

commented May 6, 2019

sc.get.select_obs (meaning selected_obs)? You're right about aggregate. obs_values is fine, too (it's really just that it reminds of .values).

@falexwolf

This comment has been minimized.

Copy link
Member

commented May 16, 2019

Should we merge this? Or is this going to partly end up in AnnData? :)

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented May 16, 2019

Kinda? This is waiting on theislab/anndata#144 and a following AnnData point release.

But I think that PR is ready to go.

@ivirshup ivirshup force-pushed the ivirshup:plotting_helpers branch from 83c069e to 27ac65a Jun 10, 2019

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 10, 2019

Getting back around to this, I think it's pretty close to ready.

Two last things to consider:

  • Name change of obs_values_df. I likeobs_df since it fits with obs_vector and has a nice short name.
  • Support for .raw, I've dropped it at the moment, but maybe should add it back in.
@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 17, 2019

Also, should this go in sc.get?

@ivirshup ivirshup force-pushed the ivirshup:plotting_helpers branch from fe26185 to faa8d07 Jun 24, 2019

@ivirshup

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 24, 2019

I think this functionality is ready to go. I'm going to try using this a bit before deciding if it needs an argument for raw. I'm going to merge this so I can use the sc.get.rank_genes_groups for #467. @flying-sheep, let me know if I messed up any docs and I'll fix it separately.

@ivirshup ivirshup merged commit a5e9806 into theislab:master Jun 24, 2019

2 checks passed

Codacy/PR Quality Review Up to standards. A positive pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@@ -12,6 +12,7 @@ Post v1.4 :small:`May 13, 2019`

New functionality:

- New module :ref:`sc.get<module-get>` adds helper functions for extracting data in convenient formats :pr:`619` :smaller:`thanks to I Virshup`

This comment has been minimized.

Copy link
@flying-sheep

flying-sheep Jun 24, 2019

Member

why not :mod:`scanpy.get`?

This comment has been minimized.

Copy link
@ivirshup

ivirshup Jun 24, 2019

Author Collaborator

I think it should link to the docs, which :mod:`scanpy.get` didn't seem to do.

This comment has been minimized.

Copy link
@ivirshup

ivirshup Jun 24, 2019

Author Collaborator

It could definitely get changed to :ref:`scanpy.get<module-get>` , though I'm not sure if how the styling can be fixed to look like a module.

This comment has been minimized.

Copy link
@flying-sheep

flying-sheep Jun 24, 2019

Member

It can’t. We should fix that :mod:`scanpy.get` doesn’t link to the proper location.

I guess that could be done by creating the module entry via .. module:: scanpy.get directly before using .. autosummary::

This comment has been minimized.

Copy link
@flying-sheep

flying-sheep Jun 24, 2019

Member

Fixed it! 0485cb8

I just put the module link targets at the correct position and immediately reset the “current module” to scanpy each time, so that relative links still work!

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Jun 24, 2019

The docs look great! I just wonder about the above: In the release notes, we refer to everything as scanpy.*, not sc.*

@LuckyMD

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

Do we have some kind of tutorial around the new sc.get module?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.