Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dotplot where sizes are proportional to p-value and the color to log2-fold change? #562

Open
falexwolf opened this issue Mar 26, 2019 · 21 comments

Comments

@falexwolf
Copy link
Member

@fidelram, as discussed today, could we adopt pl.rank_genes_groups_dotplot so that it reads this information from .uns['rank_genes_groups']?

Maybe just a simple switch? Or having arguments color and size be a choice from a selection {pvals, pvals_adj, log2FC, expression, frac-genes-expressed}.

@fidelram
Copy link
Collaborator

fidelram commented Mar 27, 2019 via email

@gokceneraslan
Copy link
Collaborator

gokceneraslan commented Apr 5, 2019

@fidelram, as discussed today, could we adopt pl.rank_genes_groups_dotplot so that it reads this information from .uns['rank_genes_groups']?

Maybe just a simple switch? Or having arguments color and size be a choice from a selection {pvals, pvals_adj, log2FC, expression, frac-genes-expressed}.

I would also love that actually 😄

rank_genes_groups results (LFC, p-val etc) and things like mean-expression, frac-genes-expressed are all cluster-specific features, which reminds me of an obs-like structure with clusters in rows. Right now, mean expression and fractions are calculated in a private function (_prepare_dataframe) in plotting/_anndata.py but we can move this to utils.py or so, call it in sc.tl.rank_genes_groups() and store the resulting data frame in ad.uns?

@fidelram
Copy link
Collaborator

fidelram commented Apr 8, 2019

@gokceneraslan the idea of a obs-like structure for clusters or in general for categorical data is interesting. Currently, in .uns we are using numpy structured arrays for ranked genes which are not so easy to navigate. A obs-like object using pandas could facilitate data manipulation and visualization.

@falexwolf should we give it a try or do you see problem with this?

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 15, 2019

If this is stored in .uns we would have to wait for pandas support for .uns.

@flying-sheep
Copy link
Member

Which is tracked in #134 I think.

@ivirshup
Copy link
Member

Why is it that .obs, .var, and .uns don't have data frames in them? np.recarray don't seem like a very popular data structure elsewhere.

Also, I'd like to suggest that storing all differential expression within the anndata object might get complicated, and deserve it's own class. It'd be nice if it could be easy to tell what cells and genes were compared, what exactly was being tested, and which direction is "up". That said, the results should definitely be easily accessible as a data frame.

@flying-sheep
Copy link
Member

Huh? obs and var have been DataFrames for a long time now. and uns is a dict.

The reason why they haven’t originally been that is that DataFrames are super complicated and have a giant API. We thought having something simpler would be easier to work with, but in the end convenience won.

@ivirshup
Copy link
Member

Ah, totally miswrote that. I meant obsm and varm.

@flying-sheep
Copy link
Member

I think enabling that .obsm['X_pca'] by subclassing pandas DataFrames would have been a clusterfuck.

@ivirshup
Copy link
Member

subclassing pandas DataFrames would have been a clusterfuck.

Definitely would be. I'd thought the alternative would be to subclass something like a dict, but have the setters check that the values have the right size. Also taking subsets across all the values when subsetting an AnnData, but that's just adding a loop.

@flying-sheep
Copy link
Member

Sure, but I think it’s about the same effort to subclass a mapping while adding array features compared to the other way round.

@ivirshup
Copy link
Member

Back on the topic of getting dot plots a bit more flexible, I've been working on an approach that could work. You can check it out in this binder environment, but it's based on two main ideas:

  1. It'd be nice if there were an easy way to get aggregated values for groups, so I've added a crude groupby to AnnData
  2. Our differential expression results are like a 3d array, with axes ["genes", "group", "values"] where values are things like p-values and mean expression.

Here's a quick example of the output:

image

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 23, 2019

This looks really cool... but then I haven't used dot plots much before, so not sure what this is replacing...

I just wonder if you can put different thresholds on the pvals_adj colour bar, such that every p-value > 0.05 is grey or something like that... Or even better would be to put the colour bar on a log scale.

@falexwolf
Copy link
Member Author

obs-like structure with clusters in rows

Completely agreed!

  1. agreed with @ivirshup that there should be a more comprehensive object (which can possibly simply be stored as a dataframe and params in .uns['rank_genes_groups'], that clarify what the reference for the test was, but that might be not powerful enough)... your latest suggestion, @ivirshup, representing things as in 3d array sounds very promising, too... how to make an intuitive object? represent the 3d array in a long-form dataframe where two axes are accessible from one multi-index? or store an actual 3d array in AnnData, which can be cast into a convenient object, through a casting namespace... the logic being sc.tl.... computes some complicated annotation, sc.pl... visualizes this annotation and sc.ex.... extracts and casts annotation into more easily manageable objects. One example is sc.Neigbors (which should go into sc.ex...) which takes the weird annotation that sc.pp.neighbors writes and casts them into an object that allows accessing things...
  2. Related, but really independent of rank_genes_groups: I had implemented a draft of a .collapse() function, which is very similar to the .groupby() function that @ivirshup suggests, but much less elegant (I would also never have put it into the main repo...). You take a summary metric like .mean() or .std() and collapse the object by that (in pandas, would be df.groupby('louvain').mean().

Why is it that .obs, .var, and .uns don't have data frames in them? np.recarray don't seem like a very popular data structure elsewhere.

We just did only allow rec arrays in .uns as they are natively supported by hdf5 and dataframes aren't. It was really just that reason, nothing else. As mentioned in anndata, I'd love to completely move away from rec arrays as a means of aggregating return values.

@ivirshup
Copy link
Member

  1. I've been playing around with xarray and finding the Dataset objects fairly intuitive for storing multidimensional arrays. I think it makes sense to store calculated values like this, but give easy access to a long (/tidy) dataframe (similar to that binder notebook). I think representing it internally as a tidy dataframe could be inefficient, since that's pretty close to 100% dense COO matrix. My impression is this is broadly similar to how diffxpy is representing it's results.

I think there are also two separate problems here, which are "what's a better way to store differential expression results" and "what's a good api for differential expression".

I'm interested in the sc.ex module you're suggesting. Would you mind elaborating a bit more on that, particularly on some functions that would be there?

  1. I'd really like to get a generalized version of this implemented. Right now, I think the biggest thing holding it back is being smart about how sparse matrices are handled, but otherwise xarray has a good model for the semantics.

@falexwolf
Copy link
Member Author

Sounds great!

Re tidy: Storing things internally in tidy format also seems inefficient to me... I remember a long discussion with Philipp more than 2 years ago... 😄

Re diffxpy: If you say that diffxpy has a good solution, why should we build a new one? Can't we just use their solution?

I think there are also two separate problems here, which are "what's a better way to store differential expression results" and "what's a good API for differential expression".

Completely agreed.

I'm interested in the sc.ex module you're suggesting. Would you mind elaborating a bit more on that, particularly on some functions that would be there?

Re sc.extract

One of the core ideas of Scanpy (as opposed to, say, scikit learn) was to have this model of taking the burden of bookkeeping from the user as much as possible. This design messed up, in particular, the return values of rank_genes_groups. I would have loved to return a collection of dataframes, but I didn't want to mess this up. Also, the return values of pp.neighbors or pl.paga aren't great.

There is a trade-off between having nice APIs and return values (such as dataframes) and a transparent and efficient on-disk representation in terms of HDF5, zarr or another format. These days, I'd even consider simply pickling things, which would have saved us a lot of work; but I thought that we'd need established compression facilities, concatenation possibilities, some way to manually "look into" an on-disk object (both from R and from the command line) so that it's maximally transparent and then the widely established, cross-language, but old-school and not entirely scalable HDF5 seemed the best. The Human Cell Atlas decided in favor of zarr meanwhile. But that's not a drama, because Scanpy only writes "storage-friendly" values to AnnData, that is, arrays and dicts. HDF5 knows how to handle them and zarr also. If one uses xarray or dataframes, one has to think about how this gets written to disk.

That being said: it's likely that we'll continue to choose representations for on-disk (and in-memory) storage that aren't convenient (rec arrays, for instance), a three-dimensional xarray and dicts.

A general solution for this problem would be the mentioned sc.extract API, similar to sc.plotting (which also completely hides the complexity of the object from the user), but not for returning visualizations, but nice objects.

The first function in that namespace should be sc.ex.neighbors, which should return an instance of sc.Neighbors (which can then disappear from the root API). Similarly, when sc.pp.neighbors is called with inplace=False, one should directly get a Neighbors object returned.

Now, we can apply this logic to every single function that doesn't have a simple return value. Upon calling the function with inplace=False, you'll get a "nice" object that is convenient to handle. If you call a function sc.tl.function in a pipeline with inplace=True but later on, you'll want this nice object, you'd call sc.ex.function.

I think DataFrames (a case like tl.marker_gene_overlap) should definitely be handled within AnnData and no extract function is necessary. But the differential expression result is a prime example for such a case. I think a function rank_genes_groups that returns a RankGenesGroups object, which then has .to_df() function (e.g. the function rank_genes_groups from (#619) could immediately go into that namespace. Maybe we can even borrow a diffxpy object for that. The good thing is, we can keep the current rec arrays as they are very efficient and basic data types, which will work with hdf5 and zarr and xarray and everything else that might come in the future. And: Fidel wrote a ton of plotting functions around them already, which we don't want to simply rewrite... We don't have to as users won't see the recarrays anymore...

Other possible names for the API would be sc.cast or sc.object (sc.ob), less conflicting with sc.external. I think sc.ob makes sense as it really makes clear that Scanpy's main API is for writing convenient scripts for compute-heavy stuff in a functional way. If one wants to transition to more light-weight "post-analysis", one can transition to objects that are designed for specific tasks.

PS: I'd love to move away from the name rank_genes_groups at some point, and simply have something like difftest or DiffTest... I always thought that we might have differential expression tests for longitudinal data at some point (like Monocle), otherwise the function would be rank_genes but I don't think this is gonna happen soon, and if, it will be in the external API... A minimal difftest API should though continue be in the core of Scanpy, with at its heart, a scalable Wilcoxon rank (much more scalable than scipy's or diffxpy's), the t test and the scikit learn logreg approach. diffxpy with it's tensorflow dependency can then handle very complex cases...

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 29, 2019

I really like the sc.extract idea (or sc.cast/sc.object). It would be pretty cool if that went both ways though. For example, if I want to generate differential expression results with diffxpy, store everything in my AnnData object, and visualize later with sc.pl.rank_genes_groups_violin(), this could be done by some kind of sc.read.diffxpy() function. Maybe you could just extend sc.read and sc.write and make them into larger modules? Or rename both to sc.io?

@ivirshup
Copy link
Member

I like the idea for the "nice data accessors module"! Maybe sc.get? sc.get.obs_values(adata, ...), sc.get.neighbors(adata, ...). We'd definitely have to be sure we're returning a nice object.

| If you say that diffxpy has a good solution, why should we build a new one? Can't we just use their solution?

I think this would involve throwing away recarrays, unless someone wants to write a converter (not me). I'm also not so sure how mature/ stable diffxpy is, but Theis lab people might have a better sense of that?

| That being said: it's likely that we'll continue to choose representations for on-disk

I like that the current representations are pretty easy to read in other languages as they're mostly standard hdf5 types. I think there are definitely cases where it make sense to break cross-compat, like complicated datastructures for a specific package (an index, for example).

| If one uses xarray or dataframes, one has to think about how this gets written to disk

My impression is xarray were designed to be similar to netCDF files, which are a subset of hdf5. pandas, on the other hand, has a pretty opaque hdf5 representation.

@falexwolf
Copy link
Member Author

sc.get is a good suggestion, too! I'd be fine with it.

diffxpy

@davidsebfischer: do you feel you have a mature solution for storing simple difftest results that could be reused for rank_genes_groups? If yes, can you point us to it? It might be that you don't as you have these relatively powerful objects that do a lot more than what we want in the context of a simple Wilcoxon Rank group-vs-reference comparison.

My impression is xarray were designed to be similar to netCDF files, which are a subset of hdf5. pandas, on the other hand, has a pretty opaque hdf5 representation.

If xarray does everything we want (sparse and categorical data), that would be great, of course. I was investigating pandas hdf5 early on and decided against it as it was very opaque (e.g., I couldn't see how to easily implement on-disk concatenation on it) and it didn't seem to offer performance gains.

@ivirshup
Copy link
Member

xarray doesn't do sparse :(. They're also holding off for csd/ csf formats in pydata/sparse I believe.

@flying-sheep
Copy link
Member

flying-sheep commented Jun 20, 2019

OK, we have those alternatives:

Alternative Pro Con
Keep everything as it is People will have the best unterstanding of its structure and not treat it as a black box Unwieldy
Subclass AnnData in scanpy and add accessor methods/attrs Nice API
  • Everyone would start using Scanpy’s AnnData subclass instead of the generic container that I think is a great design choice for extensibility
  • Hides AnnData structure
sc.get
  • Nice API
  • Separates Scanpy-specific API from AnnData API
Hides AnnData structure

I think sc.get is the best option here!

ivirshup added a commit to ivirshup/scanpy that referenced this issue Jun 24, 2019
ivirshup added a commit to ivirshup/scanpy that referenced this issue Jun 24, 2019
awnimo pushed a commit to dpeerlab/scanpy that referenced this issue Dec 17, 2019
fidelram added a commit that referenced this issue Jun 29, 2020
 * Unifies common code between `dotplot`, `matrixplot` and `stacked_violin` plots while adding flexibility to the plots. sc.pl.dotplot`, `sc.pl.matrixplot` and `sc.pl.stacked_violin` methods had been transformed into wrappers for the new `DotPlot`, `MatrixPlot` and `StackedVioling` classes. Accessing the new classes directly allows further fine tuning of the plots. 

* The new plot classes are all descendants of `BasePlot` class that captures the common code. The design of the classes follows the method chaining (as found in Pandas or Altair). This allows the addition of independent features (via well documented methods) to the plot without increasing the number parameters of a single function. This was first suggested here #956. All objects have consistent functions for `legend`, to set up titles and width, `style()` to set visual parameters specific to each plot like colormap, edge color, linewidth. `swap_axes` to transpose the figure, `add_dendrogram` with options to change the with of the dendrogran and `add_total` tho show a bar plot of the total number of cells per category. Also includes options to sort the categories.

 * Previous functionality is maintained but plots will look slightly different.

 *  This commit addresses issues from #979 and #1103 related to `sc.pl.dotplot` 

 * Now is possible to plot fold changes, log fold changes and p-values from `sc.tl.rank_genes_groups` as suggested in #562

Specific changes:
**all figures**:
 * Set a title to the image. 
 * Pass an `axe` where to plot the image.
 * Return a dictionary of axes for further manipulation
 * using `return_fig` the plot object can be used to adjust the proportions to the legend and other visual aspects can be fine tuned.
 * a bar plot with the totals per category can be added. This will align at the left of the image if the categories are plotted in rows or at the top if they are plotted in columns.
 * legend can be removed
 * `groupby` can be a list of categories. 

**dotplot**
 * Improved the colorbar and size legend for dotplots. Now the colorbar and size have titles, which can be modified using the `colorbar_title` and `size_title` arguments. They also align at the bottom of the image and do not shrink if the dotplot image is smaller.
 * Plot genes in rows and categories in columns (swap_axes).
 * Using the DotPlot object the dot_edge_color and line width can be set up, a grid added as well as several other features
 * `sc.pl.rank_genes_groups_dotplot` can now plot `pvals` and `log fold changes`

**matrixplot**
 * added title for colorbar and positioned as in dotplot
 * `sc.pl.rank_genes_groups_matrixplot` can now plot `pvals` and `log fold changes`

**stacked_violin**
 *  violin colors can be colored based on average gene expression as in dotplots
 * made the linewidth of the violin plots smaller.
 * removed the tics for the y axis as they tend to overlap with each other. Using the `style` method they can be visualized.

**other**
* `sc.pl.heatmap` and `sc.pl.trackplots` now return a dictionary of axes when `show=False` as for the other plots.
* now 'interpolation' can be passed as parameter for `sc.pl.heatmap`
@ivirshup ivirshup mentioned this issue Jul 26, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants