Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dotplot with x axis being one variable and y axis being another variable #1876

Open
1 of 5 tasks
zhangguy opened this issue Jun 15, 2021 · 5 comments
Open
1 of 5 tasks

Comments

@zhangguy
Copy link

  • Additional function parameters / changed functionality / changed defaults?
  • New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
  • New plotting function: A kind of plot you would like to seein sc.pl?
  • External tools: Do you know an existing package that should go into sc.external.*?
  • Other?

Hi,
I'm wondering if it is possible to add a new feature to sc.pl.dotplot if it is not too much of work. Say I'm interested in just one gene, and I want to plot the expression across two conditions. I understand that currently this could be achieved by using groupby = ['var1', 'var2'], but it'll be only one column, and conditions will be coerced into var1_var2. Is it possible to add a feature to the plotting function and change this behavior? I want var1 to be the x axis and var2 to be the y axis.

Thank you very much!

@ivirshup
Copy link
Member

That's an interesting idea and I see how it would be useful. I don't think it's going to be easy to implement, since I believe our code is heavily based around having groups of observations on one axis, groups of variables on the other.

Definitely something to keep in mind for a refactor though.

@ivirshup
Copy link
Member

ivirshup commented Dec 6, 2021

@zhangguy, adding on to some thoughts from your PR #2055 (comment)

From my reading of that PR, you added a boolean argument groupby_expand which, when True, assumed group_by had two values: a grouping variable for the rows of the plot and a grouping variable for the columns of the plot. It also assumed var_names was a single variable which would be used to fill cell in the plot. As an example:

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["sampleid"] = np.repeat(["s1", "s2"], pbmc.n_obs / 2)

sc.pl.dotplot(pbmc, var_names='LDHB', groupby=['louvain', 'sampleid'], groupby_expand=True)

tmpdm8256t1

Instead of having an argument which changes the interpretation of the earlier arguments, I would prefer more orthogonal arguments.

I think you'd be able to get an output close to what you would currently like with:

import scanpy as sc, pandas as pd, numpy as np

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["sampleid"] = np.repeat(["s1", "s2"], pbmc.n_obs / 2)
df = sc.get.obs_df(pbmc, ["LDHB", "louvain", "sampleid"])

summarized = df.pivot_table(
    index=["louvain", "sampleid"],
    values="LDHB",
    aggfunc=[np.mean, np.count_nonzero]
)
color_df = summarized["mean"].unstack()
size_df = summarized["count_nonzero"].unstack()

# I don't think the var_names or groupby variables are actually important here
sc.pl.DotPlot(
    pbmc,
    var_names="LDHB",  groupby=["louvain", "sampleid"],  # Just here so it doesn't error
    dot_color_df=color_df, dot_size_df=size_df,
).style(cmap="Reds").show()

I think this functionality could be more generic, and inspired by the pd.pivot_table function. This could end up looking like:

# Imaginary implementation:
sc.pl.heatmap(
    pbmc,
    var_names="LDHB",
    row_groups="louvain",
    col_groups="sampleid"
)

image

sc.pl.heatmap(
    pbmc,
    var_names=["LDHB", "LYZ", "CD79A"],
    row_groups="louvain",
    col_groups="sampleid"
)

image

What do you think about that?

@zhangguy
Copy link
Author

zhangguy commented Dec 7, 2021

@zhangguy, adding on to some thoughts from your PR #2055 (comment)

From my reading of that PR, you added a boolean argument groupby_expand which, when True, assumed group_by had two values: a grouping variable for the rows of the plot and a grouping variable for the columns of the plot. It also assumed var_names was a single variable which would be used to fill cell in the plot. As an example:

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["sampleid"] = np.repeat(["s1", "s2"], pbmc.n_obs / 2)

sc.pl.dotplot(pbmc, var_names='LDHB', groupby=['louvain', 'sampleid'], groupby_expand=True)

tmpdm8256t1

Instead of having an argument which changes the interpretation of the earlier arguments, I would prefer more orthogonal arguments.

I think you'd be able to get an output close to what you would currently like with:

import scanpy as sc, pandas as pd, numpy as np

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["sampleid"] = np.repeat(["s1", "s2"], pbmc.n_obs / 2)
df = sc.get.obs_df(pbmc, ["LDHB", "louvain", "sampleid"])

summarized = df.pivot_table(
    index=["louvain", "sampleid"],
    values="LDHB",
    aggfunc=[np.mean, np.count_nonzero]
)
color_df = summarized["mean"].unstack()
size_df = summarized["count_nonzero"].unstack()

# I don't think the var_names or groupby variables are actually important here
sc.pl.DotPlot(
    pbmc,
    var_names="LDHB",  groupby=["louvain", "sampleid"],  # Just here so it doesn't error
    dot_color_df=color_df, dot_size_df=size_df,
).style(cmap="Reds").show()

I think this functionality could be more generic, and inspired by the pd.pivot_table function. This could end up looking like:

# Imaginary implementation:
sc.pl.heatmap(
    pbmc,
    var_names="LDHB",
    row_groups="louvain",
    col_groups="sampleid"
)

image

sc.pl.heatmap(
    pbmc,
    var_names=["LDHB", "LYZ", "CD79A"],
    row_groups="louvain",
    col_groups="sampleid"
)

image

What do you think about that?

Thanks @ivirshup !

I like these lines you suggested- perhaps I can adopt to make it more elegant when creating color_df/size_df:

import scanpy as sc, pandas as pd, numpy as np

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["sampleid"] = np.repeat(["s1", "s2"], pbmc.n_obs / 2)
df = sc.get.obs_df(pbmc, ["LDHB", "louvain", "sampleid"])

summarized = df.pivot_table(
    index=["louvain", "sampleid"],
    values="LDHB",
    aggfunc=[np.mean, np.count_nonzero]
)
color_df = summarized["mean"].unstack()
size_df = summarized["count_nonzero"].unstack()

# I don't think the var_names or groupby variables are actually important here
sc.pl.DotPlot(
    pbmc,
    var_names="LDHB",  groupby=["louvain", "sampleid"],  # Just here so it doesn't error
    dot_color_df=color_df, dot_size_df=size_df,
).style(cmap="Reds").show()

this is the output:
image
some work are needed to modify the grid/axis size, legend and scale. Actually this is the reason I work on top of the _dotplot and _baseplot function/ classes to implement the solution- to make the plots the same style with scanpy dotplot without doing too much work on the cosmetics.

But I can certainly change grouby_expand from bool to an actual variable group_cols as you suggested in #2055 . Or should we call it col_groups as you did in your sc.pl.heatmap pseudo code?
I'd be more than happy to make it more generalized, i.e., to sc.pl.heatmap, but I may need some time to understand sc.pl.heatmap first. The plotting functions are getting really complex- it took me some time to understand _dotplot and _baseplot :)

Thanks

@ivirshup
Copy link
Member

ivirshup commented Dec 8, 2021

Or should we call it col_groups as you did in your sc.pl.heatmap pseudo code?

That could be up to you. It depends on what the user is trying to achieve, which makes more sense. For instance, I'm not sure if it makes sense to allow splitting the columns by both variables and groups, or if that's the wrong abstraction.

I'd be more than happy to make it more generalized, i.e., to sc.pl.heatmap, but I may need some time to understand sc.pl.heatmap first. The plotting functions are getting really complex- it took me some time to understand _dotplot and _baseplot :)

This code could definitely be a lot more simple. Would definitely appreciate help here! I think some of the concepts used in seaborn could be quite useful here, though it looks like they're under heavy refactoring at the moment (relevant seaborn branch).

Maybe a good first step would be to fix how so the dotplot would look right if the user provides the dot size and dot color dataframes? Would make these plots possible, and gives an interface to try later approaches with.

@zhangguy
Copy link
Author

zhangguy commented Dec 23, 2021

Hi @ivirshup
I made some updates to PR #2055 . The column grouping argument was changed to a string/list argument 'col_groups'.
A few examples:

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["sampleid"] = np.repeat(["s1", "s2"], pbmc.n_obs / 2)
pbmc.obs["condition"] = np.tile(["c1", "c2"], int(pbmc.n_obs / 2))

## plot one gene, one column grouping variable
sc.pl.dotplot(pbmc, var_names='C1QA', groupby='louvain', col_groups='sampleid')

image

## plot two genes, one column grouping variable
sc.pl.dotplot(pbmc, var_names=['C1QA', 'CD19'], groupby='louvain', col_groups='sampleid')

image

## plot two genes, tow column group variable
sc.pl.dotplot(pbmc, var_names=['C1QA', 'CD19'], groupby='louvain', col_groups=['sampleid', 'condition'])

image

## or we could use the same varaibles as y axis
sc.pl.dotplot(pbmc, var_names=['C1QA', 'CD19'], groupby=['sampleid', 'condition'], col_groups='louvain')

image

For the heatmap, I think you were referring to sc.pl.matrixplot. sc.pl.heatmap is a different function which plot a cell as a row and a gene as a column. col_groups was also added to sc.pl.matrixplot:

## plot two genes, tow column group variable
sc.pl.matrixplot(pbmc, var_names=['C1QA', 'CD19'], groupby='louvain', col_groups=['sampleid', 'condition'])

image
For the row_groups you proposed in your hypothetical sc.pl.heatmap implementation, it is equivalent to the current groupby argument in sc.pl.dotplot/sc.pl.matrixplot. I think it might be good to keep it as is for now- for this kind of changes it might be good to do a coordinated update on all plotting functions because I see quite a few functions use the groupby argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants