-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order of plotting datapoints with categorical colouring should be amended #1263
Comments
This is the categorical equivalent of the |
Yes, exactly :). I would have a default that turns it to random. I don't think it should be too hard either. However, this will probably break a few tests. |
Sounds good to me. How are you thinking of handling reproducibility w.r.t. random seeds? To me, the best solution here is to make it easy to do small multiples for categorical plots like this, but that's a big change in the kind of plot being made. As an aside, I've also tried coloring the pixel by which group showed up the most under it, but this can look weird (less so, if density is used to calculate the alpha level) Snippet to reproduceimport datashader as ds
from datashader import transfer_functions as tf
import scanpy as sc
import numpy as np
import xarray as xr
# Where you load your AnnData, I was using a preprocessed set of 1.3 million mouse braincells
df = sc.get.obs_df(
adata,
["Sox17", "louvain"],
obsm_keys=[("X_umap", 0), ("X_umap", 1)]
)
louvain_colors = dict(
zip(
adata.obs["louvain"].cat.categories,
adata.uns["louvain_colors"]
)
)
pts = (
ds.Canvas(500, 500)
.points(df, "X_umap-0", "X_umap-1", agg=ds.count_cat("louvain"))
)
newpts = xr.zeros_like(pts)
newpts[:, :, pts.argmax(dim="louvain")] = pts.sum(dim="louvain")
tf.shade(newpts, color_key=louvain_colors) What datashader does by default is takes the average of the RGB values for the categories under a pixel, weighted by number of samples, and calculates an alpha level based on the number of samples present. This looks like: Addendum to previous snippet for plotting thistf.shade(pts, color_key=louvain_colors) I've also been wondering if there's a good way to show "colors cannot be trusted in this region". This could be done like how camera's do zebra stripes – where a texture is overlaid on the viewfinder for the sensor pixels which are saturated with light. |
I was thinking of just setting the seed to 0 by default, which would be consistent with other Scanpy code. I don't actually mind the version of the plot without accounting for sample density. I do think that randomization would result in sth similar to the |
As an additional example, I was thinking about using zebra-stripes (like a camera) for showing when information was hidden. Not sure if it's quite there yet, but its something: Codeimport datashader as ds
from datashader import transfer_functions as tf
import numpy as np
import pandas as pd
from scipy import sparse
import xarray as xr
import scanpy as sc
def diagonal_bands_like(arr, width=3):
assert arr.ndim == 2
a = np.zeros_like(arr, dtype=bool)
step = a.shape[1] + 1
# Not sure why end isn't making a difference
end = None
# end = a.shape[1] * a.shape[1]
fill = True
for i in range(arr.shape[0]):
if (i + width // 2) % width == 0:
fill = not fill
if fill:
a.flat[i:end:step] = True
return a
# Setup
adata = sc.read("/Users/isaac/data/10x_mouse_13MM_processed.h5ad", backed="r")
df = sc.get.obs_df(
adata,
["Sox17", "louvain"],
obsm_keys=[("X_umap", 0), ("X_umap", 1)]
)
louvain_colors = dict(
zip(
adata.obs["louvain"].cat.categories,
adata.uns["louvain_colors"]
)
)
pts = (
ds.Canvas(1000, 1000)
.points(df, "X_umap-0", "X_umap-1", agg=ds.count_cat("louvain"))
)
# Make images
pts_ncats = (pts != 0).sum(axis=2)
overlap_idx = pts_ncats == 1
zebra_source = xr.DataArray(
diagonal_bands_like(overlap_idx, 13),
coords=overlap_idx.coords
)
color_by_cluster = tf.shade(pts, color_key=louvain_colors)
tf.Images(
color_by_cluster,
tf.stack(
tf.Image(xr.where(pts_ncats == 1, color_by_cluster, 0)),
tf.Image(tf.shade(xr.where(pts_ncats > 1, zebra_source, False), cmap="black"))
),
tf.stack(
color_by_cluster,
tf.Image(tf.shade(xr.where(pts_ncats > 1, zebra_source, False), cmap="black"))
),
)
I wonder how either of these are effected by number of points. Say you have two cell types (A and B) in an overlapping region.
I think bin size would be helpful here. Additionally datashader has methods for exaggerating points in less dense regions so they are visible. This could be worth looking into. Update: Turns out |
We should make dynamic 3D plots ;-) If I remember correctly, in the past we have the issue that the categorical colors were given by the adata.obs order and we change them such that they follow the order of the categories. Yet, I agree that a good mix of categorical colors is good sometimes. To address this issue I think that we can simply randomize the order if Isaac's solution looks great for dealing with of lots of cells, something that I imagine will become more frequent. I think we should have a 'cookbook' where we can keep this and other information. I find this better than adding more and more functionality to the scatter plots. |
I would argue that this would be fair. In the end it's about showing which cells are represented per pixel/pixel bin. And rare cell types shouldn't be up-weighted in that in an unbiased representation (if there is such a thing). In general I do like the idea of density being linked to transparency though. We could do a quick fix based on random order for now though, and then look into transparency for a larger update that would have to do with updating scanpy plotting to larger cell numbers? |
I've been trying to be organized about keeping notebooks around for this (here). Of course, I rarely get the notebooks clean enough to push 😆.
Is it fair if coloring by batch and one dataset had fewer samples? Wouldn't you want to know that multiple batches were showing up in this region? I'm fairly convinced there is no good way to show this in one plot, other than telling users some information is hidden.
I'm trying to think of the simplest way to implement this. I would like to keep the behaviour of
I think this might be worth a separate package, at least to start out. At least with how I'm handling it now, there would be a large number of dependencies. Plus, I think overplottting like this is an unsolved problem, so freedom to experiment in important. |
@ivirshup I like your notebooks for a cookbook. Does it need to be super organized to add it to the readthedocs page? Regarding the options, I like the The |
Another advantage is that it could be user specified per plot when there are multiple plots. I think there is another issue, which is that Docstrings for these arguments would look something like: order_continuous: Literal["current", "random", "ascending", "descending"] = "ascending"
How to order points in plots colored by continuous values. Options include:
* "current": use current ordering of AnnData object
* "random": randomize the order
* "ascending": points with the highest value are plotted on top
* "descending": points with lowest value are plotted on top
order_categorical: Literal["current", "random", "ascending", "descending"] = "random"
How to order non-null categorical points in the plot. Uses same options as order_continuous. In this case, Potential extensions
Possible issuesVectorization could be complicatedVectorization of argument unclear/ maybe not possible. That is, what if I want the same variable twice, but ordered differently? This would look like: sc.pl.umap(adata, color=["CD8", "CD8"], order_continuous=["ascending", "descending"]) Now what if I wanted to also plot a categorical value? Is this: sc.pl.umap(adata, color=["CD8", "CD8", "leiden"], order_continuous=["ascending", "descending", None]) Null valuesThis solution assumes we still want null values plotted on bottom. Should there be control over that? Some references for other libraries:
|
At the moment when we are plotting data points in e.g.,
sc.pl.umap()
withcolor='covariate'
we determine the plotting order in two ways:'covariate'
is continuous the highest values are plotted on top, to showcase the peaks of the distribution;'covariate'
is a categorical variable, the order ofadata.obs_names
is used (i believe). As we often concatenate datasets after integration or loading from multiple sources, covariates we plot are usually not randomly ordered here.I think the first case is fine (and it can be turned off), but we should probably not be doing case 2. Instead, it would be good if the default was to plot in a random order unless the covariate is ordered internally (I believe this is already taken into account, but not sure). I have come across this issue several times now, and we're not solving this in a good way imo. Fabian has mentioned this to me several times as well. What do you think @fidelram @ivirshup ?
The text was updated successfully, but these errors were encountered: