Order of plotting datapoints with categorical colouring should be amended #1263

LuckyMD · 2020-06-02T11:41:56Z

Additional function parameters / changed functionality / changed defaults?

At the moment when we are plotting data points in e.g., sc.pl.umap() with color='covariate' we determine the plotting order in two ways:

if 'covariate' is continuous the highest values are plotted on top, to showcase the peaks of the distribution;
if 'covariate' is a categorical variable, the order of adata.obs_names is used (i believe). As we often concatenate datasets after integration or loading from multiple sources, covariates we plot are usually not randomly ordered here.

I think the first case is fine (and it can be turned off), but we should probably not be doing case 2. Instead, it would be good if the default was to plot in a random order unless the covariate is ordered internally (I believe this is already taken into account, but not sure). I have come across this issue several times now, and we're not solving this in a good way imo. Fabian has mentioned this to me several times as well. What do you think @fidelram @ivirshup ?

The text was updated successfully, but these errors were encountered:

gokceneraslan · 2020-06-02T12:30:02Z

This is the categorical equivalent of the sort_order parameter basically, right? I like the idea.

LuckyMD · 2020-06-02T22:33:47Z

Yes, exactly :). I would have a default that turns it to random. I don't think it should be too hard either. However, this will probably break a few tests.

ivirshup · 2020-06-03T05:51:03Z

Sounds good to me. How are you thinking of handling reproducibility w.r.t. random seeds?

To me, the best solution here is to make it easy to do small multiples for categorical plots like this, but that's a big change in the kind of plot being made.

As an aside, I've also tried coloring the pixel by which group showed up the most under it, but this can look weird (less so, if density is used to calculate the alpha level)

Example without accounting for density

Snippet to reproduce

import datashader as ds
from datashader import transfer_functions as tf
import scanpy as sc
import numpy as np
import xarray as xr

# Where you load your AnnData, I was using a preprocessed set of 1.3 million mouse braincells

df = sc.get.obs_df(
    adata,
    ["Sox17", "louvain"],
    obsm_keys=[("X_umap", 0), ("X_umap", 1)]
)
louvain_colors = dict(
    zip(
        adata.obs["louvain"].cat.categories, 
        adata.uns["louvain_colors"]
    )
)

pts = (
    ds.Canvas(500, 500)
    .points(df, "X_umap-0", "X_umap-1", agg=ds.count_cat("louvain"))
)

newpts = xr.zeros_like(pts)
newpts[:, :, pts.argmax(dim="louvain")] = pts.sum(dim="louvain")
tf.shade(newpts, color_key=louvain_colors)

What datashader does by default is takes the average of the RGB values for the categories under a pixel, weighted by number of samples, and calculates an alpha level based on the number of samples present. This looks like:

Addendum to previous snippet for plotting this

tf.shade(pts, color_key=louvain_colors)

I've also been wondering if there's a good way to show "colors cannot be trusted in this region". This could be done like how camera's do zebra stripes – where a texture is overlaid on the viewfinder for the sensor pixels which are saturated with light.

LuckyMD · 2020-06-03T08:49:16Z

Sounds good to me. How are you thinking of handling reproducibility w.r.t. random seeds?

I was thinking of just setting the seed to 0 by default, which would be consistent with other Scanpy code.

I don't actually mind the version of the plot without accounting for sample density. I do think that randomization would result in sth similar to the datashader example you show though, except that it wouldn't change alphas by density. I'm not sure if doing that is so helpful as it can lead to hardly being able to see small clusters in less dense regions of the plot. These clusters (potentially rare cell types) are often the interesting features in such a plot.

ivirshup · 2021-01-15T05:36:25Z

As an additional example, I was thinking about using zebra-stripes (like a camera) for showing when information was hidden. Not sure if it's quite there yet, but its something:

Code

import datashader as ds
from datashader import transfer_functions as tf

import numpy as np
import pandas as pd
from scipy import sparse
import xarray as xr

import scanpy as sc

def diagonal_bands_like(arr, width=3):
    assert arr.ndim == 2
    a = np.zeros_like(arr, dtype=bool)
    step = a.shape[1] + 1
    # Not sure why end isn't making a difference
    end = None
#     end = a.shape[1] * a.shape[1]
    fill = True
    for i in range(arr.shape[0]):
        if (i + width // 2) % width == 0:
            fill = not fill
        if fill:
            a.flat[i:end:step] = True
    return a

# Setup
adata = sc.read("/Users/isaac/data/10x_mouse_13MM_processed.h5ad", backed="r")
df = sc.get.obs_df(
    adata,
    ["Sox17", "louvain"],
    obsm_keys=[("X_umap", 0), ("X_umap", 1)]
)
louvain_colors = dict(
    zip(
        adata.obs["louvain"].cat.categories, 
        adata.uns["louvain_colors"]
    )
)
pts = (
    ds.Canvas(1000, 1000)
    .points(df, "X_umap-0", "X_umap-1", agg=ds.count_cat("louvain"))
)

# Make images
pts_ncats = (pts != 0).sum(axis=2)
overlap_idx = pts_ncats == 1
zebra_source = xr.DataArray(
    diagonal_bands_like(overlap_idx, 13),
    coords=overlap_idx.coords
)

color_by_cluster = tf.shade(pts, color_key=louvain_colors)
tf.Images(
    color_by_cluster,
    tf.stack(
        tf.Image(xr.where(pts_ncats == 1, color_by_cluster, 0)),
        tf.Image(tf.shade(xr.where(pts_ncats > 1, zebra_source, False), cmap="black"))
    ),
    tf.stack(
        color_by_cluster,
        tf.Image(tf.shade(xr.where(pts_ncats > 1, zebra_source, False), cmap="black"))
    ),
)

I do think that randomization would result in sth similar to the datashader example you show though, except that it wouldn't change alphas by density.

I wonder how either of these are effected by number of points. Say you have two cell types (A and B) in an overlapping region.
A has 10x the representation of B in this region, but it's only 10% of the A in this dataset, while this region has all of B. What the fair way to color this? If it were random, or purely by count this would look mostly like A.

I'm not sure if doing that is so helpful as it can lead to hardly being able to see small clusters in less dense regions of the plot

I think bin size would be helpful here. Additionally datashader has methods for exaggerating points in less dense regions so they are visible. This could be worth looking into.

Update: Turns out dynspread uses global density, not local. The spread operators could still be of help here. Also, minimum alpha values can be set. Overall, I do like that there is a sense of density with the alpha levels, and wouldn't want to miss out on it.

fidelram · 2021-01-15T18:03:39Z

We should make dynamic 3D plots ;-)

If I remember correctly, in the past we have the issue that the categorical colors were given by the adata.obs order and we change them such that they follow the order of the categories. Yet, I agree that a good mix of categorical colors is good sometimes. To address this issue I think that we can simply randomize the order if sort_order=False to avoid adding any new parameters.

Isaac's solution looks great for dealing with of lots of cells, something that I imagine will become more frequent. I think we should have a 'cookbook' where we can keep this and other information. I find this better than adding more and more functionality to the scatter plots.

LuckyMD · 2021-01-16T17:08:53Z

What the fair way to color this? If it were random, or purely by count this would look mostly like A.

I would argue that this would be fair. In the end it's about showing which cells are represented per pixel/pixel bin. And rare cell types shouldn't be up-weighted in that in an unbiased representation (if there is such a thing).

In general I do like the idea of density being linked to transparency though. We could do a quick fix based on random order for now though, and then look into transparency for a larger update that would have to do with updating scanpy plotting to larger cell numbers?

ivirshup · 2021-01-17T07:09:20Z

I think we should have a 'cookbook' where we can keep this and other information.

I've been trying to be organized about keeping notebooks around for this (here). Of course, I rarely get the notebooks clean enough to push 😆.

In the end it's about showing which cells are represented per pixel/pixel bin.

I would argue that this would be fair. In the end it's about showing which cells are represented per pixel/pixel bin.

Is it fair if coloring by batch and one dataset had fewer samples? Wouldn't you want to know that multiple batches were showing up in this region? I'm fairly convinced there is no good way to show this in one plot, other than telling users some information is hidden.

We could do a quick fix based on random order for now

I'm trying to think of the simplest way to implement this. I would like to keep the behaviour of sort_order=False just using the order from the anndata object. Some options:

sort_order="random", this would make the order random, but we might need to add a seed argument. Also, do we still plot over null values?
sort_order=order_array where order_array: np.ndarray[1, int]. Basically, the user can pass whatever order they like. For random order it would be np.random.choice(adata.n_obs, adata.n_obs, repeat=False). This is pretty flexible since it allows whatever order you want to be used without sorting the object.

larger update that would have to do with updating scanpy plotting to larger cell numbers?

I think this might be worth a separate package, at least to start out. At least with how I'm handling it now, there would be a large number of dependencies. Plus, I think overplottting like this is an unsolved problem, so freedom to experiment in important.

fidelram · 2021-01-19T10:57:16Z

@ivirshup I like your notebooks for a cookbook. Does it need to be super organized to add it to the readthedocs page?

Regarding the options, I like the sort_order='random'.

The order_array option would be equivalent to sc.pl.embedding(adata[adata.obs.sample(frac=1).index], ...) for random or any other sorting from of the adata object. The point is that user defined sorting is already covered. The only advantage of sort_order=order_array is that is explicit for the user.

ivirshup · 2021-02-10T07:35:56Z

The only advantage of sort_order=order_array is that is explicit for the user.

Another advantage is that it could be user specified per plot when there are multiple plots.

I think there is another issue, which is that sort_order currently just applies to numeric values while here we are trying to deal with issues around categorical values. To me this suggests a need to have separate arguments for the two cases (order_categorical, order_continuous), though this raises issues with "vectorizing" the argument.

Docstrings for these arguments would look something like:

order_continuous: Literal["current", "random", "ascending", "descending"] = "ascending"
    How to order points in plots colored by continuous values. Options include:
    * "current": use current ordering of AnnData object
    * "random": randomize the order
    * "ascending": points with the highest value are plotted on top
    * "descending": points with lowest value are plotted on top
order_categorical: Literal["current", "random", "ascending", "descending"] = "random"
    How to order non-null categorical points in the plot. Uses same options as order_continuous.

In this case, sort_order would be deprecated, and tell the user to use order_continuous instead.

Potential extensions

We could also allow users to pass Callable[Vector, Vector[int]]s (e.g. function which takes color vector, returns vector of integers) as arguments.

Possible issues

Vectorization could be complicated

Vectorization of argument unclear/ maybe not possible. That is, what if I want the same variable twice, but ordered differently? This would look like:

sc.pl.umap(adata, color=["CD8", "CD8"], order_continuous=["ascending", "descending"])

Now what if I wanted to also plot a categorical value? Is this:

sc.pl.umap(adata, color=["CD8", "CD8", "leiden"], order_continuous=["ascending", "descending", None])

Null values

This solution assumes we still want null values plotted on bottom. Should there be control over that?

Some references for other libraries:

altair.Sort
(I'm actually not sure if other libraries do this, datashader does max/ min/ mean which is sorta similar?)

LuckyMD added the Enhancement ✨ label Jun 2, 2020

ivirshup mentioned this issue Jan 15, 2021

Default plotting order for cluster annotations #1588

Closed

1 task

ivirshup mentioned this issue May 19, 2021

Density / Z-Stack Planning chanzuckerberg/cellxgene#1457

Open

Hrovatin mentioned this issue Oct 22, 2021

Order of cell plotting differs in UMAP and scatterplot #2017

Open

2 tasks

grst mentioned this issue Sep 6, 2023

Datashader as plotting backend #2656

Open

ivirshup added this to the 1.11.0 milestone Apr 9, 2024

ivirshup added the Area - Plotting 🌺 label Apr 10, 2024

ivirshup linked a pull request Apr 10, 2024 that will close this issue

Start overhaul for embedding sort order #2998

Draft

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order of plotting datapoints with categorical colouring should be amended #1263

Order of plotting datapoints with categorical colouring should be amended #1263

LuckyMD commented Jun 2, 2020

gokceneraslan commented Jun 2, 2020

LuckyMD commented Jun 2, 2020

ivirshup commented Jun 3, 2020 •

edited

LuckyMD commented Jun 3, 2020

ivirshup commented Jan 15, 2021 •

edited

fidelram commented Jan 15, 2021

LuckyMD commented Jan 16, 2021

ivirshup commented Jan 17, 2021

fidelram commented Jan 19, 2021

ivirshup commented Feb 10, 2021 •

edited

Order of plotting datapoints with categorical colouring should be amended #1263

Order of plotting datapoints with categorical colouring should be amended #1263

Comments

LuckyMD commented Jun 2, 2020

gokceneraslan commented Jun 2, 2020

LuckyMD commented Jun 2, 2020

ivirshup commented Jun 3, 2020 • edited

LuckyMD commented Jun 3, 2020

ivirshup commented Jan 15, 2021 • edited

fidelram commented Jan 15, 2021

LuckyMD commented Jan 16, 2021

ivirshup commented Jan 17, 2021

fidelram commented Jan 19, 2021

ivirshup commented Feb 10, 2021 • edited

Potential extensions

Possible issues

Vectorization could be complicated

Null values

Some references for other libraries:

ivirshup commented Jun 3, 2020 •

edited

ivirshup commented Jan 15, 2021 •

edited

ivirshup commented Feb 10, 2021 •

edited