Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of plotting datapoints with categorical colouring should be amended #1263

Open
1 task done
LuckyMD opened this issue Jun 2, 2020 · 10 comments · May be fixed by #2998
Open
1 task done

Order of plotting datapoints with categorical colouring should be amended #1263

LuckyMD opened this issue Jun 2, 2020 · 10 comments · May be fixed by #2998

Comments

@LuckyMD
Copy link
Contributor

LuckyMD commented Jun 2, 2020

  • Additional function parameters / changed functionality / changed defaults?

At the moment when we are plotting data points in e.g., sc.pl.umap() with color='covariate' we determine the plotting order in two ways:

  1. if 'covariate' is continuous the highest values are plotted on top, to showcase the peaks of the distribution;
  2. if 'covariate' is a categorical variable, the order of adata.obs_names is used (i believe). As we often concatenate datasets after integration or loading from multiple sources, covariates we plot are usually not randomly ordered here.

I think the first case is fine (and it can be turned off), but we should probably not be doing case 2. Instead, it would be good if the default was to plot in a random order unless the covariate is ordered internally (I believe this is already taken into account, but not sure). I have come across this issue several times now, and we're not solving this in a good way imo. Fabian has mentioned this to me several times as well. What do you think @fidelram @ivirshup ?

@gokceneraslan
Copy link
Collaborator

This is the categorical equivalent of the sort_order parameter basically, right? I like the idea.

@LuckyMD
Copy link
Contributor Author

LuckyMD commented Jun 2, 2020

Yes, exactly :). I would have a default that turns it to random. I don't think it should be too hard either. However, this will probably break a few tests.

@ivirshup
Copy link
Member

ivirshup commented Jun 3, 2020

Sounds good to me. How are you thinking of handling reproducibility w.r.t. random seeds?

To me, the best solution here is to make it easy to do small multiples for categorical plots like this, but that's a big change in the kind of plot being made.

As an aside, I've also tried coloring the pixel by which group showed up the most under it, but this can look weird (less so, if density is used to calculate the alpha level)

image

Example without accounting for density

image

Snippet to reproduce
import datashader as ds
from datashader import transfer_functions as tf
import scanpy as sc
import numpy as np
import xarray as xr

# Where you load your AnnData, I was using a preprocessed set of 1.3 million mouse braincells

df = sc.get.obs_df(
    adata,
    ["Sox17", "louvain"],
    obsm_keys=[("X_umap", 0), ("X_umap", 1)]
)
louvain_colors = dict(
    zip(
        adata.obs["louvain"].cat.categories, 
        adata.uns["louvain_colors"]
    )
)

pts = (
    ds.Canvas(500, 500)
    .points(df, "X_umap-0", "X_umap-1", agg=ds.count_cat("louvain"))
)

newpts = xr.zeros_like(pts)
newpts[:, :, pts.argmax(dim="louvain")] = pts.sum(dim="louvain")
tf.shade(newpts, color_key=louvain_colors)

What datashader does by default is takes the average of the RGB values for the categories under a pixel, weighted by number of samples, and calculates an alpha level based on the number of samples present. This looks like:

image

Addendum to previous snippet for plotting this
tf.shade(pts, color_key=louvain_colors)

I've also been wondering if there's a good way to show "colors cannot be trusted in this region". This could be done like how camera's do zebra stripes – where a texture is overlaid on the viewfinder for the sensor pixels which are saturated with light.

@LuckyMD
Copy link
Contributor Author

LuckyMD commented Jun 3, 2020

Sounds good to me. How are you thinking of handling reproducibility w.r.t. random seeds?

I was thinking of just setting the seed to 0 by default, which would be consistent with other Scanpy code.

I don't actually mind the version of the plot without accounting for sample density. I do think that randomization would result in sth similar to the datashader example you show though, except that it wouldn't change alphas by density. I'm not sure if doing that is so helpful as it can lead to hardly being able to see small clusters in less dense regions of the plot. These clusters (potentially rare cell types) are often the interesting features in such a plot.

@ivirshup
Copy link
Member

ivirshup commented Jan 15, 2021

As an additional example, I was thinking about using zebra-stripes (like a camera) for showing when information was hidden. Not sure if it's quite there yet, but its something:

image

Code
import datashader as ds
from datashader import transfer_functions as tf

import numpy as np
import pandas as pd
from scipy import sparse
import xarray as xr

import scanpy as sc

def diagonal_bands_like(arr, width=3):
    assert arr.ndim == 2
    a = np.zeros_like(arr, dtype=bool)
    step = a.shape[1] + 1
    # Not sure why end isn't making a difference
    end = None
#     end = a.shape[1] * a.shape[1]
    fill = True
    for i in range(arr.shape[0]):
        if (i + width // 2) % width == 0:
            fill = not fill
        if fill:
            a.flat[i:end:step] = True
    return a

# Setup
adata = sc.read("/Users/isaac/data/10x_mouse_13MM_processed.h5ad", backed="r")
df = sc.get.obs_df(
    adata,
    ["Sox17", "louvain"],
    obsm_keys=[("X_umap", 0), ("X_umap", 1)]
)
louvain_colors = dict(
    zip(
        adata.obs["louvain"].cat.categories, 
        adata.uns["louvain_colors"]
    )
)
pts = (
    ds.Canvas(1000, 1000)
    .points(df, "X_umap-0", "X_umap-1", agg=ds.count_cat("louvain"))
)

# Make images
pts_ncats = (pts != 0).sum(axis=2)
overlap_idx = pts_ncats == 1
zebra_source = xr.DataArray(
    diagonal_bands_like(overlap_idx, 13),
    coords=overlap_idx.coords
)

color_by_cluster = tf.shade(pts, color_key=louvain_colors)
tf.Images(
    color_by_cluster,
    tf.stack(
        tf.Image(xr.where(pts_ncats == 1, color_by_cluster, 0)),
        tf.Image(tf.shade(xr.where(pts_ncats > 1, zebra_source, False), cmap="black"))
    ),
    tf.stack(
        color_by_cluster,
        tf.Image(tf.shade(xr.where(pts_ncats > 1, zebra_source, False), cmap="black"))
    ),
)

I do think that randomization would result in sth similar to the datashader example you show though, except that it wouldn't change alphas by density.

I wonder how either of these are effected by number of points. Say you have two cell types (A and B) in an overlapping region.
A has 10x the representation of B in this region, but it's only 10% of the A in this dataset, while this region has all of B. What the fair way to color this? If it were random, or purely by count this would look mostly like A.

I'm not sure if doing that is so helpful as it can lead to hardly being able to see small clusters in less dense regions of the plot

I think bin size would be helpful here. Additionally datashader has methods for exaggerating points in less dense regions so they are visible. This could be worth looking into.

Update: Turns out dynspread uses global density, not local. The spread operators could still be of help here. Also, minimum alpha values can be set. Overall, I do like that there is a sense of density with the alpha levels, and wouldn't want to miss out on it.

@fidelram
Copy link
Collaborator

We should make dynamic 3D plots ;-)

If I remember correctly, in the past we have the issue that the categorical colors were given by the adata.obs order and we change them such that they follow the order of the categories. Yet, I agree that a good mix of categorical colors is good sometimes. To address this issue I think that we can simply randomize the order if sort_order=False to avoid adding any new parameters.

Isaac's solution looks great for dealing with of lots of cells, something that I imagine will become more frequent. I think we should have a 'cookbook' where we can keep this and other information. I find this better than adding more and more functionality to the scatter plots.

@LuckyMD
Copy link
Contributor Author

LuckyMD commented Jan 16, 2021

What the fair way to color this? If it were random, or purely by count this would look mostly like A.

I would argue that this would be fair. In the end it's about showing which cells are represented per pixel/pixel bin. And rare cell types shouldn't be up-weighted in that in an unbiased representation (if there is such a thing).

In general I do like the idea of density being linked to transparency though. We could do a quick fix based on random order for now though, and then look into transparency for a larger update that would have to do with updating scanpy plotting to larger cell numbers?

@ivirshup
Copy link
Member

I think we should have a 'cookbook' where we can keep this and other information.

I've been trying to be organized about keeping notebooks around for this (here). Of course, I rarely get the notebooks clean enough to push 😆.

In the end it's about showing which cells are represented per pixel/pixel bin.

I would argue that this would be fair. In the end it's about showing which cells are represented per pixel/pixel bin.

Is it fair if coloring by batch and one dataset had fewer samples? Wouldn't you want to know that multiple batches were showing up in this region? I'm fairly convinced there is no good way to show this in one plot, other than telling users some information is hidden.

We could do a quick fix based on random order for now

I'm trying to think of the simplest way to implement this. I would like to keep the behaviour of sort_order=False just using the order from the anndata object. Some options:

  • sort_order="random", this would make the order random, but we might need to add a seed argument. Also, do we still plot over null values?
  • sort_order=order_array where order_array: np.ndarray[1, int]. Basically, the user can pass whatever order they like. For random order it would be np.random.choice(adata.n_obs, adata.n_obs, repeat=False). This is pretty flexible since it allows whatever order you want to be used without sorting the object.

larger update that would have to do with updating scanpy plotting to larger cell numbers?

I think this might be worth a separate package, at least to start out. At least with how I'm handling it now, there would be a large number of dependencies. Plus, I think overplottting like this is an unsolved problem, so freedom to experiment in important.

@fidelram
Copy link
Collaborator

@ivirshup I like your notebooks for a cookbook. Does it need to be super organized to add it to the readthedocs page?

Regarding the options, I like the sort_order='random'.

The order_array option would be equivalent to sc.pl.embedding(adata[adata.obs.sample(frac=1).index], ...) for random or any other sorting from of the adata object. The point is that user defined sorting is already covered. The only advantage of sort_order=order_array is that is explicit for the user.

@ivirshup
Copy link
Member

ivirshup commented Feb 10, 2021

The only advantage of sort_order=order_array is that is explicit for the user.

Another advantage is that it could be user specified per plot when there are multiple plots.


I think there is another issue, which is that sort_order currently just applies to numeric values while here we are trying to deal with issues around categorical values. To me this suggests a need to have separate arguments for the two cases (order_categorical, order_continuous), though this raises issues with "vectorizing" the argument.

Docstrings for these arguments would look something like:

order_continuous: Literal["current", "random", "ascending", "descending"] = "ascending"
    How to order points in plots colored by continuous values. Options include:
    * "current": use current ordering of AnnData object
    * "random": randomize the order
    * "ascending": points with the highest value are plotted on top
    * "descending": points with lowest value are plotted on top
order_categorical: Literal["current", "random", "ascending", "descending"] = "random"
    How to order non-null categorical points in the plot. Uses same options as order_continuous.

In this case, sort_order would be deprecated, and tell the user to use order_continuous instead.

Potential extensions

  • We could also allow users to pass Callable[Vector, Vector[int]]s (e.g. function which takes color vector, returns vector of integers) as arguments.

Possible issues

Vectorization could be complicated

Vectorization of argument unclear/ maybe not possible. That is, what if I want the same variable twice, but ordered differently? This would look like:

sc.pl.umap(adata, color=["CD8", "CD8"], order_continuous=["ascending", "descending"])

Now what if I wanted to also plot a categorical value? Is this:

sc.pl.umap(adata, color=["CD8", "CD8", "leiden"], order_continuous=["ascending", "descending", None])

Null values

This solution assumes we still want null values plotted on bottom. Should there be control over that?

Some references for other libraries:

  • altair.Sort
  • (I'm actually not sure if other libraries do this, datashader does max/ min/ mean which is sorta similar?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants