Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concat API lacks semantic dimension identifier #1245

Open
3 tasks done
flying-sheep opened this issue Nov 28, 2023 · 14 comments · May be fixed by #1244
Open
3 tasks done

Concat API lacks semantic dimension identifier #1245

flying-sheep opened this issue Nov 28, 2023 · 14 comments · May be fixed by #1244
Assignees

Comments

@flying-sheep
Copy link
Member

flying-sheep commented Nov 28, 2023

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

AnnData was designed around semantic dimension names instead of implementation details.

The concat API lacks support for this basic design goal.

While it’s of course not entirely possible to avoid numerical axes like that, all APIs should support semantic axis names.

Code:

anndata.concat([...], dim="var")

Traceback:

TypeError: Function `concat` has no parameter named `dim`

Versions

0.10.3
@flying-sheep flying-sheep linked a pull request Nov 28, 2023 that will close this issue
3 tasks
@ivirshup
Copy link
Member

ivirshup commented Dec 12, 2023

I'm unsure about this.

We let people put names on their dimensions. E.g. adata.obs_names.name = "cell", adata.var_names.name = "gene". If we are able to move towards compatibility with xarray, this will become even more important. In their case dim is used as the keyword argument for selecting dimension, but will refer to the name assigned.

I don't know that we should have dim only refer to "obs", "var" while those dimensions can be assigned names dynamically. I also don't think we should consider dropping the functionality of naming the dimensions.

Right now, the obs, var convention almost entirely exists in attribute space. Does it need to move to "string" space?

@ivirshup
Copy link
Member

@ilan-gold, do you have thoughts on this w.r.t. the xarray API?

@gtca, do you have thoughts on this w.r.t. MuData? I think this is one of the strongest cases for allowing dynamic naming of dimensions is for multimodal data, e.g. something like:

mudata.mod["rna"].var_names.name == "gene", mudata.mod["atac"].var_names.name == "genomic_bin"

@ilan-gold
Copy link
Contributor

This is a tough point, actually. If your dimension name in xarray is something like obs_names, I don't think there's a second name you get to put on top of that (you could put a name on the underlying array as you do here, but this is a function of that array and not xarray). I am still getting familiar with xarray though so this may be wrong. But even if you tried to do that it would not work.

An example that hopefully illustrates I know what I'm talking about:

num_obs = 10
obs_names = np.arange(0, num_obs)
col1 = xr.DataArray(np.random.rand(num_obs,), coords=[obs_names], dims=["obs_names"])
col2 = xr.DataArray(np.random.rand(num_obs,), coords=[obs_names], dims=["obs_names"])
obs =  xr.Dataset(dict(col1=col1, col2=col2))
# <xarray.Dataset>
# Dimensions:    (obs_names: 10)
# Coordinates:
#   * obs_names  (obs_names) int64 0 1 2 3 4 5 6 7 8 9
# Data variables:
#     col1       (obs_names) float64 0.3043 0.4391 0.7602 ... 0.309 0.8719 0.6947
#     col2       (obs_names) float64 0.2787 0.5073 0.572 ... 0.6495 0.9579 0.8732
obs['obs_names']
# <xarray.DataArray 'obs_names' (obs_names: 10)>
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Coordinates:
#   * obs_names  (obs_names) int64 0 1 2 3 4 5 6 7 8 9
obs['obs_names'].name
# obs_names
obs['obs_names'].name = 'other_name'
obs['obs_names'].name
# obs_names

However, that brings up the point of whether or not things like obs_names should be standardized, or if we should always just assume 1 dimension of indexing, and that dimension can be called anything. This is sort of a sub-question of this since it is the same question but nested within the obs object (names vs some sort of known numeric indexing).

Right now, the obs, var convention almost entirely exists in attribute space. Does it need to move to "string" space?

What does this mean?

@flying-sheep
Copy link
Member Author

flying-sheep commented Dec 12, 2023

The core design philosophy of AnnData is “a lot of data has two main dimensions, obs and var”.

Of course it makes sense to have metadata describing these axes (e.g. “observations are cells, variables are transcripts”), but obs and var are how AnnData refers to them.

What does this mean?

Isaac means that adata.obs{,m,p} are attributes, whereas api(dim="obs") specifies the dimension as string name.

Of course the exact same argument applies to axis. In AnnData, apart from concat, it only exists as the position in indexing syntax, not as int.

@gtca
Copy link

gtca commented Dec 12, 2023

@ivirshup

I think this is one of the strongest cases for allowing dynamic naming of dimensions is for multimodal data

I think I am not convinced about dynamic dim naming yet. In the multimodal case, there are already modality names, and a single AnnData object is already concerned with a particular feature set.

In the context of this thread, while it makes sense to define operations like concatenation through axes (0, 1) or their names (obs, var — as our way of labelling "samples" and "features"), I am not sure how the user-defined names are going to be helpful. E.g. if the obs axis has a different .name defined in different AnnData objects, it only makes sense to talk about concatenation along 0 / "obs" rather than along a set of names.

In addition to that, there's a concern that codebases would lose some generality that they have now if we allow for dynamic naming. I.e. methods that work on "cells" will suddenly not be applicable to objects with "samples", some dimensionality reduction with the hardcoded "genes" dimension will not run on "peaks", etc.

@ivirshup
Copy link
Member

Compatibility with xarray

@ilan-gold, @flying-sheep I've put together a short demo of where I think there will be conflicts with xarray.

import anndata as ad, pandas as pd, numpy as np, xarray as xr

M, N  = (3, 2)

adata = ad.AnnData(
    np.arange(M * N).reshape((M, N)),
    obs=pd.DataFrame(
        index=pd.Index(["cell_{}".format(i) for i in range(M)], name="cell")
    ),
    var=pd.DataFrame(
        index=pd.Index(["gene_{}".format(i) for i in range(N)], name="gene")
    ),
)

# Creating a dataarray using the same indices
da = xr.DataArray(
    adata.X,
    coords=(adata.obs_names, adata.var_names),
)

# Xarray semantics:

da.sum(dim="cell")
# <xarray.DataArray (gene: 2)>
# array([6, 9])
# Coordinates:
#   * gene     (gene) object 'gene_0' 'gene_1'
xr.concat([da, da], dim="cell")
# <xarray.DataArray (cell: 6, gene: 2)>
# array([[0, 1],
#        [2, 3],
#        [4, 5],
#        [0, 1],
#        [2, 3],
#        [4, 5]])
# Coordinates:
#   * cell     (cell) object 'cell_0' 'cell_1' 'cell_2' 'cell_0' 'cell_1' 'cell_2'
#   * gene     (gene) object 'gene_0' 'gene_1'
xr.concat([da,da], dim="gene")
# <xarray.DataArray (cell: 3, gene: 4)>
# array([[0, 1, 0, 1],
#        [2, 3, 2, 3],
#        [4, 5, 4, 5]])
# Coordinates:
#   * cell     (cell) object 'cell_0' 'cell_1' 'cell_2'
#   * gene     (gene) object 'gene_0' 'gene_1' 'gene_0' 'gene_1'

For the sake of interoperability I don't think we should have a keyword argument dim that takes ["obs", "var"] as values. Some cases this would make weird:

adata.X = da

adata.X.sum(dim=???)
ad.concat([adata, adata], dim=???)

ds = adata.to_xarray()  # Theoretical method that returns a `xr.Dataset` object
xr.concat([ds, ds], dim=???)

So what are the options here?

  1. AnnData always sets the dimension names to "obs", "var"

We basically delete this information out of old files. Also makes operations like reset_index less useful, and will break user code that relied on this. Also we can't actually enforce this at runtime, since we don't control what happens when a user does: adata.obs.index.name = "..."

  1. The dim names are not used by scverse functions. Only "obs", "var".

This is basically the case above. I really don't like:

adata.X = da
ad.concat([adata, adata], dim="obs").X.sum(dim="cells")

Especially since using any other permutation of those strings would not work.

  1. scverse function use the assigned dim names as arguments for dim

I think this makes the most sense. However we do not enforce that adata.obs_names.name, adata.var_names.name dimension names are unique. In fact, they are quite frequently indistinct.

Since there is data written which specifies this we'd need to figure out what happens there. Probaly we error when it's ambiguous.

This requires that there's a way to reference these things unambiguously, which would be axis.

  1. Sorta both?

We could allow "obs", "var" and assigned dimension names.

This increases the number of ambiguous cases.

Another thought: using user assigned names can lead to repeated dimensions names. Xarray is currently not capable of handling these, so we'd need to figure out how to handle that.


MuData

@gtca, when we've talked before about how we can use MuData to represent not just same-dataset different-modality and even collections of datasets we got into "how do you indicate which sample set or modality a particular anndata object is part of". I believe one of the approaches that came up was naming the dimensions.

I would think that it could be nice to do something like:

.mod = {
    AnnData{dataset_1, genes}
    AnnData{dataset_1, atac}
    AnnData{dataset_2, genes}
    AnnData{dataset_2, immune_receptor}
    AnnData{dataset_3, atac}
}

Though I wonder if instead of dataset_{n} they should all be cells since they are the same kind of observation. Or

.mod = {
    AnnData{slide_1_cells, immuno_fluorescence},
    AnnData{slice_1_tissues, immuno_flourescence},
    AnnData{slice_2_cells, immuno_fluorescence},
    AnnData{slice_2_cells, FISH},
}

It looks like you've started to address this with the MuData(axis=) argument. I would be interested in how an dim argument could be used here. A few questions:

It looks like users have the option of having a MuData object represent:

  • Joint observations, multiple modalities
  • Disjoint observations, single modality
  • Joint observations, subsets of one modality

Is this correct? And can MuData represent say, multiple sets of observations each of which has a variable selection of modalities?


Spatialdata?

@scverse/spatialdata would also be nice to know if I'm forgetting anything about any implications for SpatialData here.

@giovp
Copy link
Member

giovp commented Dec 14, 2023

I don't see implications for spatialdata atm.

But, unrequested feedback, I think I understand the dynamic naming potentials, and am excited by xarray compatibility, but I must say I don't think I've ever used the current dynamic names and am not familiar with any API that uses it.

Therefore I guess I see the potential and don't see a real barrier to it, as adding the dimension semantic identifier would encourage users and developers to use it/think about it.

Therefore, I would go for this

  1. scverse function use the assigned dim names as arguments for dim

I also think that API(dim="obs") is also not very intuitive, as I also think of them as attributes.

@flying-sheep
Copy link
Member Author

flying-sheep commented Dec 14, 2023

My intuition comes from the central idea I designed AnnData around: AnnData has two main dimensions that represent observations and variables (while we iterated, “samples” and “features” also came up, but we settled on the former). the way I designed AnnData, and the way it has been from the first commit until today is as follows: The identifiers of AnnData’s two dimensions are obs and var. If we allow other equally valid identifiers to alias obs and var, that already introduces ambiguity and should therefore be avoided.

Our amount of support for adata.obs.index.name allows users to treat it as the obs dimension’s label, just like ax.xaxis.get_label() can get you the x axis’s label. I think it would be pretty gross to have an API that expects users to pass that string in to refer to the x axis in some way. Likewise I don’t think we should elevate Index.name to anything more than it is right now. If other APIs like xarray start treating pandas index names as identifiers, we should clarify that in AnnData, it’s not a dimension identifier and maybe consider removing support for it to minimize confusion.

@ivirshup for your examples, this should be the API:

ds = adata.to_xarray()  # Theoretical method that returns a `xr.Dataset` object
xr.concat([ds, ds], dim='obs')  # or 'var'

I really don't like:

adata.X = da
ad.concat([adata, adata], dim="obs").X.sum(dim="cells")

Me neither. So we need to make sure that the first line fails with ValueError: dimension name mismatch or so. Any API call for storing an xarray in AnnData or creating AnnData(some_xarray) needs to have a check for matching names or an explicit mapping between named dimension and obs/var, similarly to when we check .obs_names and .var_names for consistency when putting DataFrames into AnnData.

Also we can't actually enforce this at runtime, since we don't control what happens when a user does: adata.obs.index.name = "..."

We have this problem in multiple places. That’s why I didn’t like adding support for adata.X = pd.DataFrame(...) without solving that first.

So maybe we should just go around to solving that before we allow using xarray in AnnData.

@flying-sheep flying-sheep self-assigned this Jan 9, 2024
@flying-sheep
Copy link
Member Author

We could also do axis: Literal[0, 1, 'obs', 'var'] and leave dim for a future xarray/… compatible API

@ivirshup
Copy link
Member

axis: Literal[0, 1, 'obs', 'var']

This makes more sense to me.

For internal variables I would always make axis be an integer, then use axis_name (or similar) for "obs", "var" since many of our dependencies are going to expect that axis is an integer.

@flying-sheep
Copy link
Member Author

Great! I updated #1244 accordingly

@ivirshup
Copy link
Member

@gtca, is this something you would adopt in MuData? Any thoughts on how this interacts with your axis=-1 argument?

@flying-sheep
Copy link
Member Author

flying-sheep commented Jan 30, 2024

I just saw that this matches pandas’ API:

axis : {0/'index', 1/'columns'}, default 0

@gtca
Copy link

gtca commented Feb 1, 2024

axis: Literal[0, 1, 'obs', 'var']

Makes sense, here's one to track for MuData: scverse/mudata#64.

For axis=-1, it's mainly relevant only when instantiating a MuData object, and I would just keep the (shared) axis argument to Literal[0, 1, 'obs', 'var', -1] though there's an option to add 'both' or some other identifier to it.


@ivirshup, if it helps to clarify the comment above #1245 (comment), the way to go there is nested MuData objects, e.g. see scverse/mudata#44 with an example for a nested object with one multimodal dataset:

  root MuData (axis=0) (5000 x 20100)
  ├── protein AnnData (5000 x 100)
  └── rna MuData (axis=-1) (5000 x 20000)
      ├── raw AnnData (5000 x 20000)
      ├── quality-filtered AnnData (3000 x 20000)
      └── hvg-filtered AnnData (3000 x 4000)

You should be able to nest multiple datasets like that and so on.

I guess the main point here is that this is intentionally not a 2D format (unlike e.g. PostData) so you have to structure your data storage.

Supporting deeply nested objects by tools is another great question of course. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants