Feature request - var_names/obs_names as fixed-sized types (integer or bytes) #777

ilibarra · 2022-05-31T06:36:33Z

Following issue #35, using integers in obs_names/var_names is allowed, yet slicing the objects is not possible after that. Could solutions for this particular exceptional case be discussed/added to the codebase?

Broadly speaking, if this is solved, it would help in integer- and bit-based representation of biological sequences as k-mers, and would play a role not only in sequence-based analyses of genomics data but additionally proteomics, RNA biology, etc.

Others who are interested in this and could maybe join in the discussion about memory and implementation considerations are @gtca @olgabot. Please tag others.

Thank you.

ivirshup · 2022-05-31T11:32:15Z

I'm all for this, but think it requires a new API for indexing.

How do we make sure the result of adata[[1,2,3]] is unambiguous if the labels are integer valued? xarray and pandas support this with indexing attributes: df.iloc[...] vs df.loc[...] for pandas and da[...] vs da.loc[...] for xarray.

The hard part here is backwards compatibility.

gtca · 2022-05-31T11:41:29Z

We would also need this for single nucleotide resolution data. Tagging @mffrank here.

@ivirshup, maybe one way to do that would be to only allow a range index, this way e.g. adata[np.array(0, 2)] will mean getting the first and the third obs irrespective of whether you think those are names or indices.
As range indices were not possible before, that should be ok with backwards compatibility.

The idea behind this is not that we need to use integers for indexing but rather that we don't want to have string indices in some use cases.

ivirshup · 2022-05-31T12:33:51Z

Previous discussion: #311

gtca · 2022-05-31T12:50:39Z

Thanks, @ivirshup, I was just going to write that it seems like one can even use RangeIndex already, and we just need to figure out (a) places where the unnecessary RangeIndex -> str conversion should be avoided and (b) (de)serialisation (should be easier than before since #555).

import numpy as np
import pandas as pd
import anndata

df = pd.DataFrame(np.random.normal(size=(100,10)))
adata = AnnData(x)
# => anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
# =>   warnings.warn("Transforming to str index.", ImplicitModificationWarning)

adata.obs_names = df.index
adata.obs_names
# => RangeIndex(start=0, stop=100, step=1)

adata[[0,9,99]]
# => View of AnnData object with n_obs × n_vars = 3 × 10

ivirshup · 2022-10-01T14:10:12Z

@LucaMarconato, this is relevant for speeding up reading of points for FISH-like data

ivirshup · 2022-12-06T23:02:12Z

My current thinking here is just to copy the API of xarray as much as possible.

The idea is to eventually move all label based indexing to a .loc attribute, and keep [] for positional.

So, we deprecate indexing with [] when it's label based. However, this would not yet allow label based indexing by integers, as any current calls to [] that used positional indexing but was passed an AnnData with integer indexes would have incorrect behavior.

The most important goal is to do this in a way that causes the fewest bugs possible. However, we would also like to quickly get to allowing integer based indexes.

Solution 1: time

We could have a significant period of time where using labels when indexing with [] throws an error. This should give downstream packages enough time to update.

Then we start allowing integer indices, assuming all label based indexing code is using .loc.

Solution 2: `iloc`

At the same time as loc we could introduce iloc for positional indexing. This would get us to integer indices much faster.

For any ambiguous cases (e.g. where axes are integer valued) we could throw errors for [].

Permanently

We then need to decide if we want to make this permanent. In this case we have effectively made [] indexing something to avoid, as whether it works is dependent on the contents of the AnnData.

Temporarily

After some period of time, we could then start allowing positional indexing via [] again, and deprecate .iloc. While I don't love the idea of an intentionally temporary API, removing iloc will be a very easy find and replace.

Questions

How much time is needed here? I think it has to be signifigant. Possibly a year between first deprecation and last removal.
Are there less disruptive ways to do this?

gtca · 2022-12-06T23:26:35Z

@ivirshup, my current thinking is to try to stay away from .loc and explore other alternatives.
While .loc could help to make it more pandas-like, it doesn't seem to me like that should be a goal here.

With the goal that you mentioned in mind, one alternative discussed recently might be to introduce integer index and a new index class that will allow to use it instead of string-based indices. I.e. adata[[0, 1, 2]] would still be subsetting by row numbers while adata[Index(3, 4)] would refer to the integer index.
Something like this will also require significantly less time to roll out and gather feedback. This will be a backward-compatible change.

A hybrid between the suggested approached might be an addition of a new method (.loc-like) for the integer/range index while using integers in [] to refer to row numbers without any ambiguity. (For strings there is no ambiguity to start with.) That also makes sense in practice as integer index essentially means that we do not care about feature names per se (but we still need some IDs in order to keep track). This is also a backward-compatible change.

I'd be up to discuss more and maybe prototype some things!

ivirshup · 2022-12-07T00:13:22Z

Thanks for the suggestions!

As a minor point:

While .loc could help to make it more pandas-like, it doesn't seem to me like that should be a goal here.

I'm aiming much more at xarray than pandas here. A big upside of removing this ambiguity is that we could also start using the xarray index dtypes.

Index type

I broadly like this idea. I think DimensionalData.jl takes a similar approach, where you just use different types for different kinds of indexing. I think there are a couple downsides though:

Not fully backwards compatible

idx = adata.obs.query("celltype == 'a'").index
adata[idx] # or adata[idx.value]

If obs_names were strings, this would be label based indexing, but if it were integer valued this would be positional. So any function that uses this kind of selection logic (I'm sure this exists in scanpy for example) would have incorrect behavior if passed an AnnData with integer labels.

Familiar/ compatible with element types

I would like it more for python if there were popular libraries that behaved like this. I would also like to have indexing expressions that work as adata[idx] to work on adata.obsm["a"][idx] where possible.

I'm not sure I get the "hybrid" option

A hybrid between the suggested approached might be an addition of a new method (.loc-like) for the integer/range index while using integers in [] to refer to row numbers without any ambiguity.

I'm a little confused by this proposal. How is this different from my suggested .loc option? And would statements that make the index type not backwards compatible be an issue here as well?

Cheap labels (another alternative)

Not so much a solution, as just addressing the performance problem a different way: we could have a different form of cheap labels. Maybe just fixed length 64bit values (or 128 if we want them to be uuids).

While I don't believe pandas has a fixed length dtype, this could be done with the ArrowExtensionArray dtype and arrays with pyarrow.binary(nbytes) labels.

Idea for obs_names as 128 bit values

import pandas as pd, numpy as np, pyarrow as pa

N = 40_000_000

np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))
obs_names = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes))

- Add scverse/anndata#777 (non-string indices) - Fix the link to the tutorials to be rendered

flying-sheep · 2023-07-11T14:49:25Z

[…] I think I'm leaning towards non-numeric, but efficient, datatypes. Like a UUID as a fixed length byte-string. The previous big blocker to this seems to have been removed in pandas:

PERF: Indexes with arrow arrays covert the arrays to numpy object arrays, and that's very slow pandas-dev/pandas#50121 (comment)

– @ivirshup in #199 (comment)

I fully agree. We already support integer slicing, with the semantic meaning of “index”. So I’d much prefer adding support for a newtype pattern using fixed size types and using that.

That way we could support genomic ranges, k-mers/cell hashes, or UUIDs.

I’m editing the title of this issue to match the original description which captured this. I’m also tentatively adding the “breaking change” label, but we might come up with an approach that isn’t breaking any assumptions.

ivirshup · 2023-07-11T15:42:30Z

The nice thing about allowing non-integer types for the index is that it isn't a breaking change (well, maybe if edge cases were relying on the string conversion for non-integer types).

Though apparently this would cause problems for R inter-op because they also only allow string rownames/ column names.

flying-sheep · 2023-08-31T14:04:37Z

Here’s a proof of concept for an UUID array type:

https://gist.github.com/flying-sheep/99f2ceafdc494f97424222611b4f9474

mruffalo · 2024-05-08T16:02:06Z

We also have a use case in HuBMAP for storing annotated feature matrices for imaging data, with summary statistics and annotations for cells and nuclei identified in some image.

In this case, the identifiers for cells or nuclei fundamentally are integers, with object i composed of all pixels in the segmentation mask image that have value i. Additionally, the index for these not only has to start from 1 (due to the convention of pixel value 0 meaning "background", not part of any cell or nucleus or other type of object), but needs to be non-contiguous, so as I understand it we couldn't use a RangeIndex.

(The non-contiguous case occurs whenever cells/nuclei touch the border of the image; it isn't very meaningful to compute total or mean protein expression or cell shape when half of the cell might be cut off.)

For the moment we'll have to work around this by storing the object IDs as strings, but this is wrong -- the type of that identifier is "integer, starting from 1, with arbitrary portions of the range missing".

ivirshup added enhancement topic: indexing labels May 31, 2022

ivirshup added help wanted 🙏 performance 🐌 topic: compatibility labels Dec 6, 2022

gtca added a commit to scverse/governance that referenced this issue Jan 10, 2023

Update the roadmap

c45c757

- Add scverse/anndata#777 (non-string indices) - Fix the link to the tutorials to be rendered

rcannood mentioned this issue May 6, 2023

obs_names and var_names as character/str or not scverse/anndataR#73

Open

ivirshup mentioned this issue Jul 11, 2023

Slowdown when initializing from a large sparse matrix #199

Closed

This was referenced Jul 11, 2023

[bug] Support integers as column names #31

Closed

Is logging "Transforming to str index." necessary? #311

Closed

flying-sheep changed the title ~~Feature request - var_names/obs_names as numeric types (integer)~~ Feature request - var_names/obs_names as fixed-sized types (integer or bytes) Jul 11, 2023

flying-sheep added the breaking change ‼️ A breaking change that needs a major version update label Jul 11, 2023

ivirshup mentioned this issue Aug 29, 2023

Make checking var_names and obs_names for uniqueness optional #1112

Closed

rcannood mentioned this issue Jul 8, 2024

var_names/obs_names as fixed-size types scverse/anndataR#172

Open

LucaMarconato mentioned this issue Jul 10, 2024

ImageTilesDataset throws RuntimeError during validation when instantiated scverse/spatialdata#628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request - var_names/obs_names as fixed-sized types (integer or bytes) #777

Feature request - var_names/obs_names as fixed-sized types (integer or bytes) #777

ilibarra commented May 31, 2022

ivirshup commented May 31, 2022

gtca commented May 31, 2022 •

edited

Loading

ivirshup commented May 31, 2022

gtca commented May 31, 2022

ivirshup commented Oct 1, 2022

ivirshup commented Dec 6, 2022 •

edited

Loading

gtca commented Dec 6, 2022 •

edited

Loading

ivirshup commented Dec 7, 2022 •

edited

Loading

flying-sheep commented Jul 11, 2023 •

edited

Loading

ivirshup commented Jul 11, 2023

flying-sheep commented Aug 31, 2023

mruffalo commented May 8, 2024

Feature request - var_names/obs_names as fixed-sized types (integer or bytes) #777

Feature request - var_names/obs_names as fixed-sized types (integer or bytes) #777

Comments

ilibarra commented May 31, 2022

ivirshup commented May 31, 2022

gtca commented May 31, 2022 • edited Loading

ivirshup commented May 31, 2022

gtca commented May 31, 2022

ivirshup commented Oct 1, 2022

ivirshup commented Dec 6, 2022 • edited Loading

Solution 1: time

Solution 2: iloc

Permanently

Temporarily

Questions

gtca commented Dec 6, 2022 • edited Loading

ivirshup commented Dec 7, 2022 • edited Loading

Index type

Not fully backwards compatible

Familiar/ compatible with element types

I'm not sure I get the "hybrid" option

Cheap labels (another alternative)

flying-sheep commented Jul 11, 2023 • edited Loading

ivirshup commented Jul 11, 2023

flying-sheep commented Aug 31, 2023

mruffalo commented May 8, 2024

gtca commented May 31, 2022 •

edited

Loading

ivirshup commented Dec 6, 2022 •

edited

Loading

Solution 2: `iloc`

gtca commented Dec 6, 2022 •

edited

Loading

ivirshup commented Dec 7, 2022 •

edited

Loading

flying-sheep commented Jul 11, 2023 •

edited

Loading