-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request - var_names/obs_names as fixed-sized types (integer or bytes) #777
Comments
I'm all for this, but think it requires a new API for indexing. How do we make sure the result of The hard part here is backwards compatibility. |
We would also need this for single nucleotide resolution data. Tagging @mffrank here. @ivirshup, maybe one way to do that would be to only allow a range index, this way e.g. The idea behind this is not that we need to use integers for indexing but rather that we don't want to have string indices in some use cases. |
Previous discussion: #311 |
Thanks, @ivirshup, I was just going to write that it seems like one can even use RangeIndex already, and we just need to figure out (a) places where the unnecessary import numpy as np
import pandas as pd
import anndata
df = pd.DataFrame(np.random.normal(size=(100,10)))
adata = AnnData(x)
# => anndata/_core/anndata.py:120: ImplicitModificationWarning: Transforming to str index.
# => warnings.warn("Transforming to str index.", ImplicitModificationWarning)
adata.obs_names = df.index
adata.obs_names
# => RangeIndex(start=0, stop=100, step=1)
adata[[0,9,99]]
# => View of AnnData object with n_obs × n_vars = 3 × 10 |
@LucaMarconato, this is relevant for speeding up reading of points for FISH-like data |
My current thinking here is just to copy the API of xarray as much as possible. The idea is to eventually move all label based indexing to a So, we deprecate indexing with The most important goal is to do this in a way that causes the fewest bugs possible. However, we would also like to quickly get to allowing integer based indexes. Solution 1: timeWe could have a significant period of time where using labels when indexing with Then we start allowing integer indices, assuming all label based indexing code is using Solution 2:
|
@ivirshup, my current thinking is to try to stay away from With the goal that you mentioned in mind, one alternative discussed recently might be to introduce integer index and a new index class that will allow to use it instead of string-based indices. I.e. A hybrid between the suggested approached might be an addition of a new method ( I'd be up to discuss more and maybe prototype some things! |
Thanks for the suggestions! As a minor point:
I'm aiming much more at Index typeI broadly like this idea. I think DimensionalData.jl takes a similar approach, where you just use different types for different kinds of indexing. I think there are a couple downsides though: Not fully backwards compatibleidx = adata.obs.query("celltype == 'a'").index
adata[idx] # or adata[idx.value] If Familiar/ compatible with element typesI would like it more for python if there were popular libraries that behaved like this. I would also like to have indexing expressions that work as I'm not sure I get the "hybrid" option
I'm a little confused by this proposal. How is this different from my suggested Cheap labels (another alternative)Not so much a solution, as just addressing the performance problem a different way: we could have a different form of cheap labels. Maybe just fixed length 64bit values (or 128 if we want them to be uuids). While I don't believe pandas has a fixed length dtype, this could be done with the Idea for obs_names as 128 bit valuesimport pandas as pd, numpy as np, pyarrow as pa
N = 40_000_000
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))
obs_names = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes)) |
- Add scverse/anndata#777 (non-string indices) - Fix the link to the tutorials to be rendered
I fully agree. We already support integer slicing, with the semantic meaning of “index”. So I’d much prefer adding support for a newtype pattern using fixed size types and using that. That way we could support genomic ranges, k-mers/cell hashes, or UUIDs. I’m editing the title of this issue to match the original description which captured this. I’m also tentatively adding the “breaking change” label, but we might come up with an approach that isn’t breaking any assumptions. |
The nice thing about allowing non-integer types for the index is that it isn't a breaking change (well, maybe if edge cases were relying on the string conversion for non-integer types). Though apparently this would cause problems for R inter-op because they also only allow string rownames/ column names. |
Here’s a proof of concept for an UUID array type: https://gist.github.com/flying-sheep/99f2ceafdc494f97424222611b4f9474 |
We also have a use case in HuBMAP for storing annotated feature matrices for imaging data, with summary statistics and annotations for cells and nuclei identified in some image. In this case, the identifiers for cells or nuclei fundamentally are integers, with object i composed of all pixels in the segmentation mask image that have value i. Additionally, the index for these not only has to start from 1 (due to the convention of pixel value 0 meaning "background", not part of any cell or nucleus or other type of object), but needs to be non-contiguous, so as I understand it we couldn't use a (The non-contiguous case occurs whenever cells/nuclei touch the border of the image; it isn't very meaningful to compute total or mean protein expression or cell shape when half of the cell might be cut off.) For the moment we'll have to work around this by storing the object IDs as strings, but this is wrong -- the type of that identifier is "integer, starting from 1, with arbitrary portions of the range missing". |
Following issue #35, using integers in obs_names/var_names is allowed, yet slicing the objects is not possible after that. Could solutions for this particular exceptional case be discussed/added to the codebase?
Broadly speaking, if this is solved, it would help in integer- and bit-based representation of biological sequences as k-mers, and would play a role not only in sequence-based analyses of genomics data but additionally proteomics, RNA biology, etc.
Others who are interested in this and could maybe join in the discussion about memory and implementation considerations are @gtca @olgabot. Please tag others.
Thank you.
The text was updated successfully, but these errors were encountered: