Storing sub-obs (variable length per observation data) #609

keller-mark · 2021-08-23T18:59:40Z

Hi,
We have a use case related to #237 but slightly different.

We would like to store a second obs array ("sub-obs", where an observation is an individual transcript in a MERFISH experiment), but related to the first obs array (where an observation is an individual cell).

Has your team thought about how to deal with this use case?

I am thinking something like this:

where the transcript ID and cell ID columns in the sub-obs can be like foreign keys into the main obs dataframe.

Is it possible to add a differently-shaped obs and obsm to the same AnnData store?

cc @ilan-gold

The text was updated successfully, but these errors were encountered:

ilan-gold · 2021-08-23T19:06:17Z

Yes @keller-mark! We really want a way to store this spot-level data in AnnData but are just not sure how to go about it. I think this proposal is really solid. We would definitely also be interested in crossover with squidpy if that would be of interest.

ivirshup · 2021-10-20T15:02:02Z

Hey, I have been talking about this a bit with @hspitzer and @giovp. I was wondering if something like Awkward array would be useful here? E.g. a data structure for ragged arrays, allowing variable length arrays per observation? This should be easy-ish to support since they've already got tooling for serialization.

ilan-gold · 2021-10-20T15:16:25Z

@ivirshup Do you know how this would work with zarr? Would each observation have its own zarr array (as opposed to one shared one)?

ivirshup · 2021-10-20T15:33:53Z

Zarr allows for ragged arrays which is how I would probably try to do it, if I were doing it from scratch.

There have been implementations relying on awkwards to_buffers and from_buffers, which I like because it looks very easy to implement. It's been a bit since I looked at how these worked, so I can't comment on exactly what they do.

ilan-gold · 2021-10-20T15:39:21Z

Zarr allows for ragged arrays which is how I would probably try to do it, if I were doing it from scratch. I believe there have been implementations relying on awkwards to_buffers and from_buffers, which I like because it looks very easy to implement.

🤯 I had no idea!! This looks extremely cool!

giovp · 2021-10-20T15:53:10Z

@ilan-gold thanks a lot for pointing me to ome/ngff#64 (review) as well as the hackathon, looks really interesting work!

From what I gathered, the effort over ome/gff is to use anndata for tabular representations for FISH-based spatial data. In that case, observations are measured molecules instead of cell. Beside being an interesting approach, and definitely useful for storing annotations as coordinates of the decoded molecules (alongside the image in zarr), I think what I have in mind here is slightly different. I might be completely wrong so I'll explain:

What we are missing in the current anndata/squidpy analysis toolkit is a way to represent the original (processed) data from FISH-based assays in the cell-level anndata representation. The original data, after processing/decoding, is essentially in the form described before (and in the PR). However, there is no direct way to index/slice/subset the decoded molecules to the cell-level observation. From my understanding of the problem, the cell-level observation is the basic unit for downstream analysis, and the one most useful for analysts. However, it could be desirable for EDA to eventually go back to the original molecule-level representation (e.g. given a clustering results, visualize all molecules of gene X in cell type Y across the tissue).
-- small detour: actually how to go from molecule level to cell level representation seems to be very challenging and there are a variety of approach, very differetn between each other, still being proposed.

For this, we essentially need a lookup table between cells (obs) and molecules (sub-obs). We already have something like this working but it's really ugly (essentially we store the sub-obs info as a Pandas series of lists, see this section of the tangram tutorial https://squidpy.readthedocs.io/en/latest/external_tutorials/tutorial_tangram.html#Deconvolution-and-mapping ). This look up table can then be used to index/slice/subset the sub-obs annotation table (which we could store actually in adata.uns for now) and create the molecule-level anndata on the spot during plotting or any downstream analysis task. The akward array proposed by @ivirshup (or just sparse matrices) are an interesting way to build such lookup table. We could then save that lookup table in adata.obsm and access the sub-obs annotation in adata.uns on the go when in need).
Again, different use case, orthogonal to the OME/ngff effort, would love to hear your thoughts on this.

Ciao!

p.s. I'll reply at rest of email later on

ilan-gold · 2021-10-20T16:10:01Z

@giovp You're totally right about this, I got a little carried away. And there's no rush on the email - I need to start coding and stop emailing you all so much 😄 Let me collect my thoughts on this and I'll post soon!

ivirshup · 2023-02-07T19:49:52Z

Closed by #647

keller-mark mentioned this issue Sep 9, 2021

Molecules via AnnData/Zarr, point cloud for 3D molecules vitessce/vitessce#1043

Draft

2 tasks

ivirshup changed the title ~~Storing sub-obs~~ Storing sub-obs (variable length per observation data) Oct 21, 2021

giovp mentioned this issue Nov 14, 2021

first attempt to support awkward arrays #647

Merged

17 tasks

ivirshup mentioned this issue Dec 10, 2021

Meta issue: new data types #662

Open

8 tasks

ivirshup mentioned this issue Feb 24, 2022

Support ragged arrays in AnnData.obs #712

Closed

ivirshup mentioned this issue Jan 31, 2023

TypeError when storing variable-length list of annotations for each row in obs and save to h5ad #888

Closed

ivirshup closed this as completed Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing sub-obs (variable length per observation data) #609

Storing sub-obs (variable length per observation data) #609

keller-mark commented Aug 23, 2021 •

edited

Loading

ilan-gold commented Aug 23, 2021

ivirshup commented Oct 20, 2021

ilan-gold commented Oct 20, 2021

ivirshup commented Oct 20, 2021 •

edited

Loading

ilan-gold commented Oct 20, 2021

giovp commented Oct 20, 2021 •

edited

Loading

ilan-gold commented Oct 20, 2021

ivirshup commented Feb 7, 2023

Storing sub-obs (variable length per observation data) #609

Storing sub-obs (variable length per observation data) #609

Comments

keller-mark commented Aug 23, 2021 • edited Loading

ilan-gold commented Aug 23, 2021

ivirshup commented Oct 20, 2021

ilan-gold commented Oct 20, 2021

ivirshup commented Oct 20, 2021 • edited Loading

ilan-gold commented Oct 20, 2021

giovp commented Oct 20, 2021 • edited Loading

ilan-gold commented Oct 20, 2021

ivirshup commented Feb 7, 2023

keller-mark commented Aug 23, 2021 •

edited

Loading

ivirshup commented Oct 20, 2021 •

edited

Loading

giovp commented Oct 20, 2021 •

edited

Loading