Adapt the new scverse AIRR data structure #261

grst · 2023-04-07T07:26:02Z

Is your feature request related to a problem?

the new scverse AIRR data structure is out in scirpy v0.13 RC and ready to be experimented with.

As discussed previously, it would be fantastic to get dandelion aligned to it and I'm opening this issue to track this challenge.

Describe the solution you'd like

In an ideal world, dandelion could directly work on the new data structure -- with the potential to get rid of the separate Dandelion class and the h5ddl format.

Describe alternatives you've considered

This could turn out to be a tremendous amount of work to transfer to a new data structure (I experienced this myself with scirpy). Therefore an easier solution could be to just support import and export from the new data structure.

Additional context

No response

The text was updated successfully, but these errors were encountered:

zktuong · 2023-04-10T23:43:48Z

Thanks @grst.

yes indeed it probably will break lots of things but nevertheless i am committed to implementing this somehow.

At the first instance, I'm thinking I can probably get rid of Query class

dandelion/dandelion/utilities/_core.py

Lines 1470 to 1481 in 878c1a0

    
           class Query: 
        
               """Query class""" 
        
               def __init__(self, data, verbose=False): 
        
                   self.data = data.copy() 
        
                   self.Cell = Tree() 
        
                   for contig, row in tqdm( 
        
                       data.iterrows(), 
        
                       desc="Setting up data", 
        
                       disable=not verbose, 
        
                   ): 
        
                       self.Cell[row["cell_id"]][contig].update(row)

and use the awkward array you use.

Will keep you updated on the progress here (while i try to potentially find a student to work on this !)

DennisCambridge · 2023-04-19T11:16:16Z

Hi @grst and @zktuong, after upgrading scirpy with pip install --upgrade --pre scirpy, it takes much longer for ddl.from_scirpy(irdata) to complete.

grst · 2023-04-19T11:30:43Z

Hi @DennisCambridge,

i have an idea why (scverse/scirpy#399).

Can you try that branch?

pip install git+https://github.com/scverse/scirpy.git@speed-up-to-airr-cells

DennisCambridge · 2023-04-19T11:37:22Z

Hi Gregor @grst ,

Yes, it seems to have been restored after I installed this branch.

grst · 2023-04-19T11:38:22Z

perfect, thanks for your feedback! I'll merge it as soon as the checks ran through.

zktuong · 2023-04-28T04:31:20Z

some initial scalability test

scalability_test.pdf

zktuong · 2023-05-02T03:05:14Z

incorporate scverse-airr https://github.com/zktuong/dandelion/issues/261

zktuong · 2023-05-05T04:44:42Z

@grst am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?

Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")? I noticed that if i do this on mudata, the mudata.obs doesn't show it.

grst · 2023-05-05T06:12:16Z

am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?
Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")?

Exactly! There's also a neat context manager to just temporarily add a column for e.g. plotting and automatically remove it again:

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="VJ_1_v_call")

I noticed that if i do this on mudata, the mudata.obs doesn't show it.

Mudata doesn't propagate changes to obs of one modality automatically. Essentially think of mdata.obs as a separate static data frame that is updated on-demand.

You can either assign columns to mdata.obs directly:

mdata.obs["airr:vj_1_locus"] = ir.get.airr(adata, "locus", "VJ_1")

or use the update_obs to propagate changes:

adata = mdata["airr"]
adata.obs["vj_1_locus"] =  ir.get.airr(adata, "locus", "VJ_1")
mdata.update_obs()

I made scirpy functions add to both adata.obs and mdata.obs by default: See DataHandler class.

zktuong · 2023-05-05T07:30:12Z

Gotcha!

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
mu.pl.umap(mdata, color="VJ_1_v_call")

This didn't work for me unfortunately i had to do

with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata["airr"], color="VJ_1_v_call")

Anyway - ok now i know what the the new datastructure is doing and i've put in a couple of issues/requests

incomplete airr format scverse/scirpy#405
* AirrCell to retrieve chain field attribute values directly like a dictionary of nested values scverse/scirpy#406
to try out some ideas.

The roadmap that i have in mind at the moment is for dandelion to work using adata.obsm['airr'] as a start. This would require its own function similar to ir.get.airr.

zktuong · 2023-05-05T07:40:35Z

and support mudata too

grst · 2023-05-05T07:57:16Z

This didn't work for me unfortunately i had to do

Sorry, my bad. When using mdata directly, the key is color="airr:VJ_1_v_call".

This would require its own function similar to ir.get.airr.

Just for clarification: Would it need its own function because you don't want to have scirpy as a heavy dependency, or because it really needs to work differently? If it's the former, there's always the option to factor that out in a helper package, see also scverse/scirpy#385

zktuong · 2023-05-05T08:06:17Z

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

still not working for mdata

for completeness this is my code:

mdata = ir.datasets.wu2020_3k()
mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata["airr"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[14], line 2
      1 with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)

with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[17], line 2
      1 with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)

grst · 2023-05-05T08:15:33Z

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

fair enough. the scirpy.get.airr function is of course somewhat specific to the scirpy receptor model, so makes sense that dandelion has a different one.

still not working for mdata

mu.pl.umap looks for X_umap in the MuData object, not in the anndata objects. In general, whenever you use mdata.obs{m}, it will look in the "global" version of MuData. If you want to access a value from a modality, you need the airr: or gex: prefix.

So what should work here is either

mdata.obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mu.pl.umap(mdata, color="airr:VJ_1_v_call")

or

mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata.update() # <-- not sure if this is even needed
mu.pl.embedding(mdata, basis="gex:X_umap", color="airr:VJ_1_v_call")

ok preliminary commit to use scverse-airr structure as per #261 TODO: - wherever *.data is mentioned in dandelion, this needs to handle the awkward arrays - create a slot/function that stores the cell_id - update the tutorials - create a new to_scirpy/from_scirpy function that basically transfers the .obs, .obsm and .uns - create a to/from_ak function so that the rearrangement data can still be repopulated as a pandas dataframe. - would be cool if i can also make this dask compatible - only issue is dask.dataframe.DataFrame.compute is always slow..? need to learn more about dask. vaex is also cool but not really compatible for contig->cell level wrangling.

zktuong · 2023-05-09T07:27:19Z

OK i've made some preliminary progress over at (d5fb09f).

There's a list of TODOs that are probably quite trivial to work through - might get a few grad students to work on them over the next quarter.

I also had to deactivate the cell level (.obs equivalent) data generation because it takes a long time to load, compared to just operating from pandas dataframe natively.

zktuong · 2024-01-09T03:27:19Z

implemented in #342

grst · 2024-01-09T06:36:26Z

Ok, then I should probably change the scirpy functions io.from_dandelion and io.to_dandelion to be aliases of your functions once this is released?

zktuong · 2024-01-09T06:40:53Z

yes. perhaps @amoschoomy can do in the PR on scirpy's side and you can do a quick check?

grst · 2024-01-09T06:43:23Z

that would be fab!

grst added the enhancement New feature or request label Apr 7, 2023

zktuong mentioned this issue Dec 21, 2023

Awkward to pandas #342

Merged

zktuong closed this as completed Jan 9, 2024

grst mentioned this issue Jan 9, 2024

Dandelion interoperability scverse/scirpy#475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt the new scverse AIRR data structure #261

Adapt the new scverse AIRR data structure #261

grst commented Apr 7, 2023

zktuong commented Apr 10, 2023 •

edited

Loading

DennisCambridge commented Apr 19, 2023

grst commented Apr 19, 2023

DennisCambridge commented Apr 19, 2023

grst commented Apr 19, 2023

zktuong commented Apr 28, 2023

zktuong commented May 2, 2023

zktuong commented May 5, 2023 •

edited

Loading

grst commented May 5, 2023

zktuong commented May 5, 2023 •

edited

Loading

zktuong commented May 5, 2023

grst commented May 5, 2023

zktuong commented May 5, 2023 •

edited

Loading

grst commented May 5, 2023

zktuong commented May 9, 2023

zktuong commented Jan 9, 2024

grst commented Jan 9, 2024

zktuong commented Jan 9, 2024

grst commented Jan 9, 2024

Adapt the new scverse AIRR data structure #261

Adapt the new scverse AIRR data structure #261

Comments

grst commented Apr 7, 2023

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

zktuong commented Apr 10, 2023 • edited Loading

DennisCambridge commented Apr 19, 2023

grst commented Apr 19, 2023

DennisCambridge commented Apr 19, 2023

grst commented Apr 19, 2023

zktuong commented Apr 28, 2023

zktuong commented May 2, 2023

zktuong commented May 5, 2023 • edited Loading

grst commented May 5, 2023

zktuong commented May 5, 2023 • edited Loading

zktuong commented May 5, 2023

grst commented May 5, 2023

zktuong commented May 5, 2023 • edited Loading

grst commented May 5, 2023

zktuong commented May 9, 2023

zktuong commented Jan 9, 2024

grst commented Jan 9, 2024

zktuong commented Jan 9, 2024

grst commented Jan 9, 2024

zktuong commented Apr 10, 2023 •

edited

Loading

zktuong commented May 5, 2023 •

edited

Loading

zktuong commented May 5, 2023 •

edited

Loading

zktuong commented May 5, 2023 •

edited

Loading