Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt the new scverse AIRR data structure #261

Closed
grst opened this issue Apr 7, 2023 · 19 comments
Closed

Adapt the new scverse AIRR data structure #261

grst opened this issue Apr 7, 2023 · 19 comments
Labels
enhancement New feature or request

Comments

@grst
Copy link
Contributor

grst commented Apr 7, 2023

Is your feature request related to a problem?

Hi @zktuong and @DennisCambridge,

the new scverse AIRR data structure is out in scirpy v0.13 RC and ready to be experimented with.

As discussed previously, it would be fantastic to get dandelion aligned to it and I'm opening this issue to track this challenge.

Describe the solution you'd like

In an ideal world, dandelion could directly work on the new data structure -- with the potential to get rid of the separate Dandelion class and the h5ddl format.

Describe alternatives you've considered

This could turn out to be a tremendous amount of work to transfer to a new data structure (I experienced this myself with scirpy). Therefore an easier solution could be to just support import and export from the new data structure.

Additional context

No response

@grst grst added the enhancement New feature or request label Apr 7, 2023
@zktuong
Copy link
Owner

zktuong commented Apr 10, 2023

Thanks @grst.

yes indeed it probably will break lots of things but nevertheless i am committed to implementing this somehow.

At the first instance, I'm thinking I can probably get rid of Query class

class Query:
"""Query class"""
def __init__(self, data, verbose=False):
self.data = data.copy()
self.Cell = Tree()
for contig, row in tqdm(
data.iterrows(),
desc="Setting up data",
disable=not verbose,
):
self.Cell[row["cell_id"]][contig].update(row)
and use the awkward array you use.

Will keep you updated on the progress here (while i try to potentially find a student to work on this !)

@DennisCambridge
Copy link

Hi @grst and @zktuong, after upgrading scirpy with pip install --upgrade --pre scirpy, it takes much longer for ddl.from_scirpy(irdata) to complete.

@grst
Copy link
Contributor Author

grst commented Apr 19, 2023

Hi @DennisCambridge,

i have an idea why (scverse/scirpy#399).

Can you try that branch?

pip install git+https://github.com/scverse/scirpy.git@speed-up-to-airr-cells

@DennisCambridge
Copy link

Hi Gregor @grst ,

Yes, it seems to have been restored after I installed this branch.

@grst
Copy link
Contributor Author

grst commented Apr 19, 2023

perfect, thanks for your feedback! I'll merge it as soon as the checks ran through.

@zktuong
Copy link
Owner

zktuong commented Apr 28, 2023

some initial scalability test

scalability_test.pdf

@zktuong
Copy link
Owner

zktuong commented May 2, 2023

@zktuong
Copy link
Owner

zktuong commented May 5, 2023

@grst am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?

Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")? I noticed that if i do this on mudata, the mudata.obs doesn't show it.

@grst
Copy link
Contributor Author

grst commented May 5, 2023

am i right to think that the new structure essentially holds off on the generation of the cell level .obs data until necessary i.e. if the function needs cell level VDJ/VJ locus information, then it will call it within the function?
Or when the user wants to populate that they can just use adata.obs['VJ_1_locus'] = ir.get.airr(adata, "locus", "VJ_1")?

Exactly! There's also a neat context manager to just temporarily add a column for e.g. plotting and automatically remove it again:

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="VJ_1_v_call")

I noticed that if i do this on mudata, the mudata.obs doesn't show it.

Mudata doesn't propagate changes to obs of one modality automatically. Essentially think of mdata.obs as a separate static data frame that is updated on-demand.

You can either assign columns to mdata.obs directly:

mdata.obs["airr:vj_1_locus"] = ir.get.airr(adata, "locus", "VJ_1")

or use the update_obs to propagate changes:

adata = mdata["airr"]
adata.obs["vj_1_locus"] =  ir.get.airr(adata, "locus", "VJ_1")
mdata.update_obs()

I made scirpy functions add to both adata.obs and mdata.obs by default: See DataHandler class.

@zktuong
Copy link
Owner

zktuong commented May 5, 2023

Gotcha!

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
mu.pl.umap(mdata, color="VJ_1_v_call")

This didn't work for me unfortunately i had to do

with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata["airr"], color="VJ_1_v_call")

Anyway - ok now i know what the the new datastructure is doing and i've put in a couple of issues/requests

The roadmap that i have in mind at the moment is for dandelion to work using adata.obsm['airr'] as a start. This would require its own function similar to ir.get.airr.

@zktuong
Copy link
Owner

zktuong commented May 5, 2023

and support mudata too

@grst
Copy link
Contributor Author

grst commented May 5, 2023

This didn't work for me unfortunately i had to do

Sorry, my bad. When using mdata directly, the key is color="airr:VJ_1_v_call".

This would require its own function similar to ir.get.airr.

Just for clarification: Would it need its own function because you don't want to have scirpy as a heavy dependency, or because it really needs to work differently? If it's the former, there's always the option to factor that out in a helper package, see also scverse/scirpy#385

@zktuong
Copy link
Owner

zktuong commented May 5, 2023

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

still not working for mdata

for completeness this is my code:

mdata = ir.datasets.wu2020_3k()
mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata["airr"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]

with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[14], line 2
      1 with ir.get.airr_context(mdata, "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)
with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
    mu.pl.umap(mdata, color="airr:VJ_1_v_call")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:143), in embedding(data, basis, color, use_raw, layer, **kwargs)
    142 try:
--> 143     mod, basis_mod = basis.split(":")
    144 except ValueError:

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[17], line 2
      1 with ir.get.airr_context(mdata["airr"], "v_call", chain=["VJ_1", "VDJ_1"]) as m:
----> 2     mu.pl.umap(mdata, color="airr:VJ_1_v_call")

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:273), in umap(mdata, **kwargs)
    267 def umap(mdata: MuData, **kwargs) -> Union[Axes, List[Axes], None]:
    268     """
    269     UMAP Scatter plot
    270 
    271     See :func:`muon.pl.embedding` for details.
    272     """
--> 273     return embedding(mdata, basis="umap", **kwargs)

File [~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145](https://file+.vscode-resource.vscode-cdn.net/Users/uqztuong/Documents/Projects/dandelion/~/miniconda3/envs/dandelion/lib/python3.10/site-packages/muon/_core/plot.py:145), in embedding(data, basis, color, use_raw, layer, **kwargs)
    143     mod, basis_mod = basis.split(":")
    144 except ValueError:
--> 145     raise ValueError(f"Basis {basis} is not present in the MuData object (.obsm)")
    147 if mod not in data.mod:
    148     raise ValueError(
    149         f"Modality {mod} is not present in the MuData object with modalities {', '.join(data.mod)}"
    150     )

ValueError: Basis umap is not present in the MuData object (.obsm)

@grst
Copy link
Contributor Author

grst commented May 5, 2023

at the moment i'm wanting to access beyond VJ_1/2 and VDJ_1/2 in case there's more than 4 contigs - this is currently the main reason why.

fair enough. the scirpy.get.airr function is of course somewhat specific to the scirpy receptor model, so makes sense that dandelion has a different one.

still not working for mdata

mu.pl.umap looks for X_umap in the MuData object, not in the anndata objects. In general, whenever you use mdata.obs{m}, it will look in the "global" version of MuData. If you want to access a value from a modality, you need the airr: or gex: prefix.

So what should work here is either

mdata.obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mu.pl.umap(mdata, color="airr:VJ_1_v_call")

or

mdata["gex"].obsm["X_umap"] = mdata["gex"].obsm["X_umap_orig"]
mdata.update() # <-- not sure if this is even needed
mu.pl.embedding(mdata, basis="gex:X_umap", color="airr:VJ_1_v_call")

zktuong added a commit that referenced this issue May 9, 2023
ok preliminary commit to use scverse-airr structure as per #261

TODO:

- wherever *.data is mentioned in dandelion, this needs to handle the awkward arrays
- create a slot/function that stores the cell_id
- update the tutorials
- create a new to_scirpy/from_scirpy function that basically transfers the .obs, .obsm and .uns
- create a to/from_ak function so that the rearrangement data can still be repopulated as a pandas dataframe.
- would be cool if i can also make this dask compatible - only issue is dask.dataframe.DataFrame.compute is always slow..? need to learn more about dask. vaex is also cool but not really compatible for contig->cell level wrangling.
@zktuong
Copy link
Owner

zktuong commented May 9, 2023

OK i've made some preliminary progress over at (d5fb09f).

There's a list of TODOs that are probably quite trivial to work through - might get a few grad students to work on them over the next quarter.

I also had to deactivate the cell level (.obs equivalent) data generation because it takes a long time to load, compared to just operating from pandas dataframe natively.

@zktuong
Copy link
Owner

zktuong commented Jan 9, 2024

implemented in #342

@zktuong zktuong closed this as completed Jan 9, 2024
@grst
Copy link
Contributor Author

grst commented Jan 9, 2024

Ok, then I should probably change the scirpy functions io.from_dandelion and io.to_dandelion to be aliases of your functions once this is released?

@zktuong
Copy link
Owner

zktuong commented Jan 9, 2024

yes. perhaps @amoschoomy can do in the PR on scirpy's side and you can do a quick check?

@grst
Copy link
Contributor Author

grst commented Jan 9, 2024

that would be fab!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants