Older versions of anndata throw unintuitive errors when trying to read newer formats #698

ghost · 2022-02-01T15:56:33Z

Hi,

I wanted to get help on an error reading h5ads created by the 0.8.0rc version of anndata. In my experience, h5ads that are created using 0.8.0rc1 cannot be opened using older anndata versions.

How to reproduce

In an environment with 0.8.0rc1 installed:

import scanpy as sc
adata = sc.datasets.pbmc3k()
adata.write_h5ad("adata.0.8.h5ad")

In an environment with 0.7.* installed (tested with 0.7.6 and 0.7.8)

from anndata import read_h5ad
ad = read_h5ad("adata.0.8.h5ad")

You get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs)
    176         try:
--> 177             return func(elem, *args, **kwargs)
    178         except Exception as e:

/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/h5ad.py in read_group(group)
    526     if encoding_type:
--> 527         EncodingVersions[encoding_type].check(
    528             group.name, group.attrs["encoding-version"]

/opt/conda/envs/py37good/lib/python3.7/enum.py in __getitem__(cls, name)
    356     def __getitem__(cls, name):
--> 357         return cls._member_map_[name]
    358 

KeyError: 'dict'

During handling of the above exception, another exception occurred:

AnnDataReadError                          Traceback (most recent call last)
~/tmp/ipykernel_17002/906833588.py in <module>
----> 1 ad = read_h5ad("adata.0.8.h5ad")

/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/h5ad.py in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    419                 d[k] = read_dataframe(f[k])
    420             else:  # Base case
--> 421                 d[k] = read_attribute(f[k])
    422 
    423         d["raw"] = _read_raw(f, as_sparse, rdasp)

/opt/conda/envs/py37good/lib/python3.7/functools.py in wrapper(*args, **kw)
    838                             '1 positional argument')
    839 
--> 840         return dispatch(args[0].__class__)(*args, **kw)
    841 
    842     funcname = getattr(func, '__name__', 'singledispatch function')

/opt/conda/envs/py37good/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, *args, **kwargs)
    182                 parent = _get_parent(elem)
    183                 raise AnnDataReadError(
--> 184                     f"Above error raised while reading key {elem.name!r} of "
    185                     f"type {type(elem)} from {parent}."
    186                 )

AnnDataReadError: Above error raised while reading key '/layers' of type <class 'h5py._hl.group.Group'> from /.

I'm using h5py==3.6.0. Let me know if you need me to list anything else about my environment.

Thanks!

The text was updated successfully, but these errors were encountered:

ivirshup · 2022-02-01T17:10:45Z

Hey, this is expected. What you're looking for would be forward compatibility.

Sometime we update the format of an AnnData objects stored on disk. We can't really make older versions of the library know how to deal with this. We've actually added some internal features in the new version which should make having some form of forward compatibility easier in the future (even if it's just writing older versions of the schema).

Is there a reason you'd need to keep using older versions of the library once this is released?

Worst case we could make another release in the 0.7.x series with smaller forward compatible changes, but I'd need to know it's needed first.

nspies-celsiustx · 2022-02-01T18:42:31Z

Hey, this is expected. What you're looking for would be forward compatibility.

Sometime we update the format of an AnnData objects stored on disk. We can't really make older versions of the library know how to deal with this. We've actually added some internal features in the new version which should make having some form of forward compatibility easier in the future (even if it's just writing older versions of the schema).

Is there a reason you'd need to keep using older versions of the library once this is released?

Worst case we could make another release in the 0.7.x series with smaller forward compatible changes, but I'd need to know it's needed first.

Thanks for your thoughtful response. This is indeed a big concern for us. We have a substantial amount of infrastructure that's using h5ads and we can't always upgrade everything in tandem. In addition, changes to the h5ad file format can break external tools, eg R code that is reading from these files using R hdf5 libraries.

I understand it's useful to make changes to the h5ad file format periodically to make it better, but I'd suggest a few things to make sure doing so doesn't break the whole ecosystem:

Embed a file format version that would be surfaced during any reading errors -- it should be possible to warn users that they're using an outdated anndata version.
Only make breaking changes in major version upgrades (I suppose 0.8 would be a major version).
Carefully document any potentially breaking changes to the file format in the version notes. While the current version documentation indicates some file format changes, it's hard to see how the above error about layers relates to the version notes about file formats. I would explicitly state in the IO Specification section of the release notes that files written by anndata>=0.8.0 won't be readable by anndata<0.8.0. (Right now it says "Internal handling of IO has been overhauled." which suggests the file format is consistent while read/write logic has changed.)
Write out a full document spec, eg what h5 slots have what in them (I know this is a heavy lift).

Moving forward, I'd recommend:

Adding an explicit file format version to h5ad
Cutting a 0.7.9 release that's backwards compatible but also capable of reading the version string and generating errors when new file formats are being read.

cc @gdesmarais-ctx

ivirshup · 2022-02-02T17:53:39Z

Thanks for all the information!

A number of the issues you raise are actually topics we're trying to address right now (and this release provides some solutions for), but it's very useful to get feedback on our approach.

In addition, changes to the h5ad file format can break external tools, eg R code that is reading from these files using R hdf5 libraries.

Very aware of this. We're going for a fairly long release candidate version cycle (1 month at least) to make sure downstream packages have time to fix compatibility or at least pin dependencies/ error gracefully.

Moving forward, we're looking at having selected set of tools to run integration tests against – but this will take some time/ resources.

Embed a file format version that would be surfaced during any reading errors -- it should be possible to warn users that they're using an outdated anndata version.

The file format version is something that's new this version!

How and when to warn users is an interesting issue though. This version throws a warning for very old anndata versions where we still have to "just know" how each element should be read in. But at how old do we need to warn (and how loudly)?

Can be more explicit about this, #699

Write out a full document spec, eg what h5 slots have what in them (I know this is a heavy lift).

I am interested in having something more formal here. Possibly a bike shed schema?

At the moment we have the on-disk format page in the docs. This does have information about every current encoding type (and has been updated for this release), but I haven't figured out a good way to present past encoding information. Recommendations welcome!

Cutting a 0.7.9 release

Will look into this. Another thing that has changed during this release cycle is us making a system for being able to have feature and bug fix branches. So, there may be unforeseen difficulties doing anything other than a commit off the last release.

nspies-celsiustx · 2022-02-02T17:59:54Z

Thanks, @ivirshup for your detailed response!

I should have started with: I know you're mostly single-handedly holding down the fort on anndata and we greatly appreciate your continued development here.

I don't think we have a lot of spare capacity right now to help with implementation but would be happy to provide feedback on planning and PRs.

cc @ryan-williams

jeffquinn-msk · 2022-08-10T21:04:48Z

Hey there, I recently noticed this issue. It's unfortunate and I think will stand in the way of wider AnnData adoption if not properly addressed, which would be a shame. I really like the AnnData abstraction and I'd like to see it stick around. Let me know if there's anything I can do, I'm a software engineer and I have some bandwidth to help contribute

brianraymor · 2022-08-10T22:54:29Z

@ivirshup - you wrote in an earlier comment:

The file format version is something that's new this version!

Other than introspecting on the encoding versions in the on-disk file format, is there a file format version that can be inspected for anndata 0.8? We want to be able to enforce anndata 0.8 for datasets being submitted to the cellxgene data portal in a future release.

flying-sheep · 2022-08-11T10:49:28Z

It’s addressed here: #734

@ivirshup, could you please respond to me there?

gtca · 2022-09-07T12:39:18Z

@ivirshup, one more consideration is adopting "encoder" and "encoder-version" that mudata has.

github-actions · 2023-06-21T02:23:15Z

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

ivirshup · 2023-06-21T15:24:10Z

I believe this has now been addressed for future versions of anndata through our encoding mechanism, so will close this.

flying-sheep · 2023-06-22T08:26:53Z

Not only future ones, #734 ended up in 0.9.0

ivirshup changed the title ~~Older versions of anndata cannot read h5ads created by anndata==0.8.0rc1~~ Older versions of anndata throw unintuitive errors when trying to read newer formats Mar 18, 2022

ivirshup mentioned this issue Mar 18, 2022

Unintuitive error message for loading v0.8 h5ad with v0.7.8 anndata #739

Closed

ivirshup added topic: io topic: compatibility labels Mar 18, 2022

christinedien referenced this issue in settylab/atac_metacell_utilities Mar 22, 2022

downgrade anndata to 0.7.8 for compatibility

086820f

XUEbaogai0101 mentioned this issue Apr 10, 2022

Some troubles with Loading data aristoteleo/dynamo-release#332

Closed

gtca mentioned this issue May 24, 2022

Fix reticulate-anndata categoricals (HDF5.Group), AnnData Base.show i… scverse/Muon.jl#12

Closed

rimelof mentioned this issue Jul 18, 2022

Python 3.7 mexchy1000/CellDART#6

Closed

gtca mentioned this issue Jul 28, 2022

can't read mudata created with muon (python) PMBio/MuDataSeurat#9

Closed

brianraymor mentioned this issue Aug 9, 2022

cellxgene-schema must enforce the AnnData encoding version required by the schema chanzuckerberg/single-cell-curation#215

Closed

sjspielman mentioned this issue Aug 19, 2022

Re-opening post process script for H5 AlexsLemonade/sc-data-integration#102

Merged

AlexanderAivazidis mentioned this issue Dec 15, 2022

requirement of anndata = "0.7.5" makes it impossible to load anndata saved with newer version (e.g. 0.8.0) pinellolab/pyrovelocity#42

Closed

QiangShiPKU mentioned this issue Feb 7, 2023

error when input with h5ad ventolab/CellphoneDB#29

Closed

github-actions bot added the stale label Jun 21, 2023

ivirshup closed this as completed Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Older versions of anndata throw unintuitive errors when trying to read newer formats #698

Older versions of anndata throw unintuitive errors when trying to read newer formats #698

ghost commented Feb 1, 2022 •

edited by flying-sheep

Loading

ivirshup commented Feb 1, 2022

nspies-celsiustx commented Feb 1, 2022 •

edited

Loading

ivirshup commented Feb 2, 2022

nspies-celsiustx commented Feb 2, 2022

jeffquinn-msk commented Aug 10, 2022

brianraymor commented Aug 10, 2022

flying-sheep commented Aug 11, 2022

gtca commented Sep 7, 2022

github-actions bot commented Jun 21, 2023

ivirshup commented Jun 21, 2023

flying-sheep commented Jun 22, 2023

Older versions of anndata throw unintuitive errors when trying to read newer formats #698

Older versions of anndata throw unintuitive errors when trying to read newer formats #698

Comments

ghost commented Feb 1, 2022 • edited by flying-sheep Loading

ivirshup commented Feb 1, 2022

nspies-celsiustx commented Feb 1, 2022 • edited Loading

ivirshup commented Feb 2, 2022

nspies-celsiustx commented Feb 2, 2022

jeffquinn-msk commented Aug 10, 2022

brianraymor commented Aug 10, 2022

flying-sheep commented Aug 11, 2022

gtca commented Sep 7, 2022

github-actions bot commented Jun 21, 2023

ivirshup commented Jun 21, 2023

flying-sheep commented Jun 22, 2023

ghost commented Feb 1, 2022 •

edited by flying-sheep

Loading

nspies-celsiustx commented Feb 1, 2022 •

edited

Loading