reimplemented incremental io #501

LucaMarconato · 2024-03-21T22:30:39Z

Closes #186
Closes #496
Closes #498

Support for incremental io operations.

New features:

ability to save additional elements to disk after the SpatialData object is created
ability to remove from disk previously saved objects
ability to see which elements are present only in-memory and not in the Zarr store and viceversa
refactored saving of metadata:
- transformations
- consolidated metadata
- set the basis (but not implemented), like empty tests or TODOs with what's missing, for the other metadata: table.uns['spatialdata_attrs'], points.attrs['spatialdata_attrs'] and OMERO metadata for image channel names

Robustness:

refactored write function to make it more robust
improved error messages for the users, with actionable advices
new concept of "self-contained" SpatialData object and "self-contained" elements. Useful for the user to understand the implications of file backing
added info on Dask-backed files for non "self-contained" elements to __repr__()

Testing:

improved existing tests for io
extensive testing for modular io
improved testing for comparision of metadata after io and after a deepcopy

Other:

fixed bug of points columns being shuffled after a query operation Columns order for points element shuffled sometimes after parser/deepcopy #486

This PR also sets the basis for (not implemented here) the ability to load in-memory objects that are Dask-backed.

codecov · 2024-03-22T01:59:10Z

Codecov Report

Attention: Patch coverage is 85.90078% with 54 lines in your changes are missing coverage. Please review.

Project coverage is 92.15%. Comparing base (b73f3a8) to head (582622f).

❗ Current head 582622f differs from pull request most recent head f2bea77. Consider uploading reports for the commit f2bea77 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #501      +/-   ##
==========================================
- Coverage   92.53%   92.15%   -0.38%     
==========================================
  Files          43       43              
  Lines        6003     6211     +208     
==========================================
+ Hits         5555     5724     +169     
- Misses        448      487      +39

Files	Coverage Δ
src/spatialdata/_core/_deepcopy.py	`98.41% <100.00%> (+0.02%)`	⬆️
src/spatialdata/_core/_elements.py	`92.47% <100.00%> (+0.80%)`	⬆️
src/spatialdata/_io/format.py	`87.38% <100.00%> (ø)`
src/spatialdata/_io/io_zarr.py	`88.37% <100.00%> (ø)`
src/spatialdata/dataloader/datasets.py	`90.68% <ø> (ø)`
src/spatialdata/models/__init__.py	`100.00% <ø> (ø)`
src/spatialdata/models/models.py	`88.30% <100.00%> (+0.31%)`	⬆️
src/spatialdata/testing.py	`98.24% <87.50%> (-1.76%)`	⬇️
src/spatialdata/models/_utils.py	`91.30% <87.50%> (-0.30%)`	⬇️
src/spatialdata/_io/_utils.py	`88.52% <75.00%> (-2.33%)`	⬇️
... and 2 more

... and 1 file with indirect coverage changes

…s_are_identical

writing metadata incrementally

LucaMarconato · 2024-03-23T18:25:56Z

@ArneDefauw @aeisenbarth tagging you because you opened at some point a issue regarding incremental IO. In this PR incremental IO is implemented, happy to receive feedback in case you want to play around with it😊

I will make a notebook to showcase it, but in short to save an element lables, table, etc you can use the new sdata.write_element('element_name'). If the element already exists in the storage an exception will be raised. You can work around the exception for instance with the strategies shown here:

spatialdata/tests/io/test_readwrite.py

Line 159 in 87dd1a8

    
           # workaround 1, mostly safe (untested for Windows platform, network drives, multi-threaded

Please note that those strategies are not guarantee to work in various scenarios, including multi-threaded application, network storages, etc. So please use with care.

LucaMarconato · 2024-03-23T18:26:11Z

Currently the whole table needs to be replaced and the whole table needs to be stored in-memory, but recent progress in anndata + dask will be used also in spatialdata to allow lazy loading and the replacement of particular elements (like adding a single obs column). This PR clean up the previous code and is a step in that direction.

kevinyamauchi

Thanks @LucaMarconato ! I left a review with some minor points below. I think it looks good, but i didn't have time for a super in depth review. Given that this is a big change, I think the approval should be given by somebody who can look more closely.

src/spatialdata/_utils.py

src/spatialdata/_io/_utils.py

src/spatialdata/_core/spatialdata.py

melonora · 2024-03-25T11:04:42Z

Thanks @LucaMarconato ! I left a review with some minor points below. I think it looks good, but i didn't have time for a super in depth review. Given that this is a big change, I think the approval should be given by somebody who can look more closely.

Thanks for the review @kevinyamauchi. I will have a look at this PR later today as well.

ArneDefauw · 2024-03-25T15:44:29Z

@ArneDefauw @aeisenbarth tagging you because you opened at some point a issue regarding incremental IO. In this PR incremental IO is implemented, happy to receive feedback in case you want to play around with it😊

I will make a notebook to showcase it, but in short to save an element lables, table, etc you can use the new sdata.write_element('element_name'). If the element already exists in the storage an exception will be raised. You can work around the exception for instance with the strategies shown here:

spatialdata/tests/io/test_readwrite.py

Line 159 in 87dd1a8

# workaround 1, mostly safe (untested for Windows platform, network drives, multi-threaded

Please note that those strategies are not guarantee to work in various scenarios, including multi-threaded application, network storages, etc. So please use with care.

Thanks for the quick response and fix!

I've tested the incremental io for my use case, and up to now everything seems to works as expected, except one thing. If I follow the approach suggested here:

spatialdata/tests/io/test_readwrite.py

Line 159 in 87dd1a8

    
           # workaround 1, mostly safe (untested for Windows platform, network drives, multi-threaded

I get a ValueError when I load my SpatialData object back from the zarr store and try to overwrite it:
ValueError: The file path specified is a parent directory of one or more files used for backing for one or more elements in the SpatialData object. Deleting the data would corrupt the SpatialData object.

The fix was to first delete the attribute from the SpatialData object, and then remove the element on disk. Minimal example below of a typical workflow in my image processing pipelines:

from spatialdata import SpatialData
from spatialdata import read_zarr
import spatialdata
import dask.array as da

img_layer="test_image"

sdata=SpatialData()

sdata.write( os.path.join( path, "sdata.zarr" ) )

dummy_array = da.random.random(size=(1,10000, 10000), chunks=(1,1000, 1000))

se=spatialdata.models.Image2DModel.parse(
                data=dummy_array,
            )

sdata.images[ img_layer ] = se

if sdata.is_backed():
    sdata.write_element("test_image", overwrite=True)

# need to read back from zarr store, otherwise graph in the in-memory sdata would not be executed
sdata=read_zarr( sdata.path )

# now overwrite:

# Here I needed to first delete the attribute:

# first delete attribute
element_type = sdata._element_type_from_element_name(  img_layer )
del getattr(sdata, element_type)[ img_layer ]
# then on disk
if sdata.is_backed():
    sdata.delete_element_from_disk( img_layer )


sdata.images[ img_layer ] = se

if sdata.is_backed():
    sdata.write_element(img_layer, overwrite=True)

sdata=read_zarr( sdata.path )

I think what the unit test you refered to lacks is the reading back from the zarr store, after an element is written to a zarr store.

In version 0.0.15 of SpatialData, when sdata.add_image(...) was executed, it was not necessary to read back from the zarr store. I understand that current implementation allows for more control, but the inplace update of the SpatialData object was kinda convenient.

Edit:
I added a pull request, to illustrate the issue a little bit more: #515

Co-authored-by: Kevin Yamauchi <kevin.yamauchi@gmail.com>

LucaMarconato · 2024-03-27T20:03:10Z

Thank you @ArneDefauw for trying the code and for the explanation, I will now look into your PR.

In version 0.0.15 of SpatialData, when sdata.add_image(...) was executed, it was not necessary to read back from the zarr store. I understand that current implementation allows for more control, but the inplace update of the SpatialData object was kinda convenient.

The reason why we refactored this part is because with add_image(), if the user had a in-memory image and was writing it to disk, the image would have then immediately lazy loaded. This is good and ergonomic if the image needs to be written only once, but if the user tries to write image again (for instance in a notebook when a cell may get manually executed twice), then it would have lead to an error.

…patialdata into feature/incremental_io

* test read write on disk * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improved tests for workarounds for incremental io * fixed tests * improved comment --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luca Marconato <m.lucalmer@gmail.com>

LucaMarconato · 2024-03-27T23:00:58Z

Thanks for the reviews. I addressed the points from @kevinyamauchi and from @ArneDefauw (in particular I merged his PR here). @giovp, when you have time could you also please give a pass to this?

giovp

minor things

giovp · 2024-03-28T15:11:54Z

tests/io/test_readwrite.py

+
+    @pytest.mark.parametrize("dask_backed", [True, False])
+    @pytest.mark.parametrize("workaround", [1, 2])
+    def test_incremental_io_on_disk(


this should 💯 be a tutorial

giovp · 2024-03-28T15:13:51Z

src/spatialdata/_core/spatialdata.py

+        A SpatialData object is said to be self-contained if all its SpatialElements or AnnData tables are
+        self-contained. A SpatialElement or AnnData table is said to be self-contained when it does not depend on a
+        Dask computational graph (i.e. it is not "lazy") or when it is Dask-backed and each file that is read in the
+        Dask computational graph is contained within the Zarr store associated with the SpatialElement.


could this description fo self-containing also be in the elements_are_self_contained above?

giovp · 2024-03-28T15:15:51Z

src/spatialdata/_core/spatialdata.py

+                "different location. "
+            )
+            WORKAROUND = (
+                "\nWorkaround: please see discussion here https://github.com/scverse/spatialdata/discussions/520."


this link is dead, I think because of the .

src/spatialdata/_core/spatialdata.py

giovp · 2024-03-28T15:18:02Z

src/spatialdata/_core/spatialdata.py

+                raise ValueError(
+                    f"Element {element_name} is found in the Zarr store as a {disk_element_type}, but it is found "
+                    f"in-memory as a {element_type}. The in-memory object should have a different name."
+                )


I think this is a bit tricky, if a user see this error message, what should they do?

Co-authored-by: Giovanni Palla <25887487+giovp@users.noreply.github.com>

ArneDefauw · 2024-03-29T13:34:10Z

Thank you @ArneDefauw for trying the code and for the explanation, I will now look into your PR.

In version 0.0.15 of SpatialData, when sdata.add_image(...) was executed, it was not necessary to read back from the zarr store. I understand that current implementation allows for more control, but the inplace update of the SpatialData object was kinda convenient.

The reason why we refactored this part is because with add_image(), if the user had a in-memory image and was writing it to disk, the image would have then immediately lazy loaded. This is good and ergonomic if the image needs to be written only once, but if the user tries to write image again (for instance in a notebook when a cell may get manually executed twice), then it would have lead to an error.

Hi @LucaMarconato ,
thanks for the reply and the fixes!
I've tested your suggestions (

spatialdata/tests/io/test_readwrite.py

Line 136 in 582622f

def test_incremental_io_on_disk(

) for my use cases and everything seems to work fine (both for images,labels, points, labels and shapes ).

Workaround 1 , looks rather safe in most scenarios. If I understand correctly it covers following scenario:
Having "x" in sdata, doing something on "x" (i.e. defining a dask graph), and then writing to "x".

How I would usually work is:
Having "x" in sdata, doing something on "x", and writing to "y" (where "y" already exists).

The latter feels less dangerous, and looks pretty standard in image processing pipelines, e.g. tuning of hyper parameters for image cleaning or segmentation.

In the latter case, I guess the following would be sufficient:


arr=sdata["x"].data
arr=arr*2
spatial_element=spatialdata.models.Image2DModel.parse(
                arr,
            )
del sdata["y"]
sdata.delete_element_from_disk("y")
sdata["y"]=spatial_element
sdata.write_element("y")
sdata=read_zarr( sdata.path )

LucaMarconato · 2024-03-30T02:44:25Z

Yes, I agree that the approach that you described is generally a good practice when processing data and safe, since the original data is not modified.

The use cases that I described are instead for the case in which the data itself is replaced. I think I should add in the comments that this approach should be avoided when possible, and clarify that the workaround that I described are only if really needed.

namsaraeva · 2024-04-08T12:10:21Z

Thank you for this PR, I am using it right now. One question: would it be possible to pass a list of strings onto write_element() instead of just one element? @LucaMarconato

LucaMarconato · 2024-04-08T19:40:43Z

@namsaraeva thanks for the suggestion, it's indeed handier to have list of names. I have added the support for this for write_element() and delete_element_from_disk().

melonora · 2024-05-15T08:59:59Z

personally I don't see any blockers currently for this PR.

LucaMarconato added 2 commits March 21, 2024 23:27

implemented incremental io; tests missing

5c73c84

added draft for write_metadata(); need to write new tests

b534f6e

wip better explanation

e7b9433

LucaMarconato linked an issue Mar 22, 2024 that may be closed by this pull request

Columns order for points element shuffled sometimes after parser/deepcopy #486

Open

LucaMarconato added 3 commits March 22, 2024 16:51

fixed bug wrong order of points columns after spatial query

ebe9f4f

testing the copying of metadata and their inclusion in assert_element…

f5f1098

…s_are_identical

improved readwrite tests

695a1d8

LucaMarconato mentioned this pull request Mar 22, 2024

(Solved) bug with io and lazy loading #117

Closed

added tests for incremental io

a0974ab

LucaMarconato changed the title ~~implemented incremental io; tests missing~~ reimplemented incremental io Mar 22, 2024

LucaMarconato added 6 commits March 22, 2024 21:25

implemented write_metadata for transformations

b429cc9

tests for incremental io of transformation, with separate validation for

d5be580

writing metadata incrementally

tests for IO and incremental IO of consolidated metadata

a9cb077

improved control over elements only on-disk/in-memory

e13075d

added tests for delete_element_from_disk

fbf9e3c

fix

ff85b2e

LucaMarconato mentioned this pull request Mar 23, 2024

Nested NGFF images can be written, not read #340

Closed

LucaMarconato added 4 commits March 23, 2024 17:33

added _check_element_not_on_disk_with_different_type()

239e0a7

updated changelog

6b8069b

fixed changelog

1bfe022

attempt fix docs

87dd1a8

LucaMarconato requested review from giovp, kevinyamauchi and melonora March 23, 2024 18:26

kevinyamauchi reviewed Mar 25, 2024

View reviewed changes

src/spatialdata/_utils.py Outdated Show resolved Hide resolved

src/spatialdata/_io/_utils.py Outdated Show resolved Hide resolved

src/spatialdata/_core/spatialdata.py Outdated Show resolved Hide resolved

src/spatialdata/_core/spatialdata.py Outdated Show resolved Hide resolved

ArneDefauw mentioned this pull request Mar 26, 2024

update test read write on disk #515

Merged

Update src/spatialdata/_io/_utils.py

7f2ec2d

Co-authored-by: Kevin Yamauchi <kevin.yamauchi@gmail.com>

LucaMarconato added 3 commits March 27, 2024 21:03

fixes from review

bee39fc

Merge branch 'feature/incremental_io' of https://github.com/scverse/s…

30a30ff

…patialdata into feature/incremental_io

Merge branch 'main' into feature/incremental_io

d277eea

LucaMarconato mentioned this pull request Mar 27, 2024

Allow reloading of particular elements, and add reload: bool parameter to write functions #521

Open

4 tasks

giovp approved these changes Mar 28, 2024

View reviewed changes

Update src/spatialdata/_core/spatialdata.py

582622f

Co-authored-by: Giovanni Palla <25887487+giovp@users.noreply.github.com>

LucaMarconato mentioned this pull request Mar 29, 2024

Error in add_shapes() function #524

Closed

LucaMarconato mentioned this pull request Mar 30, 2024

Reading external segmentation results scverse/spatialdata-io#82

Closed

LucaMarconato mentioned this pull request Apr 8, 2024

Prototype/geoparquet #542

Draft

LucaMarconato added 2 commits April 8, 2024 15:37

list of names for write_element() and delete_element_from_disk()

98037eb

Merge branch 'main' into feature/incremental_io

5ee00c5

improved docs

f2bea77

LucaMarconato self-assigned this Apr 25, 2024

LucaMarconato mentioned this pull request May 6, 2024

Can't save transformation result #558

Open

melonora approved these changes May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reimplemented incremental io #501

reimplemented incremental io #501

LucaMarconato commented Mar 21, 2024 •

edited

codecov bot commented Mar 22, 2024 •

edited

LucaMarconato commented Mar 23, 2024 •

edited

LucaMarconato commented Mar 23, 2024

kevinyamauchi left a comment

melonora commented Mar 25, 2024

ArneDefauw commented Mar 25, 2024 •

edited

LucaMarconato commented Mar 27, 2024

LucaMarconato commented Mar 27, 2024 •

edited

giovp left a comment

giovp Mar 28, 2024

giovp Mar 28, 2024

giovp Mar 28, 2024

giovp Mar 28, 2024

ArneDefauw commented Mar 29, 2024 •

edited

LucaMarconato commented Mar 30, 2024

namsaraeva commented Apr 8, 2024

LucaMarconato commented Apr 8, 2024

melonora commented May 15, 2024

reimplemented incremental io #501

Are you sure you want to change the base?

reimplemented incremental io #501

Conversation

LucaMarconato commented Mar 21, 2024 • edited

codecov bot commented Mar 22, 2024 • edited

Codecov Report

LucaMarconato commented Mar 23, 2024 • edited

LucaMarconato commented Mar 23, 2024

kevinyamauchi left a comment

Choose a reason for hiding this comment

melonora commented Mar 25, 2024

ArneDefauw commented Mar 25, 2024 • edited

LucaMarconato commented Mar 27, 2024

LucaMarconato commented Mar 27, 2024 • edited

giovp left a comment

Choose a reason for hiding this comment

giovp Mar 28, 2024

Choose a reason for hiding this comment

giovp Mar 28, 2024

Choose a reason for hiding this comment

giovp Mar 28, 2024

Choose a reason for hiding this comment

giovp Mar 28, 2024

Choose a reason for hiding this comment

ArneDefauw commented Mar 29, 2024 • edited

LucaMarconato commented Mar 30, 2024

namsaraeva commented Apr 8, 2024

LucaMarconato commented Apr 8, 2024

melonora commented May 15, 2024

LucaMarconato commented Mar 21, 2024 •

edited

codecov bot commented Mar 22, 2024 •

edited

LucaMarconato commented Mar 23, 2024 •

edited

ArneDefauw commented Mar 25, 2024 •

edited

LucaMarconato commented Mar 27, 2024 •

edited

ArneDefauw commented Mar 29, 2024 •

edited