Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with dask arrays returned by nd2: depend on ND2File being open, can't be pickled #19

Closed
VolkerH opened this issue Nov 10, 2021 · 17 comments

Comments

@VolkerH
Copy link
Contributor

VolkerH commented Nov 10, 2021

Hi Talley,

a couple of observations regarding the dask arrays returned by nd2. None of this is terribly surprising (so you are probbably aware), and I am not expecting this can really be fixed. But maybe some lines in the readme file could save users some headaches.

keep ND2File object in open() state when working with dask array

For my use case I created a little helper function (pseudo-code):

def provide_tiles_as_daskarray_from_nd2(filepath):
      with nd2.ND2File(filepath) as f:
            arr = f.to_dask()
            # do a few more things on array: cropping, sorting dimensions
            return arr

I have similiar functions for other file types, e.g. something like provide_tiles_as_daskarray_from_folder_of_imgs(filepath).

On first sight, this works. I used this function from a Jupyter Notebook and it seemed as if it returns a working dask array.
Until you do anything that forces a compute(), in which case the python kernel crashes without a traceback.
The issue is that f is closed when returning out of the context manager. The same thing happens if you don't use a context manager and call f.close() before returning. Again, not terribly surprising. When running in the VS Code debugger I actually saw a trace back and saw that there is a segmentation fault when the nd2 library is trying to read data.

My workaround is to not close f which is a bit dirty as there will be open file handles hanging around possibly causing memory leaks.

cloudpickle

The dask arrays returned by nd2 can't be serialized using cloudpickle due to the ND2File object not being pickleable:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-8a76360da592> in <module>
----> 1 cloudpickle.dump(array,f)

~/miniconda3/envs/napari_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump(obj, file, protocol, buffer_callback)
     53         compatibility with older versions of Python.
     54         """
---> 55         CloudPickler(
     56             file, protocol=protocol, buffer_callback=buffer_callback
     57         ).dump(obj)

~/miniconda3/envs/napari_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump(self, obj)
    561     def dump(self, obj):
    562         try:
--> 563             return Pickler.dump(self, obj)
    564         except RuntimeError as e:
    565             if "recursion" in e.args[0]:

~/miniconda3/envs/napari_latest/lib/python3.9/site-packages/nd2/_sdk/latest.cpython-39-x86_64-linux-gnu.so in nd2._sdk.latest.ND2Reader.__reduce_cython__()

TypeError: no default __reduce__ due to non-trivial __cinit__

Again, not terribly surprising. However, the implication is (I haven't tried it, but am fairly certain this is the case) that the dask arrays returned by nd2 cannot be used in a dask.distributed context where the pickled array needs to be sent to the various worker processes.

A workaround for both would be to create a temporary .zarr file (pseudo-code):

def provide_tiles_as_daskarray_from_nd2(filepath):
      with nd2.ND2File(filepath) as f:
            arr = f.to_dask()
            # do a few more things on array: cropping, sorting dimensions
            arr.to_zarr("_tmpfile.zarr")
            f.close()
            return dask.array.from_zarr("_tmpfile.zarr")

with obvious drawback of duplicating the required storarge space.

I don't know whether there is much that can be done about this, but it may be useful to add a note of caution in the Readme.

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 10, 2021

Actually, I just looked a bit into un/pickling and it seems that adding these to methods to the ND2File class solve the pickling problem:

    def __getstate__(self):
        state = self.__dict__.copy()
        del state['_rdr']
        return state 

    def __setstate__(self, d):
        print("I'm being unpickled with these values: " + repr(d))
        self.__dict__ = d    
        self._rdr = get_reader(self._path)
        if not self._closed:
            self.open()

I can now pickle the dask arrays and when I unpickle them I can run compute().

I will submit a PR.

@tlambert03
Copy link
Owner

keep ND2File object in open() state when working with dask array

yeah, this is a tricky one... I actually wrote a wrapper over at aicsimageio for just this purpose:
https://github.com/AllenCellModeling/aicsimageio/blob/main/aicsimageio/utils/dask_proxy.py

could use it here as well?

Actually, I just looked a bit into un/pickling and it seems that adding these to methods to the ND2File class solve the pickling problem:

awesome thank you!

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 10, 2021

yeah, this is a tricky one... I actually wrote a wrapper over at aicsimageio for just this purpose: https://github.com/AllenCellModeling/aicsimageio/blob/main/aicsimageio/utils/dask_proxy.py

could use it here as well?

Ugh. I would have to do a bit more reading to fully understand how that proxy works, so not sure about whether this would work here. However, as aicsimageio also incorporates nd2, maybe I should just aicsimageio instead of nd2 (I'd prefer not to have that dependency though).

@tlambert03
Copy link
Owner

the gist of that proxy is essentially these two lines:

    def compute(self, **kwargs: Any) -> np.ndarray:
        with self.__wrapped__._ctx_:
            return self.__wrapped__.compute(**kwargs)

    def __array__(self, dtype: str = None, **kwargs: Any) -> np.ndarray:
        with self.__wrapped__._ctx_:
            return self.__wrapped__.__array__(dtype, **kwargs)

that is, it's a dask array that, whenever you try to call compute or np.asarray, will re-open the underlying file (with self.__wrapped__._ctx_ is essentially just with ND2File()....)

It looks a tad bit risky at first, but I haven't run into any issues with it yet. In any case, I suspect the issue of trying to use a dask array after closing the file is far more common than whatever hidden issues there are with this proxy. I'm inclined to try it

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 10, 2021

It looks a tad bit risky at first, but I haven't run into any issues with it yet. In any case, I suspect the issue of trying to use a dask array after closing the file is far more common than whatever hidden issues there are with this proxy.

I agree.
Also, if you do try, you will currently just see a crash, most likely without any usable error message.

So if I understand you correctly, the idea would be that ND2File.to_dask() returns such an ObjectProxy instead of the dask.array?

@tlambert03
Copy link
Owner

So if I understand you correctly, the idea would be that ND2File.to_dask() returns such an ObjectProxy instead of the dask.array?

yeah, exactly.

@tlambert03
Copy link
Owner

I think #27 and #26 have improved the situation here. closing until we get another reproducible bug.

I tried on my ubuntu box to get the segfault you showed, but am still failing (even running it a few hundred times in case of race condition)... but let's keep an eye out for it

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 16, 2021

Hi @tlambert03,

didn't get around to it yesterday but installed the latest version including #27 from the main branch just now.
Unfortunately, on my machine this has not been an improvement.
The segfault issue when returning the dask array from context manager still persists.

In addition, I now also get dying kernels (probably segfault, but this is something I run in a notebook) when I run some more complex computation on the dask array that worked previously when avoiding returning from a context manager.

I will try and provide a reproducible example, but probably won't get around to it today.
For the time being I'm holding on to an older state of the repo (VolkerH@294e07a) where everything except returning from the context manager works for me.

@tlambert03 tlambert03 reopened this Nov 16, 2021
@tlambert03
Copy link
Owner

ugh 😔 i'm sorry. this is a really frustrating one.

I can't get it to segfault on any of my computers (tried mac, win, ubuntu20)... maybe it's a race condition that depends on dataset size?

It's certainly an issue though, since it looks like aicsimageio is also having CI issues with the new release. Will keep digging... but let me know if you hit on anything (and send files if necessary)

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 16, 2021 via email

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 16, 2021 via email

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 26, 2021

Ok, I started debugging this. The problem appears to be that the pickling doesn't work correctly if there are additional operations in the compute graph. I don't have a stand-alone example yet but maybe the following will already be giving you enough information:

I have the following function that loads an nd2 and performs some re-sorting (using np.einsum) of dimensions before returning it

def load_tiles_nd2(file_path: PathLike) -> da.array:
    file_path = Path(file_path)
    if file_path.suffix.lower() != ".nd2":
        raise (ValueError, "Can only read nd2 files.")
    f = ND2File(file_path)
    tile_array = f.to_dask()
    dim_order_tuple = f.sizes.keys()
    dim_order = "".join(dim_order_tuple).lower()
    # the followig is basically np.einsum (f"{dim_order}->pcyx")
    ordered_tile_array = to_tczyx(tile_array, in_order=dim_order, out_order="pcyx")
    return ordered_tile_array

So when I use this as in

arr = load_tiles_nd2("samplefile.nd2")
arr[..., :10].compute()

there is no problem.

However, when I do

tmp = load_tiles_nd2("samplefile.nd2")
arr = pickle.loads(pickle.dumps(tmp))
arr[..., :10].compute()

I get a segmentation fault.

When I look at the dask graph of the ResourceBackedArray that hasn't been pickled looks like that:

image

In contrast, the dask task graph for the ResourceBackedArray that has been pickled and regenerated by unpickling looks like this:

image

So the layers 0 to 2 lose important attributes such as .shape and .chunksize when pickling/unpickling.

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 26, 2021

Arghh. Now I can't reproduce the SegFault with a simple [...,:10].compute() reliably anymore, I can only reproduce it with a more complicated map_blocks. Will try to narrow it down further, but maybe the differences in the compute graph layers are relevant.

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 26, 2021

So the segfault seems to depend on the slicing.

  • tile_arr[...,:10].compute() works. This slice covers multiple source chunks if i am not mistaken.
  • tiles_arr[0,0,...].compute() causes segfault, this slice only covers a single source chunk.

@tlambert03
Copy link
Owner

Thanks for digging!
By the way I got a good response on SO about the general problem of adding tasks to a dask array graph if you're curious. , might try it here:

https://stackoverflow.com/a/70012999

@VolkerH
Copy link
Contributor Author

VolkerH commented Nov 26, 2021

Thanks, interesting, I didn't even think about graph optmiziation steps.
I sent you a screen recording of a debug session in action through the napari zulip chat. Unfortunately can't attach .webm to Github.

@tlambert03
Copy link
Owner

tlambert03 commented Jun 6, 2022

I think this has been solved over the iterations. currently:

In [1]: import nd2

In [2]: f = nd2.ND2File('tests/data/dims_p1z5t3c2y32x32.nd2')

In [3]: d = f.to_dask()

In [4]: f.close()

In [5]: import cloudpickle

In [6]: p = cloudpickle.dumps(d)  # no problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants