async zarr #1104

martindurant · 2022-08-02T14:26:49Z

A quick sketch of how we can couple zarr with async code. This is aimed slightly at pyscript, but can be useful in its own right: for instance, I asked the question a while ago what it would take to be able to fetch concurrent chunks not just from one array but, say, one chunk each from multiple arrays in a dataset.
This sketch is only for reading...

Outline:

we subclass Group, so that __getitem__ produces an AsyncArray
we subclass Array as AsyncArray and overwrite its _get_selection (which calls IO) up to __getitem__ (which is the user-facing API)
we have three stores:
- a synchronous HTTP one for the dataset metadata. This can be based on requests for standard python or pyfetch under pyodide. Note that sync calls in pyodide are limited to text, which is perfect for this use case.
- a fake synchronous store which merely stored the paths that are attempted, but returns FileNotFound for all of them
- a fake synchronous store in which we have prefilled all the keys it will ever need, i.e., this can be a simple dict
The flow goes as follows:
- A zarr AsyncGroup is made by reading JSON files synchronously
- When we attempt to get data, we make a coroutine in which first we use the fake store and zarr's existing machinery to record all the keys that will be needed (this will temporarily make an array of NaN); then we fetch all these keys concurrently, then we populate a dict and have the existing zarr machinery read from the dict
For interest, this is an fsspec async filesystem for pyodide. We don't need it to be this verbose for zarr.
Note that in the browser, no fetches can ever be done without considering CORS, but any dataset known to work with zarr.js will work for this case too.

The text was updated successfully, but these errors were encountered:

martindurant · 2022-08-02T14:40:42Z

(I'm not sure if this needs any specs discussion, since it provides an alternative user API, but it doesn't actually change what zarr does or what the metadata looks like)

jbms · 2022-08-02T14:56:47Z

I think this might be more appropriate to discuss under the zarr-python repository, since it is just about the zarr-python API.

You might find it interesting to look at the tensorstore python API for ideas, as tensorstore provides an async API.

An alternative to consider would be to just address the limitation in pyscript directly:

I think with the help of a separate webworker thread it is possible to emulate sync fetch requests.

I think there are also options for compiling c code to "async" web assembly, though that does hurt performance.

martindurant · 2022-08-02T15:04:33Z

I am happy to have this in zarr-python instead.

How do you imagine using the tensorstore model? The problem we are facing, is being forced to call zarr's synchronous code in an async context, so using another Futures abstraction sounds like even more complexity.

I think with the help of a separate webworker thread it is possible to emulate sync fetch requests.

In general, getting IO to work well in pyscript is an unsolved problem, and websorkers-as-thread might be part of the solution. Certainly, that's the only way that the browser allows binary sync connections. To be sure, though: we do not want sync requests and pay the latency cost for every single chunk.

I think there are also options for compiling c code to "async" web assembly, though that does hurt performance.

We are stuck with the sync python API, being called from an async context, so this is a python programming problem. Anything lower level will not help us.

As I said at the start though, pyscript is not the only reason to want this.

jbms · 2022-08-02T15:45:43Z

I am happy to have this in zarr-python instead.

How do you imagine using the tensorstore model? The problem we are facing, is being forced to call zarr's synchronous code in an async context, so using another Futures abstraction sounds like even more complexity.

I think from an API perspective futures are the most natural choice.

You can always create an async API on top of a sync API using a thread pool. In general it might be best to gradually add async apis to zarr-python from the top down, using thread pools as needed to convert lower-level components from sync to async. Ultimately would want to add async store implementations so that there are no sync i/o components left. The codecs are pure computation and don't need to be converted --- they would just always require a thread pool.

I think with the help of a separate webworker thread it is possible to emulate sync fetch requests.

In general, getting IO to work well in pyscript is an unsolved problem, and websorkers-as-thread might be part of the solution. Certainly, that's the only way that the browser allows binary sync connections. To be sure, though: we do not want sync requests and pay the latency cost for every single chunk.

I think there are also options for compiling c code to "async" web assembly, though that does hurt performance.

We are stuck with the sync python API, being called from an async context, so this is a python programming problem. Anything lower level will not help us.

My understanding is that pyscript is built by compiling cpython to webassembly, where sync Python corresponds to sync webassembly/javascript. I was proposing that instead it could be compiled such that sync Python corresponds to async JavaScript. Thinking about it more though, I realize that in addition to being a major re-architecting of pyscript, it would also come with major restrictions on "re-entering" python during other operations, and therefore wouldn't really be practical.

As I said at the start though, pyscript is not the only reason to want this.

martindurant · 2022-08-02T15:55:57Z

You can always create an async API on top of a sync API using a thread pool. In general it might be best to gradually add async apis to zarr-python from the top down, using thread pools as needed to convert lower-level components from sync to async.

The fsspec store and indeed the JS HTTP fetch methods are async, so we actually already have this already at the bottom of the stack. Making the compute part "concurrent" isn't useful, it's the IO that matters. You are advocating a completely async alternate codepath all the way through zarr? I am trying to make use of zarr's simplicity to implement something that works quickly without changing the core.

Question to everyone: what does the the zarr.js API look like, is there any async there? I would assume there must be.

jakirkham · 2022-08-15T22:55:02Z

AIUI PR ( #534 ) was exploring the concurrent.futures.Executor approach

joshmoore · 2022-12-02T16:27:56Z

See https://github.com/martindurant/async-zarr/

martindurant · 2022-12-02T16:30:05Z

I have started writing a blog about my implementation, might be out this afternoon. That won't say anything new to people who are already on this thread, but might gain more general interest. Specifically for pyodide/pyscript, I think it's still fair to say that the IO story is very far from solved for typical pydata libraries.

jhamman · 2023-12-07T21:05:25Z

This was a great discussion. Pointing folks to the continuation of this idea slated for v3: #1583

joshmoore transferred this issue from zarr-developers/zarr-specs Aug 2, 2022

MSanKeys963 mentioned this issue Dec 20, 2022

follow-up issues in zarr-python zarr-developers/community#57

Open

jstriebel mentioned this issue Feb 3, 2023

Refactor core.py #1329

Closed

jhamman closed this as completed Dec 7, 2023

agriyakhetarpal mentioned this issue May 22, 2024

Add Pyodide support and CI jobs for Zarr #1902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async zarr #1104

async zarr #1104

martindurant commented Aug 2, 2022 •

edited

Loading

martindurant commented Aug 2, 2022

jbms commented Aug 2, 2022

martindurant commented Aug 2, 2022

jbms commented Aug 2, 2022

martindurant commented Aug 2, 2022

jakirkham commented Aug 15, 2022

joshmoore commented Dec 2, 2022

martindurant commented Dec 2, 2022

jhamman commented Dec 7, 2023

async zarr #1104

async zarr #1104

Comments

martindurant commented Aug 2, 2022 • edited Loading

martindurant commented Aug 2, 2022

jbms commented Aug 2, 2022

martindurant commented Aug 2, 2022

jbms commented Aug 2, 2022

martindurant commented Aug 2, 2022

jakirkham commented Aug 15, 2022

joshmoore commented Dec 2, 2022

martindurant commented Dec 2, 2022

jhamman commented Dec 7, 2023

martindurant commented Aug 2, 2022 •

edited

Loading