Low-latency storage backends #237

tomwhite · 2023-06-28T10:41:55Z

So far we've only used cloud storage (S3 and GCS) for storing intermediate Zarr data in Cubed. It would be interesting to try other storage backends that have lower latency.

Here are some examples, I'm sure there are more:

Note that Google Cloud Memorystore is not serverless, so you'd need to start a cluster before running your computation, and shut it down afterwards.

We'd also want to be more aggressive in deleting intermediate data that is no longer needed in Cubed. Currently we just let it get deleted automatically in the background, well after the computation has completed. Low-latency storage has smaller capacity and is a lot more expensive, so we'd want to delete data that is no longer a direct dependency of any node in the DAG immediately as the computation progresses.

cc @TomNicholas @rabernat

TomNicholas · 2023-06-28T14:22:09Z

This would be very interesting.

Low-latency storage has smaller capacity

I would be concerned about the maximum size of the low-latency storage limiting scaling. For example I noticed:

Amazon EFS supports up to 25,000 connections per file system.

With 100MB chunks, this means the maximum size Zarr array you could process in one step before losing parallelism would be 2.5TB.

tomwhite · 2023-07-01T11:28:00Z

Jacob Tomlinson on twitter suggested momento, Dragonfly, and ElastiCache as possibilities too.

tomwhite · 2023-08-22T14:47:38Z

Also relevant to this issue is the work going on in Zarr to improve performance.

Discussion thread: Zarr-Python → Benchmarking and Performance zarr-developers/zarr-python#1479
Cubed use case (by @TomNicholas): Please share your use-cases for Zarr (to help inform benchmarking) zarr-developers/zarr-python#1486 (comment)

hammer · 2023-12-03T15:27:32Z

Amazon S3 Express One Zone might have solved this problem.

TomNicholas · 2023-12-04T03:38:21Z

Amazon S3 Express One Zone might have solved this problem.

This looks very interesting to me! It would be great to test it.

Some clarifying questions:

The blog post talks about how this new storage class brings the most performance benefits for smaller objects, as latency is a higher fraction of the total time to download a small object. Would our ~100MB chunks be considered "small" in this context?

combined with request costs that are 50% lower than for the S3 Standard storage class

Is this sentence from the post referring to some monetary cost or to computational cost?
The post mentions co-locating this storage with various compute services, but AWS lambda isn't explicitly listed. Do we know if we can use this with serverless compute? EDIT: AWS Lambda is explicitly listed as compatible on this docs page.
Is fsspec + s3fs + Zarr likely to "just work" with this new storage class? It says it implements most of the same S3 API...

tomwhite · 2023-12-07T16:24:50Z

Is fsspec + s3fs + Zarr likely to "just work" with this new storage class? It says it implements most of the same S3 API...

Not yet, by the look of it. I had a quick dig and found that fsspec uses aiobotocore, which in turn depends on botocore. The latest version of aiobotocore is pinned to botocore < 1.33.2, and support for Amazon S3 Express One Zone was added in botocore 1.33.2.

Also, quoting from aio-libs/aiobotocore#1057 (comment):

botocore>=1.33.2 includes many new codepaths and will likely be a major piece of work for us.

hammer · 2023-12-07T17:06:25Z

Yikes! Reading a bit more to see why botocore itself doesn't support async io, in boto/botocore#458 I was amused to find a comment from scanpy's Philipp Angerer. We're all hitting the same issues!

I went ahead and filed aio-libs/aiobotocore#1065 to have a place to track the S3 Express One Zone support work in aiobotocore.

rabernat · 2023-12-07T17:21:56Z

Let's just create python bindings for Rust's object store crate and start using that instead. Looks very powerful and performant.

tomwhite · 2023-12-07T17:45:25Z

Let's just create python bindings for Rust's object store crate and start using that instead. Looks very powerful and performant.

More performance would be welcome! It doesn't support S3 Express One Zone yet either: apache/arrow-rs#5140

hammer · 2023-12-07T18:07:59Z

Let's just create python bindings for Rust's object store crate and start using that instead. Looks very powerful and performant.

Ha, was just looking at that: fastlmm/bed-reader#22. There is in fact a Python interface already: https://github.com/roeap/object-store-python. It's a bit stale.

I have been thinking for a while that the fsspec abstraction is too broad and deep (e.g. caching) and most of our use cases would be better served by a more focused library. Perhaps this could be refactored out of fsspec, or perhaps built on top of another library...

martindurant · 2023-12-07T19:01:56Z

the fsspec abstraction is too broad and deep (e.g. caching)

To defend fsspec: what do you think could possibly be getting in the way? Methods like fs.cat([list of paths]) are very minimal and all that zarr actually needs. Caching and such are totally optional. It would be better if you can affect optimisations there, perhaps hooking into a rust implementation if you like.

Does object-store-python support anything async?

I again refer readers to rfsspec: rust/tokio does not magically make things much faster, it may help a little.

mrocklin · 2023-12-07T19:28:40Z

A month ago I was doing comparative benchmarks between Spark/Dask/DuckDB/Polars on cloud data. My observations were that, as long as the projects don't do anything dumb (a big assumption) the game was entirely about optimizing S3 access. I starting pushing on Arrow timings (see apache/arrow#38389) and found that it has much less to do with the language that is used, but rather due to how gently one handles S3. What I found at the time was that S3 liked ...

Reads of 5-15 MB in size
Plenty of concurrency (2x-3x threads than cores)

On an m6i.xlarge (4 cores) I could get about 60 MB/s from a single read (regardless of whether or not it was coming from boto, fsspec, or arrow (C++ backed)), or about 500 MB/s if I fully saturated the machine.

This was all done before the new S3 single-zone stuff was launched.

I came to the following conclusions:

The framework orchestrating downloads (zarr, dask, xarray, ...) needs to add in some concurrency
It's nice if the downloading library (fsspec, boto, arrow) is pretty simple

All this is to say, I wouldn't expect "Just reimplement everything in Rust" to solve the problem. I think that you'll need to be more clever than that here. However I also don't think it would hurt. I found myself leaning more towards Arrow over fsspec in profiling because things felt simpler to profile. I was pretty confident that there wasn't some background IOLoop or GIL issue slowing me down (although I never had solid evidence that there was).

In Cubed's case I'd recommend trying to build some concurrency in to a single function call.

After talking to devs in other projects, the 2-3x threads over cores approach is common in Dask (or will be soon), DuckDB, and Polars.

martindurant · 2023-12-07T19:56:47Z

Just a note that concurrency X threads is normal and should be expected in zarr-on-dask workflows. fsspec supports concurrency, so this achieved by setting the dask partition to be a few (4-20) zarr chunks, depending on memory.

build some concurrency in to a single function call

That of course is the tricky thing, in whichever language. If you need a sync gate at all (because the whole application is not async), then you need to decide where that gate is.

martindurant · 2023-12-07T20:03:41Z

I never had solid evidence that there was

This is quite opposite to claims that were made before. IF we can reach near equivalent performance in python, the simplicity of the code and speed of development are big benefits. My experiments in rfsspec showed there is probably little difference.

What I found at the time was that S3 liked ...
Reads of 5-15 MB in size
Plenty of concurrency (2x-3x threads than cores)

These are interesting findings that are hard to tease out from the linked very long thread. Perhaps there can be some sort of summary of findings?

mrocklin · 2023-12-07T20:05:55Z

A summary of the finding sounds great. If someone wants to do that that would be welcome.

mrocklin · 2023-12-07T20:07:33Z

This notebook might also be of use https://gist.github.com/mrocklin/c1fd89575b40c055a9be77b2a47894df

JackKelly · 2023-12-08T10:58:48Z

Batched function calls

build some concurrency in to a single function call

Yeah, I agree: batched async provides a bunch of benefits. On the design for Zarr-Python v3.0, there's some discussion of implementing an async Store.get_items() method. This would allow the Store to perform a bunch of optimisations.

Compiled languages and millions of IOPS

On the topic of implementing IO libraries in compiled languages (like Rust)... There are use-cases which would benefit from being able to read millions of small chunks per second to a single machine (for example, here's an overly long blog (!) post I wrote about my main use-case: training large ML models from multi-dimensional data). To achieve that kind of performance, it would appear necessary to implement at least some of the code in a compiled language. (And to use all the tricks discussed in the thread above: especially using native async IO that modern operating systems provide. I'm tinkering with some ideas in Rust in this repo).

But I recognise that this current GitHub discussion is about low-latency cloud storage buckets (using cubed). And my main focus is somewhat different: my main focus is on local, modern SSDs (which, today, can do up to 3 million IOPS. And the performance is rapidly improving). So - I apologise - I'm a bit off-topic. But I wanted to make the point that some use-cases would benefit from implementing IO in a compiled language.

As far as I can tell, it's not yet possible to perform millions of IOPS to a single machine from a cloud storage bucket (although Amazon S3 Express One Zone gets us to 100,000 IOPS... although I assume that performance is only possible when talking to the bucket from many VMs). And, of course, if you're using large chunks sizes (100 MB) then you're probably only doing on the order of 10 IOPS per machine, and so your performance is dominated by the network bandwidth, and no one cares if you can shave a few milliseconds off your code's runtime per IO operation!

But there are folks (like me) who require on-prem hardware and fast, local storage. Amazing things are happening in the storage industry right now: I've heard it said that there's been more innovation in storage in the last two years than in the last twenty years! Folks like me need code which can keep up with these innovations in hardware!

And - who knows - maybe cloud storage buckets will soon deliver millions of IOPS (to a single machine). But - wait - is it even possible to do that over a network? Well... even the lowly 1 Gbps NIC in your laptop can do almost 1.5 million packets per second (PPS)... and 400 Gbps NICs are available, which can do almost 600 million PPS. And, in the future, it'd be nice if cloud storage APIs allowed us to submit multiple disk operations per packet... you know... like HTTP has been capable of since 1996 🙂.

tomwhite · 2023-12-08T11:33:43Z

Thanks for all the comments! Really interesting.

I starting pushing on Arrow timings (see apache/arrow#38389) and found that it has much less to do with the language that is used, but rather due to how gently one handles S3.

That's a really good point @mrocklin. There is a PR in aiobotocore to support Amazon S3 Express One Zone, so hopefully we'll be able to try it out (via fsspec) soon. That would be the first thing I would try to address this issue, which is particularly acute in Cubed as it uses cloud storage for intermediate results (i.e. it doesn't use S3 very gently).

In Cubed's case I'd recommend trying to build some concurrency in to a single function call.

Agreed. When we start reading multiple chunks per task in more sophisticated ways (e.g. for tree-reduce in #284, #331) then we'll need the ability to load N chunks concurrently - sometimes in a streaming fashion - and no more than N chunks to meet Cubed's memory guarantees. It sounds like fsspec can already do that, which is great, also there's active discussion about how to improve this more generally (like the Zarr V3 discussion @JackKelly linked to).

TomNicholas · 2023-12-13T19:37:06Z

aiobotocore is pinned to botocore < 1.33.2, and support for Amazon S3 Express One Zone was added in botocore 1.33.2.

The newest release of aibotocore seems to have fixed this now right? So we can now try out using fsspec with S3 Express One Zone?

martindurant · 2023-12-13T20:31:38Z

botocore>=1.33.2,<1.33.14 is what I see now, so give it a go

JackKelly · 2023-12-14T14:03:45Z

I assume not... But, just to check: would cubed be happy with high latency if it got high bandwidth and a large number of IO operations per second?

tomwhite added runtime zarr storage labels Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-latency storage backends #237

Low-latency storage backends #237

tomwhite commented Jun 28, 2023

TomNicholas commented Jun 28, 2023

tomwhite commented Jul 1, 2023

tomwhite commented Aug 22, 2023

hammer commented Dec 3, 2023

TomNicholas commented Dec 4, 2023 •

edited

tomwhite commented Dec 7, 2023

hammer commented Dec 7, 2023

rabernat commented Dec 7, 2023

tomwhite commented Dec 7, 2023

hammer commented Dec 7, 2023

martindurant commented Dec 7, 2023

mrocklin commented Dec 7, 2023

martindurant commented Dec 7, 2023

martindurant commented Dec 7, 2023

mrocklin commented Dec 7, 2023

mrocklin commented Dec 7, 2023

JackKelly commented Dec 8, 2023 •

edited

tomwhite commented Dec 8, 2023

TomNicholas commented Dec 13, 2023

martindurant commented Dec 13, 2023

JackKelly commented Dec 14, 2023

Low-latency storage backends #237

Low-latency storage backends #237

Comments

tomwhite commented Jun 28, 2023

TomNicholas commented Jun 28, 2023

tomwhite commented Jul 1, 2023

tomwhite commented Aug 22, 2023

hammer commented Dec 3, 2023

TomNicholas commented Dec 4, 2023 • edited

tomwhite commented Dec 7, 2023

hammer commented Dec 7, 2023

rabernat commented Dec 7, 2023

tomwhite commented Dec 7, 2023

hammer commented Dec 7, 2023

martindurant commented Dec 7, 2023

mrocklin commented Dec 7, 2023

martindurant commented Dec 7, 2023

martindurant commented Dec 7, 2023

mrocklin commented Dec 7, 2023

mrocklin commented Dec 7, 2023

JackKelly commented Dec 8, 2023 • edited

Batched function calls

Compiled languages and millions of IOPS

tomwhite commented Dec 8, 2023

TomNicholas commented Dec 13, 2023

martindurant commented Dec 13, 2023

JackKelly commented Dec 14, 2023

TomNicholas commented Dec 4, 2023 •

edited

JackKelly commented Dec 8, 2023 •

edited