Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low-latency storage backends #237

Open
tomwhite opened this issue Jun 28, 2023 · 21 comments
Open

Low-latency storage backends #237

tomwhite opened this issue Jun 28, 2023 · 21 comments

Comments

@tomwhite
Copy link
Member

So far we've only used cloud storage (S3 and GCS) for storing intermediate Zarr data in Cubed. It would be interesting to try other storage backends that have lower latency.

Here are some examples, I'm sure there are more:

Note that Google Cloud Memorystore is not serverless, so you'd need to start a cluster before running your computation, and shut it down afterwards.

We'd also want to be more aggressive in deleting intermediate data that is no longer needed in Cubed. Currently we just let it get deleted automatically in the background, well after the computation has completed. Low-latency storage has smaller capacity and is a lot more expensive, so we'd want to delete data that is no longer a direct dependency of any node in the DAG immediately as the computation progresses.

cc @TomNicholas @rabernat

@TomNicholas
Copy link
Collaborator

This would be very interesting.

Low-latency storage has smaller capacity

I would be concerned about the maximum size of the low-latency storage limiting scaling. For example I noticed:

Amazon EFS supports up to 25,000 connections per file system.

With 100MB chunks, this means the maximum size Zarr array you could process in one step before losing parallelism would be 2.5TB.

@tomwhite
Copy link
Member Author

tomwhite commented Jul 1, 2023

Jacob Tomlinson on twitter suggested momento, Dragonfly, and ElastiCache as possibilities too.

@tomwhite
Copy link
Member Author

Also relevant to this issue is the work going on in Zarr to improve performance.

@hammer
Copy link

hammer commented Dec 3, 2023

Amazon S3 Express One Zone might have solved this problem.

@TomNicholas
Copy link
Collaborator

TomNicholas commented Dec 4, 2023

Amazon S3 Express One Zone might have solved this problem.

This looks very interesting to me! It would be great to test it.

Some clarifying questions:

  • The blog post talks about how this new storage class brings the most performance benefits for smaller objects, as latency is a higher fraction of the total time to download a small object. Would our ~100MB chunks be considered "small" in this context?

combined with request costs that are 50% lower than for the S3 Standard storage class

  • Is this sentence from the post referring to some monetary cost or to computational cost?

  • The post mentions co-locating this storage with various compute services, but AWS lambda isn't explicitly listed. Do we know if we can use this with serverless compute? EDIT: AWS Lambda is explicitly listed as compatible on this docs page.

  • Is fsspec + s3fs + Zarr likely to "just work" with this new storage class? It says it implements most of the same S3 API...

@tomwhite
Copy link
Member Author

tomwhite commented Dec 7, 2023

  • Is fsspec + s3fs + Zarr likely to "just work" with this new storage class? It says it implements most of the same S3 API...

Not yet, by the look of it. I had a quick dig and found that fsspec uses aiobotocore, which in turn depends on botocore. The latest version of aiobotocore is pinned to botocore < 1.33.2, and support for Amazon S3 Express One Zone was added in botocore 1.33.2.

Also, quoting from aio-libs/aiobotocore#1057 (comment):

botocore>=1.33.2 includes many new codepaths and will likely be a major piece of work for us.

@hammer
Copy link

hammer commented Dec 7, 2023

Yikes! Reading a bit more to see why botocore itself doesn't support async io, in boto/botocore#458 I was amused to find a comment from scanpy's Philipp Angerer. We're all hitting the same issues!

I went ahead and filed aio-libs/aiobotocore#1065 to have a place to track the S3 Express One Zone support work in aiobotocore.

@rabernat
Copy link

rabernat commented Dec 7, 2023

Let's just create python bindings for Rust's object store crate and start using that instead. Looks very powerful and performant.

@tomwhite
Copy link
Member Author

tomwhite commented Dec 7, 2023

Let's just create python bindings for Rust's object store crate and start using that instead. Looks very powerful and performant.

More performance would be welcome! It doesn't support S3 Express One Zone yet either: apache/arrow-rs#5140

@hammer
Copy link

hammer commented Dec 7, 2023

Let's just create python bindings for Rust's object store crate and start using that instead. Looks very powerful and performant.

Ha, was just looking at that: fastlmm/bed-reader#22. There is in fact a Python interface already: https://github.com/roeap/object-store-python. It's a bit stale.

I have been thinking for a while that the fsspec abstraction is too broad and deep (e.g. caching) and most of our use cases would be better served by a more focused library. Perhaps this could be refactored out of fsspec, or perhaps built on top of another library...

@martindurant
Copy link

the fsspec abstraction is too broad and deep (e.g. caching)

To defend fsspec: what do you think could possibly be getting in the way? Methods like fs.cat([list of paths]) are very minimal and all that zarr actually needs. Caching and such are totally optional. It would be better if you can affect optimisations there, perhaps hooking into a rust implementation if you like.

Does object-store-python support anything async?

I again refer readers to rfsspec: rust/tokio does not magically make things much faster, it may help a little.

@mrocklin
Copy link

mrocklin commented Dec 7, 2023

A month ago I was doing comparative benchmarks between Spark/Dask/DuckDB/Polars on cloud data. My observations were that, as long as the projects don't do anything dumb (a big assumption) the game was entirely about optimizing S3 access. I starting pushing on Arrow timings (see apache/arrow#38389) and found that it has much less to do with the language that is used, but rather due to how gently one handles S3. What I found at the time was that S3 liked ...

  • Reads of 5-15 MB in size
  • Plenty of concurrency (2x-3x threads than cores)

On an m6i.xlarge (4 cores) I could get about 60 MB/s from a single read (regardless of whether or not it was coming from boto, fsspec, or arrow (C++ backed)), or about 500 MB/s if I fully saturated the machine.

This was all done before the new S3 single-zone stuff was launched.

I came to the following conclusions:

  1. The framework orchestrating downloads (zarr, dask, xarray, ...) needs to add in some concurrency
  2. It's nice if the downloading library (fsspec, boto, arrow) is pretty simple

All this is to say, I wouldn't expect "Just reimplement everything in Rust" to solve the problem. I think that you'll need to be more clever than that here. However I also don't think it would hurt. I found myself leaning more towards Arrow over fsspec in profiling because things felt simpler to profile. I was pretty confident that there wasn't some background IOLoop or GIL issue slowing me down (although I never had solid evidence that there was).

In Cubed's case I'd recommend trying to build some concurrency in to a single function call.

After talking to devs in other projects, the 2-3x threads over cores approach is common in Dask (or will be soon), DuckDB, and Polars.

@martindurant
Copy link

Just a note that concurrency X threads is normal and should be expected in zarr-on-dask workflows. fsspec supports concurrency, so this achieved by setting the dask partition to be a few (4-20) zarr chunks, depending on memory.

build some concurrency in to a single function call

That of course is the tricky thing, in whichever language. If you need a sync gate at all (because the whole application is not async), then you need to decide where that gate is.

@martindurant
Copy link

I never had solid evidence that there was

This is quite opposite to claims that were made before. IF we can reach near equivalent performance in python, the simplicity of the code and speed of development are big benefits. My experiments in rfsspec showed there is probably little difference.

What I found at the time was that S3 liked ...
Reads of 5-15 MB in size
Plenty of concurrency (2x-3x threads than cores)

These are interesting findings that are hard to tease out from the linked very long thread. Perhaps there can be some sort of summary of findings?

@mrocklin
Copy link

mrocklin commented Dec 7, 2023

A summary of the finding sounds great. If someone wants to do that that would be welcome.

@mrocklin
Copy link

mrocklin commented Dec 7, 2023

This notebook might also be of use https://gist.github.com/mrocklin/c1fd89575b40c055a9be77b2a47894df

@JackKelly
Copy link

JackKelly commented Dec 8, 2023

Batched function calls

build some concurrency in to a single function call

Yeah, I agree: batched async provides a bunch of benefits. On the design for Zarr-Python v3.0, there's some discussion of implementing an async Store.get_items() method. This would allow the Store to perform a bunch of optimisations.

Compiled languages and millions of IOPS

On the topic of implementing IO libraries in compiled languages (like Rust)... There are use-cases which would benefit from being able to read millions of small chunks per second to a single machine (for example, here's an overly long blog (!) post I wrote about my main use-case: training large ML models from multi-dimensional data). To achieve that kind of performance, it would appear necessary to implement at least some of the code in a compiled language. (And to use all the tricks discussed in the thread above: especially using native async IO that modern operating systems provide. I'm tinkering with some ideas in Rust in this repo).

But I recognise that this current GitHub discussion is about low-latency cloud storage buckets (using cubed). And my main focus is somewhat different: my main focus is on local, modern SSDs (which, today, can do up to 3 million IOPS. And the performance is rapidly improving). So - I apologise - I'm a bit off-topic. But I wanted to make the point that some use-cases would benefit from implementing IO in a compiled language.

As far as I can tell, it's not yet possible to perform millions of IOPS to a single machine from a cloud storage bucket (although Amazon S3 Express One Zone gets us to 100,000 IOPS... although I assume that performance is only possible when talking to the bucket from many VMs). And, of course, if you're using large chunks sizes (100 MB) then you're probably only doing on the order of 10 IOPS per machine, and so your performance is dominated by the network bandwidth, and no one cares if you can shave a few milliseconds off your code's runtime per IO operation!

But there are folks (like me) who require on-prem hardware and fast, local storage. Amazing things are happening in the storage industry right now: I've heard it said that there's been more innovation in storage in the last two years than in the last twenty years! Folks like me need code which can keep up with these innovations in hardware!

And - who knows - maybe cloud storage buckets will soon deliver millions of IOPS (to a single machine). But - wait - is it even possible to do that over a network? Well... even the lowly 1 Gbps NIC in your laptop can do almost 1.5 million packets per second (PPS)... and 400 Gbps NICs are available, which can do almost 600 million PPS. And, in the future, it'd be nice if cloud storage APIs allowed us to submit multiple disk operations per packet... you know... like HTTP has been capable of since 1996 🙂.

@tomwhite
Copy link
Member Author

tomwhite commented Dec 8, 2023

Thanks for all the comments! Really interesting.

I starting pushing on Arrow timings (see apache/arrow#38389) and found that it has much less to do with the language that is used, but rather due to how gently one handles S3.

That's a really good point @mrocklin. There is a PR in aiobotocore to support Amazon S3 Express One Zone, so hopefully we'll be able to try it out (via fsspec) soon. That would be the first thing I would try to address this issue, which is particularly acute in Cubed as it uses cloud storage for intermediate results (i.e. it doesn't use S3 very gently).

In Cubed's case I'd recommend trying to build some concurrency in to a single function call.

Agreed. When we start reading multiple chunks per task in more sophisticated ways (e.g. for tree-reduce in #284, #331) then we'll need the ability to load N chunks concurrently - sometimes in a streaming fashion - and no more than N chunks to meet Cubed's memory guarantees. It sounds like fsspec can already do that, which is great, also there's active discussion about how to improve this more generally (like the Zarr V3 discussion @JackKelly linked to).

@TomNicholas
Copy link
Collaborator

aiobotocore is pinned to botocore < 1.33.2, and support for Amazon S3 Express One Zone was added in botocore 1.33.2.

The newest release of aibotocore seems to have fixed this now right? So we can now try out using fsspec with S3 Express One Zone?

@martindurant
Copy link

botocore>=1.33.2,<1.33.14 is what I see now, so give it a go

@JackKelly
Copy link

I assume not... But, just to check: would cubed be happy with high latency if it got high bandwidth and a large number of IO operations per second?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants