New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Comparison between native AWSSDK and FSSpec (boto3) based DataPipes #500
Comments
Thanks @NivekT and @ejguan for the context in #847! I've run some preliminary benchmarks on our datasets comparing s3io (aka s3 plugin - pybinded c++ client) versus fsspec. TL;DR - at high enough batchsize + num_workers (dataloader workers) the throughput is comparable (although s3io is still ~ 16% faster) at ~2.4M samples/sec versus ~2.0M samples/sec. Where the difference really shows is when you strip away all the parallelism gimmick. In this case s3io is ~2x faster than fsspec. Below are the benchmark results: Experiment Parameters:
Notes:
|
@kiukchung Thank you so much to help us benchmarking on the text use case! This seems contradictory to the benchmarking result we have carried out previously (I am more focusing on I have a noob question on the benchmarking settings. Does 1000 batches mean you would only read data from a single shard since a shard has 100k lines with a smaller batch sizes like (16, 32 and 64)? And, since only one shard has been read, it seems weird to me on the low P0 value. Do you mind helping us to test those two implementations with higher |
@ejguan thanks for taking a look and your insights/questions. I’m on PTO today so will update with the answers to your questions on Mon/Tue. Will also clean up the benchmark code and post it here so that you can also run it on your end (I need to remove some dependencies to our internal tools that read the dataset manifest containing the s3 URLs of the shards). RE: P0 being super low. I should’ve been clearer, yes you are correct that for low batch sizes the benchmark will only read one shard. However the first source datapipe in my chain is a “ListShardsFromManifest” which is a custom iter datapipe that I implemented that simply queries the manifest file (a json file in s3) given the dataset name, branch, and region. I believe that the low P0 qps is coming from the fact that to read the first batch, we first read the manifest (a list + read s3 operation). The manifest file itself is pretty small (no more than 100kb) so most of that latency is coming from making those two s3 requests cross region (datasets are in us-east-1 and my desktop is in is-west-2). I’ll try to run the benchmarks on the same region to see if that improves the P0 numbers. |
Sorry it took me longer than expected. Here's the benchmarking script:
|
Did some more digging and here are some observations:
So what is the bottom line?
|
@kiukchung This is amazing! Thank you for providing such detailed benchmarking result and analysis.
Even without All your bottom lines you mentioned are super useful for users. It definitely deserves to be written in our documents! cc: @NivekT |
@kiukchung Thanks for looking into this and sharing the results! Your findings are very helpful and we should incorporate them into our doc. A few questions:
dp.open_files_by_fsspec(mode="rb", anon=True).load_from_tar(mode="r|")
# Note that `mode="r|"` for streaming
|
Haven't tried it on tar-balls but "block-streaming" works for text files. I'd assume that since tar-balls can be stream opened, this also works for tars with a caveat (see the end of this paragraph). Will put up a PR for an
Yeah so I was only benchmarking pure reads. In practice the dataloader bottleneck will be in the pipes towards the end of the chain that does a lot of data transforms (e.g. tokenization for NLP). For our use-case we have most of the data pre-processed in S3 that is ready to be fed into the forward() method as soon as they are read (e.g. no pre-processing in the trainer) hence my recommendation for 4 workers. I mentioned this above, but reiterating here - the |
@kiukchung Not yet. Do you want to open a PR to add them? We are currently tied to work on a certain features prior to branch cut. |
🐛 Describe the bug
After
AWSSDK
is integrated with TorchData, we now have two categories ofDataPipe
s to access and load data from AWS S3 Bucket:DataPipe
usingfsspec
: It relies ons3fs
module to list/load data from S3 bucket.DataPipe
usingAWSSDK
: It relies on pybind fromAWSSDK_CPP
module.And, I want to carry out a performance comparison of
Lister
andOpener
/Loader
between these two ways.Lister
s, I was using the same root path of"s3://ai2-public-datasets/charades"
and validated that they returned the same values during iteration.Testing script
And the result is:
The
FSSpecFileLister
performs 10x better thanS3FileLister
.S3FileLoader
andFSSpecFileOpener
, except iterating over these twoDataPipe
s, I also carried out an extra experiment by addingread
from file returned by theseDataPipe
s. And, I only used a two datasets hosted on S3 bucket for testing simply to save my time running tests.Testing script
And the result is:
By comparing the results without
read
, I believeS3FileLoader
would trigger loading data butFSSpecFileOpener
won't read data from remote. So, it makes more sense to compare these twoDataPipe
s both with theread
operation attached. TheFSSpecFileOpener
still beatsS3FileLoader
about 25% performance wise.Due to the performance regression with
AWSSDK
, it becomes hard for me to recommend users to use nativeS3FileLister
orS3FileLoader
.cc: @ydaiming
Versions
main branch
I only execute these scripts on my Mac as our out AWS cluster doesn't allow me to access the S3.
The text was updated successfully, but these errors were encountered: