-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Retry on OSError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached
#43803
Comments
Hey @jennifgcrl, could you share a full traceback? Also, what does your cluster look like (type and number of nodes), and how did you choose between |
Hi there, I am seeing the same issue and it is a major blocker. I am using Sometimes its error getting information for key, sometimes it's libcurl was given bad argument, sometimes it's timeout like above, however, it's always caused by transient network errors that need to be retried. I see some fixes for similar problems have been applied in the past, but there are still many gaps. Some previous related works:
The closest fix is this one: #42027 The error I am seeing is
A bit of custom hackery into the codebase has improved the reliability greatly for me, although my solutions are too messy to PR. These code references should hopefully be helpful enough to indicate gaps in the current retry strategies though. |
Here's an example stacktrace with some stuff ***'d out:
|
@bveeramani May I know when we can expect a fix on this, Im facing the same issue when im trying to read around 42k parquet files from s3 directory |
We are seeing similar issues when reading ~1M images from a S3 bucket with
|
Is there any update on fixing this? Also running into this and it is of high severity for me as well |
@murthyn - this is high priority for us but we're a little swamped; it's funded as part of our planning through May. Balaji will provide more details as he has them. |
re-reading this - @meltzerpete do you think you'd be game to contribute a PR; we can pair with someone to help shepherd through on the Anyscale side. Your breakdown of the problem and whereabouts to solve is quite spectacular :) cc @c21 |
hi @jennifgcrl , looks like this is a transient error. |
tag @murthyn @ronyw7 @Sri-nidhi @meltzerpete as well see @raulchen above^ |
Thanks for the updates. Hopefully this has been solved. I'm unavailable until Tuesday, but will test this then. Failing that happy to try and support with PR if there's someone to pair with. |
I'm pretty sure I tried ray 2.10 when it was first released and found the issue persisted, however, I have tested today with ray 20.0.0 and can confirm the issue appears to be solved. I am seeing no transient errors at all, thanks so much! |
Ah in fact perhaps I spoke too soon 😬 I am seeing this:
I'm seeing this several times, but this looks like the only place it's erroring. Before there were many. |
Is there any update ? I'm facing the same issue... |
What happened + What you expected to happen
ray.data.read_parquet_bulk crashes on parquet_base_datasource's pq.read_table with
This error should be retried
Versions / Dependencies
ray==2.9.2
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: