Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Better handling of redis errors #33744

Closed
jjyao opened this issue Mar 27, 2023 · 8 comments · Fixed by #34108
Closed

[Core] Better handling of redis errors #33744

jjyao opened this issue Mar 27, 2023 · 8 comments · Fixed by #34108
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core release-blocker P0 Issue that blocks the release

Comments

@jjyao
Copy link
Collaborator

jjyao commented Mar 27, 2023

What happened + What you expected to happen

When we use ray with external redis, sometimes redis returns REDIS_REPLY_ERROR and we are not handle it properly.

Versions / Dependencies

master

Reproduction script

#33733

Issue Severity

None

@jjyao jjyao added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release core Issues that should be addressed in Ray Core labels Mar 27, 2023
@jjyao jjyao self-assigned this Mar 27, 2023
@jjyao
Copy link
Collaborator Author

jjyao commented Mar 28, 2023

For 2.4, we should audit all redis calls and retry REDIS_REPLY_ERROR

@fishbone
Copy link
Contributor

Close since duplicate with this #32541

@jjyao
Copy link
Collaborator Author

jjyao commented Mar 28, 2023

Are they the same?

@edoakes
Copy link
Contributor

edoakes commented Mar 29, 2023

@iycheng these issues not the same (but somewhat related). This one is about how the GCS handles errors and status codes in all requests to redis. The other is about surfacing a better error to the user when we fail to connect on startup.

@edoakes edoakes reopened this Mar 29, 2023
@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed release-blocker P0 Issue that blocks the release labels Mar 31, 2023
@jjyao
Copy link
Collaborator Author

jjyao commented Apr 3, 2023

The printed out error message is redis_context.cc:67: Check failed: false Got an error in redis reply: MOVED 11267 10.0.18.181:6379

@jjyao jjyao added release-blocker P0 Issue that blocks the release and removed P1 Issue that should be fixed within a few weeks labels Apr 3, 2023
@fishbone
Copy link
Contributor

fishbone commented Apr 3, 2023

@edoakes I think for Redis failures, right now, we don't do anything to fix this since it's just not easy. Blindly retry won't work and is worse IMO. (refer to #34014 for details).

@fishbone fishbone linked a pull request Apr 5, 2023 that will close this issue
8 tasks
@edoakes
Copy link
Contributor

edoakes commented Apr 10, 2023

@iycheng thanks for the leader election fix!

I think we should still leave this issue open and prioritize improving how we handle reply codes. Not having any form of retries or gracefully handling error codes makes the system unstable. What do you think?

@fishbone
Copy link
Contributor

@edoakes it's tracked here #34014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants