-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kill 1 rss which is being shuffle read/wirte, application will be re-run by yarn with new attempt id #48
Comments
@wangyi2021 Currently, multiple replicas with LocalFile is not ready. We already did some refactor with shuffle read and multiple replicas should be supported next according to these refactor. |
case 4 tested by following configuration: task 195 error log (shuffle fetch for partiton 94 data):
It will retry 3 times, then task succeed.
In rss-abnormal server log, there is no log about this reduce task for partition 94.
|
@wangyi2021 For shuffle write phase, we did implementation to mark task as successful if write data to any shuffle server successfully, but it is reverted because a lot of situations should be considered to make sure data is corrected. We have some ideas on better replica support already, and I hope it can be available next month. |
Test it by Terasort job.
If an application needs to run for 5 hours and the one RSS is abnormal halfway, it will waste a lot of time.
Four cases were tested,
Shuffle read can be run normally by config 2 replicas. But for shuffle write, there is no way to avoid application re-run.
For shuffle write, is it possible to make driver re-registershuffle to get new reachable rss?
The text was updated successfully, but these errors were encountered: