-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. #26663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @Izeren !
I've left a couple of comments, PTAL.
Were you able to reproduce the failure reliably without fix - and success with the fix?
...in/java/org/apache/flink/runtime/scheduler/adaptive/allocator/StateLocalitySlotAssigner.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/StateSizeEstimates.java
Outdated
Show resolved
Hide resolved
The ci is failed. |
Yes, it fails in intellij more than it doesn't, but you need to provide VM options: |
|
@dmvk, I have added 2 tests:
|
I am still looking into few other test failures in AdaptiveScheduler |
…for local recovery adjustment.
@Izeren Hi, can we push this fix before code freeze time? |
@dmvk, would you have time to have a second look today? |
What is the purpose of the change
Address local recovery issues when Adaptive scheduler is enabled.
StateSizeEstimates
(that is needed because execution graph goes through cancelling/cancelled state and checkpoint coordinator isnulled
by the time we run calculations).managedKeyedState
is present, but local recovery semantics doesn'trequire
state presence).Context: When job can be recovered locally, we should keep slot allocation after restart to maintain
Verifying this change
LocalRecoveryITCase#testRecoverLocallyFromProcessCrashWithWorkingDirectory
now passes whenAdaptiveScheduler
is enabled.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation