[FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. #26663

Izeren · 2025-06-10T18:25:53Z

What is the purpose of the change

Address local recovery issues when Adaptive scheduler is enabled.

Pass latest completed checkpoint in addition to execution graph to StateSizeEstimates (that is needed because execution graph goes through cancelling/cancelled state and checkpoint coordinator is nulled by the time we run calculations).
Assign positive priority score to allocations that have overlapping key groups even when state size is zero (currently we would only give priority score if managedKeyedState is present, but local recovery semantics doesn't require state presence).

Context: When job can be recovered locally, we should keep slot allocation after restart to maintain

Verifying this change

LocalRecoveryITCase#testRecoverLocallyFromProcessCrashWithWorkingDirectory now passes when AdaptiveScheduler is enabled.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2025-06-10T18:33:19Z

CI report:

042e752 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

rkhachatryan

Thanks for the fix @Izeren !
I've left a couple of comments, PTAL.
Were you able to reproduce the failure reliably without fix - and success with the fix?

...in/java/org/apache/flink/runtime/scheduler/adaptive/allocator/StateLocalitySlotAssigner.java

.../src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/StateSizeEstimates.java

lsyldliu · 2025-06-11T01:53:11Z

The ci is failed.

dmvk · 2025-06-11T07:58:42Z

@Izeren rebasing the PR to include 374fedb should fix the CI

Izeren · 2025-06-11T09:05:33Z

@Izeren rebasing the PR to include 374fedb should fix the CI

Thank you, will do

Izeren · 2025-06-11T09:24:37Z

Thanks for the fix @Izeren ! I've left a couple of comments, PTAL. Were you able to reproduce the failure reliably without fix - and success with the fix?

Yes, it fails in intellij more than it doesn't, but you need to provide VM options: -Dflink.tests.enable-adaptive-scheduler=true to ensure that adaptive scheduler is used. With the fix it is consistently successful.

dmvk · 2025-06-11T12:11:19Z

We should add test case to AdaptiveScheduler with custom implementation of SlotAssigner that acts as a regression test
We should add test case to SlotAssigner implementation that verifies how SA behaves in non-rescaling scenarios, when we simply want to reuse previously known allocations

Izeren · 2025-06-13T15:56:43Z

We should add test case to AdaptiveScheduler with custom implementation of SlotAssigner that acts as a regression test

We should add test case to SlotAssigner implementation that verifies how SA behaves in non-rescaling scenarios, when we simply want to reuse previously known allocations

@dmvk, I have added 2 tests:

For AdaptiveScheduler to ensure that SlotAllocator receives data from the checkpoint and will use it for distribution. (I couldn't check the checkpoints themselves as it AllocationInformation calculated through static call, hence I verify that state made it into SlotAllocator after).
For SlotAllocator to ensure that it preserves allocation according to state distribution and should retain allocation for a job restart.

Izeren · 2025-06-16T08:16:34Z

I am still looking into few other test failures in AdaptiveScheduler

…for local recovery adjustment.

lsyldliu · 2025-06-18T01:48:59Z

@Izeren Hi, can we push this fix before code freeze time?

Izeren · 2025-06-18T09:41:35Z

@dmvk, would you have time to have a second look today?

rkhachatryan reviewed Jun 10, 2025

View reviewed changes

...in/java/org/apache/flink/runtime/scheduler/adaptive/allocator/StateLocalitySlotAssigner.java Outdated Show resolved Hide resolved

.../src/main/java/org/apache/flink/runtime/scheduler/adaptive/allocator/StateSizeEstimates.java Outdated Show resolved Hide resolved

dmvk self-requested a review June 11, 2025 07:56

Izeren closed this Jun 11, 2025

Izeren force-pushed the master branch from bfc9dfa to 0195f00 Compare June 11, 2025 09:06

Izeren reopened this Jun 11, 2025

Izeren closed this Jun 13, 2025

Izeren force-pushed the master branch from 53b459d to cefdee3 Compare June 13, 2025 15:45

Izeren reopened this Jun 13, 2025

Izeren force-pushed the master branch from 21f87b1 to b191940 Compare June 13, 2025 16:21

Izeren closed this Jun 16, 2025

Izeren force-pushed the master branch from b191940 to 9fbcc0f Compare June 16, 2025 09:21

[FLINK-37701] Fix AdaptiveScheduler ignoring checkpoint states sizes …

042e752

…for local recovery adjustment.

Izeren reopened this Jun 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. #26663

[FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. #26663

Izeren commented Jun 10, 2025

Uh oh!

flinkbot commented Jun 10, 2025 •

edited

Loading

Uh oh!

rkhachatryan left a comment

Uh oh!

Uh oh!

Uh oh!

lsyldliu commented Jun 11, 2025

Uh oh!

dmvk commented Jun 11, 2025

Uh oh!

Izeren commented Jun 11, 2025

Uh oh!

Izeren commented Jun 11, 2025

Uh oh!

dmvk commented Jun 11, 2025

Uh oh!

Izeren commented Jun 13, 2025

Uh oh!

Izeren commented Jun 16, 2025

Uh oh!

lsyldliu commented Jun 18, 2025

Uh oh!

Izeren commented Jun 18, 2025

Uh oh!

Uh oh!

[FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. #26663

Are you sure you want to change the base?

[FLINK-37701][flink-runtime] Fix AdaptiveScheduler ignoring checkpoint states sizes for local recovery adjustment. #26663

Conversation

Izeren commented Jun 10, 2025

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lsyldliu commented Jun 11, 2025

Uh oh!

dmvk commented Jun 11, 2025

Uh oh!

Izeren commented Jun 11, 2025

Uh oh!

Izeren commented Jun 11, 2025

Uh oh!

dmvk commented Jun 11, 2025

Uh oh!

Izeren commented Jun 13, 2025

Uh oh!

Izeren commented Jun 16, 2025

Uh oh!

lsyldliu commented Jun 18, 2025

Uh oh!

Izeren commented Jun 18, 2025

Uh oh!

Uh oh!

flinkbot commented Jun 10, 2025 •

edited

Loading