Skip to content

[FLINK-37867][state/forst] Ensure files of half-uploaded checkpoints are cleaned when using path copying. #26696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AlexYinHan
Copy link
Contributor

What is the purpose of the change

This pull request makes ForSt StateBackend register the uploaded state handles to tmpResourcesRegistry when using path copying for checkpointing, thus ensuring the uploaded files can be cleaned up as expected.

Brief change log

  • CopyDataTransferStrategy registers the uploaded state handles to tmpResourcesRegistry

Verifying this change

This change added tests and can be verified as follows:

  • DataTransferStrategyTest#testUncompletedCheckpoint tests for cleaning files of checkpoints

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 18, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@@ -153,7 +157,8 @@ private HandleAndLocalPath copyFileToCheckpoint(
private @Nullable StreamStateHandle tryPathCopyingToCheckpoint(
@Nonnull StreamStateHandle sourceHandle,
CheckpointStreamFactory checkpointStreamFactory,
CheckpointedStateScope stateScope) {
CheckpointedStateScope stateScope,
CloseableRegistry tmpResourcesRegistry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add to javadoc

@@ -163,7 +168,10 @@ private HandleAndLocalPath copyFileToCheckpoint(
List<StreamStateHandle> result =
checkpointStreamFactory.duplicate(
Collections.singletonList(sourceHandle), stateScope);
return result.get(0);
StreamStateHandle resultStateHandle = result.get(0);
tmpResourcesRegistry.registerCloseable(
Copy link
Contributor

@davidradl davidradl Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to add comments / javadoc around this logic. It is not obvious to me why we are discarding state quietly in a copy method - some comments to detail the thinking here would be good.

I think we are copying information for a failure case. It it worth adding at least a debug around this cleanup.

The Jira says : clean up half-uploaded checkpoints. How do we get half up loaded checkpoints. Is there a case to fix this at source to prevent these half uploaded files from occurring in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants