Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Checkpoints are only partially transferred thus cannot be resumed #6226

Closed
wjaskowski opened this issue Nov 21, 2019 · 7 comments · Fixed by #6376
Closed

[tune] Checkpoints are only partially transferred thus cannot be resumed #6226

wjaskowski opened this issue Nov 21, 2019 · 7 comments · Fixed by #6376
Assignees
Labels
tune Tune-related issues

Comments

@wjaskowski
Copy link

wjaskowski commented Nov 21, 2019

System information

I have been fighting with checkpointing in a scenario when spot instances get interrupted. I have found many problems and (hopefully) some workarounds.

The first is the "--remove-source-files" option you apply to rsync. Apparently, there is some race between making checkpoints and rsync'ing. The result is that if I don't remove this option, my checkpoint files are truncated and I cannot resume from a checkpoint, obviously. Currently, I disabled this option by setting sync_to_driver=log_sync_template() but this is a poor workaround given a limited disk space.

What is more, now the setting keep_checkpoints_num=3 makes the files deleted for a second but then rsynced back again, which makes no sense. It seems that you are syncing the whole experiment directory, which might the root of all those problems.

The other problems:

  1. When a spot instance dies and, new need to be launched and this takes time and it can easily lead to timeout's in RayTrialExecutor, thus I nulled all the timeouts in restore, and fetch_result.
  2. In RayTrialExecutor._setup_remote_runner.logger_creator, there is a race condition (actually, it happened to me) when checking for log dir existence, so this should be changed to os.makedirs(remote_logdir, exist_ok=True)

EDIT: I realized that the current design is still not fully robust against an instant death of a spot instance since rsync can be interrupted at any time during copying. To obtain full robustness, checkpoints should be probably annotated whether they are "ready" when they are fully copied (md5?) and resume should happen using only "ready" checkpoints.

@ujvl ujvl changed the title checkpoints are only partially transferred thus cannot be resumed [tune] Checkpoints are only partially transferred thus cannot be resumed Nov 21, 2019
@ujvl ujvl self-assigned this Nov 21, 2019
@ujvl
Copy link
Contributor

ujvl commented Nov 21, 2019

Apparently, there is some race between making checkpoints and rsync'ing. The result is that if I don't remove this option, my checkpoint files are truncated and I cannot resume from a checkpoint, obviously.

Just to be clear, this "truncation" is happening only if the instance dies during the rsync, correct? Do you see a Sync error logged? If so, the corrupted checkpoint isn't used by tune for restoration (are you trying to restore from it manually somehow?).

I don't think the flag has anything to do with truncating the checkpoint on the drivers side since it only enables deletion of the worker's copy.

It seems that you are syncing the whole experiment directory, which might the root of all those problems.

Yeah thanks for pointing this out, it's a known problem we intend to fix soon. In this case what you're seeing happens because the source files are no longer deleted (due to removal of --remove-source-files).

When a spot instance dies and, new need to be launched and this takes time and it can easily lead to timeout's in RayTrialExecutor, thus I nulled all the timeouts in restore, and fetch_result.

Can you describe:

  • the failure scenario,
  • what your cluster looks like and
  • if you're using the autoscaler
  • your tune.run line with the params?

Were no workers available to start the restored trial? If so, tune should just hang until there are resources available. However, if a worker is killed during restoration, the timeout will likely occur and it should fall back to restarting the trial.

To obtain full robustness, checkpoints should be probably annotated whether they are "ready" when they are fully copied (md5?) and resume should happen using only "ready" checkpoints.

Even without using a checksum the resume shouldn't be happening with corrupted checkpoints. Instead if rsync is interrupted you should see an error logged (Error handling checkpoint ... Sync error ({code}): {error message} and the checkpoint remains unused.

@wjaskowski
Copy link
Author

Just to be clear, this "truncation" is happening only if the instance dies during the rsync, correct? Do you see a Sync error logged? If so, the corrupted checkpoint isn't used by tune for restoration (are you trying to restore from it manually somehow?).

I am sure (double-checked) that this truncation only happens when --remove-source-files enabled.

No sync errors in those cases (*). I inspected also sync_log.log and everything was fine there, i.e., it declared sending as many bytes as there were in the destination (which is less than it should be). I was using automatic ray restore (also tried to do it manually to makes sure that the files are corrupt). What is more, it was not only the last checkpoint that was corrupt - nearly all of them were corrupt.

By saying about a race I meant (a hypothesis) that rsync starts syncing already when the checkpoint file starts to be written to but is not finished yet. Before the file is finished writing, rsync already sees it, copies it and deletes it (when "--remove-source-files" is on). Maybe I experienced that only because I use open.gzip for creating checkpoints, which builds the file incrementally and a workaround would be to put it in a tmp location first and then copy it the the target location in one go although if it is a race than it could happen in this case also.

(*) Previously, I was experiencing sync errors when I was checkpointing very often. I think this might have been a manifestation of the same problem (a combination of --remove-source-files and keep_checkpoints_num)

Can you describe the failure scenario

I create an autoscaler cluster on AWS with 1 head (CPU) and 1 worker (GPU), a have a single job (testing purposes) that requires a GPU. After a while (several checkpoints done), I terminate the node from the EC2 console and wait for another node to go up.

    ray.tune.run(AbusedTrainable, num_samples=1, resources_per_trial={"cpu": 1, "gpu": 0.5},
                 config={
                     'env_name': ray.tune.grid_search(ENVS),
                     'seed': ray.tune.grid_search([123])
                 },
                 verbose=2,
                 queue_trials=True,
                 max_failures=20,
                 checkpoint_freq=1000,    # Must be >0 but does not matter since I use should_checkpoint myself
                 # keep_checkpoints_num=3,  # Currently, makes no sense
                 checkpoint_at_end=True,
                 local_dir=RAY_OUT_DIR,
                 trial_executor=WaitRayTrialExecutor(queue_trials=True, reuse_actors=False, ray_auto_init=False),  # My version without timeouts
                 sync_to_driver=log_sync_template()  # This way we prevent the problem with "--remove-source-files"
                 )

Were no workers available to start the restored trial?

Not immediately.

If so, tune should just hang until there are resources available.

Yep. This is what happens.

However, if a worker is killed during restoration, the timeout will likely occur and it should fall back to restarting the trial.

I did not observe a worker being killed during restoration. I observed that the newly created instances are quite slow in the beginning just after they spawn up and I think this is what causes the timeouts, which I had to set to None/very high value. And the timeouts lead to restarting the job from scratch.

Even without using a checksum the resume shouldn't be happening with corrupted checkpoints

Thx. Good to know.

@ujvl
Copy link
Contributor

ujvl commented Nov 22, 2019

By saying about a race I meant (a hypothesis) that rsync starts syncing already when the checkpoint file starts to be written to but is not finished yet. Before the file is finished writing, rsync already sees it, copies it and deletes it (when "--remove-source-files" is on).

Ah I think I understand what's going on, thanks for flagging this; I'll try to push a fix within the next couple days.

You're probably right, there seems to be a race condition between the log rsync (not the checkpoint rsync) and the checkpoint writes. Sharing the same directory is definitely problematic.

I did not observe a worker being killed during restoration. I observed that the newly created instances are quite slow in the beginning just after they spawn up and I think this is what causes the timeouts, which I had to set to None/very high value. And the timeouts lead to restarting the job from scratch.

Got it, this should be a straightforward fix.

@wjaskowski
Copy link
Author

You're probably right, there seems to be a race condition between the log rsync (not the checkpoint rsync) and the checkpoint writes. Sharing the same directory is definitely problematic.

Well, I don't fully understand how ray works under the hood so I am not sure but let me add something: I changed my checkpointing behavior and now I am first saving my checkpoint to a temp file and when it is ready, I move it (this is should be an atomic operation in Linux) to the target location. And this solves the problem (no truncations anymore as far as I can tell). So this is another evidence that rsync tries to copy the file before the writing to the file has finished (the file is closed).

@ujvl
Copy link
Contributor

ujvl commented Nov 22, 2019

Yeah that would definitely work (although the mv needs to be done on the whole checkpoint dir since there are multiple files), although longer-term we should fix the full-directory sync issue regardless.

@ericl
Copy link
Contributor

ericl commented Dec 6, 2019

Hmm why not disable --remove-source-files? I don't think disk space is that scare a resource that we should go to the trouble of structuring the log files around this.

@ujvl
Copy link
Contributor

ujvl commented Jan 3, 2020

@wjaskowski this should be fixed now but let me know if you run into any other errors

@richardliaw richardliaw added the tune Tune-related issues label Mar 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants