[tune] Checkpoints are only partially transferred thus cannot be resumed #6226

wjaskowski · 2019-11-21T15:57:25Z

System information

Ray version: 0.8.0dev6 (1f9ab74)

I have been fighting with checkpointing in a scenario when spot instances get interrupted. I have found many problems and (hopefully) some workarounds.

The first is the "--remove-source-files" option you apply to rsync. Apparently, there is some race between making checkpoints and rsync'ing. The result is that if I don't remove this option, my checkpoint files are truncated and I cannot resume from a checkpoint, obviously. Currently, I disabled this option by setting sync_to_driver=log_sync_template() but this is a poor workaround given a limited disk space.

What is more, now the setting keep_checkpoints_num=3 makes the files deleted for a second but then rsynced back again, which makes no sense. It seems that you are syncing the whole experiment directory, which might the root of all those problems.

The other problems:

When a spot instance dies and, new need to be launched and this takes time and it can easily lead to timeout's in RayTrialExecutor, thus I nulled all the timeouts in restore, and fetch_result.
In RayTrialExecutor._setup_remote_runner.logger_creator, there is a race condition (actually, it happened to me) when checking for log dir existence, so this should be changed to os.makedirs(remote_logdir, exist_ok=True)

EDIT: I realized that the current design is still not fully robust against an instant death of a spot instance since rsync can be interrupted at any time during copying. To obtain full robustness, checkpoints should be probably annotated whether they are "ready" when they are fully copied (md5?) and resume should happen using only "ready" checkpoints.

The text was updated successfully, but these errors were encountered:

ujvl · 2019-11-21T22:05:36Z

Apparently, there is some race between making checkpoints and rsync'ing. The result is that if I don't remove this option, my checkpoint files are truncated and I cannot resume from a checkpoint, obviously.

Just to be clear, this "truncation" is happening only if the instance dies during the rsync, correct? Do you see a Sync error logged? If so, the corrupted checkpoint isn't used by tune for restoration (are you trying to restore from it manually somehow?).

I don't think the flag has anything to do with truncating the checkpoint on the drivers side since it only enables deletion of the worker's copy.

It seems that you are syncing the whole experiment directory, which might the root of all those problems.

Yeah thanks for pointing this out, it's a known problem we intend to fix soon. In this case what you're seeing happens because the source files are no longer deleted (due to removal of --remove-source-files).

When a spot instance dies and, new need to be launched and this takes time and it can easily lead to timeout's in RayTrialExecutor, thus I nulled all the timeouts in restore, and fetch_result.

Can you describe:

the failure scenario,
what your cluster looks like and
if you're using the autoscaler
your tune.run line with the params?

Were no workers available to start the restored trial? If so, tune should just hang until there are resources available. However, if a worker is killed during restoration, the timeout will likely occur and it should fall back to restarting the trial.

To obtain full robustness, checkpoints should be probably annotated whether they are "ready" when they are fully copied (md5?) and resume should happen using only "ready" checkpoints.

Even without using a checksum the resume shouldn't be happening with corrupted checkpoints. Instead if rsync is interrupted you should see an error logged (Error handling checkpoint ... Sync error ({code}): {error message} and the checkpoint remains unused.

wjaskowski · 2019-11-22T06:45:27Z

Just to be clear, this "truncation" is happening only if the instance dies during the rsync, correct? Do you see a Sync error logged? If so, the corrupted checkpoint isn't used by tune for restoration (are you trying to restore from it manually somehow?).

I am sure (double-checked) that this truncation only happens when --remove-source-files enabled.

No sync errors in those cases (*). I inspected also sync_log.log and everything was fine there, i.e., it declared sending as many bytes as there were in the destination (which is less than it should be). I was using automatic ray restore (also tried to do it manually to makes sure that the files are corrupt). What is more, it was not only the last checkpoint that was corrupt - nearly all of them were corrupt.

By saying about a race I meant (a hypothesis) that rsync starts syncing already when the checkpoint file starts to be written to but is not finished yet. Before the file is finished writing, rsync already sees it, copies it and deletes it (when "--remove-source-files" is on). Maybe I experienced that only because I use open.gzip for creating checkpoints, which builds the file incrementally and a workaround would be to put it in a tmp location first and then copy it the the target location in one go although if it is a race than it could happen in this case also.

(*) Previously, I was experiencing sync errors when I was checkpointing very often. I think this might have been a manifestation of the same problem (a combination of --remove-source-files and keep_checkpoints_num)

Can you describe the failure scenario

I create an autoscaler cluster on AWS with 1 head (CPU) and 1 worker (GPU), a have a single job (testing purposes) that requires a GPU. After a while (several checkpoints done), I terminate the node from the EC2 console and wait for another node to go up.

    ray.tune.run(AbusedTrainable, num_samples=1, resources_per_trial={"cpu": 1, "gpu": 0.5},
                 config={
                     'env_name': ray.tune.grid_search(ENVS),
                     'seed': ray.tune.grid_search([123])
                 },
                 verbose=2,
                 queue_trials=True,
                 max_failures=20,
                 checkpoint_freq=1000,    # Must be >0 but does not matter since I use should_checkpoint myself
                 # keep_checkpoints_num=3,  # Currently, makes no sense
                 checkpoint_at_end=True,
                 local_dir=RAY_OUT_DIR,
                 trial_executor=WaitRayTrialExecutor(queue_trials=True, reuse_actors=False, ray_auto_init=False),  # My version without timeouts
                 sync_to_driver=log_sync_template()  # This way we prevent the problem with "--remove-source-files"
                 )

Were no workers available to start the restored trial?

Not immediately.

If so, tune should just hang until there are resources available.

Yep. This is what happens.

However, if a worker is killed during restoration, the timeout will likely occur and it should fall back to restarting the trial.

I did not observe a worker being killed during restoration. I observed that the newly created instances are quite slow in the beginning just after they spawn up and I think this is what causes the timeouts, which I had to set to None/very high value. And the timeouts lead to restarting the job from scratch.

Even without using a checksum the resume shouldn't be happening with corrupted checkpoints

Thx. Good to know.

ujvl · 2019-11-22T07:47:47Z

By saying about a race I meant (a hypothesis) that rsync starts syncing already when the checkpoint file starts to be written to but is not finished yet. Before the file is finished writing, rsync already sees it, copies it and deletes it (when "--remove-source-files" is on).

Ah I think I understand what's going on, thanks for flagging this; I'll try to push a fix within the next couple days.

You're probably right, there seems to be a race condition between the log rsync (not the checkpoint rsync) and the checkpoint writes. Sharing the same directory is definitely problematic.

I did not observe a worker being killed during restoration. I observed that the newly created instances are quite slow in the beginning just after they spawn up and I think this is what causes the timeouts, which I had to set to None/very high value. And the timeouts lead to restarting the job from scratch.

Got it, this should be a straightforward fix.

wjaskowski · 2019-11-22T08:05:17Z

You're probably right, there seems to be a race condition between the log rsync (not the checkpoint rsync) and the checkpoint writes. Sharing the same directory is definitely problematic.

Well, I don't fully understand how ray works under the hood so I am not sure but let me add something: I changed my checkpointing behavior and now I am first saving my checkpoint to a temp file and when it is ready, I move it (this is should be an atomic operation in Linux) to the target location. And this solves the problem (no truncations anymore as far as I can tell). So this is another evidence that rsync tries to copy the file before the writing to the file has finished (the file is closed).

ujvl · 2019-11-22T08:18:35Z

Yeah that would definitely work (although the mv needs to be done on the whole checkpoint dir since there are multiple files), although longer-term we should fix the full-directory sync issue regardless.

ericl · 2019-12-06T07:15:10Z

Hmm why not disable --remove-source-files? I don't think disk space is that scare a resource that we should go to the trouble of structuring the log files around this.

ujvl · 2020-01-03T06:29:10Z

@wjaskowski this should be fixed now but let me know if you run into any other errors

ujvl changed the title ~~checkpoints are only partially transferred thus cannot be resumed~~ [tune] Checkpoints are only partially transferred thus cannot be resumed Nov 21, 2019

ujvl self-assigned this Nov 21, 2019

This was referenced Nov 26, 2019

[tune] Retry restore on timeout #6284

Merged

[tune] Restructure trial directory #6300

Closed

ujvl mentioned this issue Dec 15, 2019

[tune] Async restores and S3/GCP-capable trial FT #6376

Merged

3 tasks

richardliaw closed this as completed in #6376 Jan 3, 2020

richardliaw added the tune Tune-related issues label Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Checkpoints are only partially transferred thus cannot be resumed #6226

[tune] Checkpoints are only partially transferred thus cannot be resumed #6226

wjaskowski commented Nov 21, 2019 •

edited

ujvl commented Nov 21, 2019 •

edited

wjaskowski commented Nov 22, 2019

ujvl commented Nov 22, 2019

wjaskowski commented Nov 22, 2019

ujvl commented Nov 22, 2019 •

edited

ericl commented Dec 6, 2019 •

edited

ujvl commented Jan 3, 2020

[tune] Checkpoints are only partially transferred thus cannot be resumed #6226

[tune] Checkpoints are only partially transferred thus cannot be resumed #6226

Comments

wjaskowski commented Nov 21, 2019 • edited

System information

ujvl commented Nov 21, 2019 • edited

wjaskowski commented Nov 22, 2019

ujvl commented Nov 22, 2019

wjaskowski commented Nov 22, 2019

ujvl commented Nov 22, 2019 • edited

ericl commented Dec 6, 2019 • edited

ujvl commented Jan 3, 2020

wjaskowski commented Nov 21, 2019 •

edited

ujvl commented Nov 21, 2019 •

edited

ujvl commented Nov 22, 2019 •

edited

ericl commented Dec 6, 2019 •

edited