New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Checkpoints are only partially transferred thus cannot be resumed #6226
Comments
Just to be clear, this "truncation" is happening only if the instance dies during the rsync, correct? Do you see a I don't think the flag has anything to do with truncating the checkpoint on the drivers side since it only enables deletion of the worker's copy.
Yeah thanks for pointing this out, it's a known problem we intend to fix soon. In this case what you're seeing happens because the source files are no longer deleted (due to removal of
Can you describe:
Were no workers available to start the restored trial? If so, tune should just hang until there are resources available. However, if a worker is killed during restoration, the timeout will likely occur and it should fall back to restarting the trial.
Even without using a checksum the resume shouldn't be happening with corrupted checkpoints. Instead if rsync is interrupted you should see an error logged ( |
I am sure (double-checked) that this truncation only happens when No sync errors in those cases (*). I inspected also By saying about a (*) Previously, I was experiencing sync errors when I was checkpointing very often. I think this might have been a manifestation of the same problem (a combination of
I create an autoscaler cluster on AWS with 1 head (CPU) and 1 worker (GPU), a have a single job (testing purposes) that requires a GPU. After a while (several checkpoints done), I terminate the node from the EC2 console and wait for another node to go up. ray.tune.run(AbusedTrainable, num_samples=1, resources_per_trial={"cpu": 1, "gpu": 0.5},
config={
'env_name': ray.tune.grid_search(ENVS),
'seed': ray.tune.grid_search([123])
},
verbose=2,
queue_trials=True,
max_failures=20,
checkpoint_freq=1000, # Must be >0 but does not matter since I use should_checkpoint myself
# keep_checkpoints_num=3, # Currently, makes no sense
checkpoint_at_end=True,
local_dir=RAY_OUT_DIR,
trial_executor=WaitRayTrialExecutor(queue_trials=True, reuse_actors=False, ray_auto_init=False), # My version without timeouts
sync_to_driver=log_sync_template() # This way we prevent the problem with "--remove-source-files"
)
Not immediately.
Yep. This is what happens.
I did not observe a worker being killed during restoration. I observed that the newly created instances are quite slow in the beginning just after they spawn up and I think this is what causes the timeouts, which I had to set to None/very high value. And the timeouts lead to restarting the job from scratch.
Thx. Good to know. |
Ah I think I understand what's going on, thanks for flagging this; I'll try to push a fix within the next couple days. You're probably right, there seems to be a race condition between the log rsync (not the checkpoint rsync) and the checkpoint writes. Sharing the same directory is definitely problematic.
Got it, this should be a straightforward fix. |
Well, I don't fully understand how ray works under the hood so I am not sure but let me add something: I changed my checkpointing behavior and now I am first saving my checkpoint to a temp file and when it is ready, I move it (this is should be an atomic operation in Linux) to the target location. And this solves the problem (no truncations anymore as far as I can tell). So this is another evidence that rsync tries to copy the file before the writing to the file has finished (the file is closed). |
Yeah that would definitely work (although the mv needs to be done on the whole checkpoint dir since there are multiple files), although longer-term we should fix the full-directory sync issue regardless. |
Hmm why not disable --remove-source-files? I don't think disk space is that scare a resource that we should go to the trouble of structuring the log files around this. |
@wjaskowski this should be fixed now but let me know if you run into any other errors |
System information
I have been fighting with checkpointing in a scenario when spot instances get interrupted. I have found many problems and (hopefully) some workarounds.
The first is the "--remove-source-files" option you apply to
rsync
. Apparently, there is some race between making checkpoints and rsync'ing. The result is that if I don't remove this option, my checkpoint files are truncated and I cannot resume from a checkpoint, obviously. Currently, I disabled this option by settingsync_to_driver=log_sync_template()
but this is a poor workaround given a limited disk space.What is more, now the setting
keep_checkpoints_num=3
makes the files deleted for a second but then rsynced back again, which makes no sense. It seems that you are syncing the whole experiment directory, which might the root of all those problems.The other problems:
RayTrialExecutor
, thus I nulled all the timeouts inrestore
, andfetch_result
.RayTrialExecutor._setup_remote_runner.logger_creator
, there is a race condition (actually, it happened to me) when checking for log dir existence, so this should be changed toos.makedirs(remote_logdir, exist_ok=True)
EDIT: I realized that the current design is still not fully robust against an instant death of a spot instance since rsync can be interrupted at any time during copying. To obtain full robustness, checkpoints should be probably annotated whether they are "ready" when they are fully copied (md5?) and resume should happen using only "ready" checkpoints.
The text was updated successfully, but these errors were encountered: