[Storage] Adding new storage mode CSYNC #2336

landscapepainter · 2023-08-01T16:18:45Z

This closes #1862

When using managed spot job for training, it's necessary to save the checkpoints in the cloud storage as we want the training to recover from the checkpoint when preempted. And this requires the application to store the checkpoints in the cloud storage. However, writing the checkpoints to a cloud storage mounted directory takes a long time, especially for large models like LLM, and this stalls the training becoming the bottleneck.

To mitigate this issue, this PR adds a new storage mode, CSYNC, where it runs a daemon to run syncing commands periodically at the specified remote path. This way, the training application can write the checkpoints to an unmounted directory and the checkpoints get synced asynchronously., which makes the process much quicker.

When a cluster running sync command laucnhed by csync is attempted to be stopped with sky stop/sky down, then it waits until the sync is completed before proceeding to stop/down the cluster.

Note:

Running backend_utils.wait_and_terminate_csync() within cli.py/_down_or_stop() can be implemented smoothly with storage info in cluster table's metadata. Will implement this in a separate issue
backend_utils.wait_and_terminate_csync() does not support autostop/autodown yet. Will be implemented after this PR is merged

TODO:

write _execute_storage_syncs to setup mode: CSYNC
Handle sky stop/down/autostop/autodown with cluster's metadata
support for user specifying syncing interval in task yaml
write the continuous syncing as an independent tool
add sync command for s3 and gcs
add storage tests for mode: CSYNC

Tested (run the relevant ones):

Code formatting: bash format.sh
Manual test on running CSYNC mode
Manual test on running CSYNC mode on a subdirectory of a bucket as a source
Manual test to run sky stop on a cluster running sync command launched by csync
Test managed spot job with checkpoint directory in CSYNC mode to confirm if training recovers correctly from checkpoints.
Relevant individual smoke tests: pytest tests/test_smoke.py::TestStorageWithCredentials
pytest tests/test_smoke.py::test_aws_storage_mounts_with_stop --aws
pytest tests/test_smoke.py::test_gcp_storage_mounts_with_stop --gcp

merge master

Michaelvll

Thanks for adding this feature @landscapepainter! This is very exciting. Could you merge the latest master branch? I will try this PR with our Vicuna examples. Left several questions.

sky/backends/backend_utils.py

sky/backends/cloud_vm_ray_backend.py

sky/data/skystorage.py

Michaelvll · 2023-08-17T06:38:12Z

sky/data/skystorage.py

+def set_s3_sync_cmd(src_path: str, bucketname: str, num_threads: int,
+                    delete: bool, no_follow_symlinks: bool):
+    """Builds sync command for aws s3"""
+    config_cmd = ('aws configure set default.s3.max_concurrent_requests '


This seems not only applied to the following commands but other s3 related commands as well. For example, it may affect the mount mode and copy mode. Is there a way to limit the change to the current command only?

In COPY mode, this method remains unaffected because _execute_storage_csync is executed after all files required for the COPY mode have been downloaded to the remote node.

Our MOUNT mode for s3 uses goofys, and goofys does not use aws cli config. It only shares the credentials.

Unfortunately, aws cli does not support direct argument for increasing concurrency. So I set a safegaurd ar run_sync method to set back the config to default value.

In COPY mode, this method remains unaffected because _execute_storage_csync is executed after all files required for the COPY mode have been downloaded to the remote node.

This is not true if a user runs sky launch on the same cluster twice right?

Resetting it after every run_sync sounds good to me.

What will happen if two such commands run simultaneously? IIUC the second will cover the modification for the first command?

@cblmemo You are right. If there are more than one CSYNC for S3 running in the cluster, and if those two happens to have different number of threads set and run at the same time when config for the second run is set before the sync is ran for the first run, the second one's modification will cover the first one's command. But I don't think there would be any good reason for users to set different number of threads for different daemon running in the cluster. Perhaps I can document the expected behavior?

Michaelvll · 2023-08-17T06:41:11Z

sky/data/skystorage.py

+        except subprocess.CalledProcessError:
+            if max_retries > 0:
+                #TODO: display the error with remaining # of retries
+                wait_time = interval / 2


Why is the wait time set to interval / 2?

This error handling originates from the error gsutil rsync raises when training script is writing to the mounted directory and syncing from gsutil rsync occurs at the same time. As the file size gsutil rsync is uploading changes while the upload, the error occurs. This is simply resolved by rerunning the syncing command. As the error usually occurs after some progress on sync, having wait_time set to interval would be too long, so had it set to half of it.

It sounds like a heuristic, can we add some comments in the code here? Also, is it better to use a normal backoff for retry, or using interval / 2

Ideally, interval should be set to the amount of time that would take for the sync tool to upload the complete list of checkpoint, which allows to keep the storage and the node synced as much as possible without overlapped launch of the sync commands. For example, it took ~900ish seconds for gsutil rsync to sync 60GB of checkpoints from our original Vicuna example. So it would be nice to keep the interval as 900s in this case.

As the error we are trying to handle here with the retry occurs when coincidentally training script is writing to the mounted directory and syncing from gsutil rsync occurs at the same time, it can take some time for this already running sync process to end. So I updated with a Backoff object from common_utils with interval/2 of initial backoff.

sky/data/skystorage.py

landscapepainter · 2024-01-22T09:25:23Z

@cblmemo Finally, this is ready for a look! Majority of the logic updates are added to sky_csync.py file to support our new design for launching and terminating.

For launching, on top of the original sync daemon, there are two FUSE processes launched: storage mount on the csync_read_path, and redirection-fuse on mountpoint, which redirects the read and write to csync_read_path and csync_write_path. The original sync daemon is ran between the csync_write_path and the storage. The redirection FUSE is hosted by this repo: https://github.com/landscapepainter/libfuse, and it's installed while provisioning. The way how we should maintain this repo should be further discussed.

For terminating, we originally checked if there's a sync process running, and terminated the sync daemon if there isn't any sync process running. Now, on top of that, the termination process sequentially unmounts the FUSE processes, and then terminates the sync daemon.

Update: e2e testing with checksum vicuna example, #2432 , runs correctly.

cblmemo

Thanks for this amazing work!! Left some comments and mostly nits. Will take a look on https://github.com/landscapepainter/libfuse/ soon.

sky/backends/backend_utils.py

sky/data/sky_csync.py

cblmemo · 2024-01-27T05:43:02Z

sky/data/sky_csync.py

+    return storage_mount_script
+
+
+def _handle_fuse_process(fuse_cmd: str) -> Tuple[int, int, str]:


docstr; especially for the return value

actually why not use sky/utils/subprocess_utils.py::run?

It doesn't seem to have a good way to obtain the pid if used with subprocess.run().

Actually I'm a little bit confused about why we need the pid here. IIRC the with subprocess.Popen will wait the process to finish and then enter the following statements. Why do we need a pid that is already finished?

sky/data/sky_csync.py

cblmemo

Thanks for adding the fuse file!! left some comments

sky/data/csync/redirect-fuse.c

cblmemo · 2024-01-29T02:30:13Z

sky/data/csync/redirect-fuse.c

+
+    free(read_path);
+    free(write_path);
+    set_destroy(&set);


why do we need this set? is it possible that under one dir there are two files with same name?

It is possible to have two files with same name in each directories.

sry can you elaborate a little bit?

So xmp_readdir file operation is called when ls is ran at the mountpoint. And we design this in a way that ls displays all the files/directories from both read directory(mounted by cloud storage) and write directory(the write cache). And if there's a file with identical names in both directories, we only want to display it once.

Why not only display the read dir? IIUC it should include every file in the write dir?

Oh is this for the case when there are some new write files in the write dir that have not been synced to the cloud?

sky/data/csync/redirect-fuse.c

github-actions · 2024-06-01T01:49:50Z

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions · 2024-06-12T01:48:57Z

This PR was closed because it has been stalled for 10 days with no activity.

landscapepainter added 23 commits June 24, 2023 19:36

mode:sync prototype

181513c

skystorage implementation

e1b25b6

update

207665f

update interval support

7370ea4

nit

05cec9a

nit

c2dcc05

cluster's storage metadata update

15d0228

update without metadata

5047b8b

wait_and_terminate_csync

4044c45

Merge branch 'master' into continuous-sync-alpha

f461df4

merge master

nit

50db103

fix spot job not respecting 'interval' from yaml

bf19970

nit

5b31c13

support for sky stop/down

f217f28

format/comments

00e1a30

Merge branch 'master' into continuous-sync-alpha

82c224b

merge master

nit

0d2f52a

smoke test

e861c5d

smoke test update

72879f7

format

91f1a89

nit

bc2a9c0

nit

c00a553

fix csync self-terminating error

1ce54ff

landscapepainter mentioned this pull request Aug 1, 2023

[Storage] Adding new storage mode C_SYNC #2190

Closed

13 tasks

csync log support

66ff8b6

landscapepainter requested a review from Michaelvll August 2, 2023 05:48

landscapepainter requested review from concretevitamin and romilbhardwaj August 16, 2023 00:15

Michaelvll reviewed Aug 17, 2023

View reviewed changes

Merge branch 'master' into continuous-sync-alpha-1

0245e33

landscapepainter added 11 commits January 20, 2024 22:18

refactor handling fuse processes

7cf942b

error handling for mount

90917ca

update sync commands with exclude list

25d5679

update redirec fuse binary installation path

6e8d581

refactor redirection fuse installation

26368b8

remove delete from csync command

657a21a

mounting process failure error handling

bb3a8d1

nit

904149d

Merge branch 'master' into continuous-sync-alpha-1

58c21fb

format and error handle nit fix

337e7d7

nit

0fd26eb

landscapepainter added 2 commits January 22, 2024 09:32

nit

d238986

Merge branch 'master' into continuous-sync-alpha-1

94586f4

cblmemo reviewed Jan 27, 2024

View reviewed changes

landscapepainter added 2 commits January 29, 2024 00:54

add redirect-fuse.c to repo

d1ae0b2

nit

8cf468f

cblmemo reviewed Jan 29, 2024

View reviewed changes

landscapepainter added 7 commits January 31, 2024 02:14

nit fixes

17ca346

combine calls to install_cmd from mounting_utils

ff2729e

nit fixes

f4aeddc

nit fixes

11e1fa2

reedirect-fuse.c indentation fix

ebc0d6b

update csync_full_path logic

0ee5b78

nit fixes

9d4ad20

concretevitamin mentioned this pull request Mar 22, 2024

[Storage] Investigate rclone mount with VFS caching #3353

Open

github-actions bot added the Stale label Jun 1, 2024

github-actions bot closed this Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage] Adding new storage mode CSYNC #2336

[Storage] Adding new storage mode CSYNC #2336

landscapepainter commented Aug 1, 2023 •

edited

Loading

Michaelvll left a comment

Michaelvll Aug 17, 2023

landscapepainter Aug 19, 2023

Michaelvll Sep 4, 2023

cblmemo Sep 16, 2023

landscapepainter Sep 18, 2023 •

edited

Loading

Michaelvll Aug 17, 2023

landscapepainter Aug 19, 2023

Michaelvll Aug 27, 2023

landscapepainter Sep 1, 2023

landscapepainter commented Jan 22, 2024 •

edited

Loading

cblmemo left a comment

cblmemo Jan 27, 2024

cblmemo Jan 27, 2024

landscapepainter Feb 1, 2024

cblmemo Feb 1, 2024

cblmemo left a comment

cblmemo Jan 29, 2024

landscapepainter Jan 31, 2024

cblmemo Jan 31, 2024

landscapepainter Feb 1, 2024

cblmemo Feb 1, 2024

cblmemo Feb 1, 2024

github-actions bot commented Jun 1, 2024

github-actions bot commented Jun 12, 2024

		return storage_mount_script


		def _handle_fuse_process(fuse_cmd: str) -> Tuple[int, int, str]:

[Storage] Adding new storage mode CSYNC #2336

[Storage] Adding new storage mode CSYNC #2336

Conversation

landscapepainter commented Aug 1, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landscapepainter Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landscapepainter commented Jan 22, 2024 • edited Loading

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 1, 2024

github-actions bot commented Jun 12, 2024

landscapepainter commented Aug 1, 2023 •

edited

Loading

landscapepainter Sep 18, 2023 •

edited

Loading

landscapepainter commented Jan 22, 2024 •

edited

Loading