Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage] Adding new storage mode CSYNC #2336

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
155 commits
Select commit Hold shift + click to select a range
181513c
mode:sync prototype
landscapepainter Jun 24, 2023
e1b25b6
skystorage implementation
landscapepainter Jul 10, 2023
207665f
update
landscapepainter Jul 11, 2023
7370ea4
update interval support
landscapepainter Jul 12, 2023
05cec9a
nit
landscapepainter Jul 12, 2023
c2dcc05
nit
landscapepainter Jul 14, 2023
15d0228
cluster's storage metadata update
landscapepainter Jul 14, 2023
5047b8b
update without metadata
landscapepainter Jul 15, 2023
4044c45
wait_and_terminate_csync
landscapepainter Jul 15, 2023
f461df4
Merge branch 'master' into continuous-sync-alpha
landscapepainter Jul 15, 2023
50db103
nit
landscapepainter Jul 15, 2023
bf19970
fix spot job not respecting 'interval' from yaml
landscapepainter Jul 15, 2023
5b31c13
nit
landscapepainter Jul 16, 2023
f217f28
support for sky stop/down
landscapepainter Jul 18, 2023
00e1a30
format/comments
landscapepainter Jul 18, 2023
82c224b
Merge branch 'master' into continuous-sync-alpha
landscapepainter Jul 18, 2023
0d2f52a
nit
landscapepainter Jul 19, 2023
e861c5d
smoke test
landscapepainter Jul 20, 2023
72879f7
smoke test update
landscapepainter Jul 20, 2023
91f1a89
format
landscapepainter Jul 20, 2023
bc2a9c0
nit
landscapepainter Jul 20, 2023
c00a553
nit
landscapepainter Jul 20, 2023
1ce54ff
fix csync self-terminating error
landscapepainter Aug 1, 2023
66ff8b6
csync log support
landscapepainter Aug 2, 2023
0245e33
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Aug 17, 2023
a85b502
nit updates
landscapepainter Aug 18, 2023
234e5b5
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Aug 18, 2023
d97e998
fix edge case: fails to run csync without interval
landscapepainter Aug 18, 2023
5b99f02
Update sky/backends/backend_utils.py
landscapepainter Aug 19, 2023
a23fe2b
error hadling for _execute_storage_csync
landscapepainter Aug 19, 2023
6de26da
merge csync_utils.py into mounting_utils.py
landscapepainter Aug 19, 2023
4408be3
nit
landscapepainter Aug 20, 2023
b157eaf
additional column to cluster table for csync check
landscapepainter Aug 22, 2023
035ad7e
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Aug 22, 2023
66909bf
nit
landscapepainter Aug 22, 2023
c6e8b42
nit
landscapepainter Aug 22, 2023
85eb79a
nit
landscapepainter Aug 22, 2023
8a2d383
nit
landscapepainter Aug 23, 2023
b339a1d
run wait_and_terminate_csync conditionally
landscapepainter Aug 24, 2023
d7695ee
wait_and_terminate_csync logging and error handlin
landscapepainter Aug 24, 2023
4b8dc9c
nit
landscapepainter Aug 25, 2023
0710ce0
add Process() to retrieve pid for terminate
landscapepainter Aug 27, 2023
b1d15e1
update skystorage csync/terminate
landscapepainter Aug 27, 2023
33e2e4d
Update sky/backends/backend_utils.py
landscapepainter Aug 31, 2023
1503461
resolve comments
landscapepainter Sep 1, 2023
5ffc8b0
update local db for skystorage csync/terminate
landscapepainter Sep 2, 2023
4db85e0
update for when source is already CSYNC-mounted
landscapepainter Sep 2, 2023
ead2621
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Sep 2, 2023
8a4bcab
update terminate and remove multiprocessing
landscapepainter Sep 4, 2023
2402f41
update comments
landscapepainter Sep 4, 2023
5517272
nit
landscapepainter Sep 4, 2023
9e63b1c
nit
landscapepainter Sep 4, 2023
7761bb3
nit
landscapepainter Sep 4, 2023
611ff12
update error command exception handling
landscapepainter Sep 6, 2023
4ece00e
add Storage.from_metadata and raise heredoc debug
landscapepainter Sep 6, 2023
59534d7
Update sky/data/skystorage.py
landscapepainter Sep 7, 2023
05c2d32
added boot_time and @db decorator
landscapepainter Sep 7, 2023
c9574f9
update Storage.__init__ and Storage.from_metadata
landscapepainter Sep 7, 2023
4ab6349
fix backward compatibility issue
landscapepainter Sep 8, 2023
cd11938
update filelocks
landscapepainter Sep 8, 2023
4b2e11d
Update sky/backends/backend_utils.py
landscapepainter Sep 16, 2023
95bc037
Update sky/data/skystorage.py
landscapepainter Sep 16, 2023
ec11a15
nit
landscapepainter Sep 16, 2023
785e770
Merge branch 'continuous-sync-alpha-1' of https://github.com/landscap…
landscapepainter Sep 16, 2023
2bec928
nit and merging _execute_storage_mounts and _csync
landscapepainter Sep 17, 2023
6e1e9f2
updates
landscapepainter Sep 17, 2023
7cbf729
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Sep 17, 2023
cdd6954
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Sep 17, 2023
c21317f
log path update
landscapepainter Sep 17, 2023
269f7be
update source path filtering
landscapepainter Sep 18, 2023
f0bae8a
nit
landscapepainter Sep 18, 2023
bcb6f85
nit
landscapepainter Sep 18, 2023
b541968
nit
landscapepainter Sep 18, 2023
4bd3683
nit
landscapepainter Sep 18, 2023
36d31ce
nit
landscapepainter Sep 18, 2023
381b333
resolve nit comments
landscapepainter Sep 21, 2023
147715e
nit
landscapepainter Sep 21, 2023
7f89e7a
nit updates
landscapepainter Sep 23, 2023
4278077
update get_storetype_upload_cmd from sky_csync
landscapepainter Sep 24, 2023
1d9720a
add interval_seconds check _validate_storage_spec
landscapepainter Sep 25, 2023
a702ab7
check non-CSYNC supported stores
landscapepainter Sep 25, 2023
a8d1300
refactor sky_csync get_upload_cmd functions
landscapepainter Sep 25, 2023
c2af5f2
update csync_command to reuse common code
landscapepainter Sep 25, 2023
cc58f2a
refactored run_sync and get_upload_cmd
landscapepainter Sep 25, 2023
84e1b65
nit
landscapepainter Sep 25, 2023
1e7fc79
nit
landscapepainter Sep 25, 2023
415a634
nit
landscapepainter Sep 26, 2023
1dce95d
nit
landscapepainter Sep 27, 2023
13a2eeb
lint
landscapepainter Sep 27, 2023
079caff
testing images
landscapepainter Oct 8, 2023
a98a6b4
back
landscapepainter Oct 18, 2023
704b7b8
Merge branch 'master' of https://github.com/landscapepainter/skypilot
landscapepainter Nov 7, 2023
4db1bea
Merge branch 'master' of https://github.com/landscapepainter/skypilot
landscapepainter Nov 9, 2023
12e3f89
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Nov 9, 2023
edaf59b
format and merge clean up
landscapepainter Nov 9, 2023
8e88eee
nit
landscapepainter Nov 11, 2023
db767dd
nit
landscapepainter Nov 11, 2023
32654b0
nit
landscapepainter Nov 11, 2023
35c4f5d
nit
landscapepainter Nov 11, 2023
26d7988
adding split_*_path for csync_command
landscapepainter Nov 12, 2023
893f596
nit
landscapepainter Nov 12, 2023
566e09a
Update sky/data/sky_csync.py
landscapepainter Nov 12, 2023
941c123
format
landscapepainter Nov 12, 2023
04ba965
nit
landscapepainter Nov 12, 2023
24d3265
Merge branch 'skypilot-org:master' into master
landscapepainter Nov 12, 2023
c3d34d3
update Storage.from_metadata
landscapepainter Nov 17, 2023
3e7e33f
Fix wordings for CSYNC to not use the term 'mount'
landscapepainter Nov 17, 2023
b401b79
Merge branch 'skypilot-org:master' into master
landscapepainter Nov 17, 2023
c9a2dc0
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Nov 17, 2023
d6aee0d
format
landscapepainter Nov 17, 2023
4fea827
csync-fuse updated
landscapepainter Dec 11, 2023
9c797c8
nit
landscapepainter Dec 11, 2023
e77b0b0
set full_path to write first
landscapepainter Dec 21, 2023
0e6b3ae
tmp change
landscapepainter Dec 26, 2023
d28fe87
format and fix to run FUSE as part of CSYNC
landscapepainter Dec 27, 2023
a93aa43
Merge branch 'master' into csync-fuse
landscapepainter Dec 27, 2023
c3d2a98
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Dec 27, 2023
f625c66
FUSE enabled CSYNC prototype
landscapepainter Dec 27, 2023
ec82a5f
organize csync setup
landscapepainter Dec 28, 2023
a16da8d
keep on track of PIDs for newly added processes
landscapepainter Jan 13, 2024
3816c12
update csync _terminate for newly added processes
landscapepainter Jan 14, 2024
a4b5ae0
update path constant
landscapepainter Jan 15, 2024
705d9e7
constant
landscapepainter Jan 15, 2024
5a84473
refactor mounting command construction
landscapepainter Jan 15, 2024
b2042d7
handle error message from terminate
landscapepainter Jan 16, 2024
e45064c
err handling nit
landscapepainter Jan 17, 2024
df52517
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Jan 17, 2024
8514354
merge master mounting_utils
landscapepainter Jan 17, 2024
4abca55
nit
landscapepainter Jan 18, 2024
08f7f59
update mount_path check if empty for csync
landscapepainter Jan 20, 2024
718c57d
update redirection fuse with C
landscapepainter Jan 20, 2024
14dafd0
format
landscapepainter Jan 20, 2024
d82c28f
nit
landscapepainter Jan 20, 2024
7cf942b
refactor handling fuse processes
landscapepainter Jan 20, 2024
90917ca
error handling for mount
landscapepainter Jan 21, 2024
25d5679
update sync commands with exclude list
landscapepainter Jan 22, 2024
6e8d581
update redirec fuse binary installation path
landscapepainter Jan 22, 2024
26368b8
refactor redirection fuse installation
landscapepainter Jan 22, 2024
657a21a
remove delete from csync command
landscapepainter Jan 22, 2024
bb3a8d1
mounting process failure error handling
landscapepainter Jan 22, 2024
904149d
nit
landscapepainter Jan 22, 2024
58c21fb
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Jan 22, 2024
337e7d7
format and error handle nit fix
landscapepainter Jan 22, 2024
0fd26eb
nit
landscapepainter Jan 22, 2024
d238986
nit
landscapepainter Jan 22, 2024
94586f4
Merge branch 'master' into continuous-sync-alpha-1
landscapepainter Jan 22, 2024
d1ae0b2
add redirect-fuse.c to repo
landscapepainter Jan 29, 2024
8cf468f
nit
landscapepainter Jan 29, 2024
17ca346
nit fixes
landscapepainter Jan 31, 2024
ff2729e
combine calls to install_cmd from mounting_utils
landscapepainter Jan 31, 2024
f4aeddc
nit fixes
landscapepainter Jan 31, 2024
11e1fa2
nit fixes
landscapepainter Feb 1, 2024
ebc0d6b
reedirect-fuse.c indentation fix
landscapepainter Feb 1, 2024
0ee5b78
update csync_full_path logic
landscapepainter Feb 1, 2024
9d4ad20
nit fixes
landscapepainter Feb 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@
# Filelocks for updating cluster's file_mounts.
CLUSTER_FILE_MOUNTS_LOCK_PATH = os.path.expanduser(
'~/.sky/.{}_file_mounts.lock')

CLUSTER_FILE_MOUNTS_LOCK_TIMEOUT_SECONDS = 10

# Remote dir that holds our runtime files.
Expand Down Expand Up @@ -2678,3 +2679,51 @@ def check_stale_runtime_on_remote(returncode: int, stderr: str,
f'not interrupted): {colorama.Style.BRIGHT}sky start -f -y '
f'{cluster_name}{colorama.Style.RESET_ALL}'
f'\n--- Details ---\n{stderr.strip()}\n')


def wait_and_terminate_csync(cluster_name: str) -> None:
"""Terminates all the CSYNC process running in each node.

Before terminating the CSYNC daemon, it waits until the sync process
launched by CSYNC is completed if there are any.

Args:
cluster_name: Cluster name (see `sky status`)
"""
record = global_user_state.get_cluster_from_name(cluster_name)
assert record is not None, cluster_name
handle = record['handle']
assert isinstance(handle, backends.CloudVmRayResourceHandle)
try:
ip_list = handle.external_ips()
# When cluster is in INIT status, attempt to fetch IP fails raising an error
except exceptions.FetchIPError as e:
logger.error('Failed to fetch IP while checking for '
'CSYNC termination.\n'
f'{common_utils.format_exception(e)}')
return
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
port_list = handle.external_ssh_ports()
ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml,
handle.docker_user)
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
runners = command_runner.SSHCommandRunner.make_runner_list(
ip_list, port_list=port_list, **ssh_credentials)
csync_terminate_cmd = constants.CSYNC_TERMINATION_CMD

def _run_csync_terminate(runner: command_runner.SSHCommandRunner) -> None:
rc, stdout, stderr = runner.run(csync_terminate_cmd,
stream_logs=False,
require_outputs=True)
stderr = stdout
if rc:
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
# TODO(Doyoung): following message will interrupt the progress
# bar UI until #2504 is resolved. Remove this comment after the
# issue is resolved.
logger.error(
f'CSYNC: failed to terminate the CSYNC on {runner.ip}. '
f'Details: {stderr}')

# TODO(Doyoung): Set the following to 'info' when #2504 is resolved
logger.debug(f'CSYNC termination initiated for {cluster_name}. If a '
'sync process is currently running, CSYNC will terminate '
'after it completes.\n')
subprocess_utils.run_in_parallel(_run_csync_terminate, runners)
83 changes: 65 additions & 18 deletions sky/backends/cloud_vm_ray_backend.py
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
from sky.clouds.utils import gcp_utils
from sky.data import data_utils
from sky.data import storage as storage_lib
from sky.data import storage_utils
from sky.provision import common as provision_common
from sky.provision import instance_setup
from sky.provision import metadata_utils
Expand All @@ -54,6 +55,7 @@
from sky.utils import command_runner
from sky.utils import common_utils
from sky.utils import controller_utils
from sky.utils import env_options
from sky.utils import log_utils
from sky.utils import resources_utils
from sky.utils import rich_utils
Expand Down Expand Up @@ -3051,7 +3053,10 @@ def _sync_file_mounts(
controller_utils.replace_skypilot_config_path_in_file_mounts(
handle.launched_resources.cloud, all_file_mounts)
self._execute_file_mounts(handle, all_file_mounts)
self._execute_storage_mounts(handle, storage_mounts)
self._execute_storage_mounts(handle, storage_mounts,
storage_utils.StorageMode.MOUNT)
self._execute_storage_mounts(handle, storage_mounts,
storage_utils.StorageMode.CSYNC)
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
self._set_storage_mounts_metadata(handle.cluster_name, storage_mounts)

def _setup(self, handle: CloudVmRayResourceHandle, task: task_lib.Task,
Expand Down Expand Up @@ -3786,6 +3791,10 @@ def teardown_no_lock(self,
f'Cluster {handle.cluster_name!r} is already terminated. '
'Skipped.')
return
# Check if the cluster includes storage with CSYNC mode and terminates
# the CSYNC process after the syncing is completed if it was running.
if self._cluster_has_csync_storage(handle.cluster_name):
backend_utils.wait_and_terminate_csync(handle.cluster_name)
log_path = os.path.join(os.path.expanduser(self.log_dir),
'teardown.log')
log_abs_path = os.path.abspath(log_path)
Expand Down Expand Up @@ -4515,12 +4524,13 @@ def _symlink_node(runner: command_runner.SSHCommandRunner):

def _execute_storage_mounts(
self, handle: CloudVmRayResourceHandle,
storage_mounts: Optional[Dict[Path, storage_lib.Storage]]):
storage_mounts: Optional[Dict[Path, storage_lib.Storage]],
mount_mode: storage_utils.StorageMode):
"""Executes storage mounts: installing mounting tools and mounting."""
# Handle cases where `storage_mounts` is None. This occurs when users
# initiate a 'sky start' command from a Skypilot version that predates
# the introduction of the `storage_mounts_metadata` feature.
if not storage_mounts:
if storage_mounts is None:
return

# Process only mount mode objects here. COPY mode objects have been
Expand All @@ -4529,7 +4539,7 @@ def _execute_storage_mounts(
storage_mounts = {
path: storage_mount
for path, storage_mount in storage_mounts.items()
if storage_mount.mode == storage_lib.StorageMode.MOUNT
if storage_mount.mode == mount_mode
}

# Handle cases when there aren't any Storages with MOUNT mode.
Expand All @@ -4544,11 +4554,18 @@ def _execute_storage_mounts(
f'mounting. No action will be taken.{colorama.Style.RESET_ALL}')
return

if mount_mode == storage_utils.StorageMode.MOUNT:
mode_str = 'mount'
action_message = 'Mounting'
else: # CSYNC mdoe
mode_str = 'csync'
action_message = 'Setting up CSYNC'

fore = colorama.Fore
style = colorama.Style
plural = 's' if len(storage_mounts) > 1 else ''
logger.info(f'{fore.CYAN}Processing {len(storage_mounts)} '
f'storage mount{plural}.{style.RESET_ALL}')
f'storage {mode_str}{plural}.{style.RESET_ALL}')
start = time.time()
ip_list = handle.external_ips()
port_list = handle.external_ssh_ports()
Expand All @@ -4557,7 +4574,7 @@ def _execute_storage_mounts(
handle.cluster_yaml, handle.docker_user)
runners = command_runner.SSHCommandRunner.make_runner_list(
ip_list, port_list=port_list, **ssh_credentials)
log_path = os.path.join(self.log_dir, 'storage_mounts.log')
log_path = os.path.join(self.log_dir, f'storage_{mode_str}s.log')

for dst, storage_obj in storage_mounts.items():
if not os.path.isabs(dst) and not dst.startswith('~/'):
Expand All @@ -4573,7 +4590,11 @@ def _execute_storage_mounts(
'successfully without mounting the bucket.')
# Get the first store and use it to mount
store = list(storage_obj.stores.values())[0]
mount_cmd = store.mount_command(dst)
if mount_mode == storage_utils.StorageMode.MOUNT:
mount_cmd = store.mount_command(dst)
else: # CSYNC mode
mount_cmd = store.csync_command(dst,
storage_obj.interval_seconds)
src_print = (storage_obj.source
if storage_obj.source else storage_obj.name)
if isinstance(src_print, list):
Expand All @@ -4585,30 +4606,56 @@ def _execute_storage_mounts(
target=dst,
cmd=mount_cmd,
run_rsync=False,
action_message='Mounting',
action_message=action_message,
log_path=log_path,
)
except exceptions.CommandError as e:
mount_path = (f'{colorama.Fore.RED}'
f'{colorama.Style.BRIGHT}{dst}'
f'{colorama.Style.RESET_ALL}')
if e.returncode == exceptions.MOUNT_PATH_NON_EMPTY_CODE:
mount_path = (f'{colorama.Fore.RED}'
f'{colorama.Style.BRIGHT}{dst}'
f'{colorama.Style.RESET_ALL}')
error_msg = (f'Mount path {mount_path} is non-empty.'
f' {mount_path} may be a standard unix '
f'path or may contain files from a previous'
f' task. To fix, change the mount path'
f' to an empty or non-existent path.')
raise RuntimeError(error_msg) from None
elif e.returncode == exceptions.CSYNC_FUSE_MOUNT_FAILURE_CODE:
error_msg = ('Failed to run CSYNC related FUSE processes '
f'at {mount_path}. Please check the CSYNC '
'log file located at ~/.sky for any error '
'message.')
raise RuntimeError(error_msg) from None
else:
# Strip the command (a big heredoc) from the exception
raise exceptions.CommandError(
e.returncode,
command='to mount',
error_msg=e.error_msg,
detailed_reason=e.detailed_reason) from None
# By default, raising an error caused from mounting_utils
# shows a big heredoc as part of it. Here, we want to
# conditionally show the heredoc only if SKYPILOT_DEBUG
# is set
if env_options.Options.SHOW_DEBUG_INFO.get():
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
raise exceptions.CommandError(
e.returncode,
command=f'to {mode_str}',
error_msg=e.error_msg,
detailed_reason=e.detailed_reason)
else:
# Strip the command (a big heredoc) from the exception
raise exceptions.CommandError(
e.returncode,
command=f'to {mode_str}',
error_msg=e.error_msg,
detailed_reason=e.detailed_reason) from None

end = time.time()
logger.debug(f'Storage mount sync took {end - start} seconds.')
logger.debug(f'Setting storage {mode_str} took {end - start} seconds.')

def _cluster_has_csync_storage(self, cluster_name: str) -> bool:
"""Checks if there are CSYNC mode storages within the cluster."""
storage_mounts = self.get_storage_mounts_metadata(cluster_name)
if storage_mounts is not None:
for _, storage_obj in storage_mounts.items():
if storage_obj.mode == storage_utils.StorageMode.CSYNC:
return True
return False

def _set_storage_mounts_metadata(
self, cluster_name: str,
Expand Down
2 changes: 1 addition & 1 deletion sky/data/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Sky Data."""
from sky.data.storage import Storage
from sky.data.storage import StorageMode
from sky.data.storage import StoreType
from sky.data.storage_utils import StorageMode

__all__ = ['Storage', 'StorageMode', 'StoreType']
Loading