Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Bring-your-own-VPC that disables public IPs for all SkyPilot nodes. #1512

Merged
merged 60 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
994e622
Minor: sky logs hint
concretevitamin Dec 12, 2022
dee5590
Minor: add a FIXME in authentication.py.
concretevitamin Dec 12, 2022
e1f0c1b
New module: sky_config
concretevitamin Dec 12, 2022
3f48372
Backend changes for SSH proxy command support.
concretevitamin Dec 12, 2022
3e3c66e
spot_launch(): sync up config; pop any proxy command.
concretevitamin Dec 12, 2022
7c8f943
AutostopEvent: monkey patch SSHOptions.
concretevitamin Dec 12, 2022
ff224ba
aws/config.py: support vpc_name and new use_internal_ips semantics.
concretevitamin Dec 12, 2022
6a529c3
Make failover catch our 'ERROR' messages from AWS node provider.
concretevitamin Dec 12, 2022
fc8dd03
.j2 changes.
concretevitamin Dec 12, 2022
f03f27b
Fix local launch hash for workers: must pop ssh_proxy_command.
concretevitamin Dec 12, 2022
cecc25d
Fix pylint.
concretevitamin Dec 12, 2022
0328d4d
typo
concretevitamin Dec 12, 2022
22a6481
smoke: make printf usage safe.
concretevitamin Dec 12, 2022
b467b03
Use SKYPILOT_ERROR as logging prefix.
concretevitamin Dec 13, 2022
2456775
Fix Resources.__repr__().
concretevitamin Dec 15, 2022
e496137
Avoid printing unnecessary termination errors for VPC-not-found.
concretevitamin Dec 15, 2022
7d80337
Fix a syntax error in codegen.
concretevitamin Dec 16, 2022
0b982cd
Read from SKYPILOT_CONFIG env var to permit dynamic generation.
concretevitamin Dec 16, 2022
2d9e205
Fix smoke test name.
concretevitamin Dec 16, 2022
85387ec
Fix another test name
concretevitamin Dec 16, 2022
ea26a89
Merge branch 'master' into jump
concretevitamin Dec 16, 2022
410ed6a
Revert "Read from SKYPILOT_CONFIG env var to permit dynamic generation."
concretevitamin Dec 17, 2022
07e0145
Fix head_node_launch_requested log line.
concretevitamin Dec 19, 2022
f4f285f
Merge branch 'master' into jump
concretevitamin Dec 19, 2022
0b6ee4f
Merge branch 'master' into jump
concretevitamin Dec 20, 2022
d620435
Optional: env var to read configs for spot, for better isolation.
concretevitamin Dec 20, 2022
f5850ea
Make query_head_ip_with_retries() robust to extra output.
concretevitamin Dec 20, 2022
2c27757
aws/config.py: reword comments
concretevitamin Dec 20, 2022
5cfd700
events.py: restart_only=True
concretevitamin Dec 20, 2022
7b136d5
Fix Resources.__repr__ to handle None fields.
concretevitamin Dec 20, 2022
af5eb26
Use SKYPILOT_ERROR_NO_NODES_LAUNCHED
concretevitamin Dec 20, 2022
af9e2b5
rstrip() for ssh config entries.
concretevitamin Dec 20, 2022
c73555b
authentication.py: reword comment
concretevitamin Dec 20, 2022
4c56607
pylint
concretevitamin Dec 20, 2022
f5d628b
Fix logging
concretevitamin Dec 20, 2022
de9280f
Try using reties for handle.{internal,external}_ips().
concretevitamin Dec 20, 2022
08bef71
Merge branch 'master' into jump
concretevitamin Jan 2, 2023
cf527bd
Address some easy comments
concretevitamin Jan 2, 2023
3bbeb4c
Typo
concretevitamin Jan 2, 2023
0fe08cb
backend_utils: fix worker IPs fetch; fix >80-col lines.
concretevitamin Jan 2, 2023
797c7d2
Fix test_minimal.
concretevitamin Jan 2, 2023
4ce221e
test_smoke: printf -> echo
concretevitamin Jan 3, 2023
18cd6da
Query IPs once.
concretevitamin Jan 3, 2023
ac0883d
Merge branch 'master' into jump
concretevitamin Jan 3, 2023
7252fd3
Drop ssh_proxy_command in launch hash when provisioning.
concretevitamin Jan 3, 2023
229f0a5
Typo
concretevitamin Jan 3, 2023
0155cc3
Typo
concretevitamin Jan 3, 2023
3e5e4b8
Add comment
concretevitamin Jan 3, 2023
4d89c38
sky/sky_config.py -> sky/skypilot_config.py
concretevitamin Jan 3, 2023
ff955e9
Add: sky/backends/monkey_patches/
concretevitamin Jan 3, 2023
594ce70
Remove logging
concretevitamin Jan 3, 2023
510fb6c
pylint
concretevitamin Jan 3, 2023
71a0bd7
MANIFEST should add monkey patch file
concretevitamin Jan 3, 2023
e4398ab
tests/test_smoke.py: fix extra \n
concretevitamin Jan 3, 2023
5f14af5
Fix monkey patching bug.
concretevitamin Jan 3, 2023
7b73a44
Remove AutostopEvent monkey patching.
concretevitamin Jan 3, 2023
db32513
_ray_launch_hash: pop ssh proxy command for head & workers
concretevitamin Jan 3, 2023
e82c4b4
Make another 'ray up' use patched launch hash fn.
concretevitamin Jan 4, 2023
649d805
Fix smoke tests.
concretevitamin Jan 4, 2023
0113e58
Fix smoke: K80 VMs could be non-ssh-able (and are more costly).
concretevitamin Jan 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 6 additions & 1 deletion sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,12 @@ def get_or_generate_keys() -> Tuple[str, str]:
_save_key_pair(private_key_path, public_key_path, private_key,
public_key)
else:
assert os.path.exists(public_key_path)
# FIXME(skypilot): ran into failing this assert once, but forgot the
# reproduction (has private key; but has not generated public key).
# AssertionError: /home/ubuntu/.ssh/sky-key.pub
assert os.path.exists(public_key_path), (
'Private key found, but associated public key '
f'{public_key_path} does not exist.')
return private_key_path, public_key_path


Expand Down
272 changes: 179 additions & 93 deletions sky/backends/backend_utils.py

Large diffs are not rendered by default.

229 changes: 160 additions & 69 deletions sky/backends/cloud_vm_ray_backend.py

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
"""Runs `ray up` while not using ssh_proxy_command in launch hash.

This monkey patches the hash_launch_conf() function inside Ray autoscaler to
exclude any ssh_proxy_command in hash calculation.

Reasons:
- In the future, we want to support changing the ssh_proxy_command field for
an existing cluster. If the launch hash included this field, then this would
mean upon such a change a new cluster would've been launched, causing
leakage.
- With our patch, ssh_proxy_command will be excluded from the launch hash when
a cluster is first created. This then makes it possible for us to support
changing the proxy command in the future.
"""
import hashlib
import json
import os

from ray.autoscaler._private import util as ray_autoscaler_private_util
from ray.autoscaler import sdk


# Ref: https://github.com/ray-project/ray/blob/releases/2.2.0/python/ray/autoscaler/_private/util.py#L392-L404
def monkey_patch_hash_launch_conf(node_conf, auth):
hasher = hashlib.sha1()
# For hashing, we replace the path to the key with the key
# itself. This is to make sure the hashes are the same even if keys
# live at different locations on different machines.
full_auth = auth.copy()
full_auth.pop('ssh_proxy_command', None) # NOTE: skypilot changes.
for key_type in ['ssh_private_key', 'ssh_public_key']:
if key_type in auth:
with open(os.path.expanduser(auth[key_type])) as key:
full_auth[key_type] = key.read()
hasher.update(
json.dumps([node_conf, full_auth], sort_keys=True).encode('utf-8'))
print('Hashed without ssh_proxy_command:', hasher.hexdigest())
print(node_conf)
print(full_auth)
return hasher.hexdigest()


ray_autoscaler_private_util.hash_launch_conf = monkey_patch_hash_launch_conf
sdk.create_or_update_cluster({ray_yaml_path}, **{kwargs})
4 changes: 2 additions & 2 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -1558,8 +1558,8 @@ def logs(

if len(job_ids) > 1 and not sync_down:
raise click.UsageError(
f'Cannot stream logs of multiple jobs {job_ids}. '
'Set --sync-down to download them.')
f'Cannot stream logs of multiple jobs (IDs: {", ".join(job_ids)}).'
'\nPass -s/--sync-down to download the logs instead.')

job_ids = None if not job_ids else job_ids

Expand Down
66 changes: 54 additions & 12 deletions sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from sky import exceptions
from sky import global_user_state
from sky import optimizer
from sky import skypilot_config
from sky import sky_logging
from sky import spot
from sky import task as task_lib
Expand Down Expand Up @@ -512,19 +513,60 @@ def spot_launch(
common_utils.dump_yaml(f.name, task_config)

controller_name = spot.SPOT_CONTROLLER_NAME
vars_to_fill = {
'remote_user_yaml_prefix': spot.SPOT_TASK_YAML_PREFIX,
'user_yaml_path': f.name,
'user_config_path': None,
'spot_controller': controller_name,
'cluster_name': name,
'gcloud_installation_commands': gcp.GCLOUD_INSTALLATION_COMMAND,
'is_dev': env_options.Options.IS_DEVELOPER.get(),
'disable_logging': env_options.Options.DISABLE_LOGGING.get(),
'logging_user_hash': common_utils.get_user_hash(),
'retry_until_up': retry_until_up,
'user': os.environ.get('USER', None),
}
if skypilot_config.loaded():
# Look up the contents of the already loaded configs via the
# 'skypilot_config' module. Don't simply read the on-disk file as
# it may have changed since this process started.
#
# Pop any proxy command, because the controller would've been
# launched behind the proxy, and in general any nodes we launch may
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# not have or need the proxy setup. (If the controller needs to
# launch spot clusters in another region/VPC, the user should
# properly set up VPC peering, which will allow the
# cross-region/VPC communication. The proxy command is orthogonal
# to this scenario.)
#
# This file will be uploaded to the controller node and will be
# used throughout the spot job's recovery attempts (i.e., if it
# relaunches due to preemption, we make sure the same config is
# used).
#
# NOTE: suppose that we have a controller in old VPC, then user
# changes 'vpc_name' in the config and does a 'spot launch'. In
# general, the old controller may not successfully launch the job
# in the new VPC. This happens if the two VPCs don’t have peering
# set up. Like other places in the code, we assume properly setting
# up networking is user's responsibilities.
# TODO(zongheng): consider adding a basic check that checks
# controller VPC (or name) == the spot job's VPC (or name). It may
# not be a sufficient check (as it's always possible that peering
# is not set up), but it may catch some obvious errors.
config_dict = skypilot_config.pop_nested(
('auth', 'ssh_proxy_command'))
with tempfile.NamedTemporaryFile(mode='w', delete=False) as tmpfile:
common_utils.dump_yaml(tmpfile.name, config_dict)
vars_to_fill.update({
'user_config_path': tmpfile.name,
'env_var_skypilot_config':
skypilot_config.ENV_VAR_SKYPILOT_CONFIG,
})

yaml_path = backend_utils.fill_template(
spot.SPOT_CONTROLLER_TEMPLATE, {
'remote_user_yaml_prefix': spot.SPOT_TASK_YAML_PREFIX,
'user_yaml_path': f.name,
'spot_controller': controller_name,
'cluster_name': name,
'gcloud_installation_commands': gcp.GCLOUD_INSTALLATION_COMMAND,
'is_dev': env_options.Options.IS_DEVELOPER.get(),
'disable_logging': env_options.Options.DISABLE_LOGGING.get(),
'logging_user_hash': common_utils.get_user_hash(),
'retry_until_up': retry_until_up,
'user': os.environ.get('USER', None),
},
spot.SPOT_CONTROLLER_TEMPLATE,
vars_to_fill,
output_prefix=spot.SPOT_CONTROLLER_YAML_PREFIX)
controller_task = task_lib.Task.from_yaml(yaml_path)
controller_task.spot_task = task
Expand Down
43 changes: 41 additions & 2 deletions sky/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,29 @@ def __init__(
self._try_validate_image_id()

def __repr__(self) -> str:
"""Returns a string representation for display.

Examples:

>>> sky.Resources(accelerators='V100')
<Cloud>({'V100': 1})

>>> sky.Resources(accelerators='V100', use_spot=True)
<Cloud>([Spot], {'V100': 1})

>>> sky.Resources(accelerators='V100',
... use_spot=True, instance_type='p3.2xlarge')
AWS(p3.2xlarge[Spot], {'V100': 1})

>>> sky.Resources(accelerators='V100', instance_type='p3.2xlarge')
AWS(p3.2xlarge, {'V100': 1})

>>> sky.Resources(instance_type='p3.2xlarge')
AWS(p3.2xlarge, {'V100': 1})

>>> sky.Resources(disk_size=100)
<Cloud>(disk_size=100)
"""
accelerators = ''
accelerator_args = ''
if self.accelerators is not None:
Expand All @@ -123,8 +146,24 @@ def __repr__(self) -> str:
if self.disk_size != _DEFAULT_DISK_SIZE_GB:
disk_size = f', disk_size={self.disk_size}'

return (f'{self.cloud}({self._instance_type}{use_spot}'
f'{accelerators}{accelerator_args}{image_id}{disk_size})')
if self._instance_type is not None:
instance_type = f'{self._instance_type}'
else:
instance_type = ''

hardware_str = (
f'{instance_type}{use_spot}'
f'{accelerators}{accelerator_args}{image_id}{disk_size}')
# It may have leading ',' (for example, instance_type not set) or empty
# spaces. Remove them.
while hardware_str and hardware_str[0] in (',', ' '):
hardware_str = hardware_str[1:]

cloud_str = '<Cloud>'
if self.cloud is not None:
cloud_str = f'{self.cloud}'

return f'{cloud_str}({hardware_str})'

@property
def cloud(self):
Expand Down
1 change: 1 addition & 0 deletions sky/setup_files/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
include sky/backends/monkey_patches/*.py
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
include sky/skylet/*.sh
include sky/skylet/providers/aws/*
include sky/skylet/providers/aws/cloudwatch/*
Expand Down
75 changes: 70 additions & 5 deletions sky/skylet/events.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,22 +126,87 @@ def _run(self):
self._stop_cluster(autostop_config)

def _stop_cluster(self, autostop_config):

def _ray_up_to_reset_upscaling_params():
from ray.autoscaler import sdk # pylint: disable=import-outside-toplevel
from ray.autoscaler._private import command_runner # pylint: disable=import-outside-toplevel

# Monkey patch. We must do this, otherwise with ssh_proxy_command
# still under 'auth:' `ray up ~/.sky/sky_ray.yaml` on the head node
# will fail (in general, the clusters do not need or have the proxy
# set up).
#
# Note also that we can't simply drop ssh_proxy_command from that
# yaml: this is because then the `ray up` command would calculate a
# different launch hash, prompting the autoscaler to stop the head
# node and launch a new one.
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
#
# Ref:
# 2.2.0: https://github.com/ray-project/ray/blame/releases/2.2.0/python/ray/autoscaler/_private/command_runner.py#L114-L143 # pylint: disable=line-too-long
# which has not changed for 3 years, so it covers all local Ray
# versions we support inside setup.py.
#
# TODO(zongheng): it seems we could skip monkey patching this, if
# we monkey patch hash_launch_conf() to drop ssh_proxy_command as
# is done in the provision code path. I tried it and could not get
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# it to work: (1) no outputs are shown (tried both subprocess.run()
# and log_lib.run_with_log()) (2) this first `ray up
# --restart-only` somehow calculates a different launch hash than
# the one used when the head node was created. To clean up in the
# future.
def monkey_patch_init(self, ssh_key, control_path=None, **kwargs):
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
"""SSHOptions.__init__(), but pops 'ProxyCommand'."""
self.ssh_key = ssh_key
self.arg_dict = {
# Supresses initial fingerprint verification.
'StrictHostKeyChecking': 'no',
# SSH IP and fingerprint pairs no longer added to
# known_hosts. This is to remove a 'REMOTE HOST
# IDENTIFICATION HAS CHANGED' warning if a new node has the
# same IP as a previously deleted node, because the
# fingerprints will not match in that case.
'UserKnownHostsFile': os.devnull,
# Try fewer extraneous key pairs.
'IdentitiesOnly': 'yes',
# Abort if port forwarding fails (instead of just printing
# to stderr).
'ExitOnForwardFailure': 'yes',
# Quickly kill the connection if network connection breaks
# (as opposed to hanging/blocking).
'ServerAliveInterval': 5,
'ServerAliveCountMax': 3,
}
if control_path:
self.arg_dict.update({
'ControlMaster': 'auto',
'ControlPath': '{}/%C'.format(control_path),
'ControlPersist': '10s',
})
# NOTE(skypilot): pops ProxyCommand. This is the only change.
kwargs.pop('ProxyCommand', None)
self.arg_dict.update(kwargs)
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

command_runner.SSHOptions.__init__ = monkey_patch_init
sdk.create_or_update_cluster(self._ray_yaml_path, restart_only=True)

if (autostop_config.backend ==
cloud_vm_ray_backend.CloudVmRayBackend.NAME):
self._replace_yaml_for_stopping(self._ray_yaml_path,
autostop_config.down)

# `ray up` is required to reset the upscaling speed and min/max
# workers. Otherwise, `ray down --workers-only` will continuously
# scale down and up.
subprocess.run([
'ray', 'up', '-y', '--restart-only', '--disable-usage-stats',
self._ray_yaml_path
],
check=True)
logger.info('Running ray up.')
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
_ray_up_to_reset_upscaling_params()

logger.info('Running ray down.')
# Stop the workers first to avoid orphan workers.
subprocess.run(
['ray', 'down', '-y', '--workers-only', self._ray_yaml_path],
check=True)

logger.info('Running final ray down.')
subprocess.run(['ray', 'down', '-y', self._ray_yaml_path],
check=True)
else:
Expand Down