Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Bring-your-own-VPC that disables public IPs for all SkyPilot nodes. #1512

Merged
merged 60 commits into from
Jan 4, 2023
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
994e622
Minor: sky logs hint
concretevitamin Dec 12, 2022
dee5590
Minor: add a FIXME in authentication.py.
concretevitamin Dec 12, 2022
e1f0c1b
New module: sky_config
concretevitamin Dec 12, 2022
3f48372
Backend changes for SSH proxy command support.
concretevitamin Dec 12, 2022
3e3c66e
spot_launch(): sync up config; pop any proxy command.
concretevitamin Dec 12, 2022
7c8f943
AutostopEvent: monkey patch SSHOptions.
concretevitamin Dec 12, 2022
ff224ba
aws/config.py: support vpc_name and new use_internal_ips semantics.
concretevitamin Dec 12, 2022
6a529c3
Make failover catch our 'ERROR' messages from AWS node provider.
concretevitamin Dec 12, 2022
fc8dd03
.j2 changes.
concretevitamin Dec 12, 2022
f03f27b
Fix local launch hash for workers: must pop ssh_proxy_command.
concretevitamin Dec 12, 2022
cecc25d
Fix pylint.
concretevitamin Dec 12, 2022
0328d4d
typo
concretevitamin Dec 12, 2022
22a6481
smoke: make printf usage safe.
concretevitamin Dec 12, 2022
b467b03
Use SKYPILOT_ERROR as logging prefix.
concretevitamin Dec 13, 2022
2456775
Fix Resources.__repr__().
concretevitamin Dec 15, 2022
e496137
Avoid printing unnecessary termination errors for VPC-not-found.
concretevitamin Dec 15, 2022
7d80337
Fix a syntax error in codegen.
concretevitamin Dec 16, 2022
0b982cd
Read from SKYPILOT_CONFIG env var to permit dynamic generation.
concretevitamin Dec 16, 2022
2d9e205
Fix smoke test name.
concretevitamin Dec 16, 2022
85387ec
Fix another test name
concretevitamin Dec 16, 2022
ea26a89
Merge branch 'master' into jump
concretevitamin Dec 16, 2022
410ed6a
Revert "Read from SKYPILOT_CONFIG env var to permit dynamic generation."
concretevitamin Dec 17, 2022
07e0145
Fix head_node_launch_requested log line.
concretevitamin Dec 19, 2022
f4f285f
Merge branch 'master' into jump
concretevitamin Dec 19, 2022
0b6ee4f
Merge branch 'master' into jump
concretevitamin Dec 20, 2022
d620435
Optional: env var to read configs for spot, for better isolation.
concretevitamin Dec 20, 2022
f5850ea
Make query_head_ip_with_retries() robust to extra output.
concretevitamin Dec 20, 2022
2c27757
aws/config.py: reword comments
concretevitamin Dec 20, 2022
5cfd700
events.py: restart_only=True
concretevitamin Dec 20, 2022
7b136d5
Fix Resources.__repr__ to handle None fields.
concretevitamin Dec 20, 2022
af5eb26
Use SKYPILOT_ERROR_NO_NODES_LAUNCHED
concretevitamin Dec 20, 2022
af9e2b5
rstrip() for ssh config entries.
concretevitamin Dec 20, 2022
c73555b
authentication.py: reword comment
concretevitamin Dec 20, 2022
4c56607
pylint
concretevitamin Dec 20, 2022
f5d628b
Fix logging
concretevitamin Dec 20, 2022
de9280f
Try using reties for handle.{internal,external}_ips().
concretevitamin Dec 20, 2022
08bef71
Merge branch 'master' into jump
concretevitamin Jan 2, 2023
cf527bd
Address some easy comments
concretevitamin Jan 2, 2023
3bbeb4c
Typo
concretevitamin Jan 2, 2023
0fe08cb
backend_utils: fix worker IPs fetch; fix >80-col lines.
concretevitamin Jan 2, 2023
797c7d2
Fix test_minimal.
concretevitamin Jan 2, 2023
4ce221e
test_smoke: printf -> echo
concretevitamin Jan 3, 2023
18cd6da
Query IPs once.
concretevitamin Jan 3, 2023
ac0883d
Merge branch 'master' into jump
concretevitamin Jan 3, 2023
7252fd3
Drop ssh_proxy_command in launch hash when provisioning.
concretevitamin Jan 3, 2023
229f0a5
Typo
concretevitamin Jan 3, 2023
0155cc3
Typo
concretevitamin Jan 3, 2023
3e5e4b8
Add comment
concretevitamin Jan 3, 2023
4d89c38
sky/sky_config.py -> sky/skypilot_config.py
concretevitamin Jan 3, 2023
ff955e9
Add: sky/backends/monkey_patches/
concretevitamin Jan 3, 2023
594ce70
Remove logging
concretevitamin Jan 3, 2023
510fb6c
pylint
concretevitamin Jan 3, 2023
71a0bd7
MANIFEST should add monkey patch file
concretevitamin Jan 3, 2023
e4398ab
tests/test_smoke.py: fix extra \n
concretevitamin Jan 3, 2023
5f14af5
Fix monkey patching bug.
concretevitamin Jan 3, 2023
7b73a44
Remove AutostopEvent monkey patching.
concretevitamin Jan 3, 2023
db32513
_ray_launch_hash: pop ssh proxy command for head & workers
concretevitamin Jan 3, 2023
e82c4b4
Make another 'ray up' use patched launch hash fn.
concretevitamin Jan 4, 2023
649d805
Fix smoke tests.
concretevitamin Jan 4, 2023
0113e58
Fix smoke: K80 VMs could be non-ssh-able (and are more costly).
concretevitamin Jan 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 4 additions & 1 deletion sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,10 @@ def get_or_generate_keys() -> Tuple[str, str]:
_save_key_pair(private_key_path, public_key_path, private_key,
public_key)
else:
assert os.path.exists(public_key_path)
# FIXME(skypilot): sky-launched instances fail this assert (has private
# key; but has not generated public key).
# AssertionError: /home/ubuntu/.ssh/sky-key.pub
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
assert os.path.exists(public_key_path), public_key_path
return private_key_path, public_key_path


Expand Down
88 changes: 64 additions & 24 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
from requests import adapters
from requests.packages.urllib3.util import retry as retry_lib
from ray.autoscaler._private import commands as ray_commands
from ray.autoscaler._private import util as ray_util
from ray.autoscaler._private import util as ray_autoscaler_private_util
import rich.console as rich_console
import rich.progress as rich_progress
import yaml
Expand All @@ -38,6 +38,7 @@
from sky import clouds
from sky import exceptions
from sky import global_user_state
from sky import sky_config
from sky import sky_logging
from sky import spot as spot_lib
from sky.backends import onprem_utils
Expand Down Expand Up @@ -361,7 +362,12 @@ class SSHConfigHelper(object):

@classmethod
def _get_generated_config(cls, autogen_comment: str, host_name: str,
ip: str, username: str, ssh_key_path: str):
ip: str, username: str, ssh_key_path: str,
proxy_command: Optional[str]):
if proxy_command is not None:
proxy = f'ProxyCommand {proxy_command}'
else:
proxy = ''
codegen = textwrap.dedent(f"""\
{autogen_comment}
Host {host_name}
Expand All @@ -372,6 +378,7 @@ def _get_generated_config(cls, autogen_comment: str, host_name: str,
ForwardAgent yes
StrictHostKeyChecking no
Port 22
{proxy}
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
""")
return codegen

Expand Down Expand Up @@ -437,8 +444,9 @@ def add_cluster(
f.writelines(config)
os.chmod(config_path, 0o644)

proxy_command = auth_config.get('ssh_proxy_command', None)
codegen = cls._get_generated_config(sky_autogen_comment, host_name, ip,
username, key_path)
username, key_path, proxy_command)

# Add (or overwrite) the new config.
if overwrite:
Expand Down Expand Up @@ -524,6 +532,8 @@ def _add_multinode_config(
with open(config_path) as f:
config = f.readlines()

proxy_command = auth_config.get('ssh_proxy_command', None)

# Check if ~/.ssh/config contains existing names
host_lines = [f'Host {c_name}' for c_name in worker_names]
for i, line in enumerate(config):
Expand All @@ -536,7 +546,7 @@ def _add_multinode_config(
logger.warning(f'Using {host_name} to identify host instead.')
codegens[idx] = cls._get_generated_config(
sky_autogen_comment, host_name, external_worker_ips[idx],
username, key_path)
username, key_path, proxy_command)

# All workers go to SKY_USER_FILE_PATH/ssh/{cluster_name}
for i, line in enumerate(extra_config):
Expand All @@ -549,14 +559,14 @@ def _add_multinode_config(
overwrite_begin_idxs[idx] = i - 1
codegens[idx] = cls._get_generated_config(
sky_autogen_comment, host_name, external_worker_ips[idx],
username, key_path)
username, key_path, proxy_command)

# This checks if all codegens have been created.
for idx, ip in enumerate(external_worker_ips):
if not codegens[idx]:
codegens[idx] = cls._get_generated_config(
sky_autogen_comment, worker_names[idx], ip, username,
key_path)
key_path, proxy_command)

for idx in range(len(external_worker_ips)):
# Add (or overwrite) the new config.
Expand Down Expand Up @@ -789,14 +799,25 @@ def write_cluster_config(
# See https://github.com/skypilot-org/skypilot/pull/742.
# Generate the name of the security group we're looking for.
# (username, last 4 chars of hash of hostname): for uniquefying
# users on shared-account cloud providers. Using uuid.getnode()
# is incorrect; observed to collide on Macs.
'security_group': f'sky-sg-{common_utils.user_and_hostname_hash()}',
# Azure only.
# users on shared-account scenarios.
'security_group': sky_config.get_nested(
('aws', 'security_group_name'),
f'sky-sg-{common_utils.user_and_hostname_hash()}'),
'vpc_name': sky_config.get_nested(('aws', 'vpc_name'), None),
'use_internal_ips': sky_config.get_nested(
('aws', 'use_internal_ips'), False),
# Not exactly AWS only, but we only test it's supported on AWS
# for now:
'ssh_proxy_command': sky_config.get_nested(
('auth', 'ssh_proxy_command'), None),

# Azure only:
'azure_subscription_id': azure_subscription_id,
'resource_group': f'{cluster_name}-{region_name}',
# GCP only.

# GCP only:
'gcp_project_id': gcp_project_id,

# Ray version.
'ray_version': constants.SKY_REMOTE_RAY_VERSION,
# Cloud credentials for cloud storage.
Expand Down Expand Up @@ -1032,10 +1053,12 @@ def ssh_credential_from_yaml(cluster_yaml: str) -> Dict[str, str]:
ssh_user = auth_section['ssh_user'].strip()
ssh_private_key = auth_section.get('ssh_private_key')
ssh_control_name = config.get('cluster_name', '__default__')
ssh_proxy_command = auth_section.get('ssh_proxy_command')
return {
'ssh_user': ssh_user,
'ssh_private_key': ssh_private_key,
'ssh_control_name': ssh_control_name
'ssh_control_name': ssh_control_name,
'ssh_proxy_command': ssh_proxy_command,
}


Expand Down Expand Up @@ -1473,7 +1496,7 @@ def _ray_launch_hash(cluster_name: str, ray_config: Dict[str, Any]) -> Set[str]:
assert metadata is not None, cluster_name
ray_launch_hashes = metadata.get('ray_launch_hashes', None)
if ray_launch_hashes is not None:
logger.debug('Using cached launch_caches')
logger.debug('Using cached launch_hashes.')
return set(ray_launch_hashes)
with ux_utils.suppress_output():
ray_config = ray_commands._bootstrap_config(ray_config) # pylint: disable=protected-access
Expand All @@ -1484,14 +1507,27 @@ def _ray_launch_hash(cluster_name: str, ray_config: Dict[str, Any]) -> Set[str]:
for node_type, node_config in ray_config['available_node_types'].items():
if node_type == head_node_type:
launch_config = ray_config.get('head_node', {})
auth_config = ray_config['auth']
else:
launch_config = ray_config.get('worker_nodes', {})
auth_config = dict(ray_config['auth'])
# Ray uses the head node to launch worker nodes. On the head node,
# ~/ray_bootstrap_config.yaml, which has any ssh_proxy_command
# field removed (see Ray's autoscaler/_private/commands.py), is
# passed to the autoscaler. Therefore when Ray calculates the hash
# for workers, ssh_proxy_command is not included. Here we follow
# this (otherwise our hash here would not match with what's on the
# console for workers).
#
# Note that head node is launched from the local client, whose
# launch has *does* include ssh_proxy_command.
auth_config.pop('ssh_proxy_command', None)
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
launch_config = copy.deepcopy(launch_config)

launch_config.update(node_config['node_config'])
with ux_utils.suppress_output():
current_hash = ray_util.hash_launch_conf(launch_config,
ray_config['auth'])
current_hash = ray_autoscaler_private_util.hash_launch_conf(
launch_config, auth_config)
launch_hashes.add(current_hash)
# Cache the launch hashes for the cluster.
metadata['ray_launch_hashes'] = list(launch_hashes)
Expand Down Expand Up @@ -1690,18 +1726,22 @@ def _update_cluster_status_no_lock(
# where the cluster is terminated by the user manually through the UI.
to_terminate = not node_statuses

# A cluster is considered "abnormal", if not all nodes are TERMINATED or not all
# nodes are STOPPED. We check that with the following logic:
# * not all nodes are terminated and there's at least one node terminated; or
# * any of the non-TERMINATED nodes is in a non-STOPPED status.
# A cluster is considered "abnormal", if not all nodes are TERMINATED or
# not all nodes are STOPPED. We check that with the following logic:
# * Not all nodes are terminated and there's at least one node
# terminated; or
# * Any of the non-TERMINATED nodes is in a non-STOPPED status.
#
# This includes these special cases:
# All stopped are considered normal and will be cleaned up at the end of the function.
# Some of the nodes UP should be considered abnormal, because the ray cluster is
# probably down.
# The cluster is partially terminated or stopped should be considered abnormal.
# * All stopped are considered normal and will be cleaned up at the end
# of the function.
# * Some of the nodes UP should be considered abnormal, because the ray
# cluster is probably down.
# * The cluster is partially terminated or stopped should be considered
# abnormal.
#
# An abnormal cluster will transition to INIT and have any autostop setting reset.
# An abnormal cluster will transition to INIT and have any autostop setting
# reset.
is_abnormal = ((0 < len(node_statuses) < handle.launched_nodes) or
any(status != global_user_state.ClusterStatus.STOPPED
for status in node_statuses))
Expand Down