Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: Fix subnet filtering. #2854

Merged
merged 3 commits into from
Dec 11, 2023
Merged

GCP: Fix subnet filtering. #2854

merged 3 commits into from
Dec 11, 2023

Conversation

concretevitamin
Copy link
Collaborator

Previously we were filtering by subnet name (name), but rather we should be filtering by vpc name (network). We did not hit this bug because our subnets were named the same as the VPC:

» gcloud compute networks subnets list --network="skypilot-vpc" 

NAME          REGION                   NETWORK       RANGE          STACK_TYPE  IPV6_ACCESS_TYPE  INTERNAL_IPV6_PREFIX  EXTERNAL_IPV6_PREFIX
skypilot-vpc  us-central1              skypilot-vpc  10.128.0.0/20  IPV4_ONLY
..

while a user hit this bug because they are named differently:

NAME           REGION    NETWORK   RANGE          STACK_TYPE  IPV6_ACCESS_TYPE  INTERNAL_IPV6_PREFIX  EXTERNAL_IPV6_PREFIX
priv-subnet-1  us-west1  skypilot  10.10.72.0/22  IPV4_ONLY
pub-subnet-1   us-west1  skypilot  10.10.68.0/22  IPV4_ONLY

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • No gcp.vpc_name config: sky launch -c no-gcp-config --cpus 2 --use-spot --cloud gcp -i0 --down -y
    • With gcp.vpc_name config: sky launch -c skypilot-vpc-gcp-config --cpus 2 --use-spot --cloud gcp -i0 --down -y
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @concretevitamin ! Thanks for fixing this. LGTM.

matched_items = []
for item in items:
# 'https://www.googleapis.com/compute/v1/projects/<project_id>/global/networks/<network_name>'
if network == item["network"].split("/")[-1]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use _network_interface_to_vpc_name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, done!

)
.execute()
)

return response["items"] if "items" in response else []
items = response["items"] if "items" in response else []
if network is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is it possible to do the network filtering with the filter field? If not, maybe we can have a comment here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment:

    # Filter by network (VPC) name.
    #
    # Note we do not directly use the filter (network=<...>) arg of the list()
    # call above, because it'd involve constructing a long URL of the following
    # format and passing it as the filter value:
    # 'https://www.googleapis.com/compute/v1/projects/<project_id>/global/networks/<network_name>'

@concretevitamin concretevitamin merged commit 9c19b77 into master Dec 11, 2023
19 checks passed
@concretevitamin concretevitamin deleted the gcp-subnet-filter branch December 11, 2023 03:18
Michaelvll added a commit that referenced this pull request Dec 22, 2023
remyleone pushed a commit to remyleone/skypilot that referenced this pull request Dec 26, 2023
* Fix GCP _list_subnets() filtering.

* Format

* updates
Michaelvll added a commit that referenced this pull request Jan 1, 2024
* init

* remove ray

* update config

* update

* update

* update

* complete bootstrapping

* add start instance

* fix

* fix

* fix

* update

* wait stopping instances

* support normal gcp tpus first

* fix gcp

* support get cluster info

* fix

* update

* wait for instance starting

* rename

* hide gcp package import

* fix

* fix

* update constants

* fix comments

* remove unused methods

* fix comments

* sync 'config' & 'constants' with upstream, Nov 16

* sync 'instace_utils' with the upstream, Nov 16

* fix typing

* parallelize provisioning

* Fix TPU node

* Fix TPU NAME env for tpu node

* implement bulk provision

* refactor selflink

* format

* reduce the sleep time for autostop

* provisioner version refactoring

* refactor

* Add logging

* avoid saving the provisioner version

* format

* format

* Fix scheduling field in config

* format

* fix public key content

* Fix provisioner version for azure

* Use ray port from head node for workers

* format

* fix ray_port

* fix smoke tests

* shorter sleep time

* refactor status refresh version

* [Provisioner] Support reserved instances in GCP (#2824)

* Support reserved instances

* remove min max count

* remove unecessary fields

* Add todo

* Add todo

* remove unused reseravation config

* Fix config.yaml tests

* format

* sync with the upstream (Dec 05, 23)

* set timeout and retries

* handle GCP creation errors

* Fix provisioning errors and improve error handling

* update blocklist for GCP

* refactor code for linting issues

* fix

* show instance status during assertion error

* Refactor error handling for failover

* adopt changes in #2854

* format

* retry for wait operation

* format

* fix typo

* fix interface

* more robust zone to region

* Fix tpu vm external IP setup

* Fix get node

* format

* revert for TPU VM pod

* Fix get_cluster_info call

* fix tab

* Fix timeout case

* remvoe \t

* GCP query statuses with new provisioner

* format

* fix import

* refactor query status

* fix stopped status

* Fix stopped status

* Add head ray start command

* Add back keys

* add workers

* Fix non stopped states

* Add more logs for autostop

* format

* increase job_docker job time

* better logging

* shorter time for recovering

* fix conflicting var

* change to V1

* fix comments

* refactor constants

* refactoring

* typo

* Fix max retry

* longer sleep time for job

* add detach setup

* revert --detach-setup

* shorter time for recovering

* more retries

* Update sky/provision/instance_setup.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* format

* [Provisioner] New provisioner for GCP TPU VM (#2898)

* init

* test

* test ins_type

* fix

* format..

* wip

* remove TPU config

* fix node ips

* Fix TPU VM pod

* format

* use TPU VM as default

* Fix example for TPU VM

* format

* fix optimizer random dag

* set TPU-VM

* accelerator_args False

* backward compatibility

* add tpu filter for tests

* fix

* Fix

* fix status refresh for tpu VM pod

* Support autodown for TPU VM pod

* Allow multi-node TPU VM pod

* Allow multi-node TPU VM pod

* fix

* add execute for operation

* avoid from

* Wait for pending before set_labels

* format

* refactor constants

* Fix for API changes

* remove GCP failover handler v1

* format

* remove TPU VM pod specific codes as they have been moved to new provisioner

* Add error handling for TPU pod case

* fix

* fix multiple node calculation

* refactor tpu_utils to gcp_utils

* shorter time for recovering

* format

---------

Co-authored-by: Wei-Lin Chiang <weichiang@berkeley.edu>

* better error logging

* Fix logging for TPU VM

* Fix logging

* Add insufficientCapacity to error handler

* Avoid adding duplicated resources to blocked_resources

* Fix blocked resources

* address comment

* add comment

* Add comments

* format

* Fix num_node_ips

* format

* fix smoke test for preinstalled package

* shorter wait time for recovering

* Fix TPU VM pod stop

* format

* Update sky/provision/gcp/instance_utils.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* update

* format

* Add debug message

* revert version for handle

* disable tpu name set

---------

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Wei-Lin Chiang <weichiang@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants