Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Refactor the reserved instances cache #2836

Merged
merged 6 commits into from
Dec 4, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 3, 2023

The previous implementation of GCP's reserved instance cache has different cache for multiple GCP() cloud objects. This can cause long optimization time if multiple resources are specified with different GCP() objects.

This PR refactors the cache out to a module to avoid querying the reservation list multiple times.

It reduces the time for optimization from 13 seconds to 4 seconds for the following yaml:

resources:
    # accelerators: A100:8
    any_of:
      # AWS:
      - region: us-east-1
      - region: us-east-2
      - region: us-west-1
      - region: us-west-2
      # GCP
      - region: us-central1
      - region: us-east1
      - region: us-east4
      - region: us-west1
      - region: us-west2
      - region: us-west3
      - region: us-west4

sky launch -c test test.yaml

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud gcp -t n2-standard-2 echo hi with reservation for n2-standard-2
    • sky launch -c test-gcp --cloud gcp echo hi on master
    • sky exec test-gcp echo hi on current PR
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll - LGTM.

tests/unit_tests/test_gcp.py Outdated Show resolved Hide resolved
sky/clouds/utils/gcp_utils.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@@ -225,11 +161,6 @@ class GCP(clouds.Cloud):
'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp' # pylint: disable=line-too-long
)

def __init__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to test back-compat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested excuting new job and launching again on existing cluster. It works correctly. Added the tests in the PR description. Thanks!

return [r for r in reservations if r.zone.endswith(f'/{zone}')]


@cachetools.cached(cache=cachetools.TTLCache(maxsize=1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any quick test on whether the cache is effective within a process? Asking because the args passed are slightly different than before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a logging in L104 and tried to remove the decorator. With the decorator the logging will only be shown once, but without the decorator it will show the output multiple times.

Also, tried to change the ttl to 0.3, and it shows the output three times during the optimization.

@concretevitamin
Copy link
Collaborator

We may want to add a comment to class Cloud docstr, something like:

Lightweight stateless objects. SkyPilot may create multiple such objects; therefore, subclasses should take care to make methods inexpensive to call, and should not store heavy state. If state needs to be queried from the cloud provider and cached, create a module in sky/clouds/utils/ so they can be reused across cloud object creation.

@Michaelvll Michaelvll merged commit d063af8 into master Dec 4, 2023
19 checks passed
@Michaelvll Michaelvll deleted the refactor-gcp-reserved-cache branch December 4, 2023 04:17
remyleone pushed a commit to remyleone/skypilot that referenced this pull request Dec 26, 2023
* Refactor the reserved instances cache for GCP

* fix tests

* Address comments

* Add comments

* fix test name

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants