Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Fix OOM for long running spot controller #2675

Merged
merged 31 commits into from
Oct 9, 2023
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 7, 2023

Fixes #2668.

Previously, we add new aws.resources() into the _local.resources cache. This will increase the memory usage, as the config object will be different every time aws.resources is called, causing OOM on spot controller.

TODO:

  • backward compatibility: An existing spot job, sky start -f sky-spot-controller-xxx, sky spot cancel -a, the spot cluster may fail to be terminated, as the adaptor.aws may be loaded in memory earlier.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky spot launch -n test-oom --cloud aws --cpus 2 --num-nodes 16 sleep 1000000000000000 and check the memory consumption with htop
  • All smoke tests: pytest tests/test_smoke.py --aws
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
    • master: sky spot launch -n test-aws sleep 100000; checkout to current PR; sky start -f sky-spot-xxxx; sky spot logs --controller

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work tracking this down @Michaelvll! Mostly LGTM, some questions. cc @suquark too

sky/provision/aws/instance.py Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
sky/adaptors/aws.py Show resolved Hide resolved
sky/provision/aws/instance.py Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
# NOTE: we need the lock here to avoid thread-safety issues when
# creating the resource, because Python module is a shared object,
# and we are not sure if the code inside 'session().resource()'
# is thread-safe.
_local.resource[key] = session().resource(service_name, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should set a maximum cache size (like LRU cache) to prevent further OOM. but I do not know how hard it is to implement it. maybe we can look into python LRU cache implementation (or can we just use lru_cache after all these changes?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched back to functools.lru_cache now. Please take another look @suquark

Copy link
Collaborator

@suquark suquark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts about lru cache vs thread local.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: does this test applied to master branch pass or fail the CI? Just wondering if the CI having no AWS credentials affects the creation of Config and/or the mem usage (probably not).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a lot of tests/*.py are already unit tests, so it's a bit odd to add a new dir. How about just making it flatter by placing this file under tests/?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It passes the CI as shown below. It is a good question for not having an AWS credentials, but the test seems passed both locally and in CI.

How about we do a refactoring in a future PR to separate the integration tests and unit tests? We should move those unit tests in tests/*.py to unit tests and have the integration tests in another place.

tests/unit_tests/sky/clouds/test_aws.py Outdated Show resolved Hide resolved
tests/unit_tests/sky/clouds/test_aws.py Outdated Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
def _default_ec2_resource(region: str) -> Any:
if not hasattr(aws, 'version'):
# For backward compatibility, reload the module if the aws module was
# imported before and staled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# imported before and staled.
# imported before and stale. Used for, e.g., a live spot controller running an older version and a new version gets installed by `sky spot launch`.

Q: actually, this reload() won't affect any existing controller.py processes? A bit confused why we need back compat here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the spot controller’s backward compatibility is guaranteed by the nature that python will cache all the imported module in the memory, so even if the skypilot package is updated in the controller VM, the existing controller process will still use the old code, i.e., no backward compatibility issue.
However, since we are using this adaptor style implementation, the modules are only imported when used, this causes any update in the aws.instance will be used by the old controller process. That said, the adaptors.aws is in old version while the aws.instance is in the new version, causing a mismatch in the implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a temporary solution, and I think we should get rid of the adaptors but only use the provision APIs as the two-level import and reloading can cause a lot of issues, like the older code request older adaptor.aws while the newly imported aws.instance requires newer adaptor.aws.

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some comments.

sky/adaptors/aws.py Outdated Show resolved Hide resolved
def resource(service_name: str, **kwargs):
"""Create an AWS resource of a certain service.

Args:
service_name: AWS resource name (e.g., 's3').
kwargs: Other options.
kwargs: Other options. We add max_attempts to the kwargs instead of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "We add ..." part probably belongs to below, ~L140?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be better to have it here, so the caller can see it in the docstr?

Here, the kwargs is the same as the one for session().resource(), except for the max_attempt, which is added by us, so we should let the caller know not to use the config without click into the function implementation.

sky/adaptors/aws.py Outdated Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Show resolved Hide resolved
Michaelvll and others added 2 commits October 8, 2023 22:27
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll! Just a note on the comment & the unit test.

sky/provision/aws/instance.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 470db0c into master Oct 9, 2023
18 checks passed
@Michaelvll Michaelvll deleted the fix-oom-for-spot branch October 9, 2023 18:56
jc9123 pushed a commit to jc9123/skypilot that referenced this pull request Oct 11, 2023
* Fix caching error for aws resources

* Add todo for local cache

* backward compatibility

* Use lru cache instead

* format

* add unit test for aws resources memory leakage

* pytest larger memory limit

* Fix unit test

* fix comment

* add requirement for memory profiler

* install memory profiler in pytest

* fix ci

* Update sky/provision/aws/instance.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* Address comments

* Add new line

* backward compatibility

* Use thread_local LRU instead

* init

* Update sky/adaptors/aws.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* make private

* Add comments

* fail early for memory exceeding

* Less frequent memory test

* shorter period

* Update sky/provision/aws/instance.py

* refactor

* Address comments

* fix test

* Add comment to modules

---------

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days
3 participants