[Spot] Fix OOM for long running spot controller #2675

Michaelvll · 2023-10-07T00:59:04Z

Previously, we add new aws.resources() into the _local.resources cache. This will increase the memory usage, as the config object will be different every time aws.resources is called, causing OOM on spot controller.

TODO:

backward compatibility: An existing spot job, sky start -f sky-spot-controller-xxx, sky spot cancel -a, the spot cluster may fail to be terminated, as the adaptor.aws may be loaded in memory earlier.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky spot launch -n test-oom --cloud aws --cpus 2 --num-nodes 16 sleep 1000000000000000 and check the memory consumption with htop
All smoke tests: pytest tests/test_smoke.py --aws
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
- master: sky spot launch -n test-aws sleep 100000; checkout to current PR; sky start -f sky-spot-xxxx; sky spot logs --controller

concretevitamin

Awesome work tracking this down @Michaelvll! Mostly LGTM, some questions. cc @suquark too

sky/provision/aws/instance.py

sky/adaptors/aws.py

sky/provision/aws/instance.py

sky/adaptors/aws.py

suquark · 2023-10-07T18:54:08Z

sky/adaptors/aws.py

+            # NOTE: we need the lock here to avoid thread-safety issues when
+            # creating the resource, because Python module is a shared object,
+            # and we are not sure if the code inside 'session().resource()'
+            # is thread-safe.
            _local.resource[key] = session().resource(service_name, **kwargs)


we should set a maximum cache size (like LRU cache) to prevent further OOM. but I do not know how hard it is to implement it. maybe we can look into python LRU cache implementation (or can we just use lru_cache after all these changes?)

Switched back to functools.lru_cache now. Please take another look @suquark

suquark

LGTM

concretevitamin

Some thoughts about lru cache vs thread local.

concretevitamin · 2023-10-08T16:06:45Z

tests/unit_tests/sky/clouds/test_aws.py

Q: does this test applied to master branch pass or fail the CI? Just wondering if the CI having no AWS credentials affects the creation of Config and/or the mem usage (probably not).

Also, a lot of tests/*.py are already unit tests, so it's a bit odd to add a new dir. How about just making it flatter by placing this file under tests/?

It passes the CI as shown below. It is a good question for not having an AWS credentials, but the test seems passed both locally and in CI.

How about we do a refactoring in a future PR to separate the integration tests and unit tests? We should move those unit tests in tests/*.py to unit tests and have the integration tests in another place.

tests/unit_tests/sky/clouds/test_aws.py

sky/adaptors/aws.py

concretevitamin · 2023-10-08T16:24:03Z

sky/provision/aws/instance.py

 def _default_ec2_resource(region: str) -> Any:
+    if not hasattr(aws, 'version'):
+        # For backward compatibility, reload the module if the aws module was
+        # imported before and staled.


Suggested change

# imported before and staled.

# imported before and stale. Used for, e.g., a live spot controller running an older version and a new version gets installed by `sky spot launch`.

Q: actually, this reload() won't affect any existing controller.py processes? A bit confused why we need back compat here.

Previously, the spot controller’s backward compatibility is guaranteed by the nature that python will cache all the imported module in the memory, so even if the skypilot package is updated in the controller VM, the existing controller process will still use the old code, i.e., no backward compatibility issue.
However, since we are using this adaptor style implementation, the modules are only imported when used, this causes any update in the aws.instance will be used by the old controller process. That said, the adaptors.aws is in old version while the aws.instance is in the new version, causing a mismatch in the implementation.

This is only a temporary solution, and I think we should get rid of the adaptors but only use the provision APIs as the two-level import and reloading can cause a lot of issues, like the older code request older adaptor.aws while the newly imported aws.instance requires newer adaptor.aws.

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

…nto fix-oom-for-spot

concretevitamin

Thanks, some comments.

sky/adaptors/aws.py

concretevitamin · 2023-10-09T04:56:20Z

sky/adaptors/aws.py

 def resource(service_name: str, **kwargs):
    """Create an AWS resource of a certain service.

    Args:
        service_name: AWS resource name (e.g., 's3').
-        kwargs: Other options.
+        kwargs: Other options. We add max_attempts to the kwargs instead of


The "We add ..." part probably belongs to below, ~L140?

I feel like it would be better to have it here, so the caller can see it in the docstr?

Here, the kwargs is the same as the one for session().resource(), except for the max_attempt, which is added by us, so we should let the caller know not to use the config without click into the function implementation.

sky/adaptors/aws.py

sky/provision/aws/instance.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

…nto fix-oom-for-spot

sky/provision/aws/instance.py

concretevitamin

LGTM, thanks @Michaelvll! Just a note on the comment & the unit test.

sky/provision/aws/instance.py

* Fix caching error for aws resources * Add todo for local cache * backward compatibility * Use lru cache instead * format * add unit test for aws resources memory leakage * pytest larger memory limit * Fix unit test * fix comment * add requirement for memory profiler * install memory profiler in pytest * fix ci * Update sky/provision/aws/instance.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Address comments * Add new line * backward compatibility * Use thread_local LRU instead * init * Update sky/adaptors/aws.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * make private * Add comments * fail early for memory exceeding * Less frequent memory test * shorter period * Update sky/provision/aws/instance.py * refactor * Address comments * fix test * Add comment to modules --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Michaelvll added 2 commits October 6, 2023 17:43

Fix caching error for aws resources

104e7e3

Add todo for local cache

49a234b

concretevitamin requested review from suquark and concretevitamin October 7, 2023 01:01

concretevitamin reviewed Oct 7, 2023

View reviewed changes

sky/provision/aws/instance.py Show resolved Hide resolved

sky/adaptors/aws.py Outdated Show resolved Hide resolved

sky/adaptors/aws.py Show resolved Hide resolved

sky/provision/aws/instance.py Show resolved Hide resolved

sky/adaptors/aws.py Outdated Show resolved Hide resolved

concretevitamin reviewed Oct 7, 2023

View reviewed changes

sky/adaptors/aws.py Show resolved Hide resolved

suquark reviewed Oct 7, 2023

View reviewed changes

Michaelvll added 10 commits October 7, 2023 16:15

backward compatibility

874638a

Use lru cache instead

b12dd26

format

e016173

add unit test for aws resources memory leakage

59ef518

pytest larger memory limit

93999b6

Fix unit test

47006da

fix comment

0ade32f

add requirement for memory profiler

ca9be38

install memory profiler in pytest

05e5789

fix ci

2b245da

suquark approved these changes Oct 8, 2023

View reviewed changes

concretevitamin reviewed Oct 8, 2023

View reviewed changes

Michaelvll and others added 7 commits October 8, 2023 16:39

Update sky/provision/aws/instance.py

c8f9f12

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Address comments

dbdb0ec

Add new line

0271ae9

Merge branch 'fix-oom-for-spot' of github.com:skypilot-org/skypilot i…

c1daffc

…nto fix-oom-for-spot

backward compatibility

3c2641d

Use thread_local LRU instead

3ae8fe0

init

f1a2982

concretevitamin reviewed Oct 9, 2023

View reviewed changes

Michaelvll and others added 2 commits October 8, 2023 22:27

Update sky/adaptors/aws.py

5dbf16e

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

make private

0f3d0ed

Michaelvll added 5 commits October 8, 2023 22:51

Merge branch 'fix-oom-for-spot' of github.com:skypilot-org/skypilot i…

83d3511

…nto fix-oom-for-spot

Add comments

cfbcdb7

fail early for memory exceeding

4b6b6af

Less frequent memory test

7885b63

shorter period

a05e0ea

Michaelvll commented Oct 9, 2023

View reviewed changes

sky/provision/aws/instance.py Outdated Show resolved Hide resolved

Michaelvll added 2 commits October 9, 2023 09:33

Update sky/provision/aws/instance.py

a98f4d0

refactor

417fc32

concretevitamin approved these changes Oct 9, 2023

View reviewed changes

sky/provision/aws/instance.py Outdated Show resolved Hide resolved

Michaelvll added 3 commits October 9, 2023 10:47

Address comments

110fc55

fix test

2cb0b38

Add comment to modules

05b6e5d

Michaelvll merged commit 470db0c into master Oct 9, 2023
18 checks passed

Michaelvll deleted the fix-oom-for-spot branch October 9, 2023 18:56

This was referenced Oct 11, 2023

[Spot] Fix AWS NoCredentialError caused by credential rotation #2695

Merged

[Spot] NoCredentialError for spot jobs #2697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Fix OOM for long running spot controller #2675

[Spot] Fix OOM for long running spot controller #2675

Michaelvll commented Oct 7, 2023 •

edited

concretevitamin left a comment

suquark Oct 7, 2023

Michaelvll Oct 8, 2023

suquark left a comment

concretevitamin left a comment

concretevitamin Oct 8, 2023

concretevitamin Oct 8, 2023

Michaelvll Oct 8, 2023

concretevitamin Oct 8, 2023

Michaelvll Oct 8, 2023

Michaelvll Oct 8, 2023

concretevitamin left a comment

concretevitamin Oct 9, 2023

Michaelvll Oct 9, 2023

concretevitamin left a comment

	# imported before and staled.
	# imported before and stale. Used for, e.g., a live spot controller running an older version and a new version gets installed by `sky spot launch`.

[Spot] Fix OOM for long running spot controller #2675

[Spot] Fix OOM for long running spot controller #2675

Conversation

Michaelvll commented Oct 7, 2023 • edited

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suquark left a comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Michaelvll commented Oct 7, 2023 •

edited