-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Fix OOM for long running spot controller #2675
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work tracking this down @Michaelvll! Mostly LGTM, some questions. cc @suquark too
sky/adaptors/aws.py
Outdated
# NOTE: we need the lock here to avoid thread-safety issues when | ||
# creating the resource, because Python module is a shared object, | ||
# and we are not sure if the code inside 'session().resource()' | ||
# is thread-safe. | ||
_local.resource[key] = session().resource(service_name, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should set a maximum cache size (like LRU cache) to prevent further OOM. but I do not know how hard it is to implement it. maybe we can look into python LRU cache implementation (or can we just use lru_cache after all these changes?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched back to functools.lru_cache
now. Please take another look @suquark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts about lru cache vs thread local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: does this test applied to master branch pass or fail the CI? Just wondering if the CI having no AWS credentials affects the creation of Config and/or the mem usage (probably not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a lot of tests/*.py
are already unit tests, so it's a bit odd to add a new dir. How about just making it flatter by placing this file under tests/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It passes the CI as shown below. It is a good question for not having an AWS credentials, but the test seems passed both locally and in CI.
How about we do a refactoring in a future PR to separate the integration tests and unit tests? We should move those unit tests in tests/*.py
to unit tests and have the integration tests in another place.
sky/provision/aws/instance.py
Outdated
def _default_ec2_resource(region: str) -> Any: | ||
if not hasattr(aws, 'version'): | ||
# For backward compatibility, reload the module if the aws module was | ||
# imported before and staled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# imported before and staled. | |
# imported before and stale. Used for, e.g., a live spot controller running an older version and a new version gets installed by `sky spot launch`. |
Q: actually, this reload() won't affect any existing controller.py processes? A bit confused why we need back compat here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the spot controller’s backward compatibility is guaranteed by the nature that python will cache all the imported module in the memory, so even if the skypilot package is updated in the controller VM, the existing controller process will still use the old code, i.e., no backward compatibility issue.
However, since we are using this adaptor
style implementation, the modules are only imported when used, this causes any update in the aws.instance
will be used by the old controller process. That said, the adaptors.aws
is in old version while the aws.instance
is in the new version, causing a mismatch in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a temporary solution, and I think we should get rid of the adaptors
but only use the provision
APIs as the two-level import and reloading can cause a lot of issues, like the older code request older adaptor.aws
while the newly imported aws.instance
requires newer adaptor.aws
.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…nto fix-oom-for-spot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, some comments.
def resource(service_name: str, **kwargs): | ||
"""Create an AWS resource of a certain service. | ||
|
||
Args: | ||
service_name: AWS resource name (e.g., 's3'). | ||
kwargs: Other options. | ||
kwargs: Other options. We add max_attempts to the kwargs instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "We add ..." part probably belongs to below, ~L140?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be better to have it here, so the caller can see it in the docstr
?
Here, the kwargs
is the same as the one for session().resource()
, except for the max_attempt
, which is added by us, so we should let the caller know not to use the config
without click into the function implementation.
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll! Just a note on the comment & the unit test.
* Fix caching error for aws resources * Add todo for local cache * backward compatibility * Use lru cache instead * format * add unit test for aws resources memory leakage * pytest larger memory limit * Fix unit test * fix comment * add requirement for memory profiler * install memory profiler in pytest * fix ci * Update sky/provision/aws/instance.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Address comments * Add new line * backward compatibility * Use thread_local LRU instead * init * Update sky/adaptors/aws.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * make private * Add comments * fail early for memory exceeding * Less frequent memory test * shorter period * Update sky/provision/aws/instance.py * refactor * Address comments * fix test * Add comment to modules --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Fixes #2668.
Previously, we add new aws.resources() into the
_local.resources
cache. This will increase the memory usage, as theconfig
object will be different every timeaws.resources
is called, causing OOM on spot controller.TODO:
sky start -f sky-spot-controller-xxx
,sky spot cancel -a
, the spot cluster may fail to be terminated, as theadaptor.aws
may be loaded in memory earlier.Tested (run the relevant ones):
bash format.sh
sky spot launch -n test-oom --cloud aws --cpus 2 --num-nodes 16 sleep 1000000000000000
and check the memory consumption withhtop
pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh
sky spot launch -n test-aws sleep 100000
; checkout to current PR;sky start -f sky-spot-xxxx
;sky spot logs --controller