-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Fix AWS NoCredentialError caused by credential rotation #2695
Conversation
…edential-issue-with-aws
…edential-issue-with-aws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @Michaelvll! Some comments while tests/repro are underway.
Since this touches inner loops, we may need to run AWS smoke tests as well.
@@ -37,6 +38,11 @@ | |||
|
|||
version = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to bump this and/or do the importlib reload hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no API changes for this file in the file, so it should be fine not bumping the version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Michaelvll. We should run AWS smoke tests before merging?
Fixes #2697
We fixed the cache for
session.resource
in #2675. However, it causes a user using SSO gettingNoCredentialError
when running spot jobs, which might be because of the rotation of the credential used by the cachedsession.resource
. We avoid using cachedresource
, but create a new one to trigger refresh of the credential.Tested (run the relevant ones):
bash format.sh
sky spot launch --cloud aws --num-nodes 2 --cpus 2+ sleep 10000000;
run with no static credential on spot controller for 11 hours with no issue.pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh