Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SSO/Spot] Fix spot controller failed to launch spot cluster when using SSO #1817

Merged
merged 5 commits into from
Mar 28, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Mar 28, 2023

A user reported that when using AWS SSO account, the sky spot launch will fail due to the controller does not have any cloud access enabled. That is a bug with our current SSO implementation, as the controller is assigned with an assumed IAM-role, which has the access to create the spot cluster and will not have static credential files, but our cloud access checking will fail when it finds there is no credential files.
Another problem with the spot controller is: the controller will keep retrying even if the all cloud disabled exception is raised, causing the spot job in STARTING status forever.

To reproduce:

  1. Run in a new docker container
  2. log in with SSO account
  3. sky spot launch -n test echo hi --retry-until-up

This PR fixes:

  1. The spot controller will now be able to launch the spot cluster when the user is using AWS SSO
  2. Spot controller will be able to catch the all cloud disabled exception and fail early. (This is for any potential future identity problem with that exception)

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • The reproducible script process.

@Michaelvll Michaelvll marked this pull request as ready for review March 28, 2023 05:30
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @Michaelvll! Mostly nits. Feel free to merge after.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/exceptions.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 42d6169 into master Mar 28, 2023
@Michaelvll Michaelvll deleted the fix-sso-spot branch March 28, 2023 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants