Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Allow 2x spot jobs on a spot controller #3191

Merged
merged 13 commits into from
Feb 21, 2024
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 19, 2024

Fixes #3187
We reduce the CPU requirement for each spot controller process to allow 2x more spot jobs to be submitted to a spot controller.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Launch 32 spot jobs and manually kill all the instance to check if there is any OOM issue on the controller.
    • sky spot cancel -a
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll marked this pull request as ready for review February 19, 2024 07:36
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, quick look.

sky/backends/backend_utils.py Show resolved Hide resolved
# We use 50 GB disk size to reduce the cost.
CONTROLLER_RESOURCES = {'disk_size': 50}
CONTROLLER_RESOURCES = {'memory': '3x', 'disk_size': 50}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Forgot about this syntax; does it mean >= 3x (if so 3x+ may be more intuitive, for the future)?

Q: why does code say 3x but comment say "allow 4x .."?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is 3x+. This 3x is not exposed to the user, but indeed we should make it more clear in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Not sure I understand why the comment (each spot controller process use 0.25 CPU cores) connects to the explanation of memory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I further rephrase it to: each CPU core can have 4 spot controller processes as we set the CPU requirement to 4, and 3 GB is barely enough for 4 spot processes
Does that make it more clear?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clear now! Can we be consistent to say vCPU vs. CPU core (which is typically 2 vCPUs) everywhere?

Comment on lines 12 to 13
# Based on profiling, the memory should be at least 3x than the CPU cores (we
# allow 4x CPUs spot controller processes on the spot controller) to avoid OOM.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Based on profiling, the memory should be at least 3x than the CPU cores (we
# allow 4x CPUs spot controller processes on the spot controller) to avoid OOM.
# Based on profiling, memory should be at least 3x (in GB) as num vCPUs to avoid OOM (we
# set the max number of spot controller processes to be "4x vCPUs" on the spot controller)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following how the statement in the parenthesis connects to the memory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased it. PTAL : )

sky/skylet/constants.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll.

# We use 50 GB disk size to reduce the cost.
CONTROLLER_RESOURCES = {'disk_size': 50}
CONTROLLER_RESOURCES = {'memory': '3x', 'disk_size': 50}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Not sure I understand why the comment (each spot controller process use 0.25 CPU cores) connects to the explanation of memory?

# We use 50 GB disk size to reduce the cost.
CONTROLLER_RESOURCES = {'disk_size': 50}
CONTROLLER_RESOURCES = {'memory': '3x', 'disk_size': 50}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clear now! Can we be consistent to say vCPU vs. CPU core (which is typically 2 vCPUs) everywhere?

@Michaelvll Michaelvll merged commit 40fa245 into master Feb 21, 2024
19 checks passed
@Michaelvll Michaelvll deleted the allow-more-spot-jobs branch February 21, 2024 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] Allow more spot jobs on a controller
2 participants