-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Allow 2x spot jobs on a spot controller #3191
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, quick look.
# We use 50 GB disk size to reduce the cost. | ||
CONTROLLER_RESOURCES = {'disk_size': 50} | ||
CONTROLLER_RESOURCES = {'memory': '3x', 'disk_size': 50} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Forgot about this syntax; does it mean >= 3x
(if so 3x+
may be more intuitive, for the future)?
Q: why does code say 3x but comment say "allow 4x .."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is 3x+. This 3x
is not exposed to the user, but indeed we should make it more clear in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Not sure I understand why the comment (each spot controller process use 0.25 CPU cores)
connects to the explanation of memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I further rephrase it to: each CPU core can have 4 spot controller processes as we set the CPU requirement to 4, and 3 GB is barely enough for 4 spot processes
Does that make it more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clear now! Can we be consistent to say vCPU vs. CPU core (which is typically 2 vCPUs) everywhere?
sky/spot/constants.py
Outdated
# Based on profiling, the memory should be at least 3x than the CPU cores (we | ||
# allow 4x CPUs spot controller processes on the spot controller) to avoid OOM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Based on profiling, the memory should be at least 3x than the CPU cores (we | |
# allow 4x CPUs spot controller processes on the spot controller) to avoid OOM. | |
# Based on profiling, memory should be at least 3x (in GB) as num vCPUs to avoid OOM (we | |
# set the max number of spot controller processes to be "4x vCPUs" on the spot controller) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not following how the statement in the parenthesis connects to the memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased it. PTAL : )
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
…ot into allow-more-spot-jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll.
# We use 50 GB disk size to reduce the cost. | ||
CONTROLLER_RESOURCES = {'disk_size': 50} | ||
CONTROLLER_RESOURCES = {'memory': '3x', 'disk_size': 50} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Not sure I understand why the comment (each spot controller process use 0.25 CPU cores)
connects to the explanation of memory?
# We use 50 GB disk size to reduce the cost. | ||
CONTROLLER_RESOURCES = {'disk_size': 50} | ||
CONTROLLER_RESOURCES = {'memory': '3x', 'disk_size': 50} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clear now! Can we be consistent to say vCPU vs. CPU core (which is typically 2 vCPUs) everywhere?
Fixes #3187
We reduce the CPU requirement for each spot controller process to allow 2x more spot jobs to be submitted to a spot controller.
Tested (run the relevant ones):
bash format.sh
sky spot cancel -a
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh