-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SkyServe] Rolling Update #2935
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!! This is amazing 🚀 Just done a pass and left some comments. One thing to notice is that on controller, we might want to store as less information in memory as possible and store in database instead.
sky/cli.py
Outdated
# which incorrectly recognizes the help string as a docstring. | ||
# pylint: disable=bad-docstring-quotes | ||
@click.option( | ||
'--mixed-replica-version', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a shortcut (say -m
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we consider a more widely-adopted name? e.g. blue&green
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a shortcut. We should discuss the name to use, blue&green
seems too specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the current name for now and discuss more w/ other group members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on offline discussion with @Michaelvll , let's not expose the flag to user, and set the default behavior to not mixing the replicas, and leave a Todo.
sky/cli.py
Outdated
'version become ready, direct traffic only to previous replicas ' | ||
'with same version and after that, direct traffic only to the ' | ||
'new replicas.')) | ||
def serve_update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add resources override for this CLI too after #2979 is merged.
…o serve-update-new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final comments; All of them are nits. It would be great if we could have some smoke tests and attach manual tests you've run to the PR description ;) Besides that it looks great on my side! 🙌🏻 @Michaelvll could you also take a look when you got time?
sky/serve/autoscalers.py
Outdated
@@ -67,8 +67,19 @@ def __init__(self, spec: 'service_spec.SkyServiceSpec') -> None: | |||
""" | |||
self.min_replicas: int = spec.min_replicas | |||
self.max_replicas: int = spec.max_replicas or spec.min_replicas | |||
# Target number of replicas is initialized to min replicas. | |||
# TODO(MaoZiming): add init replica numbers in SkyServe spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feeling like updating init replica is an important feature; and the spot PR still takes some time to ship to master. cc @Michaelvll for a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this important feature @MaoZiming and thank you for the review @cblmemo! The code looks mostly good to me except some concurrency concerns and some questions. : )
sky/serve/autoscalers.py
Outdated
@@ -67,8 +67,19 @@ def __init__(self, spec: 'service_spec.SkyServiceSpec') -> None: | |||
""" | |||
self.min_replicas: int = spec.min_replicas | |||
self.max_replicas: int = spec.max_replicas or spec.min_replicas | |||
# Target number of replicas is initialized to min replicas. | |||
# TODO(MaoZiming): add init replica numbers in SkyServe spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear to me what's difference between init_replicas
vs min_replicas
. I would prefer to reduce the specs in the arguments. Could we elaborate a bit for the TODO comment here for what the init_replicas
is about and why we need it?
sky/serve/core.py
Outdated
with ux_utils.print_exception_no_traceback(): | ||
raise RuntimeError(prompt) | ||
|
||
version = service_record['version'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can cause concurrency issue, e.g., if there are two updates called, they will have the same version number.
sky serve update
- Before the first step updates the database on the controller, call
sky serve update
again.
We probably want to refer to the implementation of the job_id
, where we first ask for the latest version+1
on the controller by using the atomic increment of the version in the database, so that multiple updates will have different version number and our update on the controller should only execute the update with the version number that is >=
the version number in the database.
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 3332 to 3333 in c1f28bc
code = job_lib.JobLibCodeGen.add_job(job_name, username, | |
self.run_timestamp, resources_str) |
skypilot/sky/skylet/job_lib.py
Lines 274 to 288 in c1f28bc
def add_job(job_name: str, username: str, run_timestamp: str, | |
resources_str: str) -> int: | |
"""Atomically reserve the next available job id for the user.""" | |
job_submitted_at = time.time() | |
# job_id will autoincrement with the null value | |
_CURSOR.execute('INSERT INTO jobs VALUES (null, ?, ?, ?, ?, ?, ?, null, ?)', | |
(job_name, username, job_submitted_at, JobStatus.INIT.value, | |
run_timestamp, None, resources_str)) | |
_CONN.commit() | |
rows = _CURSOR.execute('SELECT job_id FROM jobs WHERE run_timestamp=(?)', | |
(run_timestamp,)) | |
for row in rows: | |
job_id = row[0] | |
assert job_id is not None | |
return job_id |
skypilot/sky/skylet/job_lib.py
Line 60 in c1f28bc
job_id INTEGER PRIMARY KEY AUTOINCREMENT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Added
sky/serve/serve_state.py
Outdated
# Check if the entry with the specified service_name and version exists | ||
_DB.cursor.execute( | ||
"""SELECT * FROM versions WHERE service_name=? AND version=?""", | ||
(service_name, version)) | ||
existing_entry = _DB.cursor.fetchone() | ||
|
||
if existing_entry: | ||
_DB.cursor.execute( | ||
"""\ | ||
UPDATE versions SET spec=? | ||
WHERE service_name=? AND version=?""", ( | ||
pickle.dumps(spec), | ||
service_name, | ||
version, | ||
)) | ||
else: | ||
_DB.cursor.execute( | ||
"""\ | ||
INSERT INTO versions | ||
(service_name, version, spec) | ||
VALUES (?, ?, ?)""", (service_name, version, pickle.dumps(spec))) | ||
_DB.conn.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use INSERT or REPLACE
?
skypilot/sky/global_user_state.py
Line 189 in f78a16e
'INSERT or REPLACE INTO clusters' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if we always call add_version
before add the specs, why should we have the INSERT
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. The reason is that when service first initializes, we also need to add_version
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @MaoZiming! It mostly looks good to me with some concerns with the concurrency. Please check the comments below
@Michaelvll Thanks for the review! Addressed all the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaoZiming! This is awesome. It should be good to go once the tests passed. Also, can we add at least one smoke test for the functionality?
@Michaelvll Thanks! added a new smoke test |
Based on discussion with @cblmemo, rewrite sky serve update based on #2581
This PR introduces
sky serve update
CLI for sky serve.Usage:
The autoscaler will perform a rolling update, gradually change your replicas to latest version.
Note to myself:
The current PR will launch a new replica even if the task yaml is the same. The reason is that the user workdir or the executable can change, and that might require re-launching the same program.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py -k 'test_skyserve_update'