Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SkyServe] Rolling Update #2935

Merged
merged 47 commits into from
Jan 24, 2024
Merged

[SkyServe] Rolling Update #2935

merged 47 commits into from
Jan 24, 2024

Conversation

MaoZiming
Copy link
Collaborator

@MaoZiming MaoZiming commented Jan 3, 2024

Based on discussion with @cblmemo, rewrite sky serve update based on #2581

This PR introduces sky serve update CLI for sky serve.

Usage:

sky serve update <service-name> <new-task-yaml>

The autoscaler will perform a rolling update, gradually change your replicas to latest version.

Note to myself:

The current PR will launch a new replica even if the task yaml is the same. The reason is that the user workdir or the executable can change, and that might require re-launching the same program.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • pytest tests/test_smoke.py -k 'test_skyserve_update'

@MaoZiming MaoZiming requested a review from cblmemo January 3, 2024 15:49
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! This is amazing 🚀 Just done a pass and left some comments. One thing to notice is that on controller, we might want to store as less information in memory as possible and store in database instead.

sky/cli.py Outdated
# which incorrectly recognizes the help string as a docstring.
# pylint: disable=bad-docstring-quotes
@click.option(
'--mixed-replica-version',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a shortcut (say -m)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should we consider a more widely-adopted name? e.g. blue&green

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a shortcut. We should discuss the name to use, blue&green seems too specific?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the current name for now and discuss more w/ other group members.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussion with @Michaelvll , let's not expose the flag to user, and set the default behavior to not mixing the replicas, and leave a Todo.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Outdated Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated
'version become ready, direct traffic only to previous replicas '
'with same version and after that, direct traffic only to the '
'new replicas.'))
def serve_update(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add resources override for this CLI too after #2979 is merged.

sky/cli.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/replica_managers.py Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final comments; All of them are nits. It would be great if we could have some smoke tests and attach manual tests you've run to the PR description ;) Besides that it looks great on my side! 🙌🏻 @Michaelvll could you also take a look when you got time?

sky/cli.py Outdated Show resolved Hide resolved
@@ -67,8 +67,19 @@ def __init__(self, spec: 'service_spec.SkyServiceSpec') -> None:
"""
self.min_replicas: int = spec.min_replicas
self.max_replicas: int = spec.max_replicas or spec.min_replicas
# Target number of replicas is initialized to min replicas.
# TODO(MaoZiming): add init replica numbers in SkyServe spec.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feeling like updating init replica is an important feature; and the spot PR still takes some time to ship to master. cc @Michaelvll for a look

sky/serve/autoscalers.py Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/controller.py Outdated Show resolved Hide resolved
sky/serve/serve_state.py Outdated Show resolved Hide resolved
sky/serve/service.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this important feature @MaoZiming and thank you for the review @cblmemo! The code looks mostly good to me except some concurrency concerns and some questions. : )

sky/cli.py Show resolved Hide resolved
@@ -67,8 +67,19 @@ def __init__(self, spec: 'service_spec.SkyServiceSpec') -> None:
"""
self.min_replicas: int = spec.min_replicas
self.max_replicas: int = spec.max_replicas or spec.min_replicas
# Target number of replicas is initialized to min replicas.
# TODO(MaoZiming): add init replica numbers in SkyServe spec.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me what's difference between init_replicas vs min_replicas. I would prefer to reduce the specs in the arguments. Could we elaborate a bit for the TODO comment here for what the init_replicas is about and why we need it?

sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/controller.py Outdated Show resolved Hide resolved
sky/serve/controller.py Outdated Show resolved Hide resolved
with ux_utils.print_exception_no_traceback():
raise RuntimeError(prompt)

version = service_record['version']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause concurrency issue, e.g., if there are two updates called, they will have the same version number.

  1. sky serve update
  2. Before the first step updates the database on the controller, call sky serve update again.

We probably want to refer to the implementation of the job_id, where we first ask for the latest version+1 on the controller by using the atomic increment of the version in the database, so that multiple updates will have different version number and our update on the controller should only execute the update with the version number that is >= the version number in the database.

code = job_lib.JobLibCodeGen.add_job(job_name, username,
self.run_timestamp, resources_str)

def add_job(job_name: str, username: str, run_timestamp: str,
resources_str: str) -> int:
"""Atomically reserve the next available job id for the user."""
job_submitted_at = time.time()
# job_id will autoincrement with the null value
_CURSOR.execute('INSERT INTO jobs VALUES (null, ?, ?, ?, ?, ?, ?, null, ?)',
(job_name, username, job_submitted_at, JobStatus.INIT.value,
run_timestamp, None, resources_str))
_CONN.commit()
rows = _CURSOR.execute('SELECT job_id FROM jobs WHERE run_timestamp=(?)',
(run_timestamp,))
for row in rows:
job_id = row[0]
assert job_id is not None
return job_id

job_id INTEGER PRIMARY KEY AUTOINCREMENT,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Added

sky/serve/autoscalers.py Outdated Show resolved Hide resolved
sky/serve/serve_state.py Outdated Show resolved Hide resolved
sky/serve/autoscalers.py Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
sky/serve/core.py Outdated Show resolved Hide resolved
sky/serve/serve_utils.py Show resolved Hide resolved
sky/serve/serve_utils.py Show resolved Hide resolved
sky/serve/replica_managers.py Outdated Show resolved Hide resolved
sky/serve/serve_state.py Outdated Show resolved Hide resolved
Comment on lines 440 to 461
# Check if the entry with the specified service_name and version exists
_DB.cursor.execute(
"""SELECT * FROM versions WHERE service_name=? AND version=?""",
(service_name, version))
existing_entry = _DB.cursor.fetchone()

if existing_entry:
_DB.cursor.execute(
"""\
UPDATE versions SET spec=?
WHERE service_name=? AND version=?""", (
pickle.dumps(spec),
service_name,
version,
))
else:
_DB.cursor.execute(
"""\
INSERT INTO versions
(service_name, version, spec)
VALUES (?, ?, ?)""", (service_name, version, pickle.dumps(spec)))
_DB.conn.commit()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use INSERT or REPLACE?

'INSERT or REPLACE INTO clusters'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if we always call add_version before add the specs, why should we have the INSERT part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. The reason is that when service first initializes, we also need to add_version.

sky/serve/serve_state.py Outdated Show resolved Hide resolved
sky/serve/serve_state.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @MaoZiming! It mostly looks good to me with some concerns with the concurrency. Please check the comments below

@MaoZiming
Copy link
Collaborator Author

@Michaelvll Thanks for the review! Addressed all the comments.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaoZiming! This is awesome. It should be good to go once the tests passed. Also, can we add at least one smoke test for the functionality?

@MaoZiming
Copy link
Collaborator Author

@Michaelvll Thanks! added a new smoke test

@MaoZiming MaoZiming merged commit 3765f03 into master Jan 24, 2024
19 checks passed
@MaoZiming MaoZiming deleted the serve-update-new branch January 24, 2024 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants