-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLIPPER-113] Implement model active version tracking and updates #115
Conversation
Note that in order to test this with the revised tutorial, you will need to build the |
Test FAILed. |
@nishadsingh1 Can you make a version of the deploy sklearn model image from the tutorial for the TensorFlow model? In other words, delete the sklearn box from this image. With this new model versioning PR, the TF model replaces the sklearn model and I would like to illustrate that. Also, what did you use to make those images? Can you add whatever powerpoint/google slides file you used to the Clipper Google Drive? |
jenkins test this please |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass and functionality seems to be mostly complete. Comments address a couple of concerns involving behavior with deploy_model()
in conjunction with set_model_version
. Additionally:
-
Line 491 of clipper_manager.py should refer to
Clipper.deploy_model()
instead ofClipper.add_model()
. -
What’s the intent of the “Warning” image in the tutorial_part_2 notebook? Seems unclear to me.
policy, latency_slo_micros, input_type); | ||
std::vector<VersionedModelId> versioned_models; | ||
{ | ||
std::unique_lock<std::mutex> l(current_model_versions_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to hold this lock for the entire query processing and response creation procedure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. The lock is in a nested block so it will go out of scope and be released on line 171 before the prediction is made.
models, policy, input_type); | ||
std::vector<VersionedModelId> versioned_models; | ||
{ | ||
std::unique_lock<std::mutex> l(current_model_versions_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, do we need to hold this lock for the entire update procedure or just until the versioned_models
map is populated? If the answer is the latter, releasing the lock earlier may avoid resource contention between the update
and predict
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock goes out of scope on line 210.
/** | ||
* Subscribes to changes in model versions. | ||
* | ||
* The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: check documentation formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -1,3 +1,4 @@ | |||
#include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this include necessary? (Maybe gcc is throwing an error where clang isn't?, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::sort
which we use in this file is defined in <algorithm>
. I prefer to explicitly include the header dependencies for a file as much as possible and I noticed this one was missing.
src/libclipper/src/redis.cpp
Outdated
std::string key = gen_model_current_version_key(model_name); | ||
auto result = send_cmd_with_reply<string>(redis, {"GET", key}); | ||
if (result) { | ||
return std::stoi(*result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we verify that result
is a nonnegative integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good idea.
src/libclipper/src/redis.cpp
Outdated
std::move(callback)); | ||
} | ||
|
||
void subscribe_to_container_changes( | ||
Subscriber& subscriber, | ||
std::function<void(const std::string&, const std::string&)> callback) { | ||
subscribe_to_keyspace_changes(REDIS_CONTAINER_DB_NUM, subscriber, | ||
subscribe_to_keyspace_changes(REDIS_CONTAINER_DB_NUM, "", subscriber, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: For clarity, defining const std::string NO_PREFIX = "";
and using it in place of ""
is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I actually disagree here. Defining NO_PREFIX as a constant seems like treating this as a special case. I'll define a local variable std::string prefix = ""
in each method instead.
management/clipper_manager.py
Outdated
@@ -555,6 +557,37 @@ def inspect_instance(self): | |||
s = r.text | |||
return s | |||
|
|||
def set_model_version(model_name, model_version, num_containers=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this the only point through which model rollback / rollforward can be achieved. Currently, we can also call deploy_model()
while specifying a previously-added version of a model. However, deploy_model()
has no checks in place to ensure that the model being added is actually equivalent to the old model corresponding to the given version. Hypothetically, version x of a model can therefore change throughout time, and this seems to defeat the purpose of versioning.
We can fix this by adding checks to deploy_model()
that fail if a model is deployed with a previously-specified version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a really good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added checks to the management frontend's add_application
and add_model
methods to reject duplicate adds.
container_name, model_path4)); | ||
|
||
ASSERT_FALSE(rh_.set_model_version("m", 11)); | ||
ASSERT_EQ(get_current_model_version(*redis_, "m"), -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we expect get_current_model_version()
to return 4
, corresponding to the last successful call to add_model()
with model4
? Is attempting to set an invalid model version supposed to change the model version to -1
? If so, we should consistently maintain the previous version in the face of an update failure, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was adding the model directly to Redis so set_model_version
wasn't being called. I fixed this by calling the managament frontend's RequestHandler::add_model
with JSON and it now should reflect the expected behavior.
management/clipper_manager.py
Outdated
@@ -555,6 +557,37 @@ def inspect_instance(self): | |||
s = r.text | |||
return s | |||
|
|||
def set_model_version(model_name, model_version, num_containers=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing self
in parameter list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@Corey-Zumar I addressed your comments. I also added a check to prevent duplicate adding of applications or model versions. This fixes #112. |
Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, LGTM. Will merge after jenkins is able to build successfully.
@Corey-Zumar BTW the warning in the tutorial is because @nishadsingh1 said he accidentally ran cleanup too early the first time he ran the tutorial. The idea was to have a visual separation from the tutorial code and the cleanup. It's been there for awhile, I just moved the location around slightly. |
Test PASSed. |
@Corey-Zumar Hold off on merging this until the tutorial is updated. Should be sometime today. |
@Corey-Zumar Tutorial is updated. This PR is good to go once Jenkins passes it. |
Test PASSed. |
This PR associated a
current_model_version
with each model and makes applications specify candidate model names but not versions. Instead, the current version of a model is deployed. Model versions are automatically updated when a newer model version is deployed. I also implemented aset_model_version
REST call in the management frontend that allows users to manually set the current model version to any version that has previously been deployed. This allows for model rollback and roll-forward.I removed the feedback section of the selection policy and the tutorial wording still needs a little bit of work.