Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLIPPER-113] Implement model active version tracking and updates #115

Merged
merged 12 commits into from
Apr 12, 2017

Conversation

dcrankshaw
Copy link
Contributor

This PR associated a current_model_version with each model and makes applications specify candidate model names but not versions. Instead, the current version of a model is deployed. Model versions are automatically updated when a newer model version is deployed. I also implemented a set_model_version REST call in the management frontend that allows users to manually set the current model version to any version that has previously been deployed. This allows for model rollback and roll-forward.

I removed the feedback section of the selection policy and the tutorial wording still needs a little bit of work.

@dcrankshaw
Copy link
Contributor Author

Note that in order to test this with the revised tutorial, you will need to build the query_frontend and management_frontend containers locally. The ones deployed to DockerHub are from the develop branch.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/127/
Test FAILed.

@dcrankshaw
Copy link
Contributor Author

@nishadsingh1 Can you make a version of the deploy sklearn model image from the tutorial for the TensorFlow model? In other words, delete the sklearn box from this image. With this new model versioning PR, the TF model replaces the sklearn model and I would like to illustrate that.

Also, what did you use to make those images? Can you add whatever powerpoint/google slides file you used to the Clipper Google Drive?

@dcrankshaw
Copy link
Contributor Author

jenkins test this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/128/
Test PASSed.

Copy link
Contributor

@Corey-Zumar Corey-Zumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests pass and functionality seems to be mostly complete. Comments address a couple of concerns involving behavior with deploy_model() in conjunction with set_model_version. Additionally:

  • Line 491 of clipper_manager.py should refer to Clipper.deploy_model() instead of Clipper.add_model().

  • What’s the intent of the “Warning” image in the tutorial_part_2 notebook? Seems unclear to me.

policy, latency_slo_micros, input_type);
std::vector<VersionedModelId> versioned_models;
{
std::unique_lock<std::mutex> l(current_model_versions_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to hold this lock for the entire query processing and response creation procedure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. The lock is in a nested block so it will go out of scope and be released on line 171 before the prediction is made.

models, policy, input_type);
std::vector<VersionedModelId> versioned_models;
{
std::unique_lock<std::mutex> l(current_model_versions_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, do we need to hold this lock for the entire update procedure or just until the versioned_models map is populated? If the answer is the latter, releasing the lock earlier may avoid resource contention between the update and predict functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock goes out of scope on line 210.

/**
* Subscribes to changes in model versions.
*
* The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: check documentation formatting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -1,3 +1,4 @@
#include <algorithm>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include necessary? (Maybe gcc is throwing an error where clang isn't?, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::sort which we use in this file is defined in <algorithm>. I prefer to explicitly include the header dependencies for a file as much as possible and I noticed this one was missing.

std::string key = gen_model_current_version_key(model_name);
auto result = send_cmd_with_reply<string>(redis, {"GET", key});
if (result) {
return std::stoi(*result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify that result is a nonnegative integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good idea.

std::move(callback));
}

void subscribe_to_container_changes(
Subscriber& subscriber,
std::function<void(const std::string&, const std::string&)> callback) {
subscribe_to_keyspace_changes(REDIS_CONTAINER_DB_NUM, subscriber,
subscribe_to_keyspace_changes(REDIS_CONTAINER_DB_NUM, "", subscriber,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: For clarity, defining const std::string NO_PREFIX = ""; and using it in place of "" is clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I actually disagree here. Defining NO_PREFIX as a constant seems like treating this as a special case. I'll define a local variable std::string prefix = "" in each method instead.

@@ -555,6 +557,37 @@ def inspect_instance(self):
s = r.text
return s

def set_model_version(model_name, model_version, num_containers=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this the only point through which model rollback / rollforward can be achieved. Currently, we can also call deploy_model() while specifying a previously-added version of a model. However, deploy_model() has no checks in place to ensure that the model being added is actually equivalent to the old model corresponding to the given version. Hypothetically, version x of a model can therefore change throughout time, and this seems to defeat the purpose of versioning.

We can fix this by adding checks to deploy_model() that fail if a model is deployed with a previously-specified version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a really good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added checks to the management frontend's add_application and add_model methods to reject duplicate adds.

container_name, model_path4));

ASSERT_FALSE(rh_.set_model_version("m", 11));
ASSERT_EQ(get_current_model_version(*redis_, "m"), -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we expect get_current_model_version() to return 4, corresponding to the last successful call to add_model() with model4? Is attempting to set an invalid model version supposed to change the model version to -1? If so, we should consistently maintain the previous version in the face of an update failure, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was adding the model directly to Redis so set_model_version wasn't being called. I fixed this by calling the managament frontend's RequestHandler::add_model with JSON and it now should reflect the expected behavior.

@@ -555,6 +557,37 @@ def inspect_instance(self):
s = r.text
return s

def set_model_version(model_name, model_version, num_containers=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing self in parameter list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@dcrankshaw
Copy link
Contributor Author

@Corey-Zumar I addressed your comments. I also added a check to prevent duplicate adding of applications or model versions. This fixes #112.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/136/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/137/
Test FAILed.

Copy link
Contributor

@Corey-Zumar Corey-Zumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, LGTM. Will merge after jenkins is able to build successfully.

@dcrankshaw
Copy link
Contributor Author

@Corey-Zumar BTW the warning in the tutorial is because @nishadsingh1 said he accidentally ran cleanup too early the first time he ran the tutorial. The idea was to have a visual separation from the tutorial code and the cleanup. It's been there for awhile, I just moved the location around slightly.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/140/
Test PASSed.

@dcrankshaw
Copy link
Contributor Author

@Corey-Zumar Hold off on merging this until the tutorial is updated. Should be sometime today.

@dcrankshaw
Copy link
Contributor Author

@Corey-Zumar Tutorial is updated. This PR is good to go once Jenkins passes it.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/141/
Test PASSed.

@Corey-Zumar Corey-Zumar merged commit 1648285 into ucbrise:develop Apr 12, 2017
@dcrankshaw dcrankshaw deleted the model_versioning branch May 17, 2017 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants