Clipper returns default predictions when a new version of a model is deployed but not connected #152

dcrankshaw · 2017-05-12T15:36:19Z

When a user updates a model version, Clipper is informed that the model version has changed and it immediately starts to query the new version. However, it can take several minutes for the container for that new version to initialize and connect to Clipper. In that intervening period, Clipper attempts to query the latest version, cannot find any containers for that model version, and instead returns the default prediction.

Instead, it may be desirable for Clipper to wait until the new container has finished initializing and connects to Clipper before switching to the new version.

rmdort · 2017-05-13T01:37:08Z

Maybe clipper_manager could have functions to undeploy old models. I was thinking, when a new model is successfully deployed, clipper can automatically stop the old containers/models.

Or undeploy can also be called manually from the clipper manager

Maybe a few new APIs

Undeploy a model
Remove application
Pause application

withsmilo · 2018-02-08T00:46:08Z

@dcrankshaw
This issue is critical for the product deployment system. We resolved it by applying the 'blue-green' deployment policy to our system. I implemented own SwarmContainerManager inspired by DockerContainerManager.

we call python_deployer.deploy_python_closure().
- [python_deployer] calls build_and_deploy_model() of clipper_admin.
  - [clipper_admin] calls self.build_model().
  - [clipper_admin] calls self.deploy_model().
  - [clipper_admin] calls deploy_model() of SwarmContainerManager.
    - [SwarmContainerManager] creates a new swarm service.
    - [SwarmContainerManager] checks that,
      - all tasks of the new swarm service have some 'running' status, or not.
      - the replicas number of the new swarm service is the same with predefined replicas, or not.
    - [SwarmContainerManager] adds it to the metric config.
  - [clipper_admin] calls self.register_model() to register the new model.

I experienced that the time to initialize the new swarm service is very variable, so I think that Clipper might need some routines to check a new model's status before registering it to Clipper.

dcrankshaw · 2018-02-12T18:45:07Z

Yeah agreed. Have you found that Swarm's container status is sufficient to indicate whether a container is running yet? We've found that for several deep learning models, especially when running on a GPU, there is a non-trivial amount of time after a container has started to initialize the model and connect to Clipper. Using the underlying container manager to detect when the containers ready as you do would be relatively simple to implement, but I was worried that that was not sufficient.

dcrankshaw · 2018-02-12T18:51:39Z

@chester-leung For a first version of a fix, you should modify the container_manager.deploy_model function to block until the container is actually running. Right now, we start the container then return immediately, rather than waiting until the container is fully running. You'll need to implement this for both the Kubernetes container manager and the Docker container manager.

withsmilo · 2018-02-12T23:26:19Z

@dcrankshaw
Thank you for your advice. I agreed with your opinion. How about use Docker's healthcheck option in DockerFile and then check healthy status in the our *ContainerManager to decide whether a container is running or not?

According to the reference,

When a container has a healthcheck specified, it has a health status in addition to its normal status. This status is initially starting. Whenever a health check passes, it becomes healthy (whatever state it was previously in). After a certain number of consecutive failures, it becomes unhealthy.

dcrankshaw · 2018-02-14T03:01:08Z

That's a good idea. @chester-leung for a first step, let's get a version working that just looks at the container state. As a second step, we can modify the RPC implementation to write a file somewhere once the container has connected to Clipper as the healthcheck.

chester-leung · 2018-03-06T09:37:14Z

So far, I've implemented a fix to force the container manager to sleep until all added containers are deemed ready. This should fix the problem of querying a model in a new container before the new container is fully functional.

dcrankshaw mentioned this issue May 12, 2017

Deploy model causes docker container to exit #149

Closed

dcrankshaw added the status: on hold label Jun 16, 2017

dcrankshaw added this to the 0.3.0 Release milestone Nov 7, 2017

dcrankshaw modified the milestone: 0.3.0 Release Nov 29, 2017

dcrankshaw added this to Backlog in 0.3 Release Dec 1, 2017

dcrankshaw added type: enhancement beginner-friendly and removed status: on hold labels Jan 19, 2018

dcrankshaw removed the beginner-friendly label Feb 12, 2018

dcrankshaw assigned chester-leung Feb 12, 2018

dcrankshaw moved this from Backlog to In progress in 0.3 Release Feb 22, 2018

dcrankshaw mentioned this issue Mar 7, 2018

Signal when rpc.start() is called #425

Closed

chester-leung mentioned this issue Mar 22, 2018

Ensure container/model is ready to be queried #447

Merged

dcrankshaw closed this as completed in #447 Apr 8, 2018

chester-leung moved this from In progress to Done in 0.3 Release Apr 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clipper returns default predictions when a new version of a model is deployed but not connected #152

Clipper returns default predictions when a new version of a model is deployed but not connected #152

dcrankshaw commented May 12, 2017

rmdort commented May 13, 2017 •

edited

Loading

withsmilo commented Feb 8, 2018

dcrankshaw commented Feb 12, 2018

dcrankshaw commented Feb 12, 2018

withsmilo commented Feb 12, 2018

dcrankshaw commented Feb 14, 2018

chester-leung commented Mar 6, 2018

Clipper returns default predictions when a new version of a model is deployed but not connected #152

Clipper returns default predictions when a new version of a model is deployed but not connected #152

Comments

dcrankshaw commented May 12, 2017

rmdort commented May 13, 2017 • edited Loading

withsmilo commented Feb 8, 2018

dcrankshaw commented Feb 12, 2018

dcrankshaw commented Feb 12, 2018

withsmilo commented Feb 12, 2018

dcrankshaw commented Feb 14, 2018

chester-leung commented Mar 6, 2018

rmdort commented May 13, 2017 •

edited

Loading