Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clipper returns default predictions when a new version of a model is deployed but not connected #152

Closed
dcrankshaw opened this issue May 12, 2017 · 7 comments
Assignees
Projects
Milestone

Comments

@dcrankshaw
Copy link
Contributor

When a user updates a model version, Clipper is informed that the model version has changed and it immediately starts to query the new version. However, it can take several minutes for the container for that new version to initialize and connect to Clipper. In that intervening period, Clipper attempts to query the latest version, cannot find any containers for that model version, and instead returns the default prediction.

Instead, it may be desirable for Clipper to wait until the new container has finished initializing and connects to Clipper before switching to the new version.

@rmdort
Copy link
Contributor

rmdort commented May 13, 2017

Maybe clipper_manager could have functions to undeploy old models. I was thinking, when a new model is successfully deployed, clipper can automatically stop the old containers/models.

Or undeploy can also be called manually from the clipper manager

Maybe a few new APIs

  1. Undeploy a model
  2. Remove application
  3. Pause application

@withsmilo
Copy link
Collaborator

@dcrankshaw
This issue is critical for the product deployment system. We resolved it by applying the 'blue-green' deployment policy to our system. I implemented own SwarmContainerManager inspired by DockerContainerManager.

  • we call python_deployer.deploy_python_closure().
    • [python_deployer] calls build_and_deploy_model() of clipper_admin.
      • [clipper_admin] calls self.build_model().
      • [clipper_admin] calls self.deploy_model().
      • [clipper_admin] calls deploy_model() of SwarmContainerManager.
        • [SwarmContainerManager] creates a new swarm service.
        • [SwarmContainerManager] checks that,
          • all tasks of the new swarm service have some 'running' status, or not.
          • the replicas number of the new swarm service is the same with predefined replicas, or not.
        • [SwarmContainerManager] adds it to the metric config.
      • [clipper_admin] calls self.register_model() to register the new model.

I experienced that the time to initialize the new swarm service is very variable, so I think that Clipper might need some routines to check a new model's status before registering it to Clipper.

@dcrankshaw
Copy link
Contributor Author

Yeah agreed. Have you found that Swarm's container status is sufficient to indicate whether a container is running yet? We've found that for several deep learning models, especially when running on a GPU, there is a non-trivial amount of time after a container has started to initialize the model and connect to Clipper. Using the underlying container manager to detect when the containers ready as you do would be relatively simple to implement, but I was worried that that was not sufficient.

@dcrankshaw
Copy link
Contributor Author

@chester-leung For a first version of a fix, you should modify the container_manager.deploy_model function to block until the container is actually running. Right now, we start the container then return immediately, rather than waiting until the container is fully running. You'll need to implement this for both the Kubernetes container manager and the Docker container manager.

@withsmilo
Copy link
Collaborator

@dcrankshaw
Thank you for your advice. I agreed with your opinion. How about use Docker's healthcheck option in DockerFile and then check healthy status in the our *ContainerManager to decide whether a container is running or not?

According to the reference,

When a container has a healthcheck specified, it has a health status in addition to its normal status. This status is initially starting. Whenever a health check passes, it becomes healthy (whatever state it was previously in). After a certain number of consecutive failures, it becomes unhealthy.

@dcrankshaw
Copy link
Contributor Author

That's a good idea. @chester-leung for a first step, let's get a version working that just looks at the container state. As a second step, we can modify the RPC implementation to write a file somewhere once the container has connected to Clipper as the healthcheck.

@dcrankshaw dcrankshaw moved this from Backlog to In progress in 0.3 Release Feb 22, 2018
@chester-leung
Copy link
Member

So far, I've implemented a fix to force the container manager to sleep until all added containers are deemed ready. This should fix the problem of querying a model in a new container before the new container is fully functional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

4 participants