Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review model API design #259

Closed
1 task
bfirsh opened this issue Sep 23, 2021 · 3 comments · Fixed by #378
Closed
1 task

Review model API design #259

bfirsh opened this issue Sep 23, 2021 · 3 comments · Fixed by #378

Comments

@bfirsh
Copy link
Member

bfirsh commented Sep 23, 2021

The current model API was hastily written, and we have now learned many things that could be incorporated into its design.

User data

  • Replicate! We currently use an undocumented Redis queue consumer.
  • An industry user wants a gRPC API with support for non-blocking requests for long-running models.
  • An industry user wants an AMQP RPC API. It is unclear whether the model itself needs to interact over this API, or whether a sidecar

Requirements

Essential

  • The primary API for Cog should be the same thing we use for Replicate.
  • Make predictions as standard blocking REST API. JSON over HTTP.
  • Make it possible to use with a queuing system. Either built-in, as a sidecar at the container-level, or as an extension at the Python-level.
  • Pass files as either URLs or base64 encoding, depending on if efficiency is important. (Interdependent with Design type system & signature #205.)

Future, to design in context

  • gRPC interface. This is harder to use, so shouldn't be the primary interface, but it seems to be becoming widely used for ML.
  • Other serving platforms, like AI Platform or Kserve.
  • Non-blocking API with requests
  • Queuing for GPUs on HTTP server #230

Prior art

We don't need to reinvent the wheel.

Off the shelf

Real world

Areas for discussion

  • What are the trade-offs between an HTTP based API and a queue based API?
  • How do we support queuing systems?

Potential solutions

Next steps

  • Use default Cog server for Replicate

See also

(Maintainers -- please edit this and consider it a wiki! Edited by @bfirsh, ...)

@andreasjansson
Copy link
Member

How is it done at spotify @andreasjansson?

A bunch of different ways. I built a couple of custom queue-based batch processing pipelines using RabbitMQ. We also used AI Platform for some workloads, Cloud Dataflow with Klio for large scale batch processing, KubeFlow for many of the non-deep learning models, real-time systems where features were precomputed in batch and retrieved from fast feature stores, etc.

I've tended towards queue-based systems, because I've worked mostly with deep learning models where latency wasn't critical, and models consumed lots of resources, limiting possible concurrency. But I can also see use cases for a bundled real-time HTTP API.

Two trade-offs I'm thinking about:

First, how can we design Cog such that it can be used in heterogenous environments? Ideally the same Cog model should be deployable on Replicate, AI Platform, Sagemaker, Cloud Dataflow, Seldon, custom GKE setups, and so on. The sidecar pattern would help in some of these environments, but not on the managed serving platforms.

Can we make a core Cog server that can be wrapped with adapters for various environments? Perhaps we provide well-documented HTTP and AMQP APIs out of the box, and maintain adapters for different platforms together with the community.

Secondly, how do we keep the APIs simple? Both kserve and Seldon are great, but, like k8s, they have large surface areas and steep learning curves. Can we be as opinionated in our serving APIs as we are with Cog, and still be deployable in different environments?

bfirsh added a commit to bfirsh/cog that referenced this issue Dec 22, 2021
This is a quick fix to get GPU models serving correctly.

The real fix is being incorporated into replicate#259 and replicate#343.

Signed-off-by: Ben Firshman <ben@firshman.co.uk>
bfirsh added a commit to bfirsh/cog that referenced this issue Dec 22, 2021
This is a quick fix to get GPU models serving correctly.

The real fix is being incorporated into replicate#259 and replicate#343.

Signed-off-by: Ben Firshman <ben@firshman.co.uk>
bfirsh added a commit that referenced this issue Dec 22, 2021
This is a quick fix to get GPU models serving correctly.

The real fix is being incorporated into #259 and #343.

Signed-off-by: Ben Firshman <ben@firshman.co.uk>
@zeke zeke closed this as completed Mar 14, 2022
@zeke zeke reopened this Mar 14, 2022
@bfirsh
Copy link
Member Author

bfirsh commented Apr 1, 2022

Hmm, I wonder whether we can consider this superseded by #443?

@bfirsh
Copy link
Member Author

bfirsh commented Jun 10, 2022

I think we can...

@bfirsh bfirsh closed this as completed Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants