Review model API design #259

bfirsh · 2021-09-23T22:16:07Z

The current model API was hastily written, and we have now learned many things that could be incorporated into its design.

User data

Replicate! We currently use an undocumented Redis queue consumer.
An industry user wants a gRPC API with support for non-blocking requests for long-running models.
An industry user wants an AMQP RPC API. It is unclear whether the model itself needs to interact over this API, or whether a sidecar

Requirements

Essential

The primary API for Cog should be the same thing we use for Replicate.
Make predictions as standard blocking REST API. JSON over HTTP.
Make it possible to use with a queuing system. Either built-in, as a sidecar at the container-level, or as an extension at the Python-level.
Pass files as either URLs or base64 encoding, depending on if efficiency is important. (Interdependent with Design type system & signature #205.)

Future, to design in context

gRPC interface. This is harder to use, so shouldn't be the primary interface, but it seems to be becoming widely used for ML.
Other serving platforms, like AI Platform or Kserve.
Non-blocking API with requests
Queuing for GPUs on HTTP server #230

Prior art

We don't need to reinvent the wheel.

Off the shelf

https://www.tensorflow.org/tfx/serving/api_rest
Kserve's model API: https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md
Kserve has a sidecar queue proxy: https://github.com/kserve/website/blob/main/docs/modelserving/control_plane.md
Sagemaker? (Can anyone find the actual API documentation?)
https://github.com/triton-inference-server/server/blob/main/docs/architecture.md#models-and-schedulers
https://pytorch.org/serve/rest_api.html

Real world

How is it done at spotify @andreasjansson?
Might be some descriptions of how people do serving here: https://huyenchip.com/mlops/

Areas for discussion

What are the trade-offs between an HTTP based API and a queue based API?
How do we support queuing systems?

Potential solutions

Next steps

Use default Cog server for Replicate

A bunch of different ways. I built a couple of custom queue-based batch processing pipelines using RabbitMQ. We also used AI Platform for some workloads, Cloud Dataflow with Klio for large scale batch processing, KubeFlow for many of the non-deep learning models, real-time systems where features were precomputed in batch and retrieved from fast feature stores, etc.

I've tended towards queue-based systems, because I've worked mostly with deep learning models where latency wasn't critical, and models consumed lots of resources, limiting possible concurrency. But I can also see use cases for a bundled real-time HTTP API.

Two trade-offs I'm thinking about:

First, how can we design Cog such that it can be used in heterogenous environments? Ideally the same Cog model should be deployable on Replicate, AI Platform, Sagemaker, Cloud Dataflow, Seldon, custom GKE setups, and so on. The sidecar pattern would help in some of these environments, but not on the managed serving platforms.

Can we make a core Cog server that can be wrapped with adapters for various environments? Perhaps we provide well-documented HTTP and AMQP APIs out of the box, and maintain adapters for different platforms together with the community.

Secondly, how do we keep the APIs simple? Both kserve and Seldon are great, but, like k8s, they have large surface areas and steep learning curves. Can we be as opinionated in our serving APIs as we are with Cog, and still be deployable in different environments?

This is a quick fix to get GPU models serving correctly. The real fix is being incorporated into replicate#259 and replicate#343. Signed-off-by: Ben Firshman <ben@firshman.co.uk>

This is a quick fix to get GPU models serving correctly. The real fix is being incorporated into #259 and #343. Signed-off-by: Ben Firshman <ben@firshman.co.uk>

bfirsh · 2022-04-01T02:17:14Z

Hmm, I wonder whether we can consider this superseded by #443?

bfirsh · 2022-06-10T01:14:55Z

I think we can...

bfirsh added the type/design label Sep 23, 2021

This was referenced Oct 28, 2021

Design type system & signature #205

Closed

Revisit design of Python type annotations #193

Closed

bfirsh mentioned this issue Nov 9, 2021

Strawman: FastAPI model server #327

Closed

bfirsh mentioned this issue Nov 24, 2021

Design exploration for new Cog schema and API #343

Closed

bfirsh mentioned this issue Dec 22, 2021

Don't use threads or processes for HTTP server #365

Merged

bfirsh added a commit that referenced this issue Dec 22, 2021

Don't use threads or processes for HTTP server

6348238

This is a quick fix to get GPU models serving correctly. The real fix is being incorporated into #259 and #343. Signed-off-by: Ben Firshman <ben@firshman.co.uk>

bfirsh mentioned this issue Jan 5, 2022

Initial working version using Pydantic for type annotations #378

Merged

36 tasks

bfirsh mentioned this issue Feb 8, 2022

Use Pydantic for type annotations #407

Merged

21 tasks

zeke closed this as completed Mar 14, 2022

zeke reopened this Mar 14, 2022

bfirsh closed this as completed Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review model API design #259

Review model API design #259

bfirsh commented Sep 23, 2021 •

edited

Loading

andreasjansson commented Oct 28, 2021

bfirsh commented Apr 1, 2022

bfirsh commented Jun 10, 2022

Review model API design #259

Review model API design #259

Comments

bfirsh commented Sep 23, 2021 • edited Loading

User data

Requirements

Essential

Future, to design in context

Prior art

Off the shelf

Real world

Areas for discussion

Potential solutions

Next steps

See also

andreasjansson commented Oct 28, 2021

bfirsh commented Apr 1, 2022

bfirsh commented Jun 10, 2022

bfirsh commented Sep 23, 2021 •

edited

Loading