# Online Serving With NVIDIA Triton Server

In this workflow, we'll use [NVIDIA Triton Server](https://developer.nvidia.com/triton-inference-server) to deploy models for online serving. Before we dive into the details, let's outline what we'll accomplish and briefly introduce Triton Server.

**What This Workflow Does**

- Work with BERT models to generate text embeddings.
- Work with multiple models in different formats, including multiple versions of the same model.
- Process requests through a default model or direct requests to a specific model and version.
- Use an ensemble construct to process inputs through multiple models and provide a combined response.
- Use this approach in three ways:
    - Locally with Docker for testing
    - With [Vertex AI Endpoints](https://cloud.google.com/vertex-ai/docs/predictions/using-nvidia-triton)
    - With Cloud Run

**The TL;DR on NVIDIA Triton Inference Server**

It's an open-source container that you start up with a path to a specifically formatted folder of models (a "model repository") and then make inference requests with.

**What is NVIDIA Triton Inference Server?**

NVIDIA Triton Inference Server is an open-source software solution that simplifies the deployment and management of AI models in production environments. It acts as a high-performance inference serving engine, allowing you to efficiently utilize your models for various applications. Here's a closer look at its key features:

- **Versatile Model Support:** Triton Server can handle models from diverse frameworks like TensorFlow, PyTorch, ONNX Runtime, TensorRT, and even custom frameworks. This eliminates the need for framework-specific serving solutions.
- **Optimized Performance:**  Triton is designed for optimal inference performance. It employs techniques like dynamic batching, concurrent model execution, and model pipelines to maximize throughput and minimize latency.
- **Flexible Deployment:** Deploy Triton Server on various platforms, including cloud (Vertex AI, AWS, Azure), on-premises data centers, edge devices, and embedded systems.
- **Model Management:** Triton Server introduces the concept of a "model repository," a structured directory containing different models and their versions. This allows for easy organization, version control, and A/B testing of your models.
- **Dynamic Request Routing:** Route inference requests to specific models and versions based on your needs. This enables you to experiment with different model versions or serve specialized models for particular tasks.
- **Ensemble Modeling:** Triton Server supports ensemble models, allowing you to chain multiple models together or incorporate custom pre- and post-processing logic using Python.
- **Monitoring and Metrics:**  Gain insights into server performance through built-in metrics, including GPU utilization, throughput, and latency.

**Why use Triton Server?**

- **Simplified Deployment:** Streamlines the deployment process across different environments and hardware.
- **Improved Performance:** Optimizes inference throughput and latency for demanding applications.
- **Scalability:** Easily scale your inference infrastructure to handle increasing workloads.
- **Versatility:** Supports a wide range of models and frameworks.
- **Production-Ready:** Provides features essential for production environments, such as model management, dynamic routing, and monitoring.

By incorporating Triton Server into your MLOps workflow on Vertex AI or Cloud Run, you can efficiently deploy and manage your models, ensuring high performance and scalability for your AI applications.

---
## TODO
- setup environment
- models
    - get pytorch and tf version of bert encoders
    - show functionality locally with notebook
- Triton
    - prepare container: from NVIDIA to AR
    - prepare model registry: models and versions to start
    - prepare ensemble: denote that this is optional
- Deploy:
    - locally for testing
    - Vertex AI
    - Cloud Run
    
    
    
Notes as we go:
- https://huggingface.co/docs/transformers/en/index
- https://www.kaggle.com/models/tensorflow/bert


---