# Lab 07 - Model Prediction Speed

During this lab we will explore the prediction speed of different models and model serving
frameworks. This is important for understanding how model architecture, deployment environment,
and other factors affect real-time inference performance and scalability.

There are multiple approaches to serving machine learning models in production environments. We can
imagine the following scenarios (the may overlap):

- batch prediction: where we predict on a large number of samples at once, e.g., offline predictions
  done on a daily basis
- real-time prediction: where we need to make predictions as soon as new data arrives, e.g., online
  predictions for user recommendations or fraud detection systems
- streaming prediction: where we predict on data streams, e.g., financial market predictions or IoT
  sensor data
- low-latency prediction: where we need to minimize the time between receiving input and producing
  output
- edge deployment: where models are deployed on devices with limited computational resources
- embedded model: where models are integrated into applications without a separate serving layer,
  e.g., a mobile app that uses a pre-trained model for image classification

The above approaches can be addressed using a custom-built solutions or by leveraging existing
libraries, services or frameworks. 

For now, we will focus mainly on **real-time** and **low-latency predictions**. The training time is
out of scope for this lab.


## 1. Model Serving

There are multiple ways to serve machine learning models. Some popular options include:

- building a custom REST API using Flask or FastAPI frameworks, etc.
- using model serving frameworks specific to a particular ML ecosystem, e.g., TensorFlow Serving,
  TorchServe, etc.
- general-purpose or multi-framework model serving platforms, e.g., MLflow, BentoML, MLServer,
  Seldon Core, KServe, NVIDIA Triton Inference Server, ONNX Runtime, OpenVINO, Ray Serve, etc. Some
  of these use other serving runtimes under the hood.
- frameworks supporting advanced inference logic and ML workflows, such as inference graphs,
  ensembles, or workflow pipelines, e.g., Seldon Tempo SDK, Kubeflow Pipelines, KServe, etc.

Sometimes, the above options may overlap.

There exist some standards and API protocols designed to facilitate model serving, e.g.:
- KServe V1 Protocol based on the TensorFlow Serving API:
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane/v1-protocol
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane
- Open Inference Protocol (KServe V2 Protocol) - endorsed also by NVIDIA Triton Inference Server,
  TensorFlow Serving and TorchServe
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane
- API protocols based on the OpenAI API specification for large language models (LLM) inference
- etc.

Your first task during this lab is to train an XGBoost or SKLearn model on a dataset of your choice
and serve it using at least two different model serving options - MLServer and a custom FastAPI
endpoint. If you prefer, you may choose different model or serving frameworks instead of the
suggested ones.

Then, try requesting predictions from both serving options. In case of MLServer, try to use both
the REST and gRPC endpoints. 

In [4]:
# write your code here


## 2. Measure Inference Performance

We are interested in the performance of our serving setups. Due to the fact that they are deployed
as services either HTTP REST or gRPC endpoints, we can use general-purpose tools for load testing
and benchmarking web services. Some popular options include:
- Locust
- k6
- Apache JMeter
- Vegeta
- etc.

Your second task during this lab is to measure the inference performance of your serving setups from
previous exercise using Locust, k6, or any other tool of your choice.

We are mainly interested in the following metrics:
- Latency (response time) - average, median, p95, p99
- Throughput (requests per second)
- Error rate

## 3. Experiment with Models and Serving Options

