# Lab 07 - Model Prediction Speed

During this lab we will explore the prediction speed of different models and model serving
frameworks. This is important for understanding how model architecture, deployment environment,
and other factors affect real-time inference performance and scalability.

There are multiple approaches to serving machine learning models in production environments. We can
imagine the following scenarios (the may overlap):

- batch prediction: where we predict on a large number of samples at once, e.g., offline predictions
  done on a daily basis
- real-time prediction: where we need to make predictions as soon as new data arrives, e.g., online
  predictions for user recommendations or fraud detection systems
- streaming prediction: where we predict on data streams, e.g., financial market predictions or IoT
  sensor data
- low-latency prediction: where we need to minimize the time between receiving input and producing
  output
- edge deployment: where models are deployed on devices with limited computational resources
- embedded model: where models are integrated into applications without a separate serving layer,
  e.g., a mobile app that uses a pre-trained model for image classification

The above approaches can be addressed using a custom-built solutions or by leveraging existing
libraries, services or frameworks. 

For now, we will focus mainly on **real-time** and **low-latency predictions**. The training time is
out of scope for this lab.


## 1. Model Serving

There are multiple ways to serve machine learning models. Some popular options include:

- in-process model serving within an application, e.g., loading a model directly in a Python code
  and using it for predictions
- building a custom REST API using Flask or FastAPI frameworks, etc.
- using model serving frameworks specific to a particular ML ecosystem, e.g., TensorFlow Serving,
  TorchServe, etc.
- general-purpose or multi-framework model serving platforms, e.g., MLflow, BentoML, MLServer,
  Seldon Core, KServe, NVIDIA Triton Inference Server, ONNX Runtime, OpenVINO, Ray Serve, etc. Some
  of these use other serving runtimes under the hood.
- frameworks supporting advanced inference logic and ML workflows, such as inference graphs,
  ensembles, or workflow pipelines, e.g., Seldon Tempo SDK, Kubeflow Pipelines, KServe, etc.

Sometimes, the above options may overlap.

There exist some standards and API protocols designed to facilitate model serving, e.g.:
- KServe V1 Protocol based on the TensorFlow Serving API:
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane/v1-protocol
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane
- Open Inference Protocol (KServe V2 Protocol) - endorsed also by NVIDIA Triton Inference Server,
  TensorFlow Serving and TorchServe
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane/v2-protocol
    - https://kserve.github.io/website/docs/concepts/architecture/data-plane
- API protocols based on the OpenAI API specification for large language models (LLM) inference
- etc.

Your first task during this lab is to train an XGBoost or SKLearn model on a dataset of your choice.
If train and test splits are not provided, please create them.

Propositions for a dataset to use:
- https://www.openml.org/search?type=data&sort=runs&id=150&status=active

  ```python
  from sklearn.datasets import fetch_openml
  bunch = fetch_openml("Covertype", return_X_y=False, version=3)
  print(bunch.data.shape, bunch.target.shape)
  ```
- https://www.kaggle.com/competitions/ieee-fraud-detection/data
- https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/data

Then, store the model in a suitable format and serve it using at least two different model serving
options \- MLServer and a custom FastAPI endpoint. If you prefer, you may choose different dataset
(however, please use a dataset of similar size and complexity), model, or serving frameworks instead
of the suggested ones.

Remark: As of writing this lab, there is an open issue in MLServer
https://github.com/SeldonIO/MLServer/issues/2286. Therefore, if you encounter any issues related to
`uvloop`, try downgrading this package.

Then, try requesting predictions from both serving options. In case of MLServer, try to use both
the REST and gRPC endpoints. Try batched (if supported) and single-sample predictions.

In [None]:
# write your code here


## 2. Measure Inference Performance

We are interested in the performance of our serving setups. Due to the fact that they are deployed
as services either HTTP REST or gRPC endpoints, we can use general-purpose tools for load testing
and benchmarking web services. Some popular options include:
- `Locust` (https://locust.io/)
- `k6` (https://k6.io/)
- `Apache JMeter` (https://jmeter.apache.org/)
- `Vegeta` (https://github.com/tsenart/vegeta)
- etc.

Your second task during this lab is to measure the inference speed of the serving setup from
previous exercise using `Locust` or any other tool of your choice. Use data test subset for
generating requests.

This may not be the best possible benchmarking setup, as it runs load generation on the same machine
as the model server, but it should be sufficient for learning purposes. Get familiar with the both
tools - the serving framework and the load testing tool you chose. There are many caveats to
properly benchmarking served models - they are out of scope for this lab, but be aware of some of
them:
- https://docs.locust.io/en/stable/increasing-request-rate.html#concurrency
- https://docs.locust.io/en/stable/increasing-request-rate.html#load-generation-performance
- https://docs.locust.io/en/stable/increasing-request-rate.html#actual-issues-with-the-system-under-test

We are mainly interested in the following metrics:
- Latency (response time) - e.g., average, median, p95, p99
- Throughput (requests per second)
- Error rate

## 3. Experiment with Models and Serving Options

The last task during this lab is to experiment with different model hyperparameters, architectures,
and sizes. Try settings that affect the size and complexity of XGBoost models (if you decided to use
it), e.g., depth of decision trees, number of estimators, etc. We want to see if these changes
impact inference speed.

Experiment with different serving options: number of workers/replicas, single-sample vs. batch
prediction (when using batches of different sizes, take this into account when comparing results,
e.g., normalize latency or throughput per sample), protocol type (REST vs gRPC), etc. Try to
formulate some conclusions based on your observations: if/how specific factors affect inference
speed metrics. Is a larger model slower? Can you explain why? What is the maximum speed (requests
per second) you can achieve on your hardware? Does batching affect latency and throughput? If
possible, provide some plots to visualize your findings - you can obtain data from `Locust` for
further analysis
https://docs.locust.io/en/stable/retrieving-stats.html#retrieve-test-statistics-in-csv-format.