# Lab 08 - Model Prediction Speed of Deep Neural Networks

During this lab, we will continue exploring the prediction speed of models. This time we will focus
on deep neural networks.

## 1. Model Serving

*Below, there is a proposed scenario for you to implement. However, if you have another idea in
mind, feel free to use a different dataset or choose another task of comparable complexity.*

We will focus on an image classification task. The building and training a model is not the main
focus of this lab, so we want this step to be as straightforward as possible. We will focus on
preparing a model that can distinguish between ants and bees using the "Hymenoptera" dataset - you
can get the inspiration and a starting point (or even follow the larger part of the code) from the
PyTorch tutorial [Transfer Learning for Computer Vision Tutorial](
https://docs.pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#transfer-learning-for-computer-vision-tutorial).
On that page, you will find also a link to download the dataset.

Follow the tutorial to obtain a trained model for the task, fine-tune it, save it appropriately and
serve it using a model serving-framework of your choice.

There are many options available, for example:
- use NVIDIA Triton Inference Server:
    - https://github.com/triton-inference-server/server
    - https://github.com/triton-inference-server/tutorials
- use Ray Serve - [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) - see "PyTorch" tab
- use LitServe - [LitServe](https://github.com/Lightning-AI/LitServe) - you can test it also locally
- use MLServe - [MLServer](https://github.com/SeldonIO/MLServer) - see "Serving a custom model"
- implement a custom Flask or FastAPI application that serves the model over a REST API
    
In short, you should serve the model and be able to send requests with image representations (there
are different options, such as raw images, preprocessed tensors, etc., depending on the design of
your serving solution). The server should respond with the predicted class (ant or bee) and/or the
associated probabilities.

Remark 1: Ensure that your serving solution can accept batched requests (you may enforce the upper
limit on the batch size) and that each image can also be of a different resolution. Depending on the 
solution you choose, you may need to use features such as dynamic input shapes or implement 
custom preprocessing logic to handle varying image sizes. Some interesting resources that may help:
- https://docs.pytorch.org/docs/2.9/export.html#expressing-dynamism

Remark 2: You may need to study some documentation or tutorials to make the chosen serving solution
work (sometimes) more efficiently, e.g., you might use static computation graph instead of the eager
mode. Some interesting resources that may help:
- https://docs.pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html
- https://onnxruntime.ai/docs/
- https://onnx.ai/
- https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Quick_Deploy/PyTorch/README.html

## 2. Measure Inference Speed

We are interested in the performance of the serving setup. Similar to the previous lab, we can use a
general-purpose tool for load testing and benchmarking web services, for example,
[Locust](https://locust.io/).

Measure the inference speed, including the latency and the number of requests per second that your
serving solution can handle. Try to estimate what is the limit of your setup with respect to the
RPS.

## 3. Experiment with Models and Serving Options

Experiment with different model architectures. In particular, try to use other pre-trained models
available - see https://docs.pytorch.org/vision/main/models.html#classification for some examples. 
In particular, try to choose models with different sizes and computational requirements.

Experiment also with different serving options. Try to draw conclusions from the results. Can you
observe any difference in the inference speed? Does batching influence the results? If possible,
provide plots to visualize your findings. You can obtain raw data from the load-testing tool
`Locust` for further analysis \-
https://docs.locust.io/en/stable/retrieving-stats.html#retrieve-test-statistics-in-csv-format.