# Chapter 11: Serving Models with TorchServe
Installation Notes :
To run this notebook on Google Colab, you will need to install the following libraries: torch-model-archiver, torchserve, captum, and pyngrok.

In Google Colab, you can run the following command to install these libraries:

In [None]:
!pip install torch-model-archiver torchserve captum pyngrok

## 11.2 Learning Objectives

By the end of this chapter, you should be able to:
- understand, build, and assemble the necessary components into a model archive
- serve a trained model locally using TorchServe

## 11.3 Archiving and Serving Models

Archiving and Serving Models: Overview
It is surely fun to train models but, just like the expression "pics or it didn't happen", in deep learning it is "deployed or it didn't happen." We're not going as far as developing a mobile app for users to classify images of fruits and vegetables, but we will neatly package all the necessary files in a model archive, and use TorchServe to serve this model on Google Colab.

Before proceeding, let's load the model trained in the previous chapter first. You're free to use your own saved checkpoint, but we also provide our own for your convenience. You can download the fomo_model.pth file from the following link:

https://github.com/lftraining/LFD273-code/releases/download/model/fomo_model.pth

If you're using Google Colab, you can just run the command below to download it:

In [None]:
!wget https://github.com/dvgodoy/assets/releases/download/model/fomo_model.pth

Once the file is downloaded, we can load the trained model:

In [None]:
import torch
import torch.nn as nn

repo = 'pytorch/vision:v0.15.2'
model = torch.hub.load(repo, 'resnet18', weights=None)
model.fc = nn.Linear(512, 4)

state = torch.load('fomo_model.pth', map_location='cpu')
model.load_state_dict(state)

Downloading: "https://github.com/pytorch/vision/zipball/main" to /root/.cache/torch/hub/main.zip
  warn(


<All keys matched successfully>

### 11.3.1 Model Archiver

Let's start with the model archive (`.mar`) file, a collection of files and folders zipped together that contains:
- a `MAR-INF` folder with a `MANIFEST.json` file inside that describes the contents of the model archive itself, such as model and archiver versions, and the files that make up the archive
- a serialized file containing the model's weights/state (`--serialized-file` argument)
- a Python file containing only one class definition of our model's class inherited from `nn.Module` (only required if the model isn't scripted - more on that later) (`--model-file` argument)
- an optional Python file containing one class definition of the handler's class inherited from `ts.torch_handler.BaseHandler` that performs the necessary transformations for pre- and post-processing  OR the name of a predefined handler (`--handler` argument)
- an optional extra file `index_to_name.json` for mapping predicted class indices to its corresponding category names (automatically used by some predefined handlers) (`--extra-files` argument)

It is typical to assemble the model archive file through the command line interface:

```
torch-model-archiver --model-name <your_model_name> \
                     --version <your_model_version> \
                     --model-file <your_model_file>.py \
                     --serialized-file <your_model_name>.pth \
                     --handler <handler-script OR name> \
                     --extra-files ./index_to_name.json
```

However, let's take a closer look at each one of its components and assemble it ourselves instead.

### 11.3.2 Model File

The model file must contain a single class definition (inherited from nn.Module) corresponding to our model's architecture and forward pass. Ours is a slightly modified ResNet18 model, but how can we define its class without having to write down ResNet18 architecture from scratch?

We need to:
- define our own class
- create an instance of an untrained ResNet18 model
- replace its head (`fc` layer) with our own
- update our own class internal dictionary with the entries from ResNet's dictionary
- set ResNet's forward pass to our own class using `setattr`

It looks like this:

In [None]:
from torchvision.models import resnet18

class FOMONet(nn.Module):
    def __init__(self):
        super().__init__()

        # Create an instance of an untrained ResNet18
        resnet = resnet18(weights=None)
        # Modifies the architecture to our task
        resnet.fc = nn.Linear(512, 4)

        # Replicate ResNet's modified architecture to FOMONet
        self.__dict__.update(resnet.__dict__)
        # Replicate Resnet's forward method to FOMONet
        setattr(self, 'forward', resnet.forward)

It is hacky for sure, but it works. Let's make sure of it by loading our modified model's state dictionary into an instance of our FOMONet.

In [None]:
fomo = FOMONet()
fomo.load_state_dict(model.state_dict())

<All keys matched successfully>

All keys matched! Time to try out the forward pass:

In [None]:
fomo.eval()
model.eval()

torch.manual_seed(32)
x = torch.randn(1, 3, 224, 224)

fomo(x), model.cpu()(x)

(tensor([[ 0.2412, -2.8556, -1.1869,  0.8597]], grad_fn=<AddmmBackward0>),
 tensor([[ 0.2412, -2.8556, -1.1869,  0.8597]], grad_fn=<AddmmBackward0>))

That's also a match!

We have patched together our own model class, now we only need to write it to a Python file:

In [None]:
model_file_script = """
import torch.nn as nn
from torchvision.models import resnet18

class FOMONet(nn.Module):
    def __init__(self):
        super().__init__()

        # Create an instance of an untrained ResNet18
        resnet = resnet18(weights=None)
        # Modifies the architecture to our task
        resnet.fc = nn.Linear(512, 4)

        # Replicate ResNet's modified architecture to FOMONet
        self.__dict__.update(resnet.__dict__)
        # Replicate Resnet's forward method to FOMONet
        setattr(self, 'forward', resnet.forward)
"""

with open('model_file.py', 'w') as fp:
    fp.write(model_file_script)

Does it feel too hacky for you? Don't worry, the idea here was to prove that there's a (hacky) way of coming up with our own model file - if needed - without having to go through the whole ResNet's architecture.

But, what if we did not need a model file at all?

### 11.3.3 Scripted Models

We briefly touched upon the topic of scripting models when we discussed data augmentation and transformations that aren't "scriptable". Now, let's talk about TorchScript and what scripting a model actually means.

"*TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.*"

Source: [Torchscript](https://pytorch.org/docs/stable/jit.html)

The key element here is "*no Python dependency*", meaning the model can be run in a standalone C++ program, for example. This preserves the best of both worlds: the ease and friendliness of the Python language for development, and the speed and reliability of the C++ language for deploying in production.

We're not going into details here, we're just showing you an example of using PyTorch JIT, an optimized compiler for PyTorch programs, to script our model:

In [None]:
# once it is scripted, there is no need for the model class def anymore
scripted_model = torch.jit.script(model)

The script() method inspects the source code (our model), compiles it as TorchScript code using the compiler, and returns a ScriptModule (a wrapper around a C++ module) back.

If you'd like to learn more about it, check the Introduction to [TorchScript tutorial](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html).

### 11.3.4 Serialized File

The serialized file has the model state/weights saved to disk. It may be the model in eager mode, saving it in the typical way we use for saving checkpoints, or it may be the scripted version of the model, which we can save using its own save() method instead:

In [None]:
# We already saved the model to disk in the previous chapter
# eager mode version
torch.save(model.state_dict(), 'fomo_model.pth')

# scripted version
scripted_model.save("fomo_model.pt")

Although there are no enforced rules for naming the files, TorchServe assumes internally (in the BaseHandler code) that the .pt file extension corresponds to a saved scripted model, so we're adhering to this convention, and we're saving eager models using .pth as file extension.

### 11.3.5 Inference Handler

The handler takes care of pre- and post-processing the inputs and outputs, respectively, sending the former to the model and returning the latter to the user.

There are several implemented [default handlers](https://pytorch.org/serve/default_handlers.html) in Torchserve:
- `image_classifier`
- `object_detector`
- `text_classifier`
- `image_segmenter`

The first three handles also implement mapping the predicted class to its corresponding names/categories using an standard `index_to_name.json` extra file.

Although the default handlers can seamlessly integrate pretrained models into TorchServe, more often than not you'll need to tweak them to accommodate changes in pre- or post-processing, or the fact that your task may have only two or a few classes (instead of the typical 1,000 classes of ImageNet).

Therefore, we're taking a closer look at the internals of a handler, so you can more easily adjust them should the need arise. Each method is briefly introduced and it depicts an abbreviated version of its implementation in TorchServe.

#### 11.3.5.1 Initialize
The initialize() method mainly takes care of loading the pretrained model and setting it to evaluation mode. Notice that it assumes that, if a model file (.py) that defines the model class is available, the model is in eager mode, and it will load its saved state. Otherwise, if the file extension is .pt, it assumes it's dealing with a scripted model. You probably won't need to modify this method.
```python
def initialize(self, context):
    """Initialize function loads the model.pt file and initialized the model object.
       First try to load torchscript else load eager mode state_dict based model.
    """
    model_file = self.manifest["model"].get("modelFile", "")
    if model_file:
        self.model = self._load_pickled_model(model_dir, model_file, self.model_pt_path)
        self.model.to(self.device)
        self.model.eval()
    elif self.model_pt_path.endswith(".pt"):
        self.model = self._load_torchscript_model(self.model_pt_path)
        self.model.eval()
```

#### 11.3.5.2 Handle
The handle() method manages the flow of information inside the handler. It calls the other methods in order: pre-processing, inference, and post-processing. You probably won't need to modify this method either.
```python
def handle(self, data, context):
    """Entry point for default handler. It takes the data from the input request and returns
       the predicted outcome for the input.
    """
    data_preprocess = self.preprocess(data)
    output = self.inference(data_preprocess)
    output = self.postprocess(output)

    return output
```

#### 11.3.5.3 Preprocess
The preprocess() method takes input data, as sent by the user, and turns it into a PyTorch-appropriate format, namely, tensors.

In the VisionHandler class, this method extracts data from the HTTP request's body, applies the required transformations (image_processing() method) to normalize the data according to ImageNet statistics for mean and standard deviation, and stacks the results together.

You may need to overwrite this method in your own handler class if your images do not follow typical ImageNet statistics, or if you're using a pretrained model that does not require this kind of preprocessing, as we've already discussed in the previous chapter.
```python
def preprocess(self, data):
    """
    Preprocess function to convert the request input to a tensor(Torchserve supported format).
    The user needs to override to customize the pre-processing
    """
    images = []

    for row in data:
        # Compat layer: normally the envelope should just return the data
        # directly, but older versions of Torchserve didn't have envelope.
        image = row.get("data") or row.get("body")
        if isinstance(image, str):
            # if the image is a string of bytesarray.
            image = base64.b64decode(image)

        # If the image is sent as bytesarray
        if isinstance(image, (bytearray, bytes)):
            image = Image.open(io.BytesIO(image))
            image = self.image_processing(image)
        else:
            # if the image is a list
            image = torch.FloatTensor(image)

        images.append(image)

    return torch.stack(images).to(self.device)
```

Let's take a quick look at the `image_processing()` function that's called by the `preprocess()` method:

In [None]:
from ts.torch_handler.image_classifier import ImageClassifier

ImageClassifier.image_processing



Compose(
    Resize(size=256, interpolation=bilinear, max_size=None, antialias=warn)
    CenterCrop(size=(224, 224))
    ToTensor()
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)

Now, compare to the prescribed transformations when using our ResNet18 model:

In [None]:
from torchvision.models import get_weight

weights = get_weight('ResNet18_Weights.DEFAULT')
weights.transforms()

ImageClassification(
    crop_size=[224]
    resize_size=[256]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BILINEAR
)

They look similar, but do they do exactly the same thing? Let's throw the same image at both of them and compare the transformed results. You can download one of the fig images from the link below:

https://raw.githubusercontent.com/lftraining/LFD273-code/main/images/ch9/fig_0_100.jpg

If you're using Google Colab, you can simply run the command below to download the image:

In [None]:
!wget https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch9/fig_0_100.jpg

Once the image is downloaded, let's transform it:

In [None]:
from PIL import Image

img = Image.open('./fig_0_100.jpg')

(ImageClassifier.image_processing(img) == weights.transforms()(img)).all()

tensor(True)

#### 11.3.5.4 Inference
The inference() method sends the data to the device and sends it to the model to get predictions. You probably won't have to modify this method.
```python
def inference(self, data, *args, **kwargs):
    """
    The Inference Function is used to make a prediction call on the given input request.
    The user needs to override the inference function to customize it.
    """
    with torch.no_grad():
        marshalled_data = data.to(self.device)
        results = self.model(marshalled_data, *args, **kwargs)
    return results
```

#### 11.3.5.5 Postprocess
The postprocess() method takes the results of the inference() method and converts them back to a list so it can be returned to the user. In the ImageClassifier handler, the method converts logits into probabilities using the softmax function, and takes the top-K most likely classes to return to the user. The number of returned classes, indicated by the topk class attribute, is set to five by default.

You may need to overwrite this method in your own handler class to suit your needs when it comes to the output expected by your users.
```python
def postprocess(self, data):
    """
    The post process function makes use of the output from the inference and converts into a
    Torchserve supported response output.
    """
    ps = F.softmax(data, dim=1)
    probs, classes = torch.topk(ps, self.topk, dim=1)
    probs = probs.tolist()
    classes = classes.tolist()
    return map_class_to_label(probs, self.mapping, classes)
```

#### 11.3.5.6 Custom Handler
In our example, the default value of topk is inconvenient (we only have four classes in total) and will raise an error if left like that. Fortunately, the ImageClassifier class also implements a set_max_result_classes() method, and we can leverage it to tweak the topk parameter in our very own handler class that inherits from it. We don't need to implement/modify any of the methods, other than the constructor:

In [None]:
handler_file_script = """
from ts.torch_handler.image_classifier import ImageClassifier

class FOMOHandler(ImageClassifier):
    def __init__(self):
      super().__init__()

      # By default, ImageClassifier uses top-5 classes
      # but our task has only 4, so we need to tweak it
      self.set_max_result_classes(4)
"""

with open('handler_file.py', 'w') as fp:
    fp.write(handler_file_script)

### 11.3.6 Extra Files

The extra files include anything you may need in order to transform your user's requests into proper model inputs, or model outputs into proper responses to those requests. The typical, and already handled by default, extra file is the index_to_name.json file that allows handlers inherited from the VisionHandler class to map the predicted class indices back to their categorical names. The appropriate file for models pretrained on the ImageNet dataset can be found here, and it looks like this:

{
 "0": ["n01440764", "tench"],
 "1": ["n01443537", "goldfish"]

 "998": ["n13133613", "ear"],
 "999": ["n15075141", "toilet_tissue"]
}

For our own FOMO dataset, we could reverse keys and values from the dataset's class_to_idx attribute.

In [None]:
# We didn't load the dataset in this chapter, so we're building the dict manually
# class_to_idx = datasets['train'].class_to_idx

class_to_idx = {'Fig': 0, 'Mandarine': 1, 'Onion White': 2, 'Orange': 3}

In [None]:
index_to_name = {v: k for k, v in class_to_idx.items()}
index_to_name

{0: 'Fig', 1: 'Mandarine', 2: 'Onion White', 3: 'Orange'}

Now, let's save this information to the corresponding file:

In [None]:
import json

with open('index_to_name.json', 'w') as f:
    json.dump(index_to_name, f)

### 11.3.7 Packaging
At this point, we could go back to the CLI and run the following command to assemble our .mar file containing everything we'll need to serve our model:
```
torch-model-archiver --model-name FOMO> \
                     --version 1.0 \
                     --model-file ./model_file.py \
                     --serialized-file fomo_model.pth \
                     --handler ./handler_file.py \
                     --extra-files ./index_to_name.json
```

However, we can also call the generate_model_archive() function directly, as if we were passing the arguments using the command line, and it will build the model archive for us.

First, though, let's create a folder to store our models: model_store.

If you're in Google Colab, you can create a folder running the command below:

In [None]:
!mkdir ./model_store

Once the folder is created, we can generate the archive file that will be placed inside it:

In [None]:
import sys
from model_archiver.model_packaging import generate_model_archive

sys.argv = ['',
            '--model-name', 'FOMO',
            '--version', '1.0',
            '--model-file', 'model_file.py',
            '--serialized-file', 'fomo_model.pth',
            '--handler', 'handler_file.py',
            '--extra-files', 'index_to_name.json',
            '--export-path', './model_store',
            '--force']

generate_model_archive()

All set! We have a model archive now, one we can easily serve using TorchServe.

## 11.4 TorchServe

[TorchServe](https://pytorch.org/serve/) is a flexible and easy to use tool for serving and scaling PyTorch eager mode and scripted models in production. It offers APIs for querying, managing, and analyzing the performance of its served models (by default, they are only accessible from localhost):

- [Inference API](https://github.com/pytorch/serve/blob/master/docs/inference_api.md): it listens to port 8080, and it offers the following services
  - description (`OPTIONS /`)
  - health check (`GET /ping`)
  - predictions (`POST {/predictions/{model_name}`)
  - explanations (`POST /explanations/{model_name}`)
  - kserve (`/v1/models/{model_name}:predict:`)
  - kserve explanations (`/v1/models/{model_name}:explain:`)
  
- [Management API](https://github.com/pytorch/serve/blob/master/docs/management_api.md): it listens to port 8081, and it offers the following services
  - description (`OPTIONS /`)
  - list models (`GET /models`)
  - describe a model (`GET /models/{model_name}`)
  - register a model (`POST /models`)
  - scale workers (`POST /models/{model_name}`)
  - set default version (`PUT /models/{model_name}/{version}/set-default`)
  - unregister a model (`DELETE /models/{model_name}/{version}`)
  
- [Metrics API](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md): it listens to port 8082, and it returns Prometheus-formatted frontend and backend metrics, such as number of requests, CPU and memory utilization, handler and prediction time, and many more.

In this course, we're only illustrating the basic functionalities of TorchServe, so we're focusing on the /predictions service of the Inference API only.

It is typical to run TorchServe through the command line interface. To start TorchServe, serving our FOMO model (archived as FOMO.mar in the model_store folder) through a custom port (as in the config.properties file), we would run the following command:
```
torchserve --start \
           --disable-token-auth \
           --model-store ./model_store \
           --models fomo=FOMO.mar \
           --ts-config config.properties
```

But, since you're likely running this notebook on Google Colab or some other platform, it is probably more convenient to start TorchServe by calling the appropriate Python function instead.

First, though, let's configure it in such a way that it uses a different port for inference:

In [None]:
config_properties = """
inference_address=http://127.0.0.1:7777
"""

with open('config.properties', 'w') as fp:
    fp.write(config_properties)

This will allow you to locally submit requests to your model. If you'd like to bind the inference API to all network interfaces, you should use 0.0.0.0 as the address instead. It won't work on Google Colab, though.

Now, we can call the start() function as if we were passing arguments using the command line:

In [None]:
from ts.model_server import start

sys.argv = ['',
            '--start',
            '--disable-token-auth',
            '--model-store', './model_store',
            '--models', 'fomo=FOMO.mar',
            '--ts-config', 'config.properties']
start()

After a few seconds (don't rush into running the following cell or you may get a connection refused error), your server should be up and running, and you can submit requests to it using your local address (127.0.0.1) and the inference API port (7777), invoking your model (as named in the models argument, so it's fomo in our case) together with the data (an image):

In [None]:
import requests

with open('./fig_0_100.jpg', 'rb') as f:
    data = f.read()

response = requests.put('http://127.0.0.1:7777/predictions/fomo', data=data)
response.json()

{'Fig': 0.9934685230255127,
 'Orange': 0.004324017558246851,
 'Onion White': 0.0012627042597159743,
 'Mandarine': 0.0009447108022868633}

Congratulations! You deployed a model in production using TorchServe!

Once we're done with it, we can stop the server by, somewhat ironically, calling the start() function once again with the --stop argument:

In [None]:
#!torchserve --stop
sys.argv = ['', '--stop']
start()

TorchServe has stopped.


It is really cool to see your own code up and running in Google Colab, being able to send HTTP requests to it and getting predictions back, but wouldn't it be even cooler to be able to show it to your friends or colleagues?

### 11.4.1 Ngrok (optional)

"*Online in One Line*" reads the [ngrok](https://ngrok.com/) website. It is an easy and convenient way of serving your model through a tunnel, thus allowing it to handle incoming requests from the outside world in your own Jupyter Notebook.

***
**DISCLAIMER**: You should NOT use Google Colab notebooks as backend for your deployed models. This is just a proof-of-concept, and a way to make your model available to the world for a brief amount of time, so you can showcase it to your family, friends, or colleagues.
***

If you want to try the code below, you'll need to [signup](https://dashboard.ngrok.com/signup) for a free account on [ngrok](https://ngrok.com/) and, once you're done, you can install the [pyngrok](https://pypi.org/project/pyngrok/) package that takes care of downloading and installing ngrok:

You'll need to copy your [authorization token](https://dashboard.ngrok.com/get-started/your-authtoken) and paste it in the appropriate command below:

***
**DISCLAIMER**: The responsibility for keeping your credentials and/or authorization tokens safe and private is your own. Make sure to remove any credentials and/or authorizations tokens from your notebook before saving or pushing it to public repositories, such as GitHub.
***

In [None]:
# Option 1
# You can call ngrok with your token
# Uncomment the line below and replace ... with your token
# !ngrok authtoken ...

# Option 2
# Or you can save it to a configuration file
# Uncomment the line below and replace ... with your token
# !echo "authtoken: ..." >> /root/.ngrok2/ngrok.yml

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


Once ngrok is setup, let's start Torchserve once again with a few modifications in the `config.properties` file:

***
**DISCLAIMER**: CORS stands for cross-origin resource sharing, and the configuration below makes Torchserve wide open to requests from anywhere. You SHOULD NOT use these configuration parameters in production as they're not safe. The responsibility for ensuring the security of your application, model, and data, is your own.
***

In [None]:
config_properties = """
inference_address=http://127.0.0.1:7777
cors_allowed_origin=*
cors_allowed_methods=GET, POST, PUT, OPTIONS
"""

with open('config_cors.properties', 'w') as fp:
    fp.write(config_properties)

In [None]:
sys.argv = ['',
            '--start',
            '--model-store', './model_store',
            '--models', 'fomo=FOMO.mar',
            '--ts-config', 'config_cors.properties']
start()

TorchServe should be up and running already, so we can use ngrok to build a tunnel and forward external requests to it. Since we're using a non-standard 7777 port, we need to specify it as the port that's handling HTTP requests:

In [None]:
from pyngrok import ngrok

# <NgrokTunnel: "http://<public_sub>.ngrok.io" -> "http://localhost:7777">
http_tunnel = ngrok.connect(7777, "http")



In [None]:
http_tunnel.public_url

'https://f295-35-202-252-169.ngrok-free.app'

The tunnel's public URL can be found in the public_url attribute:

Now, you (or anyone else) can send requests to your model, provided they know the public URL and the name of your model (fomo, in our example).

To make predictions, we need to send a PUT request with the image data, as shown below:

In [None]:
with open('./fig_0_100.jpg', 'rb') as f:
    data = f.read()

response = requests.put(f'{http_tunnel.public_url}/predictions/fomo', data=data)
response.json()

{'Fig': 0.9934685230255127,
 'Orange': 0.004324017558246851,
 'Onion White': 0.0012627042597159743,
 'Mandarine': 0.0009447108022868633}

The response, in JSON, contains the predictions of our served model! Congratulations! Now you can showcase your model to your family, friends, and colleagues!

Once you're done, you can disconnect the tunnel and stop TorchServe:

In [None]:
ngrok.disconnect(http_tunnel.public_url)

In [None]:
sys.argv = ['', '--stop']
start()

TorchServe has stopped.
