# Effortless models deployment with Mlflow

## Packaging an NLP text review classifier from HuggingFace with Mlflow

This example demostrates how to package models with Mlflow that require multiple assets to be loaded on inference time. To showcase the case, I will try to show an example as close as possible to real life: let's try to save an NLP classifier created with the popular library transformers from HuggingFace. This model will classify reviews according to a 5 stars ranking: 1, 2, 3, 4 or 5. We will create the model and then show how you can save it in MLFlow format to then achieved our so-called effortless deployment.

## Using a pretrained model from HuggingFace

Let's try to create an NLP classifier that assing the number of stars associated with a given text representing a product review. We are going to borrow a model already trained to perform this task from HuggingFace. HuggingFace🤗 is one of the most robust AI communities out there, with a wide range of solutions from models to datasets, built on top of open source principles, so let's take advantage of it.

In this case we will use [`nlptown/bert-base-multilingual-uncased-sentiment`](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment?text=I+like+you.+I+love+you). This is a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).

The model can be used directly as a sentiment analysis model for product reviews in any of the six languages, or further finetuned on related sentiment analysis tasks. To keep the example small, we won't do any fine-tunning with our own data in this opportunity.

In [None]:
from transformers.models.auto import AutoConfig, AutoModelForSequenceClassification
from transformers.models.auto.tokenization_auto import AutoTokenizer

### Loading the model

Let's start by loading our model configuration. To do that we will use the library `transformers` which provides a convenient way to pull a model from the HuggingFace repository just by using it's URL:

In [None]:
model_uri = 'nlptown/bert-base-multilingual-uncased-sentiment'
config = AutoConfig.from_pretrained(model_uri)

Here we can see some interesting properties of the model:

In [None]:
print('Architecture:', config.architectures)
print('Classes:', config.label2id.keys())

One of the aspects that make BERT-based models to perform well is the used of well designed tokenizers. Tokenizer will allow us to transform the text from sequence of characters to sequences of words or tokens (actually, BERT uses piece-wise tokenizers, so it will return sequencies of parts of words). Tokenizers are an important concept cause you have to ensure you use the same tokenizer you model was trained with. Fortunately, `transformers` have a convenient way to pull tokenizers associated with a given model easily:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_uri)
model = AutoModelForSequenceClassification.from_pretrained(model_uri, config=config)

Let's check for GPUs:

In [None]:
import torch

if torch.cuda.is_available():
    print("Switching model to CUDA device")
    model = model.cuda()
else:
    print("No CUDA device found. Using CPU.")

We won't do any further training, so it is important to switch our model to evaluation mode so we get reproducible predictions:

In [None]:
_ = model.eval()

Let's try the model with some sample data. To do that, we can create a sample text to send to the model:

In [None]:
import pandas as pd 

In [None]:
sample = pd.DataFrame({ 'text': ['good enough',
                                 'The overall quality if good, but there are certain aspects of the product that made it hard to use']})
sample

Let's run our model. Our model can't handle text directly, which is why we need a tokenizer. It will convert the text to tensors representing the text. Then we can pass those representations to our model:

In [None]:
inputs = tokenizer(list(sample['text'].values), padding=True, return_tensors='pt')

if model.device.index != None:
    print("Model is in a different device as inputs. Moving location to device:", model.device.index)
    for key in inputs.keys():
        inputs[key] = inputs[key].to(model.device.index)
    
predictions = model(**inputs)

Our model actually returns the log of the probabilities, so we need to change the domain:

In [None]:
import torch
probs = torch.nn.Softmax(dim=1)(predictions.logits)

We are using PyTorch backend with `transformers`, which will return tensors in the training/inference device. To easily manipulate them, we can move them to a numpy array:

In [None]:
probs = probs.detach().cpu().numpy()

Let's see our results:

In [None]:
classes = probs.argmax(axis=1)
confidences = probs.max(axis=1)

In [None]:
outputs = pd.DataFrame({ 'rating': [config.id2label[c] for c in classes], 'confidence': confidences })
outputs

Great, our model looks to work good. It would be nice to have a validation dataset to actually measure how good or bad our model performs. I will let that as an exercise for the reader.

### Saving the model with Mlflow

Now that we are fine with the model we got, it's time to save it. As usual, the first step it to create the model signature. Let's see what are the inputs an outputs of this model:

In [None]:
from mlflow.models.signature import infer_signature

signature = infer_signature(sample, outputs)
signature

#### Saving a HuggingFace model with Mlflow

Mlflow doesn't support directly HuggingFace models, so we have to use the flavor `pyfunc` to save it. As we did in the previous example with the recommender model, we can create a Python class that inherits from `PythonModel` and then place everthing we need there. Something like this:

```python
class BertTextClassifier(mlflow.pyfunc.PythonModel):
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        
    def predict(self, context: mlflow.pyfunc.PythonContext, data):
        (...)
```

Althought it works, doing so would have some limitations:

- `model` and `tokenizer` will get serialized in the object, but PyTorch has more efficient ways to store models.
- `model` contains references to the training device and hence those will get serialized too
- `model` is a big object, so persisting it will generate a big `Pickle` file.

However, Mlflow provides another way to deal with artifacts that you model may need to opperate but that you don't want to serialize in a Python object. That is done by indicating `artifacts`.

#### Artifacts in Mlflow

We didn't mentioned before, but if you pay closer look to the signature of the method `mlflow.pyfunc.log_model` you will find an argument called `artifacts`. This parameter can be used to indicate any artifact (meaning, any file) that need to be packaged in the model package. It can be 1) any number of files and 2) of any type. Whatever you indicate there will be persisted and packaged along with the model object.

Artifacts are indicated using a dictionary with keys as the name of the artifact, and value as the path in the local file system where the artifact is currently placed. **Any file indicated in this dictionary will be copied and packaged inside the package along with the model.** Note that artifacts are always path to files, it can't be a directory.

The `transformers` library provides a convenient way to store all the artifacts of a given model, and that is using the function`save_pretrained` from the model.

In [None]:
model_path = 'rating_classifier'
model.save_pretrained(model_path)

This will generate a single file called `pytorch_model.bin` which contains the weights of the model itself. However, remember that in order to run the model we also need it's corresponding tokenizer. The same `save_pretrained` method is available for the tokenizer, which will generate other set of files:

In [None]:
tokenizer.save_pretrained(model_path)

Here we can actually see all the files the tokenizer needs in order to operate. Let's tell Mlflow that we need all thes files to run the model. First, we need to create the dictionary I mentioned before:

In [None]:
import os, pathlib

artifacts = { pathlib.Path(file).stem: os.path.join(model_path, file) 
             for file in os.listdir(model_path) 
             if not os.path.basename(file).startswith('.') }

> **What is this code doing?** It creates a dictionary with the name of the file (without the extenison) as the key and the full path as the value. Files that start with a dot (.) are not included since usually this files are hidden.

Let's see the output:

In [None]:
artifacts

Great! So artifacts now is a dictionary that contains all the elements we need to run the model.

#### How this artifacts will be loaded?

We now need to tell Mlflow how to load this artifacts on inference time. When we introduced the class `PythonModel` from Mlflow we mentioned that the existance of the method `load_context` but we didn't say much more than that. We didn't implemented it in the Python wrapper we created. However, this method provides a chance for the model builder to load any artifacts that the model may need. Such artifacts are located inside the model package and can be accessed directly.

In our case, we need to load the BERT model and the tokenizer. `transformers` library has a method `from_pretrained` that can handle models stored locally. We are going to use this inside of the `load_context`.

In [None]:
from mlflow.pyfunc import PythonModel, PythonModelContext
from typing import Dict

class BertTextClassifier(PythonModel):
    def load_context(self, context: PythonModelContext):
        import os
        from transformers.models.auto import AutoConfig, AutoModelForSequenceClassification
        from transformers.models.auto.tokenization_auto import AutoTokenizer
        
        config_file = os.path.dirname(context.artifacts["config"])
        self.config = AutoConfig.from_pretrained(config_file)
        self.tokenizer = AutoTokenizer.from_pretrained(config_file)
        self.model = AutoModelForSequenceClassification.from_pretrained(config_file, config=self.config)
        
        if torch.cuda.is_available():
            print('[INFO] Model is being sent to CUDA device as GPU is available')
            self.model = self.model.cuda()
        else:
            print('[INFO] Model will use CPU runtime')
        
        _ = self.model.eval()
        
    def _predict_batch(self, data):
        import torch
        import pandas as pd
        
        with torch.no_grad():
            inputs = self.tokenizer(list(data['text'].values), padding=True, return_tensors='pt', truncation=True)
        
            if self.model.device.index != None:
                torch.cuda.empty_cache()
                for key in inputs.keys():
                    inputs[key] = inputs[key].to(self.model.device.index)

            predictions = self.model(**inputs)
            probs = torch.nn.Softmax(dim=1)(predictions.logits)
            probs = probs.detach().cpu().numpy()

            classes = probs.argmax(axis=1)
            confidences = probs.max(axis=1)

            return classes, confidences
        
    def predict(self, context: PythonModelContext, data: pd.DataFrame) -> pd.DataFrame:
        import math
        import numpy as np
        
        batch_size = 64
        sample_size = len(data)
        
        classes = np.zeros(sample_size)
        confidences = np.zeros(sample_size)

        for batch_idx in range(0, math.ceil(sample_size / batch_size)):
            bfrom = batch_idx * batch_size
            bto = bfrom + batch_size
            
            c, p = self._predict_batch(data.iloc[bfrom:bto])
            classes[bfrom:bto] = c
            confidences[bfrom:bto] = p
            
        return pd.DataFrame({'rating': [self.config.id2label[c] for c in classes], 
                             'confidence': confidences })  
        

Note here a couple of things:
- `context.artifacts` contains a dictionary similar to the one we created before, where value contains the path - **now inside the MLflow package** - where the asset `key` is located. So we can access any file directly. In this case, we are accesing the file `config`.
- `transformers` library can load a mode, tokenizer and config directly from a folder, since it will then load each of the required files. This is why we are using just `artifacts['config']` path, although we have the path of the rest of the files also available (`artifacts['tokenizer']`, `artifacts['vocab']`, etc). We are actually extracting just the folder where the file is. However, in just case you may need to access each file individually.
- `BertTextClassifier` doesn't have a constructor. This is not required, but since we are not using it I removed it. Use parameters in the constructor to indicate values that you want to persist with you model, but you don't have them on an artifact. For instance, the max lenght of the supported sequence, error messages values, or any other piece of data that you may need.
- Imports are done always inside the `load_context` function or `predict`.

Now that we have all the pieces, it's time to log the model:

In [None]:
import mlflow

mlflow.set_experiment('bert-classification')

with mlflow.start_run():
    mlflow.pyfunc.log_model('classifier', 
                            python_model=BertTextClassifier(), 
                            artifacts=artifacts, 
                            signature=signature,
                            registered_model_name='bert-rating-classification')

### Testig the MLFlow model

We can load the model from the code using the following line. In this case we are assuming the model was registered using the name bert-rating-classification. We are also retrieving the last version of it.

In [None]:
import mlflow

model = mlflow.pyfunc.load_model('models:/bert-rating-classification/latest')

Running the `predict` function:

In [None]:
model.predict(sample)

### Serving the model locally

We can run the model in an inference server locally in our local compute. Again, with this we can check that our deployment strategy will work. 

To do so, let's serve our model using mlflow:

```bash
mlflow models serve -m models:/bert-rating-classification/latest
```

Creating a sample request

In [None]:
import json

with open("sample.json", "w") as f:
    f.write(sample.to_json(orient='split', index=False))

> Note how the model inputs is indicated. MLFlow requires the inputs to the model to be submitted using `JSON` format and multiple specification are supported. In the Cats vs Dogs sample we saw before we used the TensorFlow Serving specification. Now, since we are using tabular data, we can use the Columnar format in Pandas.

Sending the request

In [None]:
!cat -A sample.json | curl http://127.0.0.1:5000/invocations \
                        --request POST \
                        --header 'Content-Type: application/json' \
                        --data-binary @-

### Deploying to Azure ML

#### ACI

In [23]:
from mlflow.deployments import get_deploy_client

In [24]:
client = get_deploy_client(os.environ['MLFLOW_TRACKING_URI'])

In [None]:
import json

deploy_config = {
  "computeType": "aci",
  "containerResourceRequirements": 
  {
    "cpu": 2,
    "memoryInGB": 4 
  }
}

deployment_config_path = "deployment_config.json"
with open(deployment_config_path, "w") as outfile:
    outfile.write(json.dumps(deploy_config))

In [None]:
webservice = client.create_deployment(model_uri=f'models:/bert-rating-classification/latest',
                                      name="bert-rating-classification",
                                      config={'deploy-config-file': deployment_config_path})

Creating a sample request

In [None]:
import json

with open("sample.json", "w") as f:
    f.write('{ "input_data": ' + sample.to_json(orient='split') + '}')

> Note how the model inputs is indicated. MLFlow requires the inputs to the model to be submitted using `JSON` format and multiple specification are supported. In the Cats vs Dogs sample we saw before we used the TensorFlow Serving specification. Now, since we are using tabular data, we can use the Columnar format in Pandas.

Sending the request

In [None]:
!cat -A sample.json | curl http://f378d2e3-f044-44bd-9009-6a35abe4d78d.eastus.azurecontainer.io/score \
                    --request POST \
                    --header 'Content-Type: application/json' \
                    --header 'Authorization: Bearer <TOKEN>' \
                    --data-binary @-

## Extra: Logging the model using a model loader

As we saw, `artifacts` provide a convenient way to tell the model exacly what we need to run the model. However, it worth mentioning another alternative for those models that may require a couple of files to be executed, but we are fine having all of them in a folder an then load the entire directory with all that there is inside. 

This is the case of the transformers model we are working with, cause we can place all the files (tokenizer, model, vocab) in a folder and the library will just load what it needs. If this is the case, we can use model loaders (similar to what we did in the example #2 of the blog series. Just a couple of things would need to be changed:

In [None]:
%%writefile huggingface_model_loader.py

import torch
import pandas as pd
from transformers.models.auto import AutoConfig, AutoModelForSequenceClassification
from transformers.models.auto.tokenization_auto import AutoTokenizer

class BertTextClassifier:
    def __init__(self, baseline_model: str, tokenizer = None):
        self.baseline_model = baseline_model
        self.config = AutoConfig.from_pretrained(baseline_model)
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer or baseline_model)
        self.model = AutoModelForSequenceClassification.from_pretrained(baseline_model, config=self.config)
        
    def predict(self, data: pd.DataFrame) -> pd.DataFrame:
        inputs = self.tokenizer(list(data['text'].values), padding=True, return_tensors='pt')
        predictions = self.model(**inputs)
        probs = torch.nn.Softmax(dim=1)(predictions.logits)
        probs = probs.detach().numpy()
        
        classes = probs.argmax(axis=1)
        confidences = probs.max(axis=1)
        
        return pd.DataFrame({'rating': [self.config.id2label[c] for c in classes], 
                             'confidence': confidences })
        
        
def _load_pyfunc(path):
    import os
    return BertTextClassifier(os.path.abspath(path))

In [None]:
import mlflow

mlflow.set_experiment('bert-classification')

with mlflow.start_run():
    mlflow.pyfunc.log_model("classifier", 
                            data_path=model_path, 
                            code_path=["./huggingface_model_loader.py"], 
                            loader_module="huggingface_model_loader", 
                            registered_model_name="bert-rating-classification", 
                            signature=signature)

> Both implementations are equally capable. We can decide which one is simpler depending on the scenario and requirements.