# Introduction to Model Serving

## Table of Contents

1. Overview
2. ML Serving Process
3. Serving our first model
4. Introduction to MLServer
5. Serving Classic ML Models
6. Multi-Model Serving
7. Serving Custom Models
9. Batch Inference
10. Packaging
11. Deployment

## 1. Overview

In this workshop, we will delve into the intricacies of serving 
machine learning models, ensuring that both experts and beginners alike can gain 
valuable insights. We will cover the essential components, best practices, and 
practical strategies for packaging and serving.

We will start by going over the machine learning lifecycle and then we will train our own 
model and showcase different ways of serving it in a step-by-step fashion.

The tools we will be using are the following ones.

- `scikit-learn`
- `fastapi`
- `mlserver`
- `mlserver_sklearn`
- `pydantic`
- `joblib`

## 2. The Process

You can think of the machine learning deployment lifecycle as a 5-step process 
that starts once you have collected data, trained and evaluated a model. Here are 
the steps.


1. Serialize and Package the Model:
   - Serialize the trained model into a format suitable for deployment (e.g., pickle, ONNX, TensorFlow SavedModel).
   - Package the serialized model along with any necessary dependencies and configurations.

2. Choose a Deployment Architecture:
   - Select an appropriate deployment architecture based on the requirements (e.g., RESTful API, microservices, serverless).
   - Consider factors such as scalability, latency, and resource utilization.

3. Containerize the Model:
   - Create a container (e.g., Docker) that encapsulates the model and its dependencies.
   - Configure the container to expose the necessary endpoints for model inference.

4. Deploy the Model:
   - Choose a suitable platform for deploying the containerized model (e.g., Kubernetes, AWS, GCP, Azure).
   - Set up the necessary infrastructure and configurations for deployment.
   - Deploy the model container to the chosen platform.
5. Expose the Model Endpoint:
   - Create an API endpoint that accepts input data and returns model predictions.
   - Handle request/response formatting and any necessary data transformations.
6. Monitor and Maintain:
   - Implement monitoring and logging to track the model's performance and health.
   - Set up alerts and notifications for any anomalies or errors.
   - Regularly update and retrain the model as new data becomes available.
   - Handle model versioning and deployment updates as needed.

Here's the process expressed as a mermaid diagram:

```mermaid
graph LR
    A[Collect Data] --> B[Engineer Features]
    B --> C[Train and Evaluate the Model]
    C --> D[Evaluate Model]
    D --> B
    D --> E[Serialize and Package the Model]
    D --> F[Choose a Deployment Architecture]
    E --> G[Containerize the Model]
    F --> G
    G --> H[Deploy the Model]
    H --> I[Expose the Model Endpoint]
    I --> J[Monitor and Maintain]
    J --> B
```

This diagram illustrates the high-level steps involved in the machine learning 
lifecycle, from training and evaluation to deployment, exposure, and maintenance. Each 
step plays a crucial role in ensuring the model is effectively served and can be 
accessed by the intended consumers.

## 3. Serving Your First Model

For our first model, we wil use the Wine dataset from scikit-learn 
and serve it using FastAPI. The Wine dataset is a classic dataset from 
scikit-learn that contains information about different wine samples, 
including their chemical properties and the corresponding wine class.

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
wine = load_wine(as_frame=True)
wine.data.head(10)

In [None]:
X, y = wine.data.values, wine.target.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train[:5, :]

In [None]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

In [None]:
score = model.score(X_test, y_test)
print("Model score on test set:", score)

In [None]:
model.predict(X_test[:3]), y_test[:3]

In [None]:
import joblib

In [None]:
joblib.dump(model, 'first_deployment/my_model.joblib')

Our first model API.

In [None]:
%%writefile first_deployment/server.py

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()

class InputData(BaseModel):
    alcohol: float
    malic_acid: float
    ash: float
    alcalinity_of_ash: float
    magnesium: float
    total_phenols: float
    flavanoids: float
    nonflavanoid_phenols: float
    proanthocyanins: float
    color_intensity: float
    hue: float
    od280_od315_of_diluted_wines: float
    proline: float


def load_model():
    return joblib.load("my_model.joblib")


model = load_model()

@app.post("/predict")
def predict(data: InputData):
    # Convert input data to a 2D array
    features = [[
        data.alcohol, data.malic_acid, data.ash, data.alcalinity_of_ash,
        data.magnesium, data.total_phenols, data.flavanoids,
        data.nonflavanoid_phenols, data.proanthocyanins, data.color_intensity,
        data.hue, data.od280_od315_of_diluted_wines, data.proline
    ]]
    
    # Make predictions using the loaded model
    prediction = model.predict(features)
    
    # Return the predicted class
    return {"class": prediction.tolist()}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

That's it! You have successfully served a machine learning model using FastAPI with the Wine dataset. The API endpoint accepts input data, makes predictions using the trained random forest classifier, and returns the predicted class.

In [None]:
import requests

In [None]:
endpoint = "http://localhost:8000/predict"
data = {
    "alcohol": 12.85,
    "malic_acid": 1.6,
    "ash": 2.52,
    "alcalinity_of_ash": 17.8,
    "magnesium": 95,
    "total_phenols": 2.48,
    "flavanoids": 2.37,
    "nonflavanoid_phenols": 0.26,
    "proanthocyanins": 1.46,
    "color_intensity": 3.93,
    "hue": 1.09,
    "od280_od315_of_diluted_wines": 2.81,
    "proline": 625
}
results = requests.post(endpoint, json=data)

In [None]:
results

In [None]:
results.json()

## 4. Intro to MLServer

MLServer is an open-source framework that simplifies the deployment of machine learning 
models as production-ready microservices. It provides a scalable and efficient solution 
for serving models, making it easier to integrate them into applications and workflows.

MLServer offers several benefits, such as automatic API documentation, request validation, and support for various deployment scenarios, including containerization with Docker and orchestration with Kubernetes.

By leveraging MLServer, we can easily serve our trained wine classification model as a scalable and production-ready microservice. This allows us to integrate the model into larger applications or workflows, enabling seamless predictions and decision-making based on the wine sample data.

In [None]:
%%writefile second_deployment/model-settings.json
{
    "name": "wine-classifier",
    "implementation": "mlserver_sklearn.SKLearnModel",
    "parameters": {
        "uri": "../models/my_model.joblib"
    }
}

In [None]:
!mlserver start second_deployment/

You can check out the docs of all of the methods at `http://0.0.0.0:8080/v2/docs`.

![open api specs]("./images/openapi.png")

Time to test it.

In [None]:
X_test[0, None]

In [None]:
input_request = {
    "inputs": [{
        "name": "my-input",
      "datatype": "INT32",
      "shape": X_test[0, None].shape,
      "data": X_test[0].tolist()
    }]
}
input_request

In [None]:
endpoint = "http://0.0.0.0:8080/v2/models/wine-classifier/infer"
results = requests.post(endpoint, json=input_request)
results.json()

## 5. Multi-Model Serving

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression

In [None]:
california = fetch_california_housing(as_frame=True)
california.data.head(10)

In [None]:
X, y = california.data.values, california.target.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
score = model.score(X_test, y_test)
print("Model score on test set:", score)

In [None]:
joblib.dump(model, "./models/california_housing_model.joblib")
print("Model saved as california_housing_model.joblib")

In [None]:
%%writefile third_deployment/model-settings.json
{
    "name": "cali_model",
    "implementation": "mlserver_sklearn.SKLearnModel",
    "parameters": {
        "uri": "../models/california_housing_model.joblib",
        "version": "v0.1.0"
    }
}

Check that it works.

In [None]:
!mlserver start third_deployment/

In [None]:
from mlserver.codecs import NumpyCodec

In [None]:
NumpyCodec.encode_input(name="predict", payload=X_test[0, None]).dict()

In [None]:
input_request = {
    "inputs": [NumpyCodec.encode_input(name="predict", payload=X_test[0, None]).dict()]
}

In [None]:
endpoint = "http://0.0.0.0:8080/v2/models/cali_model/infer"
results = requests.post(endpoint, json=input_request)
results.json()

Serving both models at the same time.

In [None]:
model_name = "cali_model"
endpoint = f"http://0.0.0.0:8080/v2/models/{model_name}/infer"
results = requests.post(endpoint, json=input_request)
results.json()

In [None]:
wine = load_wine().data[0, None]
wine

In [None]:
model_name = "wine-classifier"
input_request = {
    "inputs": [NumpyCodec.encode_input(name="predict", payload=wine).dict()]
}
endpoint = f"http://0.0.0.0:8080/v2/models/{model_name}/infer"
results = requests.post(endpoint, json=input_request)
results.json()

## 5. Serving Custom Models

In [None]:
from llama_cpp import Llama

In [None]:
llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
    filename="*q8_0.gguf",
    verbose=False
)

In [None]:
result = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are an assistant who is an expert in geography and fun facts."},
          {
              "role": "user",
              "content": "What can you tell me about the capital of the Dominican Republic?"
          }
      ]
)
result

In [None]:
result['choices'][0]['message']['content']

In [None]:
llm.create_chat_completion??

In [None]:
%%writefile fourth_deployment/qwen_model.py
from mlserver import MLModel
from mlserver.codecs import decode_args
from typing import List

from llama_cpp import Llama

class MyKulModel(MLModel):

    async def load(self):
        self.llm = Llama.from_pretrained(
            repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
            filename="*q8_0.gguf",
            verbose=False
        )

    @decode_args
    async def predict(self, system: List[str], user: List[str]) -> List[str]:

        return [self.llm.create_chat_completion(
            messages = [
                {"role": "system", "content": system[0]},
                {"role": "user", "content": user[0]}
            ]
        )['choices'][0]['message']['content']]

In [None]:
%%writefile fourth_deployment/model-settings.json
{
    "name": "llama_qwen",
    "implementation": "qwen_model.MyKulModel"
}

In [None]:
%%writefile fourth_deployment/settings.json
{
    "http_port": 7070,
    "grpc_port": 6070
}

In [None]:
from mlserver.codecs import StringCodec

In [None]:
model_name = "llama_qwen"
system_prompt = ["You are a helpful assistant that is also an expert in data science."]
user_prompt = ["What is Analytics Vidhya and what do you know about it?"]

input_request = {
    "inputs": [
        StringCodec.encode_input(name="system", payload=system_prompt, use_bytes=False).dict(),
        StringCodec.encode_input(name="user", payload=user_prompt, use_bytes=False).dict()
    ]
}
endpoint = f"http://0.0.0.0:7070/v2/models/{model_name}/infer"
results = requests.post(endpoint, json=input_request)
results.json()

## 6. Packaging

In [None]:
%%writefile fourth_deployment/requirements.txt
llama-cpp-python
mlserver

In [None]:
!mlserver build . -t ramonprz/mymodel

In [None]:
!docker run -it --rm -p 7070:7070 ramonprz/mymodel