<a href="https://colab.research.google.com/github/xtbtds/ml-zoomcamp/blob/main/lesson7_3_7_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 7.3 Deploying Your Prediction Service

- `bentoml models list`
- `bentoml models get <tag>`
- `vim bentofile.yaml`$ - $ for building
- `bentoml build`
- `bentoml containerize <tag>` $ - $ build docker image

We can look into this directory by locating `cd ~/bentoml/bentos/credit_risk_classifier/kdelkqsqms4i2b6d/` and see the file structure.


## 7.4 Sending, Receiving and Validating Data


What if we will send another fields or unknown fields of data? It won't fail by the way, but we *WANT* it to fail in such cases. For such purposes there is a library $ - $ **pydantic**.
- `pip install pydantic`
- `from pydantic import BaseModel`




In [None]:
# Create pydantic base class to create data schema for validation
class CreditApplication(BaseModel):
    seniority: int
    home: str
    time: int
    age: int
    marital: str
    records: str
    job: str
    expenses: int
    income: float
    assets: float
    debt: float
    amount: int
    price: int

The BaseModel will ensure that we are always recieving this 13 features for the model prediction.

In [None]:
# Pass pydantic class in the application
@svc.api(input=JSON(pydantic_model=CreditApplication), output=JSON()) # decorate endpoint as in json format for input and output
def classify(credit_application):
    # transform pydantic class to dict to extract key-value pairs 
    application = credit_application.dict()
    # transform data from client using dictvectorizer
    vector = dv.transform(application)
    # make predictions using 'runner.predict.run(input)' instead of 'model.predict'
    prediction = model_runner.predict.run(vector) 

Along the `JSON()`, BentoML uses various other descriptors in the input and output specification of the service api, for example, NumpyNdarray(), PandasDataFrame(), Text(), and many more.

## 7.5 High-Performance Serving

**Locust** - library for load testing
- `pip install locust`


In [None]:
# locustfile.py
import numpy as np
from locust import task
from locust import between
from locust import HttpUser


# Sample data to send
sample = {"seniority": 3,
 "home": "owner",
 "time": 36,
 "age": 26,
 "marital": "single",
 "records": "no",
 "job": "freelance",
 "expenses": 35,
 "income": 0.0,
 "assets": 60000.0,
 "debt": 3000.0,
 "amount": 800,
 "price": 1000
 }

# Inherit HttpUser object from locust
class CreditRiskTestUser(HttpUser):
    """
    Usage:
        Start locust load testing client with:
            locust -H http://localhost:3000, in case if all requests failed then load client with:
            locust -H http://localhost:3000 -f locustfile.py

        Open browser at http://0.0.0.0:8089, adjust desired number of users and spawn
        rate for the load test from the Web UI and start swarming.
    """

    # create mathod with task decorator to send request
    @task
    def classify(self):
        self.client.post("/classify", json=sample) # post request in json format with the endpoint 'classify'

    wait_time = between(0.01, 2) # set random wait time between 0.01-2 secs

###1.  **async** optimization
Process the requests in parallel and the model will make predictions simultaneously.

In [None]:
# Define an endpoint on the BentoML service
# pass pydantic class application
@svc.api(input=JSON(pydantic_model=CreditApplication), output=JSON()) # decorate endpoint as in json format for input and output
async def classify(credit_application): # parallelized requests at endpoint level (async)
    # transform pydantic class to dict to extract key-value pairs 
    application = credit_application.dict()
    # transform data from client using dictvectorizer
    vector = dv.transform(application)
    # make predictions using 'runner.predict.run(input)' instead of 'model.predict'
    prediction = await model_runner.predict.async_run(vector) # bentoml inference level parallelization (async_run)

### 2. **micro-batching** optimization
Combine the data coming from multiple users and combine them into one array, and then this array will be batched into smaller batches when the model prediction is called.

In [None]:
# Save the model batchable settings for production efficiency
bentoml.xgboost.save_model('credit_risk_model',
                            model,
                            custom_objects={'DictVectorizer': dv},
                           signatures={  # model signatures for runner inference
                               'predict': { 
                                   'batchable': True, 
                                   'batch_dim': 0 # '0' means bentoml will concatenate request arrays by first dimension
                               }
                           })

- `bentoml serve --production` $ - $ make the batchable model in serving, `--production` flag will enable more than one process for our web workers

`bentoconfiguration.yaml`:
```
# Config file controls the attributes of the runner
runners:
  batching:
    enabled: true
    max_batch_size: 100
    max_latency_ms: 500
```
