# Model deployment patterns
#### *Credits: This tutorial is adapted from the course materials of "[Full Stack Deep learning 2022](https://fullstackdeeplearning.com/course/2022/lecture-5-deployment/)" organized by UC Berkeley*.

First let's consider how a software application might work and put ML models aside for now. An abstract diagram of deploying a software application can look like this:

![](./images/software-deployment.jpg)

In this diagram, the client is a device (e.g., mobile phone and laptop) that interacts with your application. The server is where your application runs. Upon receiving a request, the application fetches data from a database and returns a response to the user. 

Let's then take a ML model into consideration. Depending on your use cases, inference can be offline or online. 

## Pattern 1: Offline inference
In offline inference, user requests (i.e., the inputs) are stored and your ML model is run over all the stored requests. typically on a scheduled basis such as once an hour or once a day. The results are then saved to a database, and the application retrieves these results from the database.

![](./images/offline-inference.jpg)

Implementing offline inference is relatively straightforward, often requiring the execution of a script at regular intervals to call the model to make predictions. However, this approach falls short in use cases where real-time results are required.

## Pattern 2: Online inference
In contrast to offline inference where inferences are made in batch format, online inference refers to the process of making inferences in real-time.

#### Pattern 2a: Model in a service
A straightforward approach to implement online inference is to directly embed the model in an application.

![](./images/model-in-svc.jpg)

[This notebook](./1_1_model_in_svc/model_in_svc.ipynb) provides a concrete example of embedding a model in an application.

#### Pattern 2b: Model as a service
Another approach to implement online inference is deploying the model as a separate inference service, so the model can expose predictions through endpoints. The application can then request this separate service to make inferences upon receiving client requests. 

Recall that in the previous example of the model-in-service deployment pattern, our `predict` function directly calls the model to make a prediction.
```python
@app.get("/predict/{product_id}")
def predict(product_id: int):
    chemical_attrs = get_chemical_attributes(product_id)
    pred = model.predict(chemical_attrs)
    return {"predicted score": pred[0]}
```
When using the model-as-a-service deployment pattern, the model is running in a separate inference service, so our function needs to forward the chemical attributes to that inference service's endpoint. 
```python
@app.get("/predict/{product_id}")
def predict(product_id: int):
    chemical_attrs = get_chemical_attributes(product_id)
    # Suppose the inference service's URL is http://localhost:8080/predict
    pred_response = requests.post(url="http://localhost:8080/predict", data={chemical_attributes})
    pred=pred_response.json() #decode the response received from the inference service
    return {"predicted score": pred}
```

[This notebook](./1_2_kserve.ipynb) elaborates on how to use KServe to deploy a model as a separate inference service. 

![](./images/model-as-svc.jpg)

#### Comparison
The model-in-a-service deployment pattern is less complex than the model-as-a-service one as it does not require developers to provision additional infrastructure to run an additional service, but it also has several drawbacks. For example, if the application is written in a different language, the model may not be compatible with the application. The model and the application may need to be updated at a different frequency, causing integration problems. The model and the application may also scale differently, and the server where the application runs may lack hardware support for models, e.g., the support of GPUs.

On the other hand, the model-as-a-service adds infrastructure complexity as you need to manage an additional inference service and the associated components for running the service. Luckily, tools like KServe and [Seldon Core](https://www.seldon.io/solutions/open-source-projects/core) can be leveraged to simplify this management process.

---
# Next step
You've seen different model deployment patterns, you can now proceed to [the next tutorial](./2_canary_deployment.ipynb) where you'll learn about canary deployment, a commonly used model deployment strategy. 

