# KServe
We recall that KServe is responsible for serving models. To deploy a model as an inference service to KServe, we need to specify the following information:
- the name of the inference service
- the model format (TensorFlow, PyTorch, .etc)
- the location of the model artifact

After the information above is specified, KServe will perform the following tasks to deploy the model:
1. Choosing a model server based on the specified model format and creating a Kubernetes service using the Docker image of the chosen model server. KServe supports multiple types of model servers, such as Triton, TFServing, TorchServe, and SKLearn MLServer. KServe also allows users to build custom model servers using KServe SDK but we will mainly use pre-built model servers in this course.
2. Download the model artifact from the specified location.
3. Configuring networking resources to expose the inference service to clients.

*More reading material: [KServe docs](https://kserve.github.io/website/0.10/).*

## KServe example
This example shows how to deploy the red wine model you trained in the first week's MLflow tutorial to KServe. 

Before going to the examples, we need to obtain the location where the model is saved. Recall that the model was uploaded to MLflow. 

Just like what you did in the example of model-in-a-service deployment pattern, navigate to the MLflow service [http://mlflow-server.local](http://mlflow-server.local) and then to the "mlflow-minio-test" experiment. Click the MLflow run (the "Start Time" column) that produced an ElasticNet model for wine quality prediction. Then keep a note of the full S3 URI of the model. 

### Deploy a model to KServe
There are two ways of deploying a model to KServe, kubectl and the KServe Python SDK. 

#### Approach 1: Applying a YAML file using kubectl

First open [manifests/redwine-model.yaml](./manifests/redwine-model.yaml) and change the `storageUri` to the S3 URI of the model artifact's location you got when running the previous MLflow tutorial, e.g., 
```yaml
spec:
  predictor:
    serviceAccountName: kserve-sa 
    model:
      modelFormat: 
        name: sklearn
      storageUri: <the-s3-uri-of-your-mode-artifact>
```
**Note**: Change the `storageUri` in the Yaml file, not in this cell. 

Let's take a deeper look at the YAML file
```yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-redwine"
  namespace: kserve-inference
spec:
  predictor:
    serviceAccountName: kserve-sa 
    model:
      modelFormat: 
        name: sklearn
      storageUri: <the-s3-uri-of-your-mode-artifact>
```
From a high level, this YAML file defines an InferenceService object in KServe. An InferenceService is a Kubernetes resource that represents a deployed model.

Now we break the file into several parts:
* `apiVersion: "serving.kserve.io/v1beta1"`: This field specifies the API version of KServe that is being used to define the InferenceService. It tells Kubernetes how to interpret the content of the YAML file.
* `kind: "InferenceService"`: This field specifies the object type being deployed to K8s. In this case, we want to deploy an InferenceService object. 
* `metadata`: This section contains metadata information about the InferenceService
  * `name: sklearn-redwine`: This is the user-defined name to identify the object. 
  * `namespace: kserve-inference`: This is the namespace where the InferenceService object is deployed. Namespaces are used to isolate resources and avoid naming conflicts.
* `spec`: This section defines the specification of the InferenceService, which includes the model configuration and serving details.
  * `predictor`: This field specifies the model used in inference. 
    * `serviceAccountName: kserve-sa`: This is the name of the Kubernetes service account that will be used by the InferenceService. Service accounts provide the necessary permissions and access control for an object to interact with other resources in the cluster. In this case, it contains the MinIO credentials (username and password) that will be used by KServe to download model artifacts from the MinIO storage service. The service account was configured during the course environment setup. 
    * `name: sklearn`: This field tells KServe that the model an Sklearn model so that KServe can prepare the appropriate runtime environment for the model. 
      * `storageUri: ...`: This is the storage URI of the model artifact. It tells KServe where to download the model artifact. 

Deploy the model

In [None]:
!kubectl apply -f manifests/redwine-model.yaml

Expected output: 

```text
inferenceservice.serving.kserve.io/redwine-week4 created
```


Check if the inference service was deployed correctly. 

In [None]:
!kubectl get isvc redwine-week4 -n kserve-inference

Expected output:

```text
NAME            URL                                                 READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                     AGE
redwine-week4   http://redwine-week4.kserve-inference.example.com   True           100                              redwine-week4-predictor-default-00001   115s
```
**Note**: 
- The value in the "AGE" field may vary depending on how long the inference service has been running. 
- It may take a few minutes for the inference service to become ready, so please wait for a while and rerun the command in the previous cell if READY is Unknown (i.e. not equal to True). Or you can use the "-w" option to continuously watch the status of the inference service (`kubectl get isvc redwine-week4 -n kserve-inference -w`) and then terminate the code cell when the inference service is ready.

You can also check that there is a pod for serving the red-wine model

In [None]:
# Use "-l" to search pods with a specific label. Specifically, here we search the pods running for the "redwine-week4" inference service. The label is added automatically be KServe
!kubectl get pods -n kserve-inference -l serving.kserve.io/inferenceservice=redwine-week4

You should see there is a pod whose name is "redwine-week4-predictor-..." running, e.g., 
```text
NAME                                                              READY   STATUS    RESTARTS   AGE
redwine-week4-predictor-default-00001-deployment-8cff9c9f-6hs8f   2/2     Running   0          8m15s
```
Similar to the inference service, The AGE of the pod may vary depending on how long the pod has been running. The suffix (8cff9c9f-6hs8f in the example) also varies as it's a random string assigned by K8s. 

As we can see from the output, the redwine-week4-predictor pod has two containers (Ready is 2/2): 1) an application container that provides the runtime for serving the model (i.e., the container where model inference happens) and 2) a proxy container that forwards user traffic to the application container. This proxy container also collects some statistics such as the number of requests the model receives, which can be used in monitoring. Recall that in the "spec" field of manifests/redwine-model.yaml, we specify that the model is an sklearn model.
```yaml
spec:
  predictor:
    model:
      modelFormat: 
        name: sklearn
```
KServe can then choose the appropriate runtime to serve the model based on the given model type. 

### Test the inference service by sending requests
Now it's time to send prediction requests to your inference service running in KServe.

KServe uses an ingress gateway to route the incoming requests to the appropriate inference services. In our setup, the ingress gateway is listening at http://kserve-gateway.local:30200. (It's expected that you see a "page can't be found" error if you click the URL.)

The runtime that KServe uses to serve our sklearn model follows the v1 inference protocol. An inference protocol can been seen as a set of specifications that define, for example, how a model should be exposed behind an endpoint and the format of the requests. The model's predictions are exposed behind an endpoint following the format: 
```
http://{gateway_host}:{gateway_port}/v1/models/{model_name}:predict
```

The requests need to follow the format
```
{
  "instances": <value>|<(nested)list>|<list-of-objects>
}
```
Besides the v1 inference protocol, another inference protocol, the v2 inference protocol, is also used by some of the KServe's runtimes. You'll see how the v2 inference protocol works in this week's assignment. You don't need the details of these protocols to do the assignment. More details of these protocols can be found from the links below if you're interested. 
- [v1 inference protocol](https://kserve.github.io/website/0.10/modelserving/data_plane/v1_protocol/)
- [v2 inference protocol](https://docs.seldon.io/projects/seldon-core/en/latest/reference/apis/v2-protocol.html)

Let's send a request.

In [None]:
import requests

input_sample = [
        [7.8, 0.58, 0.02, 2, 0.073, 9, 18, 0.9968, 3.36, 0.57, 9.5],
        [8.9, 0.22, 0.48, 1.8, 0.077, 29, 60, 0.9968, 3.39, 0.53, 9.4]
    ]
model_name = "redwine-week4"
req_data = {
    "instances": input_sample
}
headers = {}

# Define Host in the request so that the ingress gateway knows how to forward the request 
# to the correct inference service
headers["Host"] = f"{model_name}.kserve-inference.example.com"
url = f"http://kserve-gateway.local:30200/v1/models/{model_name}:predict"
result = requests.post(url, json=req_data, headers=headers)
print(result.json())

Expected output:
```text
{'predictions': [5.657319539336507, 5.529618438168187]}
```

In [None]:
# delete the inference service
!kubectl delete isvc redwine-week4 -n kserve-inference

Expected output:

```text
inferenceservice.serving.kserve.io "redwine-week4" deleted
```

#### Approach 2: Using the KServe Python SDK
KServe provides a Python client SDK. 

*More information of the usage of the SDK can be found [here](https://kserve.github.io/website/0.10/sdk_docs/sdk_doc/#documentation-for-client-api).*

Remember to replace the `model_uri` to your own model artifact's S3 URI before running the next code cell. 

The code snippet below can be matched to the YAML file used in Approach 1, as shown in the comments. 

In [None]:
from kubernetes import client
from kserve import KServeClient
from kserve import constants
from kserve import V1beta1InferenceService
from kserve import V1beta1InferenceServiceSpec
from kserve import V1beta1PredictorSpec
from kserve import V1beta1SKLearnSpec
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def main():
    model_name = "redwine-week4-2"
    
    # replace this with your your own red wine model S3 URI
    model_uri = "s3://mlflow/12/9d75a172ed7543cd9619cb6ab5589258/artifacts/model"
    
    namespace = "kserve-inference"
    kserve_version="v1beta1"
    api_version = constants.KSERVE_GROUP + "/" + kserve_version

    isvc = V1beta1InferenceService(
        # apiVersion: "serving.kserve.io/v1beta1"
        # kind: "InferenceService"
        api_version=api_version,
        kind=constants.KSERVE_KIND,

        # metadata:
        #   name: "redwine-week4"
        #   namespace: kserve-inference
        metadata=client.V1ObjectMeta(
            name=model_name,
            namespace=namespace,
        ),

        # spec:
        spec=V1beta1InferenceServiceSpec(
            # predictor
            predictor=V1beta1PredictorSpec(
                # serviceAccountName
                service_account_name="kserve-sa",
                # model format
                sklearn=V1beta1SKLearnSpec(
                    # storageUri
                    storage_uri=model_uri
                )
            )
        )
    )
    kserve = KServeClient()

    # When applying a YAML file to KServe in Approach 1, KServe will create a new InferenceService if it doesn't yet exist,
    # otherwise KServe will patch (modify) the existing InferenceService. 
    # When using KServe SDK, different methods need to called in the cases of existing and non-existing InferenceService. 
    try:
        kserve.create(inferenceservice=isvc)
    except RuntimeError:
        # If the inference service with the same name exists
        kserve.patch(name=model_name, inferenceservice=isvc, namespace=namespace)


In [None]:
logger.info('Start deploying an inference service.')
main()
logger.info('The inference service has been deployed.')

You can check the inference service by running the `kubectl get isvc` command. 

In [None]:
!kubectl get isvc redwine-week4-2 -n kserve-inference

Expected output:

```text
NAME              URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                       AGE
redwine-week4-2   http://redwine-week4-2.kserve-inference.example.com   True           100                              redwine-week4-2-predictor-default-00001   2m1s
```

Let's test our "redwine-week4-2" inference service by sending a request to it. 

In [None]:
import requests

input_sample = [
        [7.8, 0.58, 0.02, 2, 0.073, 9, 18, 0.9968, 3.36, 0.57, 9.5],
        [8.9, 0.22, 0.48, 1.8, 0.077, 29, 60, 0.9968, 3.39, 0.53, 9.4]
    ]
model_name = "redwine-week4-2"
req_data = {
    "instances": input_sample
}
headers = {}

# Define Host in the request so that the ingress gateway knows how to forward the request 
# to the correct inference service
headers["Host"] = f"{model_name}.kserve-inference.example.com"
url = f"http://kserve-gateway.local:30200/v1/models/{model_name}:predict"
result = requests.post(url, json=req_data, headers=headers)
print(result.json())

Expected output:  
```text
{"predictions": [5.657319539336507, 5.529618438168187]}
```

In [None]:
# Delete the "redwine-week4-2" inference service
!kubectl delete isvc redwine-week4-2 -n kserve-inference

Expected output:
```text
inferenceservice.serving.kserve.io "redwine-week4-2" deleted
```

---
You've learned how to deploy a model as an independent inference service. To wrap up the section on model deployment patterns, you can return to [1_deployment_patterns.ipynb](./1_deployment_patterns.ipynb) to explore the pros and cons of both the "model-in-service" and "model-as-service" deployment patterns.