**Note**: Several YAML files will be used to run some examples in this tutorial. Similar to what you did before, you will need to change the "storareUri" in these YAML files to the S3 URI of one of your own red wine models saved in MLflow if you want to run the examples yourself. 

# Horizontal scaling
The capacity of a single server to handle computational tasks is limited. As user traffic surges, it becomes necessary to distribute this traffic among multiple replicas that run your model on different servers. 

**Horizontal scaling** means adding more servers to ensure that your inference service can respond to increasing user requests.

In our K8s setup, horizontal scaling in practice means increasing the pods running for an inference service. In [manifests/redwine-model-scale.yaml](./manifests/redwine-model-scale.yaml), by adding the `minReplicas` field, we specify the minimum number of pods running for the "redwine-week4" inference service. 

In [1]:
# Create the "redwine-week4" inference service served by 3 pods.
!kubectl apply -f manifests/redwine-model-scale.yaml

inferenceservice.serving.kserve.io/redwine-week4 created


Expected output:
```text
inferenceservice.serving.kserve.io/redwine-week4 created
```

In [2]:
# Check the "redwine-week4" inference service to make sure it's ready
!kubectl get isvc redwine-week4 -n kserve-inference

NAME            URL                                                 READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                     AGE
redwine-week4   http://redwine-week4.kserve-inference.example.com   True           100                              redwine-week4-predictor-default-00001   94s


Expected output:
```text
NAME            URL                                                 READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                     AGE
redwine-week4   http://redwine-week4.kserve-inference.example.com   True           100                              redwine-week4-predictor-default-00001   28s
```

In [3]:
# Check the pods running for the "redwine-week4" inference service
!kubectl -n kserve-inference get pods -l serving.kserve.io/inferenceservice=redwine-week4 -o wide

NAME                                                              READY   STATUS    RESTARTS   AGE    IP            NODE              NOMINATED NODE   READINESS GATES
redwine-week4-predictor-default-00001-deployment-747fbd846l26kx   2/2     Running   0          119s   10.244.2.29   kind-ep-worker    <none>           <none>
redwine-week4-predictor-default-00001-deployment-747fbd846l2wf7   2/2     Running   0          118s   10.244.1.26   kind-ep-worker2   <none>           <none>
redwine-week4-predictor-default-00001-deployment-747fbd846qh4cl   2/2     Running   0          118s   10.244.2.30   kind-ep-worker    <none>           <none>


Expected output:
```text
NAME                                                              READY   STATUS    RESTARTS   AGE     IP             NODE              NOMINATED NODE   READINESS GATES
redwine-week4-predictor-default-00002-deployment-5744cfb98j88x2   2/2     Running   0          4m27s   10.244.2.106   kind-ep-worker2   <none>           <none>
redwine-week4-predictor-default-00002-deployment-5744cfb98qrzqq   2/2     Running   0          4m27s   10.244.2.107   kind-ep-worker2   <none>           <none>
redwine-week4-predictor-default-00002-deployment-5744cfb98z5lvb   2/2     Running   0          4m27s   10.244.1.120   kind-ep-worker    <none>           <none>
```
The "NODE" field shows that these pods are running on different cluster nodes. The IPs in the "IP" column can vary. 

In [4]:
# Clean up
!kubectl -n kserve-inference delete isvc redwine-week4

inferenceservice.serving.kserve.io "redwine-week4" deleted


Expected output:
```text
inferenceservice.serving.kserve.io "redwine-week4" deleted
```

### Horizontal autoscaling
User traffic can fluctuate up and down in many use cases. As a result, it is inefficient to manually add more servers when traffic spikes come and remove them when traffic decreases. This is where horizontal autoscaling is needed. By horizontal autoscaling, new servers can be automatically launched when user traffic increases. Similarly, excess servers can be terminated when the traffic decreases. 

Autoscaling can bring the following benefits:
- It allows your inference service to automatically respond changes in demand, minimizing human intervention.
- Servers can allocated based on user demand, minimizing the risks of overprovisioning (i.e., launching too many servers than actually needed), thereby saving costs during periods of low demand.

#### Horizontal autoscaling in KServe
KServe offers horizontal autoscaling by default. Let's look at an example in [manifests/redwine-model-autoscale.yaml](./manifests/redwine-model-autoscale.yaml). The content looks familiar, except the following fields:
```yaml
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 8
    scaleTarget: 1
    scaleMetric: concurrency
```
- `minReplicas`: Minimum number of pods running for an inference service.
- `maxReplicas`: Maximum number of pods for autoscaling.
- `scaleMetric`: The scaling metric. It's used to decided when to scale pods. In this example, the metric is concurrency, which refers to the number of in-flight requests per pod at any given time.
- `scaleTarget`: This an integer that specifies the target value of the "scaleMetric" that autoscaling should try to satisfy. Note that this target may not always be achieved as the maximum number of pods and/or the computational resources in the cluster is limited. 

In [5]:
# Create the "redwine-week4" inference service with horizontal autoscaling enabled
!kubectl apply -f manifests/redwine-model-autoscale.yaml

inferenceservice.serving.kserve.io/redwine-week4 created


Expected output:
```text
inferenceservice.serving.kserve.io/redwine-week4 created
```

In [7]:
# Make sure the "redwine-week4" inference service is ready
!kubectl get isvc redwine-week4 -n kserve-inference

NAME            URL                                                 READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                     AGE
redwine-week4   http://redwine-week4.kserve-inference.example.com   True           100                              redwine-week4-predictor-default-00001   27s


Expected output:
```text
NAME            URL                                                 READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                     AGE
redwine-week4   http://redwine-week4.kserve-inference.example.com   True           100                              redwine-week4-predictor-default-00001   36s
```

In [8]:
# Check the pods running for the "redwine-week4" inference service
!kubectl -n kserve-inference get pods -l serving.kserve.io/inferenceservice=redwine-week4

NAME                                                              READY   STATUS    RESTARTS   AGE
redwine-week4-predictor-default-00001-deployment-7d767d9dcr9mvd   2/2     Running   0          36s


Expected output
```text
NAME                                                              READY   STATUS    RESTARTS   AGE
redwine-week4-predictor-default-00001-deployment-5ddb7c85b9sdcf   2/2     Running   0          68s
```
Only one pod is running for the "redwine-week4" inference service now.

Next, let's send some requests to the service using a tool named "hey". Hey is a small program that send traffic in a controllable way. 

We will use the following command to simulate user requests:
```bash
hey -z 10s -c 5 -m POST -host ${host} -D ${input_path} ${url}
```
Running this command will simulate 5 concurrent POST requests for 10 seconds.

In [9]:
%%bash

model_name=redwine-week4
input_path=redwine-input.json
host=${model_name}.kserve-inference.example.com
url=http://kserve-gateway.local:30200/v1/models/${model_name}:predict

# Send 10 seconds of post requests maintaining 5 in-flight requests. The sent data are saved in redwine-input.json
hey -z 10s -c 5 -m POST -host ${host} -D ${input_path} ${url}


Summary:
  Total:	10.1390 secs
  Slowest:	0.2430 secs
  Fastest:	0.0207 secs
  Average:	0.0857 secs
  Requests/sec:	58.0925
  
  Total data:	31217 bytes
  Size/request:	53 bytes

Response time histogram:
  0.021 [1]	|
  0.043 [63]	|■■■■■■■■■■■
  0.065 [60]	|■■■■■■■■■■■
  0.087 [222]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.110 [156]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.132 [44]	|■■■■■■■■
  0.154 [18]	|■■■
  0.176 [9]	|■■
  0.199 [3]	|■
  0.221 [5]	|■
  0.243 [8]	|■


Latency distribution:
  10% in 0.0423 secs
  25% in 0.0700 secs
  50% in 0.0834 secs
  75% in 0.0986 secs
  90% in 0.1230 secs
  95% in 0.1461 secs
  99% in 0.2278 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0001 secs, 0.0207 secs, 0.2430 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0028 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0011 secs
  resp wait:	0.0853 secs, 0.0206 secs, 0.2429 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0055 secs

Status code distribution:
  [200]	589 responses





In [10]:
# Check the pods running for the "redwine-week4" inference service again
# You need to run this command immediately after the hey command is completed
!kubectl -n kserve-inference get pods -l serving.kserve.io/inferenceservice=redwine-week4

NAME                                                              READY   STATUS        RESTARTS   AGE
redwine-week4-predictor-default-00001-deployment-7d767d9dc6t7kx   2/2     Terminating   0          98s
redwine-week4-predictor-default-00001-deployment-7d767d9dchs2fr   2/2     Terminating   0          101s
redwine-week4-predictor-default-00001-deployment-7d767d9dcr9mvd   2/2     Terminating   0          3m42s
redwine-week4-predictor-default-00001-deployment-7d767d9dcsn8lr   2/2     Terminating   0          100s
redwine-week4-predictor-default-00001-deployment-7d767d9dcvkc6l   2/2     Running       0          102s


Example output:
```text
NAME                                                              READY   STATUS    RESTARTS   AGE
redwine-week4-predictor-default-00001-deployment-5ddb7c85b5gzxw   2/2     Running   0          24s
redwine-week4-predictor-default-00001-deployment-5ddb7c85b9sdcf   2/2     Running   0          70s
redwine-week4-predictor-default-00001-deployment-5ddb7c85bhtvfx   2/2     Running   0          22s
redwine-week4-predictor-default-00001-deployment-5ddb7c85brm5xb   2/2     Running   0          22s
redwine-week4-predictor-default-00001-deployment-5ddb7c85bxpnp5   2/2     Running   0          20s
```
Recall that we set `scaleTarget` to 1 in manifests/redwine-model-autoscale.yaml, which means KServe should try to scale up the pods in such a way that each pod will handle one request simultaneously. Since 5 concurrent requests were sent, five pods were needed in total. As a result, KServe launched four more pods to handle the requests. 

Notice that the number of the additional pods KServe creates can vary depending on how the traffic arrived at the inference service. It's also OK if you see three or five additional pods created.  

If you wait for a few minutes and check the pods again (by running the previous code cell), you will see four of pods are being terminated. In other words, the inference service is scaled down as there is no traffic anymore.
```text
NAME                                                              READY   STATUS        RESTARTS   AGE
redwine-week4-predictor-default-00001-deployment-5ddb7c85b5gzxw   2/2     Terminating   0          76s
redwine-week4-predictor-default-00001-deployment-5ddb7c85b9sdcf   2/2     Running       0          2m2s
redwine-week4-predictor-default-00001-deployment-5ddb7c85bhtvfx   2/2     Terminating   0          74s
redwine-week4-predictor-default-00001-deployment-5ddb7c85brm5xb   2/2     Terminating   0          74s
redwine-week4-predictor-default-00001-deployment-5ddb7c85bxpnp5   2/2     Terminating   0          72s
```
Finally, only one pod remains:
```text
NAME                                                              READY   STATUS    RESTARTS   AGE
redwine-week4-predictor-default-00001-deployment-5ddb7c85b9sdcf   2/2     Running       0     4m27s
```

In [11]:
# Clean up
!kubectl -n kserve-inference delete isvc redwine-week4

inferenceservice.serving.kserve.io "redwine-week4" deleted


Expected output:
```text
inferenceservice.serving.kserve.io "redwine-week4" deleted
```

*Credits: The examples are based on [this KServe doc](https://kserve.github.io/website/0.10/modelserving/autoscaling/autoscaling/).*

*If you are interested at seeing how horizontal autoscaling can maintain the performance of an inference service, you can take a look at [this experiment](./autoscaling-exp.ipynb). (You don't need to run the experiment yourself. The experiment is used to illustrate how horizontal autoscaling can maintain the performance of an inference service in user traffic spikes.)*

# Next step
You've learned how to horizontally scale an inference service using KServe. Now you can go to [the next tutorial](./4_inference_graph.ipynb) and learn about inference graph. 