# Serverless Llama3 deployment on GKE using VLLM and Keda Scaling Kubernetes to Zero (And Back) 

**(with OpenAI drop in replacement)**

## Overview

This notebook demonstrates deploying llama3 instruct using VLLM from a gcp bucket. In this notebook we will deploy and serve VLLM on GPUs. In this guide we specifically use L4 GPUs but this guide should also work for A100(40 GB), A100(80 GB), H100(80 GB) GPUs.


### Objective

Deploy and run inference for serving LLMS with VLLM on GPUs and scale to zero.

### GPUs

GPUs let you accelerate specific workloads running on your nodes such as machine learning and data processing. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, L4, and A100 GPUs.

### VLLM

VLLM is a highly optimized open-source LLM serving framework that can increase serving throughput on GPUs. VLLM includes features such as:

Optimized transformer implementation with PagedAttention
Continuous batching to improve the overall serving throughput
Tensor parallelism and distributed serving on multiple GPUs

### KEDA

KEDA — the Kubernetes Event-Driven Autoscaler

Kubernetes offers the Horizontal Pod Autoscaler (HPA) as a controller to increase and decrease replicas dynamically.


Unfortunately, the HPA has a few drawbacks:

1. It doesn’t work out of the box– you need to install a Metrics Server to aggregate and expose the metrics.

2. It doesn’t scale to zero replicas.

3. It scales replicas based on metrics, and doesn’t intercept HTTP traffic.

4. Fortunately, you don’t have to use the official autoscaler, but you can use KEDA instead.


KEDA is an autoscaler made of three components:

1. A Scaler

2. A Metrics Adapter

3. A Controller

### Prerequisites
Install gcloud

## Run the notebook

In [7]:
# Get the default cloud project id.
PROJECT_ID = "demo-project"

# Get the default region for launching jobs.
REGION = "demo-location"

NAMESPACE="vllm"

CLUSTER_NAME=f"vllm-cluster-{PROJECT_ID}-3"

LLM_MODEL_ID="b0581263-c45a-4851-9e4b-b47e612a750e"

Install the neccesary libraties

In [12]:
# Install the openai client
! pip install openai

# Set up gcloud.
! gcloud config set project {PROJECT_ID}
! gcloud services enable container.googleapis.com
! sudo apt-get install kubectl
! sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin

Updated property [core/project].
E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem. 
E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem. 


Create an auto cluster

In [None]:
! gcloud container clusters create-auto {CLUSTER_NAME} \
    --project={PROJECT_ID} \
    --region={REGION} \
    --release-channel=rapid \
    --cluster-version=1.28 \
    --scopes=cloud-platform,storage-rw,cloud-source-repos \
    --create-subnetwork ""

Login to cluster

In [None]:
! gcloud container clusters get-credentials {CLUSTER_NAME} --project {PROJECT_ID} --region {REGION}

Create the associated user and apply the neccesary service account and roles/permissions. Create a namespace and annotate the associated roles. 

In [None]:
# This line captures the output of the gcloud command into a list
project_number = ! gcloud projects describe {PROJECT_ID} --format='value(projectNumber)'

# Since the output is a list with the project number as its first element, access it with [0]
gce_sa = f"{project_number[0]}-compute@developer.gserviceaccount.com"

# List of roles you want to assign
roles = ["monitoring.metricWriter", "stackdriver.resourceMetadata.writer"]

# Loop over the roles and add IAM policy binding for each
for role in roles:
    !gcloud projects add-iam-policy-binding {PROJECT_ID} --member=serviceAccount:{gce_sa} --role=roles/{role}

# Create a namespace in Kubernetes
!kubectl create ns {NAMESPACE}

# Create a service account in the newly created namespace
!kubectl create serviceaccount {NAMESPACE} --namespace {NAMESPACE}

# Add IAM policy binding to the GCE service account
!gcloud iam service-accounts add-iam-policy-binding {gce_sa} \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:{PROJECT_ID}.svc.id.goog[{NAMESPACE}/{NAMESPACE}]"

# Annotate the Kubernetes service account with the GCE service account
!kubectl annotate serviceaccount {NAMESPACE} \
    --namespace {NAMESPACE} \
    iam.gke.io/gcp-service-account={gce_sa}

Install Helm Binaries

In [None]:
! curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
! chmod 700 get_helm.sh
! ./get_helm.sh

Install Keda Core

In [None]:
! helm repo add kedacore https://kedacore.github.io/charts
! helm repo update
! helm install keda kedacore/keda --namespace {NAMESPACE} --create-namespace

Install Keda http addon

In [106]:
! helm install http-add-on kedacore/keda-add-ons-http --namespace {NAMESPACE}

NAME: http-add-on
LAST DEPLOYED: Sun May 12 17:01:00 2024
NAMESPACE: vllm
STATUS: deployed
REVISION: 1
TEST SUITE: None


Install NGINX Ingress Controller

In [None]:
! helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
! helm repo update
! helm install ingress-nginx ingress-nginx/ingress-nginx -n {NAMESPACE}

Set timeouts for scale up/down connections

In [151]:
! helm upgrade http-add-on kedacore/keda-add-ons-http --namespace {NAMESPACE} \
  --set interceptor.replicas.waitTimeout=1000s \
  --set interceptor.responseHeaderTimeout=1000s \
  --set interceptor.expectContinueTimeout=1000s \
  --set interceptor.tcpConnectTimeout=1000s

Release "http-add-on" has been upgraded. Happy Helming!
NAME: http-add-on
LAST DEPLOYED: Sun May 12 19:20:10 2024
NAMESPACE: vllm
STATUS: deployed
REVISION: 3
TEST SUITE: None


Modify KEDA_RESPONSE_HEADER_TIMEOUT which is the 6th environment variable listed. This should be set to your longest batched response

In [153]:
! kubectl patch deployment keda-add-ons-http-interceptor -n vllm --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/env/5/value", "value":"100s"}]'


deployment.apps/keda-add-ons-http-interceptor patched


Create the deployment

In [8]:
# @title Deploy VLLM
model_location = f"/gcs-mount/models/{LLM_MODEL_ID}/model"
# @markdown This section deploys on VLLM.

K8S_YAML = f"""
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
        ai.gke.io/model: llm-7b
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
      annotations:
        kubectl.kubernetes.io/default-container: vllm-server
        gke-gcsfuse/volumes: "true"
        gke-gcsfuse/cpu-limit: "10"
        gke-gcsfuse/memory-limit: 10Gi
        gke-gcsfuse/ephemeral-storage-limit: 1Ti
        gke-gcsfuse/cpu-request: 500m
        gke-gcsfuse/memory-request: 1Gi
        gke-gcsfuse/ephemeral-storage-request: 50Gi
    spec:
      serviceAccountName: vllm
      containers:
      - name: vllm-server
        #image: us-docker.pkg.dev/command-center-alpha/deployment/vllm:032
        image: vllm/vllm-openai:latest
        resources:
          requests:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 1
          limits:
            cpu: "2"
            memory: "25Gi"
            ephemeral-storage: "25Gi"
            nvidia.com/gpu: 2
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model={model_location}
        - --gpu-memory-utilization=0.9
        - --swap-space=0
        - --dtype=half
        - --quantization=gptq
        - --tensor-parallel-size=1
        - --port=8080
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 2
          successThreshold: 1
          failureThreshold: 3
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: /gcs-mount
          name: gcs-fuse-csi-ephemeral
          readOnly: true
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      - name: gcs-fuse-csi-ephemeral
        csi:
          driver: gcsfuse.csi.storage.gke.io
          volumeAttributes:
            bucketName: {PROJECT_ID}
            mountOptions: "implicit-dirs"
            fileCacheCapacity: "10Gi"
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
"""

with open("vllm-deployment.yaml", "w") as f:
    f.write(K8S_YAML)

! kubectl delete -f vllm-deployment.yaml -n {NAMESPACE}
! kubectl apply -f vllm-deployment.yaml -n {NAMESPACE}


deployment.apps "vllm-deployment" deleted
deployment.apps/vllm-deployment created


Create the service which references the "vllm-server" app, created in the previous deployment

In [154]:
K8S_YAML = f"""
apiVersion: v1
kind: Service
metadata:
 name: vllm-server
spec:
 ports:
   - port: 8080
     targetPort: 8080
 selector:
   app: vllm-server
"""
with open("vllm-service.yaml", "w") as f:
    f.write(K8S_YAML)

!kubectl replace --force -f vllm-service.yaml -n {NAMESPACE}

service "vllm-server" deleted
service/vllm-server replaced


Create Scaled object which references the deployment and service previously created called "vllm-server" and "vllm-deployment" respectively

In [161]:
K8S_YAML = f"""
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
  name: vllm-server
  namespace: vllm
spec:
  hosts: 
  - vllm.com
  pathPrefixes:
  - /v1
  scaleTargetRef:
    deployment: vllm-deployment
    service: vllm-server
    port: 8080
  replicas:
    min: 0
    max: 3
  scaledownPeriod: 300
  scalingMetric: # requestRate and concurrency are mutually exclusive
    concurrency:
        targetValue: 100
"""
with open("vllm-keda.yaml", "w") as f:
    f.write(K8S_YAML)

!kubectl replace --force -f vllm-keda.yaml -n {NAMESPACE}

httpscaledobject.http.keda.sh "vllm-server" deleted
httpscaledobject.http.keda.sh/vllm-server replaced


Create Ingress which references the "scaled object" via the host (vllm.com)

In [162]:
K8S_YAML = f"""
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-server
  namespace: vllm
  annotations:
    nginx.ingress.kubernetes.io/upstream-vhost: vllm.com
    nginx.ingress.kubernetes.io/proxy-read-timeout: "1000"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "1000"
spec:
  ingressClassName: nginx
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: keda-add-ons-http-interceptor-proxy
            port:
              number: 8080
"""
with open("ingress-keda.yaml", "w") as f:
    f.write(K8S_YAML)

!kubectl replace --force -f ingress-keda.yaml -n {NAMESPACE}

ingress.networking.k8s.io "vllm-server" deleted
ingress.networking.k8s.io/vllm-server replaced


Run inference on the deployed model

Initialize the openai client

In [9]:
ip_address = !  kubectl get services -l "app.kubernetes.io/component=controller" -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}" -n vllm

# @title Prediction

from openai import OpenAI

model_location = f"/gcs-mount/models/{LLM_MODEL_ID}/model"

client = OpenAI(
    base_url=f"http://{ip_address[0]}/v1/",
    api_key="llama3",
)

Batched response

In [10]:
completion = client.chat.completions.create(
  model=f"{model_location}",
  messages=[
    {"role": "user", "content": "Give me a 5 day workout routine"},
  ],
  temperature=0,
)

print(completion.choices[0].message) 


ChatCompletionMessage(content='Here is a 5-day workout routine that targets different muscle groups and includes a mix of cardio and strength training exercises:\n\n**Day 1: Chest and Triceps**\n\n1. Warm-up: 5-10 minutes of cardio (jogging, jumping jacks, etc.)\n2. Barbell Bench Press: 3 sets of 8-12 reps\n3. Incline Dumbbell Press: 3 sets of 10-15 reps\n4. Tricep Pushdown: 3 sets of 12-15 reps\n5. Tricep Dips: 3 sets of 12-15 reps\n6. Cool-down: 5-10 minutes of stretching\n\n**Day 2: Back and Biceps**\n\n1. Warm-up: 5-10 minutes of cardio\n2. Pull-ups: 3 sets of 8-12 reps (or Assisted Pull-ups if needed)\n3. Barbell Rows: 3 sets of 8-12 reps\n4. Dumbbell Bicep Curls: 3 sets of 10-15 reps\n5. Hammer Curls: 3 sets of 10-15 reps\n6. Cool-down: 5-10 minutes of stretching\n\n**Day 3: Legs**\n\n1. Warm-up: 5-10 minutes of cardio\n2. Squats: 3 sets of 8-12 reps\n3. Leg Press: 3 sets of 10-15 reps\n4. Lunges: 3 sets of 10-15 reps (per leg)\n5. Leg Extensions: 3 sets of 12-15 reps\n6. Cool-do

Streaming response

In [11]:

stream = client.chat.completions.create(
  model=f"{model_location}",
  messages=[
    {"role": "user", "content": "Give me a 5 day workout routine"},
  ],
  temperature=0,
  stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)


Here is a 5-day workout routine that targets different muscle groups and includes a mix of cardio and strength training exercises:

**Day 1: Chest and Triceps**

1. Warm-up: 5-10 minutes of cardio (jogging, jumping jacks, etc.)
2. Barbell Bench Press: 3 sets of 8-12 reps
3. Incline Dumbbell Press: 3 sets of 10-15 reps
4. Tricep Pushdown: 3 sets of 12-15 reps
5. Tricep Dips: 3 sets of 12-15 reps
6. Cool-down: 5-10 minutes of stretching

**Day 2: Back and Biceps**

1. Warm-up: 5-10 minutes of cardio
2. Pull-ups: 3 sets of 8-12 reps (or Assisted Pull-ups if needed)
3. Barbell Rows: 3 sets of 8-12 reps
4. Dumbbell Bicep Curls: 3 sets of 10-15 reps
5. Hammer Curls: 3 sets of 10-15 reps
6. Cool-down: 5-10 minutes of stretching

**Day 3: Legs**

1. Warm-up: 5-10 minutes of cardio
2. Squats: 3 sets of 8-12 reps
3. Leg Press: 3 sets of 10-15 reps
4. Lunges: 3 sets of 10-15 reps (per leg)
5. Leg Extensions: 3 sets of 12-15 reps
6. Cool-down: 5-10 minutes of stretching

**Day 4: Shoulders and Abs

Clean up

In [None]:
# Uninstall and delete all helm charts 
! helm uninstall keda -n {NAMESPACE}
! helm uninstall http-add-on -n {NAMESPACE}
! helm uninstall ingress-nginx -n {NAMESPACE}
! helm delete -n {NAMESPACE} http-add-on
! helm delete -n {NAMESPACE} keda
! helm delete -n {NAMESPACE} ingress-nginx

# Delete all resources in the namespace 
! kubectl delete all --all -n {NAMESPACE}

# Delete the cluster
! gcloud container clusters delete {CLUSTER_NAME} --region={REGION} --quiet 