Skip to content

Commit

Permalink
KServe gpt-fast example (#2895)
Browse files Browse the repository at this point in the history
* KServe gpt-fast example

* KServe gpt-fast example

* KServe gpt-fast example

* typo
  • Loading branch information
agunapal committed Jan 12, 2024
1 parent 43e6740 commit 0d11f4c
Show file tree
Hide file tree
Showing 5 changed files with 164 additions and 1 deletion.
3 changes: 2 additions & 1 deletion examples/large_models/gpt_fast/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,8 @@ def preprocess(self, requests):
if isinstance(input_data, (bytes, bytearray)):
input_data = input_data.decode("utf-8")

input_data = json.loads(input_data)
if isinstance(input_data, str):
input_data = json.loads(input_data)

prompt = input_data["prompt"]

Expand Down
5 changes: 5 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
FROM pytorch/torchserve-kfs-nightly:latest-gpu
USER root
RUN pip uninstall torchtext torchdata torch torchvision torchaudio -y
RUN pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --ignore-installed && pip install sentencepiece chardet requests
USER model-server
133 changes: 133 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Text generation with GPT Fast using KServe

[GPT Fast](https://github.com/pytorch-labs/gpt-fast) is a PyTorch native solution of optimized GPT models. We are using GPT Fast version of `llama2-7b-chat-hf`.
In this example, we show how to serve GPT fast version of Llama 2 with KServe
We will be serving the model using KServe deployed using [minikube](https://minikube.sigs.k8s.io/docs/start/) on a single instance. The same solution can be extended to Kubernetes solutions of various cloud providers

The inference service returns the text generated for the given prompt.

## KServe Image

Before we setup the infrastructure, we need the correct docker image for running this example.
Currently, GPT-Fast needs PyTorch >=2.2 nightlies to run. The nightly image published by TorchServe doesn't include this. Hence, we need to build a custom image.

#### Build custom KServe image

The Dockerfile takes the nightly TorchServe KServe image and installs PyTorch nightlies on top of that.

If your username for dockerhub is `abc`, use the following command
```
docker build . -t abc/torchserve-kfs-nightly:latest-gpu
```

#### Publish KServe image

Make sure you have logged in your account using `docker login`

```
docker push abc/torchserve-kfs-nightly:latest-gpu
```

### GPT-Fast model archive

You can refer to the following [link](https://github.com/pytorch/serve/blob/master/examples/large_models/gpt_fast/README.md) to create torchserve model archive of GPT-Fast.

You would need to publish the config & the model-store to an accessible bucket.

Now we are ready to start deploying the published model.

## Install KServe

Start minikube cluster

```
minikube start --gpus all
```

Install KServe locally.
```
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.11/hack/quick_install.sh" | bash
```

Make sure KServe is installed on minikube cluster using

```
kubectl get pods -n kserve
```

This should result in
```
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-57574b4878-rnsjn 2/2 Running 0 17s
```

TorchServe supports KServe V1 and V2 protocol. We show how to deploy with v1 for GPT-Fast

## KServe V1 protocol

Deploy `InferenceService` with KServe V1 protocol

```
kubectl apply -f llama2_v1_gpu.yaml
```

results in

```
inferenceservice.serving.kserve.io/torchserve created
```

We need to wait till the pod is up

```
kubectl get pods
NAME READY STATUS RESTARTS AGE
torchserve-predictor-00001-deployment-8d66f9c-dkdhr 2/2 Running 0 8m19s
```

We need to set the following

```
MODEL_NAME=gpt_fast
SERVICE_HOSTNAME=$(kubectl get inferenceservice torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

```
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
```

```
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80 &
```

#### Model loading

Once the pod is up, the model loading can take some time in case of large models. We can monitor the `ready` state to determine when the model is loaded.
You can use the following command to get the `ready` state.

```
curl -v -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}
```

#### Inference request
Make an inference request

```
curl -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d @.sample_text.json
```

Expected output is

```
{"predictions":["is Paris. It is located in the northern central part of the country and is known for its stunning architecture, art museums, fashion, and historical landmarks. The city is home to many famous landmarks such as the Eiffel Tower"]}
```


## Stop and Delete the cluster

```
minikube stop
minikube delete
```
13 changes: 13 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/llama2_v1_gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: "torchserve"
spec:
predictor:
pytorch:
storageUri: gs://<gpt-fast-url>/v1
image: abc/torchserve-kfs-nightly:latest-gpu
resources:
limits:
memory: 20Gi
nvidia.com/gpu: "0"
11 changes: 11 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/sample_text.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"instances": [
{
"data":
{
"prompt": "The capital of France",
"max_new_tokens": 50
}
}
]
}

0 comments on commit 0d11f4c

Please sign in to comment.