Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KServe gpt-fast example #2895

Merged
merged 6 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion examples/large_models/gpt_fast/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,8 @@ def preprocess(self, requests):
if isinstance(input_data, (bytes, bytearray)):
input_data = input_data.decode("utf-8")

input_data = json.loads(input_data)
if isinstance(input_data, str):
input_data = json.loads(input_data)

Comment on lines +121 to 123
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this part is needed. could you please check if client pass content type as json?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work without this because kserve is doing some pre-processing and it reads the "data" part as a dict.

prompt = input_data["prompt"]

Expand Down
5 changes: 5 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
FROM pytorch/torchserve-kfs-nightly:latest-gpu
USER root
RUN pip uninstall torchtext torchdata torch torchvision torchaudio -y
RUN pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --ignore-installed && pip install sentencepiece chardet requests
USER model-server
133 changes: 133 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Text generation with GPT Fast using KServe

[GPT Fast](https://github.com/pytorch-labs/gpt-fast) is a PyTorch native solution of optimized GPT models. We are using GPT Fast version of `llama2-7b-chat-hf`.
In this example, we show how to serve GPT fast version of Llama 2 with KServe
We will be serving the model using Kserve deployed using [minikube](https://minikube.sigs.k8s.io/docs/start/) on a single instance. The same solution can be extended to Kubernetes solutions of various cloud providers

The inference service returns the text generated for the given prompt.

## KServe Image

Before we setup the infrastructure, we need the correct docker image for running this example.
Currently, GPT-Fast needs PyTorch >=2.2 nightlies to run. The nightly image published by TorchServe doesn't include this. Hence, we need to build a custom image.

#### Build custom KServe image

The Dockerfile takes the nightly TorchServe kserve image and installs PyTorch nightlies on top of that.

If your username for dockerhub is `abc`, use the following command
```
docker build . -t abc/torchserve-kfs-nightly:latest-gpu
```

#### Publish KServe image

Make sure you have logged in your account using `docker login`

```
docker push abc/torchserve-kfs-nightly:latest-gpu
```

### GPT-Fast model archive

You can refer to the following [link](https://github.com/pytorch/serve/blob/master/examples/large_models/gpt_fast/README.md) to create torchserve model archive of GPT-Fast.

You would need to publish the config & the model-store to an accessible bucket.

Now we are ready to start deploying the published model.

## Install kserve

Start minikube cluster

```
minikube start --gpus all
```

Install kserve locally.
```
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.11/hack/quick_install.sh" | bash
```

Make sure kserve is installed on minikube cluster using

```
kubectl get pods -n kserve
```

This should result in
```
NAME READY STATUS RESTARTS AGE
kserve-controller-manager-57574b4878-rnsjn 2/2 Running 0 17s
```

TorchServe supports KServe V1 and V2 protocol. We show how to deploy with v1 for GPT-Fast

## KServe V1 protocol

Deploy `InferenceService` with Kserve V1 protocol
agunapal marked this conversation as resolved.
Show resolved Hide resolved

```
kubectl apply -f llama2_v1_gpu.yaml
```

results in

```
inferenceservice.serving.kserve.io/torchserve created
```

We need to wait till the pod is up

```
kubectl get pods
NAME READY STATUS RESTARTS AGE
torchserve-predictor-00001-deployment-8d66f9c-dkdhr 2/2 Running 0 8m19s
```

We need to set the following

```
MODEL_NAME=gpt_fast
SERVICE_HOSTNAME=$(kubectl get inferenceservice torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

```
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
```

```
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80 &
```

#### Model loading

Once the pod is up, the model loading can take some time in case of large models. We can monitor the `ready` state to determine when the model is loaded.
You can use the following command to get the `ready` state.

```
curl -v -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}
```

#### Inference request
Make an inference request

```
curl -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d @.sample_text.json
```

Expected output is

```
{"predictions":["is Paris. It is located in the northern central part of the country and is known for its stunning architecture, art museums, fashion, and historical landmarks. The city is home to many famous landmarks such as the Eiffel Tower"]}
```


## Stop and Delete the cluster

```
minikube stop
minikube delete
```
13 changes: 13 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/llama2_v1_gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: "torchserve"
spec:
predictor:
pytorch:
storageUri: gs://<gpt-fast-url>/v1
image: abc/torchserve-kfs-nightly:latest-gpu
resources:
limits:
memory: 20Gi
nvidia.com/gpu: "0"
11 changes: 11 additions & 0 deletions kubernetes/kserve/examples/gpt_fast/sample_text.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"instances": [
{
"data":
{
"prompt": "The capital of France",
"max_new_tokens": 50
}
}
]
}
Comment on lines +1 to +11
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move to use the generate endpoint from open inference protocol once it is out

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. If there's an example please point to it so we can update all our examples and tests