-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KServe gpt-fast example #2895
KServe gpt-fast example #2895
Changes from 5 commits
f4bd3cf
a5a64bf
baf0dd6
166fddd
b420623
2b6bc6e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
FROM pytorch/torchserve-kfs-nightly:latest-gpu | ||
USER root | ||
RUN pip uninstall torchtext torchdata torch torchvision torchaudio -y | ||
RUN pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --ignore-installed && pip install sentencepiece chardet requests | ||
USER model-server |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# Text generation with GPT Fast using KServe | ||
|
||
[GPT Fast](https://github.com/pytorch-labs/gpt-fast) is a PyTorch native solution of optimized GPT models. We are using GPT Fast version of `llama2-7b-chat-hf`. | ||
In this example, we show how to serve GPT fast version of Llama 2 with KServe | ||
We will be serving the model using Kserve deployed using [minikube](https://minikube.sigs.k8s.io/docs/start/) on a single instance. The same solution can be extended to Kubernetes solutions of various cloud providers | ||
|
||
The inference service returns the text generated for the given prompt. | ||
|
||
## KServe Image | ||
|
||
Before we setup the infrastructure, we need the correct docker image for running this example. | ||
Currently, GPT-Fast needs PyTorch >=2.2 nightlies to run. The nightly image published by TorchServe doesn't include this. Hence, we need to build a custom image. | ||
|
||
#### Build custom KServe image | ||
|
||
The Dockerfile takes the nightly TorchServe kserve image and installs PyTorch nightlies on top of that. | ||
|
||
If your username for dockerhub is `abc`, use the following command | ||
``` | ||
docker build . -t abc/torchserve-kfs-nightly:latest-gpu | ||
``` | ||
|
||
#### Publish KServe image | ||
|
||
Make sure you have logged in your account using `docker login` | ||
|
||
``` | ||
docker push abc/torchserve-kfs-nightly:latest-gpu | ||
``` | ||
|
||
### GPT-Fast model archive | ||
|
||
You can refer to the following [link](https://github.com/pytorch/serve/blob/master/examples/large_models/gpt_fast/README.md) to create torchserve model archive of GPT-Fast. | ||
|
||
You would need to publish the config & the model-store to an accessible bucket. | ||
|
||
Now we are ready to start deploying the published model. | ||
|
||
## Install kserve | ||
|
||
Start minikube cluster | ||
|
||
``` | ||
minikube start --gpus all | ||
``` | ||
|
||
Install kserve locally. | ||
``` | ||
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.11/hack/quick_install.sh" | bash | ||
``` | ||
|
||
Make sure kserve is installed on minikube cluster using | ||
|
||
``` | ||
kubectl get pods -n kserve | ||
``` | ||
|
||
This should result in | ||
``` | ||
NAME READY STATUS RESTARTS AGE | ||
kserve-controller-manager-57574b4878-rnsjn 2/2 Running 0 17s | ||
``` | ||
|
||
TorchServe supports KServe V1 and V2 protocol. We show how to deploy with v1 for GPT-Fast | ||
|
||
## KServe V1 protocol | ||
|
||
Deploy `InferenceService` with Kserve V1 protocol | ||
agunapal marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
``` | ||
kubectl apply -f llama2_v1_gpu.yaml | ||
``` | ||
|
||
results in | ||
|
||
``` | ||
inferenceservice.serving.kserve.io/torchserve created | ||
``` | ||
|
||
We need to wait till the pod is up | ||
|
||
``` | ||
kubectl get pods | ||
NAME READY STATUS RESTARTS AGE | ||
torchserve-predictor-00001-deployment-8d66f9c-dkdhr 2/2 Running 0 8m19s | ||
``` | ||
|
||
We need to set the following | ||
|
||
``` | ||
MODEL_NAME=gpt_fast | ||
SERVICE_HOSTNAME=$(kubectl get inferenceservice torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3) | ||
``` | ||
|
||
``` | ||
export INGRESS_HOST=localhost | ||
export INGRESS_PORT=8080 | ||
``` | ||
|
||
``` | ||
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}') | ||
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80 & | ||
``` | ||
|
||
#### Model loading | ||
|
||
Once the pod is up, the model loading can take some time in case of large models. We can monitor the `ready` state to determine when the model is loaded. | ||
You can use the following command to get the `ready` state. | ||
|
||
``` | ||
curl -v -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME} | ||
``` | ||
|
||
#### Inference request | ||
Make an inference request | ||
|
||
``` | ||
curl -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d @.sample_text.json | ||
``` | ||
|
||
Expected output is | ||
|
||
``` | ||
{"predictions":["is Paris. It is located in the northern central part of the country and is known for its stunning architecture, art museums, fashion, and historical landmarks. The city is home to many famous landmarks such as the Eiffel Tower"]} | ||
``` | ||
|
||
|
||
## Stop and Delete the cluster | ||
|
||
``` | ||
minikube stop | ||
minikube delete | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
apiVersion: serving.kserve.io/v1beta1 | ||
kind: InferenceService | ||
metadata: | ||
name: "torchserve" | ||
spec: | ||
predictor: | ||
pytorch: | ||
storageUri: gs://<gpt-fast-url>/v1 | ||
image: abc/torchserve-kfs-nightly:latest-gpu | ||
resources: | ||
limits: | ||
memory: 20Gi | ||
nvidia.com/gpu: "0" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
{ | ||
"instances": [ | ||
{ | ||
"data": | ||
{ | ||
"prompt": "The capital of France", | ||
"max_new_tokens": 50 | ||
} | ||
} | ||
] | ||
} | ||
Comment on lines
+1
to
+11
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should move to use the generate endpoint from open inference protocol once it is out There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Noted. If there's an example please point to it so we can update all our examples and tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this part is needed. could you please check if client pass content type as json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work without this because kserve is doing some pre-processing and it reads the "data" part as a dict.