KServe gpt-fast example #2895

agunapal · 2024-01-12T19:12:57Z

Description

This PR demonstrates how to serve large language models using KServe

This also shows how to build a custom KServe image for using GPT-Fast

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Model loading

Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init)
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-01-12T18:25:14,992 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-01-12T18:25:14,996 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-01-12T18:25:15,164 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-01-12T18:25:15,346 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.9.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 4949 M
Python executable: /home/venv/bin/python
Config file: /mnt/models/config/config.properties
Inference address: http://0.0.0.0:8085
Management address: http://0.0.0.0:8085
Metrics address: http://0.0.0.0:8082
Model Store: /mnt/models/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 4
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /mnt/models/model-store
Model config: N/A
2024-01-12T18:25:15,375 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-01-12T18:25:15,397 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {"name":"startup.cfg","modelCount":1,"models":{"gpt_fast":{"1.0":{"defaultVersion":true,"marName":"gpt_fast","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":10,"responseTimeout":300}}}}
2024-01-12T18:25:15,405 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg
2024-01-12T18:25:15,406 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully
2024-01-12T18:25:15,424 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /home/model-server/tmp/models/87d3cb1dfb3d4022befaf7e2e87876e9
2024-01-12T18:25:15,424 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /home/model-server/tmp/models/87d3cb1dfb3d4022befaf7e2e87876e9/gpt_fast
2024-01-12T18:25:15,434 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model gpt_fast
2024-01-12T18:25:15,434 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model gpt_fast
2024-01-12T18:25:15,435 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model gpt_fast
2024-01-12T18:25:15,435 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model gpt_fast loaded.
2024-01-12T18:25:15,435 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: gpt_fast, count: 1
2024-01-12T18:25:15,467 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2024-01-12T18:25:15,472 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-01-12T18:25:15,673 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8085
2024-01-12T18:25:15,674 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-01-12T18:25:15,675 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2024-01-12T18:25:16,505 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2024-01-12T18:25:17,792 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,794 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:29.50983428955078|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,794 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:261.04111099243164|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,794 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.8|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,795 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,795 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,795 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,796 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:28758.0859375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,796 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2494.82421875|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,796 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.4|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
INFO:root:Wrapper : Model names ['gpt_fast'], inference address http://0.0.0.0:8085, management address http://0.0.0.0:8085, grpc_inference_address, 0.0.0.0:7070, model store /mnt/models/model-store
INFO:root:Predict URL set to 0.0.0.0:8085
INFO:root:Explain URL set to 0.0.0.0:8085
INFO:root:Protocol version is v1
INFO:root:Copying contents of /mnt/models/model-store to local
INFO:root:Loading gpt_fast .. 1 of 10 tries..
2024-01-12T18:25:19,105 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /127.0.0.1:45702 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 31
2024-01-12T18:25:19,106 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083919
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:25:19,959 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=34
2024-01-12T18:25:19,960 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2024-01-12T18:25:19,968 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2024-01-12T18:25:19,968 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - [PID]34
2024-01-12T18:25:19,968 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Torch worker started.
2024-01-12T18:25:19,969 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_1.0 State change null -> WORKER_STARTED
2024-01-12T18:25:19,969 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Python runtime: 3.9.18
2024-01-12T18:25:19,972 [INFO ] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-01-12T18:25:19,978 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-01-12T18:25:19,980 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1705083919980
2024-01-12T18:25:19,982 [INFO ] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1705083919982
2024-01-12T18:25:19,997 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - model_name: gpt_fast, batchSize: 1
2024-01-12T18:25:20,980 [WARN ] W-9000-gpt_fast_1.0-stderr MODEL_LOG - /home/venv/lib/python3.9/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2024-01-12T18:25:20,981 [WARN ] W-9000-gpt_fast_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2024-01-12T18:25:24,609 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Enabled tensor cores
2024-01-12T18:25:24,610 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2024-01-12T18:25:24,610 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-01-12T18:25:24,614 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Loading model ...
INFO:root:Loading gpt_fast .. 2 of 10 tries..
2024-01-12T18:25:49,284 [INFO ] epollEventLoopGroup-3-2 ACCESS_LOG - /127.0.0.1:34458 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 111
2024-01-12T18:25:49,285 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083949
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:26:18,993 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:33.3|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083978
2024-01-12T18:26:18,996 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:29.509563446044922|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083978
2024-01-12T18:26:18,996 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:261.0413818359375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083978
2024-01-12T18:26:19,060 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.8|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,060 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:56.487754038561754|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,060 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:13008.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:26425.05859375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:4818.03515625|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:16.7|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
INFO:root:Loading gpt_fast .. 3 of 10 tries..
2024-01-12T18:26:19,387 [INFO ] epollEventLoopGroup-3-3 ACCESS_LOG - /127.0.0.1:51504 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 25
2024-01-12T18:26:19,389 [INFO ] epollEventLoopGroup-3-3 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:26:22,694 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Time to load model: 58.08 seconds
2024-01-12T18:26:22,731 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Stream is False
2024-01-12T18:26:22,733 [INFO ] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 62751
2024-01-12T18:26:22,734 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-01-12T18:26:22,734 [INFO ] W-9000-gpt_fast_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:67271.0|#WorkerName:W-9000-gpt_fast_1.0,Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083982
2024-01-12T18:26:22,734 [INFO ] W-9000-gpt_fast_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083982
INFO:root:Loading gpt_fast .. 4 of 10 tries..
2024-01-12T18:26:49,445 [INFO ] epollEventLoopGroup-3-4 ACCESS_LOG - /127.0.0.1:47110 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 23
2024-01-12T18:26:49,446 [INFO ] epollEventLoopGroup-3-4 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084009
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:27:17,123 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,124 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:29.509357452392578|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,124 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:261.04158782958984|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,124 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.8|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:59.24960917144346|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:13644.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:26458.15234375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:4784.8046875|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,126 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:16.6|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
INFO:root:The model gpt_fast is ready
INFO:root:TSModelRepo is initialized
INFO:kserve:Registering model: gpt_fast
INFO:kserve:Setting max asyncio worker threads as 12
INFO:kserve:Starting uvicorn with 1 workers
2024-01-12 18:27:19.546 uvicorn.error INFO:     Started server process [9]
2024-01-12 18:27:19.546 uvicorn.error INFO:     Waiting for application startup.
2024-01-12 18:27:19.585 9 kserve INFO [start():62] Starting gRPC server on [::]:8081
2024-01-12 18:27:19.585 uvicorn.error INFO:     Application startup complete.
2024-01-12 18:27:19.586 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Inference

curl -v -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d @./sample_text.json
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> POST /v1/models/gpt_fast:predict HTTP/1.1
> Host: torchserve.default.example.com
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 129
> 
Handling connection for 8080
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 246
< content-type: application/json
< date: Fri, 12 Jan 2024 18:33:25 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 1315
< 
* Connection #0 to host localhost left intact
{"predictions":["is Paris. It is located in the northern central part of the country and is known for its stunning architecture, art museums, fashion, and historical landmarks. The city is home to many famous landmarks such as the Eiffel Tower"]}

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

into examples/kserve_llama

yuzisun · 2024-01-12T19:31:46Z

kubernetes/kserve/examples/gpt_fast/sample_text.json

+{
+  "instances": [
+    {
+      "data":
+      {
+        "prompt": "The capital of France",
+        "max_new_tokens": 50
+      }
+    }
+  ]
+}


We should move to use the generate endpoint from open inference protocol once it is out

Noted. If there's an example please point to it so we can update all our examples and tests

lxning

Do you check if streaming response can work on KServe/

lxning · 2024-01-12T19:40:15Z

examples/large_models/gpt_fast/handler.py

+        if isinstance(input_data, str):
+            input_data = json.loads(input_data)



I don't think this part is needed. could you please check if client pass content type as json?

It doesn't work without this because kserve is doing some pre-processing and it reads the "data" part as a dict.

agunapal · 2024-01-12T20:13:02Z

Do you check if streaming response can work on KServe/

I tried this. This didn't work for me. Will track this separately

mreso

LGTM

kubernetes/kserve/examples/gpt_fast/README.md

agunapal and others added 4 commits January 12, 2024 19:09

KServe gpt-fast example

f4bd3cf

Merge branch 'master' into examples/kserve_llama

a5a64bf

KServe gpt-fast example

baf0dd6

Merge branch 'examples/kserve_llama' of https://github.com/pytorch/serve

166fddd

into examples/kserve_llama

agunapal requested review from mreso, lxning and chauhang January 12, 2024 19:15

KServe gpt-fast example

b420623

yuzisun reviewed Jan 12, 2024

View reviewed changes

lxning reviewed Jan 12, 2024

View reviewed changes

mreso approved these changes Jan 12, 2024

View reviewed changes

kubernetes/kserve/examples/gpt_fast/README.md Outdated Show resolved Hide resolved

typo

2b6bc6e

agunapal enabled auto-merge January 12, 2024 22:17

agunapal added this pull request to the merge queue Jan 12, 2024

Merged via the queue into master with commit 0d11f4c Jan 12, 2024
13 checks passed

agunapal deleted the examples/kserve_llama branch January 13, 2024 00:32

chauhang added this to the v0.10.0 milestone Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KServe gpt-fast example #2895

KServe gpt-fast example #2895

agunapal commented Jan 12, 2024

yuzisun Jan 12, 2024

agunapal Jan 12, 2024

lxning left a comment

lxning Jan 12, 2024

agunapal Jan 12, 2024

agunapal commented Jan 12, 2024

mreso left a comment

		if isinstance(input_data, str):
		input_data = json.loads(input_data)

KServe gpt-fast example #2895

KServe gpt-fast example #2895

Conversation

agunapal commented Jan 12, 2024

Description

Type of change

Feature/Issue validation/testing

Checklist:

yuzisun Jan 12, 2024

Choose a reason for hiding this comment

agunapal Jan 12, 2024

Choose a reason for hiding this comment

lxning left a comment

Choose a reason for hiding this comment

lxning Jan 12, 2024

Choose a reason for hiding this comment

agunapal Jan 12, 2024

Choose a reason for hiding this comment

agunapal commented Jan 12, 2024

mreso left a comment

Choose a reason for hiding this comment