Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KServe gpt-fast example #2895

Merged
merged 6 commits into from
Jan 12, 2024
Merged

KServe gpt-fast example #2895

merged 6 commits into from
Jan 12, 2024

Conversation

agunapal
Copy link
Collaborator

Description

This PR demonstrates how to serve large language models using KServe

  • This also shows how to build a custom KServe image for using GPT-Fast

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

  • Model loading
Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init)
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-01-12T18:25:14,992 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-01-12T18:25:14,996 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-01-12T18:25:15,164 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-01-12T18:25:15,346 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.9.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 4949 M
Python executable: /home/venv/bin/python
Config file: /mnt/models/config/config.properties
Inference address: http://0.0.0.0:8085
Management address: http://0.0.0.0:8085
Metrics address: http://0.0.0.0:8082
Model Store: /mnt/models/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 4
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /mnt/models/model-store
Model config: N/A
2024-01-12T18:25:15,375 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-01-12T18:25:15,397 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {"name":"startup.cfg","modelCount":1,"models":{"gpt_fast":{"1.0":{"defaultVersion":true,"marName":"gpt_fast","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":10,"responseTimeout":300}}}}
2024-01-12T18:25:15,405 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg
2024-01-12T18:25:15,406 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully
2024-01-12T18:25:15,424 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /home/model-server/tmp/models/87d3cb1dfb3d4022befaf7e2e87876e9
2024-01-12T18:25:15,424 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /home/model-server/tmp/models/87d3cb1dfb3d4022befaf7e2e87876e9/gpt_fast
2024-01-12T18:25:15,434 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model gpt_fast
2024-01-12T18:25:15,434 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model gpt_fast
2024-01-12T18:25:15,435 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model gpt_fast
2024-01-12T18:25:15,435 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model gpt_fast loaded.
2024-01-12T18:25:15,435 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: gpt_fast, count: 1
2024-01-12T18:25:15,467 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2024-01-12T18:25:15,472 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-01-12T18:25:15,673 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8085
2024-01-12T18:25:15,674 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-01-12T18:25:15,675 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2024-01-12T18:25:16,505 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2024-01-12T18:25:17,792 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,794 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:29.50983428955078|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,794 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:261.04111099243164|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,794 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.8|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,795 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,795 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,795 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,796 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:28758.0859375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,796 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2494.82421875|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
2024-01-12T18:25:17,796 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.4|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083917
INFO:root:Wrapper : Model names ['gpt_fast'], inference address http://0.0.0.0:8085, management address http://0.0.0.0:8085, grpc_inference_address, 0.0.0.0:7070, model store /mnt/models/model-store
INFO:root:Predict URL set to 0.0.0.0:8085
INFO:root:Explain URL set to 0.0.0.0:8085
INFO:root:Protocol version is v1
INFO:root:Copying contents of /mnt/models/model-store to local
INFO:root:Loading gpt_fast .. 1 of 10 tries..
2024-01-12T18:25:19,105 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /127.0.0.1:45702 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 31
2024-01-12T18:25:19,106 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083919
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:25:19,959 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=34
2024-01-12T18:25:19,960 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2024-01-12T18:25:19,968 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2024-01-12T18:25:19,968 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - [PID]34
2024-01-12T18:25:19,968 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Torch worker started.
2024-01-12T18:25:19,969 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_1.0 State change null -> WORKER_STARTED
2024-01-12T18:25:19,969 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Python runtime: 3.9.18
2024-01-12T18:25:19,972 [INFO ] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-01-12T18:25:19,978 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-01-12T18:25:19,980 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1705083919980
2024-01-12T18:25:19,982 [INFO ] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Looping backend response at: 1705083919982
2024-01-12T18:25:19,997 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - model_name: gpt_fast, batchSize: 1
2024-01-12T18:25:20,980 [WARN ] W-9000-gpt_fast_1.0-stderr MODEL_LOG - /home/venv/lib/python3.9/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
2024-01-12T18:25:20,981 [WARN ] W-9000-gpt_fast_1.0-stderr MODEL_LOG -   _torch_pytree._register_pytree_node(
2024-01-12T18:25:24,609 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Enabled tensor cores
2024-01-12T18:25:24,610 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2024-01-12T18:25:24,610 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-01-12T18:25:24,614 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Loading model ...
INFO:root:Loading gpt_fast .. 2 of 10 tries..
2024-01-12T18:25:49,284 [INFO ] epollEventLoopGroup-3-2 ACCESS_LOG - /127.0.0.1:34458 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 111
2024-01-12T18:25:49,285 [INFO ] epollEventLoopGroup-3-2 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083949
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:26:18,993 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:33.3|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083978
2024-01-12T18:26:18,996 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:29.509563446044922|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083978
2024-01-12T18:26:18,996 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:261.0413818359375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083978
2024-01-12T18:26:19,060 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.8|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,060 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:56.487754038561754|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,060 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:13008.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:26425.05859375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:4818.03515625|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
2024-01-12T18:26:19,061 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:16.7|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
INFO:root:Loading gpt_fast .. 3 of 10 tries..
2024-01-12T18:26:19,387 [INFO ] epollEventLoopGroup-3-3 ACCESS_LOG - /127.0.0.1:51504 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 25
2024-01-12T18:26:19,389 [INFO ] epollEventLoopGroup-3-3 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083979
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:26:22,694 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Time to load model: 58.08 seconds
2024-01-12T18:26:22,731 [INFO ] W-9000-gpt_fast_1.0-stdout MODEL_LOG - Stream is False
2024-01-12T18:26:22,733 [INFO ] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 62751
2024-01-12T18:26:22,734 [DEBUG] W-9000-gpt_fast_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-gpt_fast_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2024-01-12T18:26:22,734 [INFO ] W-9000-gpt_fast_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:67271.0|#WorkerName:W-9000-gpt_fast_1.0,Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083982
2024-01-12T18:26:22,734 [INFO ] W-9000-gpt_fast_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:3.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705083982
INFO:root:Loading gpt_fast .. 4 of 10 tries..
2024-01-12T18:26:49,445 [INFO ] epollEventLoopGroup-3-4 ACCESS_LOG - /127.0.0.1:47110 "GET /models/gpt_fast?customized=false HTTP/1.1" 200 23
2024-01-12T18:26:49,446 [INFO ] epollEventLoopGroup-3-4 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084009
INFO:root:Sleep 30 seconds for load gpt_fast..
2024-01-12T18:27:17,123 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,124 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:29.509357452392578|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,124 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:261.04158782958984|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,124 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.8|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:59.24960917144346|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:13644.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:26458.15234375|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,125 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:4784.8046875|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
2024-01-12T18:27:17,126 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:16.6|#Level:Host|#hostname:torchserve-predictor-00001-deployment-5b5b7874fc-7ttk2,timestamp:1705084037
INFO:root:The model gpt_fast is ready
INFO:root:TSModelRepo is initialized
INFO:kserve:Registering model: gpt_fast
INFO:kserve:Setting max asyncio worker threads as 12
INFO:kserve:Starting uvicorn with 1 workers
2024-01-12 18:27:19.546 uvicorn.error INFO:     Started server process [9]
2024-01-12 18:27:19.546 uvicorn.error INFO:     Waiting for application startup.
2024-01-12 18:27:19.585 9 kserve INFO [start():62] Starting gRPC server on [::]:8081
2024-01-12 18:27:19.585 uvicorn.error INFO:     Application startup complete.
2024-01-12 18:27:19.586 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
  • Inference
curl -v -H "Content-Type: application/json" -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d @./sample_text.json
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> POST /v1/models/gpt_fast:predict HTTP/1.1
> Host: torchserve.default.example.com
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 129
> 
Handling connection for 8080
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-length: 246
< content-type: application/json
< date: Fri, 12 Jan 2024 18:33:25 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 1315
< 
* Connection #0 to host localhost left intact
{"predictions":["is Paris. It is located in the northern central part of the country and is known for its stunning architecture, art museums, fashion, and historical landmarks. The city is home to many famous landmarks such as the Eiffel Tower"]}

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Comment on lines +1 to +11
{
"instances": [
{
"data":
{
"prompt": "The capital of France",
"max_new_tokens": 50
}
}
]
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move to use the generate endpoint from open inference protocol once it is out

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. If there's an example please point to it so we can update all our examples and tests

Copy link
Collaborator

@lxning lxning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check if streaming response can work on KServe/

Comment on lines +121 to 123
if isinstance(input_data, str):
input_data = json.loads(input_data)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this part is needed. could you please check if client pass content type as json?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work without this because kserve is doing some pre-processing and it reads the "data" part as a dict.

@agunapal
Copy link
Collaborator Author

Do you check if streaming response can work on KServe/

I tried this. This didn't work for me. Will track this separately

Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kubernetes/kserve/examples/gpt_fast/README.md Outdated Show resolved Hide resolved
@agunapal agunapal added this pull request to the merge queue Jan 12, 2024
Merged via the queue into master with commit 0d11f4c Jan 12, 2024
13 checks passed
@agunapal agunapal deleted the examples/kserve_llama branch January 13, 2024 00:32
@chauhang chauhang added this to the v0.10.0 milestone Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants