-
Notifications
You must be signed in to change notification settings - Fork 886
Quickstart vLLM examples do not work as expected #3365
Description
🐛 Describe the bug
tl; dr: Quickstart Examples not working as expected
Our team is currently evaluating torchserve to serve various LLM models, once of which is meta-llama/Llama-3.1-8B-Instruct.
The GPU that we are relying on is
22:16 $ nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: <UUID-1>)
GPU 1: Tesla V100-PCIE-32GB (UUID: <UUID-2>)
We have only begun exploring torchserve, and I'd like to report issues here with the vLLM examples specified in the Quickstart sections:
While trying out the quickstart examples to deploy the specified Llama-3.1 model, we encountered various issues:
1. ts.llm_launche crashing with ValueError
ts.llm_launcher was started like so:
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth
this resulted in the server raising an exception like so:
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 301, in <module>
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 266, in run_server
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection_async(cl_socket)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 220, in handle_connection_async
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 133, in load_model
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/torch_handler/vllm_handler.py", line 47, in initialize
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.vllm_engine = AsyncLLMEngine.from_engine_args(vllm_engine_config)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - engine = cls(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.engine = self._engine_class(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 257, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().__init__(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 317, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.model_executor = executor_class(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 222, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._init_executor()
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 124, in _init_executor
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._run_workers("init_device")
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - driver_worker_output = driver_worker_method(*args, **kwargs)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 168, in init_device
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - _check_if_gpu_supports_dtype(self.model_config.dtype)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 461, in _check_if_gpu_supports_dtype
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - raise ValueError(
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
2. On setting --dtype to half, ts.llm_launcher isn't honoring that flag.
This time, ts.llm_launcher was invoked like so:
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth --dtype=half
ts.llm_launcher most definitely has --dtype specfied (but the list of possible values for dtype aren't specified at all):
22:05 $ python -m ts.llm_launcher --help
usage: llm_launcher.py [-h] [--model_name MODEL_NAME] [--model_store MODEL_STORE] [--model_id MODEL_ID] [--disable_token_auth] [--vllm_engine.max_num_seqs VLLM_ENGINE.MAX_NUM_SEQS]
[--vllm_engine.max_model_len VLLM_ENGINE.MAX_MODEL_LEN] [--vllm_engine.download_dir VLLM_ENGINE.DOWNLOAD_DIR] [--startup_timeout STARTUP_TIMEOUT] [--engine ENGINE]
[--dtype DTYPE] [--trt_llm_engine.max_batch_size TRT_LLM_ENGINE.MAX_BATCH_SIZE]
[--trt_llm_engine.kv_cache_free_gpu_memory_fraction TRT_LLM_ENGINE.KV_CACHE_FREE_GPU_MEMORY_FRACTION]
options:
-h, --help show this help message and exit
--model_name MODEL_NAME
Model name
--model_store MODEL_STORE
Model store
--model_id MODEL_ID Model id
--disable_token_auth Disable token authentication
--vllm_engine.max_num_seqs VLLM_ENGINE.MAX_NUM_SEQS
Max sequences in vllm engine
--vllm_engine.max_model_len VLLM_ENGINE.MAX_MODEL_LEN
Model context length
--vllm_engine.download_dir VLLM_ENGINE.DOWNLOAD_DIR
Cache dir
--startup_timeout STARTUP_TIMEOUT
Model startup timeout in seconds
--engine ENGINE LLM engine
--dtype DTYPE Data type # This is what I am taking about
--trt_llm_engine.max_batch_size TRT_LLM_ENGINE.MAX_BATCH_SIZE
Max batch size
--trt_llm_engine.kv_cache_free_gpu_memory_fraction TRT_LLM_ENGINE.KV_CACHE_FREE_GPU_MEMORY_FRACTION
KV Cache free gpu memory fraction
3. The Quickstart LLM with Docker example points to a model that doesn't exist
The docker.... snippet provided looks like so:
docker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth
When executing this, this (abridged) stack trace is produced
2024-11-18T16:43:45,496 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - get_hf_file_metadata(url, token=token)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - return fn(*args, **kwargs)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - r = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - response = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - hf_raise_for_status(response)
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - raise _format(GatedRepoError, message, response) from e
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - huggingface_hub.errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-673b6ec1-716b9ee4567a8483100930bb;77b39a73-4aed-4ccd-9b6b-3fdf1c8cd81c)
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
2024-11-18T16:43:45,501 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct to ask for access.
On searching for this model on HuggingFace, I see this, indicating that his model probably doesn't exist.
Error logs
When running with --dtype and otherwise:
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 301, in <module>
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 266, in run_server
2024-11-18T21:24:57,141 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection_async(cl_socket)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 220, in handle_connection_async
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_service_worker.py", line 133, in load_model
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/model_loader.py", line 143, in load
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/ts/torch_handler/vllm_handler.py", line 47, in initialize
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.vllm_engine = AsyncLLMEngine.from_engine_args(vllm_engine_config)
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 573, in from_engine_args
2024-11-18T21:24:57,142 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - engine = cls(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.engine = self._engine_class(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 257, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().__init__(*args, **kwargs)
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 317, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.model_executor = executor_class(
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 222, in __init__
2024-11-18T21:24:57,143 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().__init__(*args, **kwargs)
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._init_executor()
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 124, in _init_executor
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._run_workers("init_device")
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers
2024-11-18T21:24:57,144 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - driver_worker_output = driver_worker_method(*args, **kwargs)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 168, in init_device
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - _check_if_gpu_supports_dtype(self.model_config.dtype)
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/privatecircle/torchenv/lib/python3.10/site-packages/vllm/worker/worker.py", line 461, in _check_if_gpu_supports_dtype
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - raise ValueError(
2024-11-18T21:24:57,145 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-PCIE-32GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
When starting the docker container
2024-11-18T16:43:45,496 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - get_hf_file_metadata(url, token=token)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - return fn(*args, **kwargs)
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata
2024-11-18T16:43:45,497 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - r = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - response = _request_wrapper(
2024-11-18T16:43:45,498 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - hf_raise_for_status(response)
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/home/venv/lib/python3.9/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
2024-11-18T16:43:45,499 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - raise _format(GatedRepoError, message, response) from e
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - huggingface_hub.errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-673b6ec1-716b9ee4567a8483100930bb;77b39a73-4aed-4ccd-9b6b-3fdf1c8cd81c)
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG -
2024-11-18T16:43:45,500 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
2024-11-18T16:43:45,501 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct to ask for access.
Installation instructions
Install torchserve from Source: No
Clone this repo as-is: Yes
Docker image built like so (as specified in Quickstart)
docker build --pull . -f docker/Dockerfile.vllm -t ts/vllm
Model Packaging
We do not know if calling ts.llm_launcher packages the model for us.
config.properties
No response
Versions
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.12.0
torch-model-archiver==0.12.0
Python version: 3.10 (64-bit runtime)
Python executable: /home/<USER>/torchenv/bin/python
Versions of relevant python libraries:
captum==0.6.0
numpy==1.24.3
nvgpu==0.10.0
pillow==10.3.0
psutil==5.9.8
requests==2.32.0
sentencepiece==0.2.0
torch==2.4.0+cu121
torch-model-archiver==0.12.0
torch-workflow-archiver==0.2.15
torchaudio==2.4.0+cu121
torchserve==0.12.0
torchvision==0.19.0+cu121
transformers==4.46.2
wheel==0.42.0
torch==2.4.0+cu121
**Warning: torchtext not present ..
torchvision==0.19.0+cu121
torchaudio==2.4.0+cu121
Java Version:
OS: Ubuntu 22.04.5 LTS
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
CMake version: N/A
Is CUDA available: Yes
CUDA runtime version: N/A
GPU models and configuration:
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.5
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.5
Environment:
library_path (LD_/DYLD_):
This script is unable to locate java, hence:
22:19 $ command -v java
/usr/bin/java
(torchenv) ✔ ~/workspace/github/serve [master|…1]
22:21 $
Repro instructions
- Clone this repo
- Checkout
master - Execute:
python ./ts_scripts/install_dependencies.py --cuda=cu121
- For Quickstart LLM Deployment:
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth
- For quickstart LLM Deployment with Docker:
docker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth
Possible Solution
We don't know what the possible solution could be for these issues, and as a result cannot propose a solution.
