-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Your current environment
The output of commands above
Your output of commands above
🐛 Describe the bug
When setting the VLLM_TORCH_PROFILER_DIR environment variable to a Google Cloud Storage (GCS) bucket path (e.g., gs://my-bucket/profiles) does not work as expected. Instead of saving the profiling data to the GCS bucket, it creates a local directory with a malformed name (e.g., ./gs:/...).
This issue stems from the base vllm library, which processes the VLLM_TORCH_PROFILER_DIR variable before it is used by tpu-inference. The base library unconditionally applies os.path.abspath() to the path, which is not compatible with GCS URIs and incorrectly converts it to a local path.
The problematic upstream code: https://github.com/vllm-project/vllm/blob/5e0c1fe69c516fe4796965185c7d7ca503e44e92/vllm/envs.py#L821-L827
# vllm/vllm/envs.py
"VLLM_TORCH_PROFILER_DIR": lambda: (
None
if os.getenv("VLLM_TORCH_PROFILER_DIR", None) is None
else os.path.abspath(
os.path.expanduser(os.getenv("VLLM_TORCH_PROFILER_DIR", "."))
)
),I'd suggest that we use a different ENV variable for profiling - since we are actually using the JAX profiler - it doesn't make sense to use a env variable with TORCH in the name - but we could phase it out. The vllm profiling docs could be updated to mention the new flag.
Something like VLLM_JAX_PROFILER_DIR would:
- Avoid the naming confusion with the PyTorch profiler.
- Allow
tpu-inferenceto handle the path logic correctly without being affected by the base library's processing. - Provide a clear separation of concerns between the GPU and TPU profiling.
Before submitting a new issue...
- Make sure you already searched for relevant issues and checked the documentation page, which can answer lots of frequently asked questions.