Skip to content

[New model support]Support Tarsier2 #19887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 21, 2025

Conversation

princepride
Copy link
Contributor

@princepride princepride commented Jun 20, 2025

Purpose

Add Tarsier2 model support: #18985 (comment)

Test Plan

python examples/offline_inference/vision_language.py -m tarsier2 --modality image
python examples/offline_inference/vision_language.py -m tarsier2 --modality video
python examples/offline_inference/vision_language_multi_image.py -m tarsier2

Test Result

Image Inference:

image

Video Inference:

image

Multi Image Inference:

image

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @princepride, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for the Tarsier2 multi-modal model. The changes encompass integrating its specific processing pipeline, updating the documentation to reflect its capabilities, and providing concrete examples for its usage across various modalities (image and video). This allows users to seamlessly run Tarsier2 models within the system.

Highlights

  • New Model Integration: I've added comprehensive support for the Tarsier2 multi-modal model, enabling it to process both image and video inputs. This integration leverages and extends existing Qwen2-VL processing components.
  • Example Usage and Testing: New example scripts have been included for offline inference with Tarsier2, covering single-image, multi-image, and video modalities. The model has also been added to relevant test suites to ensure processing correctness and online availability.
  • Documentation Update: The supported_models.md documentation has been updated to reflect Tarsier2's capabilities, explicitly noting its support for Text (T), Image (I), and Video (V) modalities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models labels Jun 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully adds support for the Tarsier2 model by extending the Qwen2VL implementation. Key changes involve adapting the image processor configuration, defining new model classes with appropriate inheritance and weight mapping, and updating example scripts, documentation, and test configurations. The approach of reusing existing Qwen2VL infrastructure is good for maintainability. I've identified a couple of areas where robustness could be improved, particularly in handling potentially None or malformed configuration dictionaries in Tarsier2ImageProcessor and ensuring that placeholder variables in example scripts are always initialized. These are detailed in the specific comments.

Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@princepride princepride requested a review from Isotr0py June 20, 2025 04:41
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
@princepride princepride requested a review from Isotr0py June 20, 2025 13:26
Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now!

@Isotr0py Isotr0py enabled auto-merge (squash) June 20, 2025 16:36
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2025
@princepride
Copy link
Contributor Author

@Isotr0py
One of the unit tests failed, and I saw that it was because the time it took to download the BigBuckBunny.mp4 video exceeded the default download timeout, resulting in a TimeoutError.

  | [2025-06-20T18:15:10Z]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/utils.py", line 255, in fetch_video_async
  | [2025-06-20T18:15:10Z]     return await self.load_from_url_async(
  | [2025-06-20T18:15:10Z]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-06-20T18:15:10Z]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/utils.py", line 132, in load_from_url_async
  | [2025-06-20T18:15:10Z]     data = await connection.async_get_bytes(url, timeout=fetch_timeout)
  | [2025-06-20T18:15:10Z]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@princepride
Copy link
Contributor Author

I run the command
pytest -v -s entrypoints/openai/test_video.py -k "test_single_chat_session_video and llava-onevision-qwen2-0.5b-ov-hf"
got the same error in my local environment:

========================================================== test session starts ===========================================================
platform linux -- Python 3.10.12, pytest-8.4.1, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /workspace/vllm
configfile: pyproject.toml
plugins: timeout-2.4.0, shard-0.1.2, rerunfailures-15.1, forked-1.6.0, asyncio-1.0.0, anyio-4.0.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 26 items / 10 deselected / 16 selected                                                                                         
Running 16 items in this shard: tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf], tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf]

tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/Biconfig.json: 100%|███████████████████████████████████████████████████████████████████████████████████| 2.59k/2.59k [00:00<00:00, 20.7MB/s]
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████| 1.73k/1.73k [00:00<00:00, 14.6MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████| 1.80k/1.80k [00:00<00:00, 15.8MB/s]
INFO 06-21 00:19:16 [config.py:1444] Using max model len 32768
INFO 06-21 00:19:16 [weight_utils.py:292] Using model weights format ['*.safetensors']
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████| 1.79G/1.79G [00:53<00:00, 33.6MB/s]
INFO 06-21 00:20:10 [weight_utils.py:308] Time spent downloading weights for llava-hf/llava-onevision-qwen2-0.5b-ov-hf: 53.579840 seconds
INFO 06-21 00:20:10 [weight_utils.py:345] No model.safetensors.index.json found in remote.
INFO 06-21 00:20:13 [__init__.py:244] Automatically detected platform cuda.
INFO 06-21 00:20:20 [api_server.py:1287] vLLM API server version 0.1.dev7214+g4719460
INFO 06-21 00:20:21 [cli_args.py:309] non-default args: {'port': 43793, 'model': 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf', 'task': 'generate', 'trust_remote_code': True, 'seed': 0, 'max_model_len': 32768, 'enforce_eager': True, 'limit_mm_per_prompt': {'video': 4}, 'max_num_seqs': 2}
INFO 06-21 00:20:31 [config.py:1444] Using max model len 32768
INFO 06-21 00:20:31 [config.py:2197] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 06-21 00:20:31 [cuda.py:91] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 2.78M/2.78M [00:00<00:00, 12.4MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████| 1.67M/1.67M [00:00<00:00, 8.73MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████| 7.03M/7.03M [00:00<00:00, 15.7MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 1.11MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████| 367/367 [00:00<00:00, 2.59MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 126/126 [00:00<00:00, 2.10MB/s]
INFO 06-21 00:20:38 [__init__.py:244] Automatically detected platform cuda.
INFO 06-21 00:20:41 [core.py:459] Waiting for init message from front-end.
INFO 06-21 00:20:41 [core.py:69] Initializing a V1 LLM engine (v0.1.dev7214+g4719460) with config: model='llava-hf/llava-onevision-qwen2-0.5b-ov-hf', speculative_config=None, tokenizer='llava-hf/llava-onevision-qwen2-0.5b-ov-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=llava-hf/llava-onevision-qwen2-0.5b-ov-hf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 06-21 00:20:42 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x71e630df6170>
INFO 06-21 00:20:43 [parallel_state.py:1072] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
video_preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████| 621/621 [00:00<00:00, 9.51MB/s]
processor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████| 178/178 [00:00<00:00, 1.58MB/s]
chat_template.json: 100%|████████████████████████████████████████████████████████████████████████████████| 826/826 [00:00<00:00, 7.14MB/s]
Unused or unrecognized kwargs: return_tensors.
WARNING 06-21 00:20:48 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 06-21 00:20:48 [gpu_model_runner.py:1691] Starting to load model llava-hf/llava-onevision-qwen2-0.5b-ov-hf...
INFO 06-21 00:20:48 [gpu_model_runner.py:1696] Loading model from scratch...
INFO 06-21 00:20:48 [cuda.py:259] Using Flash Attention backend on V1 engine.
INFO 06-21 00:20:49 [weight_utils.py:292] Using model weights format ['*.safetensors']
INFO 06-21 00:20:49 [weight_utils.py:345] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.76it/s]

INFO 06-21 00:20:49 [default_loader.py:272] Loading weights took 0.36 seconds
INFO 06-21 00:20:49 [gpu_model_runner.py:1720] Model loading took 1.6818 GiB and 0.792465 seconds
INFO 06-21 00:20:50 [gpu_model_runner.py:2117] Encoder cache will be initialized with a budget of 8748 tokens, and profiled with 1 image items of the maximum feature size.
INFO 06-21 00:20:52 [gpu_worker.py:232] Available KV cache memory: 19.22 GiB
INFO 06-21 00:20:52 [kv_cache_utils.py:716] GPU KV cache size: 1,679,376 tokens
INFO 06-21 00:20:52 [kv_cache_utils.py:720] Maximum concurrency for 32,768 tokens per request: 51.25x
WARNING 06-21 00:20:52 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
INFO 06-21 00:20:52 [core.py:172] init engine (profile, create kv cache, warmup model) took 2.42 seconds
INFO 06-21 00:20:53 [loggers.py:137] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 104961
INFO 06-21 00:20:53 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:43793
INFO 06-21 00:20:53 [launcher.py:29] Available routes are:
INFO 06-21 00:20:53 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /docs, Methods: HEAD, GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /health, Methods: GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /load, Methods: GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /ping, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /ping, Methods: GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /version, Methods: GET
INFO 06-21 00:20:53 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /pooling, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /classify, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /score, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /rerank, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /invocations, Methods: POST
INFO 06-21 00:20:53 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [2006]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     127.0.0.1:53064 - "GET /health HTTP/1.1" 200 OK
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 06-21 00:20:56 [chat_utils.py:420] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
INFO:     127.0.0.1:53066 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 347, in _wait
    await waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 714, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 734, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/workspace/vllm/vllm/entrypoints/utils.py", line 78, in wrapper
    return handler_task.result()
  File "/workspace/vllm/vllm/entrypoints/utils.py", line 100, in wrapper
    return await func(*args, **kwargs)
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 554, in create_chat_completion
    generator = await handler.create_chat_completion(request, raw_request)
  File "/workspace/vllm/vllm/entrypoints/openai/serving_chat.py", line 183, in create_chat_completion
    ) = await self._preprocess_chat(
  File "/workspace/vllm/vllm/entrypoints/openai/serving_engine.py", line 818, in _preprocess_chat
    mm_data = await mm_data_future
  File "/workspace/vllm/vllm/entrypoints/chat_utils.py", line 648, in all_mm_data
    items_by_modality = {
  File "/workspace/vllm/vllm/entrypoints/chat_utils.py", line 649, in <dictcomp>
    modality: await asyncio.gather(*items)
  File "/workspace/vllm/vllm/multimodal/utils.py", line 255, in fetch_video_async
    return await self.load_from_url_async(
  File "/workspace/vllm/vllm/multimodal/utils.py", line 132, in load_from_url_async
    data = await connection.async_get_bytes(url, timeout=fetch_timeout)
  File "/workspace/vllm/vllm/connections.py", line 99, in async_get_bytes
    return await r.read()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/client_reqrep.py", line 686, in read
    self._body = await self.content.read()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 418, in read
    block = await self.readany()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 440, in readany
    await self._wait("readany")
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 346, in _wait
    with self._timer:
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/helpers.py", line 685, in __exit__
    raise asyncio.TimeoutError from exc_val
asyncio.exceptions.TimeoutError
FAILED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:21:45 [logger.py:43] Received request chatcmpl-ed3f48097d2d4b4a95eefb7923c8654d: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=5, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:21:47 [async_llm.py:270] Added request chatcmpl-ed3f48097d2d4b4a95eefb7923c8654d.
INFO:     127.0.0.1:54992 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-21 00:21:55 [loggers.py:118] Engine 000: Avg prompt throughput: 628.7 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-21 00:22:05 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-21 00:22:28 [logger.py:43] Received request chatcmpl-2589a77389ed4768a43aa12cdc5e45c4: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant \nThe video contains a series of images depicting a futuristic<|im_end|><|im_start|>user \nexpress your result in json<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:22:28 [async_llm.py:270] Added request chatcmpl-2589a77389ed4768a43aa12cdc5e45c4.
INFO:     127.0.0.1:54992 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:22:29 [logger.py:43] Received request chatcmpl-9680e68f09d04e6d9db053fa7f0690eb: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=5, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:22:30 [async_llm.py:270] Added request chatcmpl-9680e68f09d04e6d9db053fa7f0690eb.
INFO:     127.0.0.1:47804 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-21 00:22:31 [logger.py:43] Received request chatcmpl-63c0c60be82f4245b7de363b5b7132be: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant \nThe video features a tablet displaying a scene from the<|im_end|><|im_start|>user \nexpress your result in json<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:22:31 [async_llm.py:270] Added request chatcmpl-63c0c60be82f4245b7de363b5b7132be.
INFO:     127.0.0.1:47804 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:22:34 [logger.py:43] Received request chatcmpl-c5c3fd7da98c413fb935cf8bc2b0bfd1: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=5, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:22:35 [async_llm.py:270] Added request chatcmpl-c5c3fd7da98c413fb935cf8bc2b0bfd1.
INFO:     127.0.0.1:47820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-21 00:22:37 [logger.py:43] Received request chatcmpl-f5be18ca53f0499c8cdb4fcc84be4cb8: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant \nThe video features a person interacting with a black USB<|im_end|><|im_start|>user \nexpress your result in json<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:22:38 [async_llm.py:270] Added request chatcmpl-f5be18ca53f0499c8cdb4fcc84be4cb8.
INFO 06-21 00:22:38 [loggers.py:118] Engine 000: Avg prompt throughput: 2505.2 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 39.8%
INFO:     127.0.0.1:47820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:22:52 [loggers.py:118] Engine 000: Avg prompt throughput: 442.6 tokens/s, Avg generation throughput: 0.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 49.8%
INFO 06-21 00:22:52 [logger.py:43] Received request chatcmpl-2c1c9732247442bbb554e21baa05ad8f: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:22:52 [async_llm.py:270] Added request beam_search-42c20f0373484cdf8bc57766c68a4b70-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-201e2125ee3e4f53a41c596c1e23d11d-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-201e2125ee3e4f53a41c596c1e23d11d-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-b0adc3b399b5400e866e42e032b40f2d-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-b0adc3b399b5400e866e42e032b40f2d-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-64bd42fe2dac48cfb029ea0156b81f14-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-64bd42fe2dac48cfb029ea0156b81f14-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-c6942577e8be40328ba6544336d526a5-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-c6942577e8be40328ba6544336d526a5-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-3328e87b3b844db98faa987cfb42838a-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-3328e87b3b844db98faa987cfb42838a-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-6fa5fe64d5854bd2ba38c0b2e875d4f7-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-6fa5fe64d5854bd2ba38c0b2e875d4f7-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-a441f325d8f14564b6c9d4d4d5241bc5-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-a441f325d8f14564b6c9d4d4d5241bc5-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-4d0cf0bb8bfa436d9a80e431faa36464-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-4d0cf0bb8bfa436d9a80e431faa36464-1.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-a7f989330dcb43f497aa28d30a76e0b8-0.
INFO 06-21 00:22:53 [async_llm.py:270] Added request beam_search-a7f989330dcb43f497aa28d30a76e0b8-1.
INFO:     127.0.0.1:47696 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:07 [loggers.py:118] Engine 000: Avg prompt throughput: 7964.7 tokens/s, Avg generation throughput: 1.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.9%
INFO 06-21 00:23:07 [logger.py:43] Received request chatcmpl-6d784f0440c04dab85142e6395e23959: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-036557308073418b97452cb8484aa91d-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-9a0b89da0c2e409cb0abc1c030434942-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-9a0b89da0c2e409cb0abc1c030434942-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-af723cfe96404428990b7a52c1fca5cb-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-af723cfe96404428990b7a52c1fca5cb-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-bd48f90c051c40a785d2e4a00b2babd1-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-bd48f90c051c40a785d2e4a00b2babd1-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-8bbcda4668d04222868efa3d1dd2a8c6-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-8bbcda4668d04222868efa3d1dd2a8c6-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-7583c3ee815a4b3e83cd0c53eb99ceff-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-7583c3ee815a4b3e83cd0c53eb99ceff-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-2ad52ad3556d42d986c7effb531bd6d8-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-2ad52ad3556d42d986c7effb531bd6d8-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-99e3c888dcfc4533be29037ccc1fdd15-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-99e3c888dcfc4533be29037ccc1fdd15-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-0693100e251f4be3a1949db635e57146-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-0693100e251f4be3a1949db635e57146-1.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-b655f44ddcdb4a6b823a44a4edd18653-0.
INFO 06-21 00:23:07 [async_llm.py:270] Added request beam_search-b655f44ddcdb4a6b823a44a4edd18653-1.
INFO:     127.0.0.1:46982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:08 [logger.py:43] Received request chatcmpl-6d63f87ccb4b4587adf59227f972e934: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-77ed9518819a4bcbb3a220a78be43abb-0.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-67fad00240df4fffac7696662080b558-0.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-67fad00240df4fffac7696662080b558-1.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-340e4bcd1c0348e7888b96745f799da5-0.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-340e4bcd1c0348e7888b96745f799da5-1.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-afbca5fa78d64343bb982c572be77066-0.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-afbca5fa78d64343bb982c572be77066-1.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-f443e1a1d0ab401ea948833996d5ff83-0.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-f443e1a1d0ab401ea948833996d5ff83-1.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-1aec22abbbdf476aa26bd4b6c26b831f-0.
INFO 06-21 00:23:08 [async_llm.py:270] Added request beam_search-1aec22abbbdf476aa26bd4b6c26b831f-1.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-77a9f8a85eb74d73ae3ffec3d98f5a26-0.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-77a9f8a85eb74d73ae3ffec3d98f5a26-1.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-320f060ff096486c8baab3fd0d97beaf-0.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-320f060ff096486c8baab3fd0d97beaf-1.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-4d743d39de6e4c89a23da01c39d80066-0.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-4d743d39de6e4c89a23da01c39d80066-1.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-83d921701c434ab29427b36cb3224e3a-0.
INFO 06-21 00:23:09 [async_llm.py:270] Added request beam_search-83d921701c434ab29427b36cb3224e3a-1.
INFO:     127.0.0.1:42342 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:10 [logger.py:43] Received request chatcmpl-56f2f23823004221bebf354e764ddbfd: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:10 [async_llm.py:270] Added request beam_search-fd4083ba65464adbbeeff7cb539a924a-0.
INFO 06-21 00:23:10 [async_llm.py:270] Added request beam_search-07cf50e305bd44a7a28051eb0ba407f2-0.
INFO 06-21 00:23:10 [async_llm.py:270] Added request beam_search-07cf50e305bd44a7a28051eb0ba407f2-1.
INFO 06-21 00:23:10 [async_llm.py:270] Added request beam_search-de6c9e47020c405fa8ae8d0f3ef0f29f-0.
INFO 06-21 00:23:10 [async_llm.py:270] Added request beam_search-de6c9e47020c405fa8ae8d0f3ef0f29f-1.
INFO 06-21 00:23:10 [async_llm.py:270] Added request beam_search-0d7c93dbd5624e7eb0ab8e5fc31ad449-0.
INFO 06-21 00:23:10 [async_llm.py:270] Added request beam_search-0d7c93dbd5624e7eb0ab8e5fc31ad449-1.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-1723e044b6964417a2258c927e460b1e-0.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-1723e044b6964417a2258c927e460b1e-1.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-8a97312df0dc48a6839a7528160312d9-0.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-8a97312df0dc48a6839a7528160312d9-1.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-5ede7ea2a4bd4d459c97b57b7cf71b70-0.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-5ede7ea2a4bd4d459c97b57b7cf71b70-1.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-9a1c48c27aba40e1aeb45356dc173f21-0.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-9a1c48c27aba40e1aeb45356dc173f21-1.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-0b30acde38794181807fcac17c41fe35-0.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-0b30acde38794181807fcac17c41fe35-1.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-7f23ac5e27534456812d00458f4346f9-0.
INFO 06-21 00:23:11 [async_llm.py:270] Added request beam_search-7f23ac5e27534456812d00458f4346f9-1.
INFO:     127.0.0.1:45946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:17 [loggers.py:118] Engine 000: Avg prompt throughput: 35860.5 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 95.0%
INFO 06-21 00:23:27 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 95.0%
INFO 06-21 00:23:40 [logger.py:43] Received request chatcmpl-620ab608144048099aeb588277d44abd: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=5, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:23:40 [async_llm.py:270] Added request chatcmpl-620ab608144048099aeb588277d44abd.
INFO:     127.0.0.1:58906 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-21 00:23:41 [logger.py:43] Received request chatcmpl-371e30c2956848ba91913a67a857abee: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant \nThe video features a variety of animated characters in a<|im_end|><|im_start|>user \nexpress your result in json<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:41 [async_llm.py:270] Added request chatcmpl-371e30c2956848ba91913a67a857abee.
INFO:     127.0.0.1:58906 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:42 [logger.py:43] Received request chatcmpl-5c51c7c4f4d6455183dbc2adc31d30f6: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=5, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:23:42 [async_llm.py:270] Added request chatcmpl-5c51c7c4f4d6455183dbc2adc31d30f6.
INFO:     127.0.0.1:58918 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-21 00:23:43 [logger.py:43] Received request chatcmpl-a4d8f2d0db094dadb6a0715d26fd312b: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant \nThe video contains a series of images depicting a futuristic<|im_end|><|im_start|>user \nexpress your result in json<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:43 [async_llm.py:270] Added request chatcmpl-a4d8f2d0db094dadb6a0715d26fd312b.
INFO:     127.0.0.1:58918 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:44 [logger.py:43] Received request chatcmpl-11c1b02b56d344c7b8c440c5a560b0cf: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=5, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:23:45 [async_llm.py:270] Added request chatcmpl-11c1b02b56d344c7b8c440c5a560b0cf.
INFO:     127.0.0.1:58928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-21 00:23:46 [logger.py:43] Received request chatcmpl-6806fbd71e634e3499e1aed210d2a76b: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant \nThe video features a tablet displaying a scene from a<|im_end|><|im_start|>user \nexpress your result in json<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:46 [async_llm.py:270] Added request chatcmpl-6806fbd71e634e3499e1aed210d2a76b.
INFO:     127.0.0.1:58928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:47 [logger.py:43] Received request chatcmpl-695fa7795d80439c8900dc6b412bffcf: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=5, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
Unused or unrecognized kwargs: return_tensors.
INFO 06-21 00:23:47 [async_llm.py:270] Added request chatcmpl-695fa7795d80439c8900dc6b412bffcf.
INFO 06-21 00:23:47 [loggers.py:118] Engine 000: Avg prompt throughput: 3777.8 tokens/s, Avg generation throughput: 6.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 91.9%
INFO:     127.0.0.1:58934 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 06-21 00:23:48 [logger.py:43] Received request chatcmpl-f80656832f8f46f0879eace5ffde1581: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant \nThe video features a person interacting with a computer,<|im_end|><|im_start|>user \nexpress your result in json<|im_end|><|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=10, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:48 [async_llm.py:270] Added request chatcmpl-f80656832f8f46f0879eace5ffde1581.
INFO:     127.0.0.1:58934 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:49 [logger.py:43] Received request chatcmpl-f0eef1eb98a8453aafa7ba8652baaf32: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-4dfb5fce0f8a4398bc82cae1275081c9-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-d980050670494eb4a38caaf469ad2733-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-d980050670494eb4a38caaf469ad2733-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-92305724cd0042ef999fd54d97586798-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-92305724cd0042ef999fd54d97586798-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-b4bccb1319e246a6a9372465c00a09f1-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-b4bccb1319e246a6a9372465c00a09f1-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-77223f30a2484dfca79999a4084d1919-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-77223f30a2484dfca79999a4084d1919-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-5eff527f403747f48ccb7db144075f61-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-5eff527f403747f48ccb7db144075f61-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-f48b1ee23f1b49d4a0b935a9d24bd9c1-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-f48b1ee23f1b49d4a0b935a9d24bd9c1-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-213c48e3bd2244c88839459c6a40fbef-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-213c48e3bd2244c88839459c6a40fbef-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-bed36d1883224a86b52092e683d8711b-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-bed36d1883224a86b52092e683d8711b-1.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-f22e9b682713418d9d3de82ada2dc2da-0.
INFO 06-21 00:23:49 [async_llm.py:270] Added request beam_search-f22e9b682713418d9d3de82ada2dc2da-1.
INFO:     127.0.0.1:51168 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ElephantsDream.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:50 [logger.py:43] Received request chatcmpl-632a653f00034a6a9ebada07acf440c0: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-a737f418958948bdbe124ec6c6e7fbd2-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-e910d381b15b4b4e81486d6ae8074bfe-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-e910d381b15b4b4e81486d6ae8074bfe-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-8e2cdf87e722482880b7ba273cba20eb-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-8e2cdf87e722482880b7ba273cba20eb-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-5c858325abe44ca3bac56bc9f70f705b-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-5c858325abe44ca3bac56bc9f70f705b-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-5ce8af284db64bfdb9eb3f10f9e7043e-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-5ce8af284db64bfdb9eb3f10f9e7043e-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-7fa1b41ff3954989a548d01b25d39088-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-7fa1b41ff3954989a548d01b25d39088-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-2ae0e7953ebc40509481fbbea85e1e5b-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-2ae0e7953ebc40509481fbbea85e1e5b-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-fce9f27dda104e588ad4597d7d891dae-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-fce9f27dda104e588ad4597d7d891dae-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-3fee2939ba404285a6b06fc2e934369d-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-3fee2939ba404285a6b06fc2e934369d-1.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-aef50b4a184d4821a9171527965201d6-0.
INFO 06-21 00:23:50 [async_llm.py:270] Added request beam_search-aef50b4a184d4821a9171527965201d6-1.
INFO:     127.0.0.1:51178 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:51 [logger.py:43] Received request chatcmpl-38e7d937ec6249e4b66a86608d4358d1: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-6f1dfa51d6e5434a887c496d5a15196e-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-dc5ec9cdf6624a88aabb1c2848d1ddbe-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-dc5ec9cdf6624a88aabb1c2848d1ddbe-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-bb56f0225a59452fb75e71c9cb5285b2-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-bb56f0225a59452fb75e71c9cb5285b2-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-82b1d66fa4654ca9879a937c4a7985d5-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-82b1d66fa4654ca9879a937c4a7985d5-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-08627eb3906c4f2d8c428821f9ca53e9-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-08627eb3906c4f2d8c428821f9ca53e9-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-a42c5e0310284dacb1d6ceb3a6f0bf63-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-a42c5e0310284dacb1d6ceb3a6f0bf63-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-7a4fd6803cec403eb082aeca61b41d61-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-7a4fd6803cec403eb082aeca61b41d61-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-5b7bae1f51144d02b171cdf915efb810-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-5b7bae1f51144d02b171cdf915efb810-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-bedef3d4f9054a2aad26896cd8930daf-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-bedef3d4f9054a2aad26896cd8930daf-1.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-ad101561a41d40ecbddbb3b60a7ab806-0.
INFO 06-21 00:23:51 [async_llm.py:270] Added request beam_search-ad101561a41d40ecbddbb3b60a7ab806-1.
INFO:     127.0.0.1:51190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSED
tests/entrypoints/openai/test_video.py::test_single_chat_session_video_base64encoded_beamsearch[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] INFO 06-21 00:23:52 [logger.py:43] Received request chatcmpl-214fe379cbab4b8ca83d9510019426f0: prompt: "<|im_start|>user <video>\nWhat's in this video?<|im_end|><|im_start|>assistant\n", params: BeamSearchParams(beam_width=2, max_tokens=10, ignore_eos=False, temperature=1.0, length_penalty=1.0, include_stop_str_in_output=False), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-655337157772484f83831b0ffad55560-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-d809265dfec247a495f11e56a7478020-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-d809265dfec247a495f11e56a7478020-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-857787f3286d4946a4880e0a0ab5f003-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-857787f3286d4946a4880e0a0ab5f003-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-5d30b0c25c5540578bd7a0f58cd81f21-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-5d30b0c25c5540578bd7a0f58cd81f21-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-39178a6ab78f4f059ba7f0db32f7671e-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-39178a6ab78f4f059ba7f0db32f7671e-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-31aa4cdc4a904e65b7bfc990ffca37b6-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-31aa4cdc4a904e65b7bfc990ffca37b6-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-860c1dcb8960488b8c2db741ccd4c7ab-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-860c1dcb8960488b8c2db741ccd4c7ab-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-847169454f094a51ba151063c3ca4e75-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-847169454f094a51ba151063c3ca4e75-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-c3513c235ac1492293826c32892475e5-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-c3513c235ac1492293826c32892475e5-1.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-0493cebca2b14ddfaf500b4f9b3bcaed-0.
INFO 06-21 00:23:52 [async_llm.py:270] Added request beam_search-0493cebca2b14ddfaf500b4f9b3bcaed-1.
INFO:     127.0.0.1:51192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
PASSEDINFO 06-21 00:23:53 [launcher.py:80] Shutting down FastAPI HTTP server.
[rank0]:[W621 00:23:53.094456904 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.


================================================================ FAILURES ================================================================
_ test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] _

client = <openai.AsyncOpenAI object at 0x71fc5f9c6c80>, model_name = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf'
video_url = 'http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4'

    @pytest.mark.asyncio
    @pytest.mark.parametrize("model_name", [MODEL_NAME])
    @pytest.mark.parametrize("video_url", TEST_VIDEO_URLS)
    async def test_single_chat_session_video(client: openai.AsyncOpenAI,
                                             model_name: str, video_url: str):
        messages = [{
            "role":
            "user",
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": video_url
                    }
                },
                {
                    "type": "text",
                    "text": "What's in this video?"
                },
            ],
        }]
    
        # test single completion
>       chat_completion = await client.chat.completions.create(
            model=model_name,
            messages=messages,
            max_completion_tokens=10,
            logprobs=True,
            temperature=0.0,
            top_logprobs=5)

tests/entrypoints/openai/test_video.py:81: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.10/dist-packages/openai/resources/chat/completions/completions.py:2028: in create
    return await self._post(
/usr/local/lib/python3.10/dist-packages/openai/_base_client.py:1784: in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <openai.AsyncOpenAI object at 0x71fc5f9c6c80>, cast_to = <class 'openai.types.chat.chat_completion.ChatCompletion'>
options = FinalRequestOptions(method='post', url='/chat/completions', params={}, headers=NOT_GIVEN, max_retries=NOT_GIVEN, timeo...n2-0.5b-ov-hf', 'logprobs': True, 'max_completion_tokens': 10, 'temperature': 0.0, 'top_logprobs': 5}, extra_json=None)

    async def request(
        self,
        cast_to: Type[ResponseT],
        options: FinalRequestOptions,
        *,
        stream: bool = False,
        stream_cls: type[_AsyncStreamT] | None = None,
    ) -> ResponseT | _AsyncStreamT:
        if self._platform is None:
            # `get_platform` can make blocking IO calls so we
            # execute it earlier while we are in an async context
            self._platform = await asyncify(get_platform)()
    
        cast_to = self._maybe_override_cast_to(cast_to, options)
    
        # create a copy of the options we were given so that if the
        # options are mutated later & we then retry, the retries are
        # given the original options
        input_options = model_copy(options)
        if input_options.idempotency_key is None and input_options.method.lower() != "get":
            # ensure the idempotency key is reused between requests
            input_options.idempotency_key = self._idempotency_key()
    
        response: httpx.Response | None = None
        max_retries = input_options.get_max_retries(self.max_retries)
    
        retries_taken = 0
        for retries_taken in range(max_retries + 1):
            options = model_copy(input_options)
            options = await self._prepare_options(options)
    
            remaining_retries = max_retries - retries_taken
            request = self._build_request(options, retries_taken=retries_taken)
            await self._prepare_request(request)
    
            kwargs: HttpxSendArgs = {}
            if self.custom_auth is not None:
                kwargs["auth"] = self.custom_auth
    
            if options.follow_redirects is not None:
                kwargs["follow_redirects"] = options.follow_redirects
    
            log.debug("Sending HTTP Request: %s %s", request.method, request.url)
    
            response = None
            try:
                response = await self._client.send(
                    request,
                    stream=stream or self._should_stream_response_body(request=request),
                    **kwargs,
                )
            except httpx.TimeoutException as err:
                log.debug("Encountered httpx.TimeoutException", exc_info=True)
    
                if remaining_retries > 0:
                    await self._sleep_for_retry(
                        retries_taken=retries_taken,
                        max_retries=max_retries,
                        options=input_options,
                        response=None,
                    )
                    continue
    
                log.debug("Raising timeout error")
                raise APITimeoutError(request=request) from err
            except Exception as err:
                log.debug("Encountered Exception", exc_info=True)
    
                if remaining_retries > 0:
                    await self._sleep_for_retry(
                        retries_taken=retries_taken,
                        max_retries=max_retries,
                        options=input_options,
                        response=None,
                    )
                    continue
    
                log.debug("Raising connection error")
                raise APIConnectionError(request=request) from err
    
            log.debug(
                'HTTP Response: %s %s "%i %s" %s',
                request.method,
                request.url,
                response.status_code,
                response.reason_phrase,
                response.headers,
            )
            log.debug("request_id: %s", response.headers.get("x-request-id"))
    
            try:
                response.raise_for_status()
            except httpx.HTTPStatusError as err:  # thrown on 4xx and 5xx status code
                log.debug("Encountered httpx.HTTPStatusError", exc_info=True)
    
                if remaining_retries > 0 and self._should_retry(err.response):
                    await err.response.aclose()
                    await self._sleep_for_retry(
                        retries_taken=retries_taken,
                        max_retries=max_retries,
                        options=input_options,
                        response=response,
                    )
                    continue
    
                # If the response is streamed then we need to explicitly read the response
                # to completion before attempting to access the response text.
                if not err.response.is_closed:
                    await err.response.aread()
    
                log.debug("Re-raising status error")
>               raise self._make_status_error_from_response(err.response) from None
E               openai.InternalServerError: Internal Server Error

/usr/local/lib/python3.10/dist-packages/openai/_base_client.py:1584: InternalServerError
======================================================== short test summary info =========================================================
FAILED tests/entrypoints/openai/test_video.py::test_single_chat_session_video[http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4-llava-hf/llava-onevision-qwen2-0.5b-ov-hf] - openai.InternalServerError: Internal Server Error
======================================== 1 failed, 15 passed, 10 deselected in 291.17s (0:04:51) =========================================
root@62dc5feff87a:/workspace/vllm# 

@Isotr0py
Copy link
Collaborator

Seems just the network timeout when fetching the test video, retrying.

@Isotr0py Isotr0py merged commit c3bf9ba into vllm-project:main Jun 21, 2025
80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants