Skip to content

feat(vlm): add streaming response handling for OpenAI VLM#740

Merged
MaojiaSheng merged 1 commit intovolcengine:mainfrom
KorenKrita:feat/openai-vlm-streaming-handler
Mar 18, 2026
Merged

feat(vlm): add streaming response handling for OpenAI VLM#740
MaojiaSheng merged 1 commit intovolcengine:mainfrom
KorenKrita:feat/openai-vlm-streaming-handler

Conversation

@KorenKrita
Copy link
Contributor

Description

Some OpenAI-compatible APIs (e.g., certain third-party proxies or gateways) return SSE streaming responses even when stream=False is requested. This causes the existing OpenAIVLM backend to crash with AttributeError when it tries to directly access response.choices[0].message.content.

This change adds automatic detection and adaptation to OpenAIVLM: when the API returns a streaming response regardless of the requested format, the backend now transparently consumes the stream, concatenates the content, and correctly extracts token usage. Normal non-streaming responses are unaffected — no caller changes required.

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

  • New feature (non-breaking change that adds functionality)

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

  • Documentation update

  • Refactoring (no functional changes)

  • Performance improvement

  • Test update

Changes Made

  • Add _is_streaming_response() and _is_async_streaming_response() to detect streaming vs non-streaming responses by checking __iter__/__aiter__ and choices attributes, with exclusion of basic types (str/list/dict)
  • Add _extract_content_from_chunk() and _extract_usage_from_chunk() helpers to extract content and token usage from individual SSE chunks
  • Add _process_streaming_chunks() sync method that consumes all chunks, concatenates content, and records the last non-zero usage
  • Add _extract_content_and_usage() / _extract_content_and_usage_async() unified entry points that auto-select streaming or non-streaming processing based on response type
  • Add _handle_response() / _handle_response_async() and _finalize_response() for common post-processing (empty response warnings, token usage update)
  • Refactor get_completion, get_completion_async, get_vision_completion, get_vision_completion_async to use the new response handling pipeline instead of direct attribute access
  • Add tests/unit/test_openai_vlm_streaming.py with 5 test classes and 14 test cases covering:
    • Streaming/non-streaming detection (including __iter__/__aiter__/_iterator/basic type exclusion)
    • Chunk content and usage extraction
    • Sync and async streaming consumption with content concatenation
    • Full Mock OpenAI client integration tests (text completion / vision completion / token usage tracking)

Testing

  • I have added tests that prove my fix is effective or that my feature works

  • New and existing unit tests pass locally with my changes

  • I have tested this on the following platforms:

    • Linux

    • macOS

    • Windows

Checklist

  • My code follows the project's coding style

  • I have performed a self-review of my code

  • I have commented my code, particularly in hard-to-understand areas

  • I have made corresponding changes to the documentation

  • My changes generate no new warnings

  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

Add support for handling SSE streaming responses from APIs that force
streaming format even when stream=False is requested.

Changes:
- Add _extract_content_and_usage() for sync responses
- Add _extract_content_and_usage_async() for async responses
- Add _handle_response() and _handle_response_async() for response handling
- Add _finalize_response() to eliminate duplicate post-processing logic
- Add _process_streaming_chunks() to reduce code duplication
- Add _extract_content_from_chunk() and _extract_usage_from_chunk() helpers
- Add response type detection with basic type filtering (str/list/dict/bytes)
- Update all completion methods to use new handlers
- Remove redundant _update_token_usage_from_response calls
- Add warning log for empty responses
- Add comprehensive unit tests

Refinements:
- Add choices attribute check to _iterator detection (avoid false positives)
- Add docstring warning that streaming response is consumed
- Add comment explaining async version doesn't reuse _process_streaming_chunks
- Fix test assertions to use correct get_token_usage_summary() method

Co-Authored-By: KorenKrita <KorenKrita@gmail.com>
@MaojiaSheng MaojiaSheng merged commit 247293b into volcengine:main Mar 18, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 18, 2026
KorenKrita added a commit to KorenKrita/OpenViking that referenced this pull request Mar 18, 2026
qin-ctx pushed a commit that referenced this pull request Mar 18, 2026
chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 19, 2026
- Add .pr_agent.toml with 15 repo-specific review rules derived from real
  bug history (PRs volcengine#505, volcengine#728, volcengine#749, volcengine#740/volcengine#745, volcengine#754, volcengine#735, volcengine#767)
- Rules structured as WHEN/THEN/BECAUSE for deterministic enforcement
- Add 8 custom labels (memory-pipeline, async-change, api-breaking, etc.)
- Add ignore patterns for lock files, third_party, build artifacts
- Enable score review, TODO scan, split-PR detection, security audit
- Configure improve tool with quality threshold and extended mode
- Configure describe tool with PR diagrams and semantic file types
- Update workflow: ark-code-latest model, checkout step for .pr_agent.toml,
  move all config from inline YAML to .pr_agent.toml (single source of truth)
qin-ctx pushed a commit that referenced this pull request Mar 19, 2026
…#780)

- Add .pr_agent.toml with 15 repo-specific review rules derived from real
  bug history (PRs #505, #728, #749, #740/#745, #754, #735, #767)
- Rules structured as WHEN/THEN/BECAUSE for deterministic enforcement
- Add 8 custom labels (memory-pipeline, async-change, api-breaking, etc.)
- Add ignore patterns for lock files, third_party, build artifacts
- Enable score review, TODO scan, split-PR detection, security audit
- Configure improve tool with quality threshold and extended mode
- Configure describe tool with PR diagrams and semantic file types
- Update workflow: ark-code-latest model, checkout step for .pr_agent.toml,
  move all config from inline YAML to .pr_agent.toml (single source of truth)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants