⚡️ Speed up method Qwen3VLBlockV1.run by 45% in PR #1968 (remote-exec-for-all-models)#1971
Open
codeflash-ai[bot] wants to merge 1 commit intoremote-exec-for-all-modelsfrom
Conversation
The optimized code achieves a **44% speedup** (5.64ms → 3.90ms) primarily by **batching remote HTTP calls** to reduce network overhead when processing multiple images. ## Key Optimization: Request Batching in `run_remotely()` **What changed:** - **Original**: Looped through images and made **one HTTP request per image** via `client.infer_lmm()` (line profiler shows 89.2% of time spent in this loop) - **Optimized**: Collects all `base64_image` values into a list and makes **a single batched HTTP call** for multiple images, with special handling for single-image cases **Why this is faster:** 1. **Network latency reduction**: Each HTTP request incurs connection overhead, SSL handshake, and round-trip time. Batching N images into one request eliminates (N-1) network round-trips 2. **Line profiler evidence**: The original's `client.infer_lmm` calls consumed 68.7ms (89.2% of 77ms total), while the optimized version spends only 1.37ms (21.6% of 6.3ms total) on HTTP calls 3. **The InferenceHTTPClient supports batch inputs**: The `infer_lmm` method accepts `List[ImagesReference]` and returns `List[dict]`, enabling efficient batch processing server-side ## Secondary Optimization: Local Path Efficiency **What changed:** - Eliminated the `prompts` list construction: `prompts = [combined_prompt] * len(inference_images)` - Now reuses the single `combined_prompt` string directly in the loop **Why this helps:** - Avoids allocating a list containing N duplicate string references - Reduces memory allocations and list iteration overhead - Line profiler shows minor improvement in local path (138.8ms → 137.8ms) ## Test Results Analysis The annotated tests show the optimization excels when: - **Multiple images processed remotely** (`test_large_scale_remote_many_images`): The batching dramatically reduces overhead for 200 images - **Single remote image**: Still benefits from cleaner code path (no loop overhead) - **Local inference**: Minor gains from eliminated list allocation (2-5% improvement in local tests) ## Impact Assessment Without `function_references`, we cannot definitively determine hot path usage, but the optimization is **universally beneficial** for remote inference workloads: - Any workflow processing multiple images remotely will see significant speedup - No breaking changes to API or behavior (empty image list correctly returns `[]`) - Single-image cases maintain compatibility with separate code path
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #1968
If you approve this dependent PR, these changes will be merged into the original PR branch
remote-exec-for-all-models.📄 45% (0.45x) speedup for
Qwen3VLBlockV1.runininference/core/workflows/core_steps/models/foundation/qwen3vl/v1.py⏱️ Runtime :
5.64 milliseconds→3.90 milliseconds(best of71runs)📝 Explanation and details
The optimized code achieves a 44% speedup (5.64ms → 3.90ms) primarily by batching remote HTTP calls to reduce network overhead when processing multiple images.
Key Optimization: Request Batching in
run_remotely()What changed:
client.infer_lmm()(line profiler shows 89.2% of time spent in this loop)base64_imagevalues into a list and makes a single batched HTTP call for multiple images, with special handling for single-image casesWhy this is faster:
client.infer_lmmcalls consumed 68.7ms (89.2% of 77ms total), while the optimized version spends only 1.37ms (21.6% of 6.3ms total) on HTTP callsinfer_lmmmethod acceptsList[ImagesReference]and returnsList[dict], enabling efficient batch processing server-sideSecondary Optimization: Local Path Efficiency
What changed:
promptslist construction:prompts = [combined_prompt] * len(inference_images)combined_promptstring directly in the loopWhy this helps:
Test Results Analysis
The annotated tests show the optimization excels when:
test_large_scale_remote_many_images): The batching dramatically reduces overhead for 200 imagesImpact Assessment
Without
function_references, we cannot definitively determine hot path usage, but the optimization is universally beneficial for remote inference workloads:[])✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-pr1968-2026-02-04T20.51.48and push.