update vaka exp by fujiajie666 · Pull Request #2147 · volcengine/OpenViking

fujiajie666 · 2026-05-20T09:34:35Z

Description

Update the Vaka benchmark script

Related Issue

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

github-actions · 2026-05-20T09:36:06Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 75
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Update import script with ingest modes and tool parts Relevant files: benchmark/vaka/vikingbot/import_to_ov.py Sub-PR theme: Update judge with ensemble mode and Azure support Relevant files: benchmark/vaka/vikingbot/judge.py Sub-PR theme: Rewrite eval pipeline to use /bot/v1/chat API Relevant files: benchmark/vaka/vikingbot/run_eval.py benchmark/vaka/vikingbot/README.md
⚡ Recommended focus areas for review Bug: Fragile tool_call/tool_result pairing The event pairing logic for tool_call and tool_result assumes strictly alternating events. If multiple tool_call events appear consecutively, the first tool_result will be mispaired with the second tool_call, leading to incorrect memory extraction. if isinstance(events, list): # Pair tool_call + tool_result by order pending_tool_name: str \| None = None for event in events: if not isinstance(event, dict): continue event_type = event.get("type") event_data = event.get("data", "") if event_type == "tool_call": # tool_call data format: "tool_name({args})" if isinstance(event_data, str): pending_tool_name = event_data.split("(", 1)[0].strip() else: pending_tool_name = None elif event_type == "tool_result": if pending_tool_name == "openviking_multi_read" and isinstance(event_data, str): llm_memories.extend(_extract_multi_read_memories(event_data)) else: llm_memories.extend(_extract_memories_from_payload(event_data)) pending_tool_name = None Documentation: Missing referenced script Step 3 references `clean_failed_eval_rows.py`, but this file is not present in the PR. Users will encounter an error when trying to run this step. ### step 3 - Clean up lines that failed (optional, execute before Judge). If there are rows in `vaka_qa_result.csv` where `response_input_tokens` is 0, it means that although these issues were written to the results file, the actual bot call failed or there were no valid token statistics. These failed rows should be cleaned up and rerun the run_eval.py script before using the Judge function. ```bash uv run python benchmark/vaka/vikingbot/clean_failed_eval_rows.py --input benchmark/vaka/vikingbot/result/vaka_qa_result.csv </details> <details><summary><a href='https://github.com/volcengine/OpenViking/pull/2147/files#diff-fa5e21ba5d499cc1d5b994ad41e6474e5bd8457d16bc4e82e7d36b37a9e34308R331-R332'><strong>Suggestion: Unintuitive default model override</strong></a> The logic to override models in single mode compares against an old default model list. This is brittle and may not behave as expected if the user provides a custom model list that partially overlaps with the old default. </summary> ```python if not args.models or args.models == ["ep-20260423162207-qfqr8", "ep-20260501104936-72vfz", "ep-20260501105042-9kp5v"]: args.models = ["gpt-5.4-2026-03-05"] Suggestion: Unhandled ValueError for --question-index Parsing --question-index does not handle non-integer values gracefully, which will crash the script instead of showing a user-friendly error message. if args.question_index is not None: indices = [int(x) for x in args.question_index.split(",")] indices_set = set(indices) qa_list = [(i, qa) for i, qa in enumerate(qa_list) if i in indices_set] if not qa_list: print(f"No questions matched --question-index={args.question_index}") return print(f"Filtered to {len(qa_list)} question(s) by --question-index={args.question_index}")

github-actions · 2026-05-20T09:37:56Z

PR Code Suggestions ✨

No code suggestions found for the PR.

update vaka exp

95df391

fujiajie666 requested a review from chenjw May 20, 2026 09:34

github-project-automation Bot added this to OpenViking project May 20, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 20, 2026

chenjw approved these changes May 20, 2026

View reviewed changes

github-actions Bot added the Review effort 4/5 label May 20, 2026

chenjw merged commit cfb680a into volcengine:main May 20, 2026
4 of 5 checks passed

github-project-automation Bot moved this from Backlog to Done in OpenViking project May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update vaka exp#2147

update vaka exp#2147
chenjw merged 1 commit into
volcengine:mainfrom
fujiajie666:vaka_exp_update

fujiajie666 commented May 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fujiajie666 commented May 20, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 20, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants