Skip to content

update vaka exp#2147

Merged
chenjw merged 1 commit into
volcengine:mainfrom
fujiajie666:vaka_exp_update
May 20, 2026
Merged

update vaka exp#2147
chenjw merged 1 commit into
volcengine:mainfrom
fujiajie666:vaka_exp_update

Conversation

@fujiajie666
Copy link
Copy Markdown
Collaborator

Description

Update the Vaka benchmark script

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

@chenjw chenjw merged commit cfb680a into volcengine:main May 20, 2026
4 of 5 checks passed
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 75
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Update import script with ingest modes and tool parts

Relevant files:

  • benchmark/vaka/vikingbot/import_to_ov.py

Sub-PR theme: Update judge with ensemble mode and Azure support

Relevant files:

  • benchmark/vaka/vikingbot/judge.py

Sub-PR theme: Rewrite eval pipeline to use /bot/v1/chat API

Relevant files:

  • benchmark/vaka/vikingbot/run_eval.py
  • benchmark/vaka/vikingbot/README.md

⚡ Recommended focus areas for review

Bug: Fragile tool_call/tool_result pairing

The event pairing logic for tool_call and tool_result assumes strictly alternating events. If multiple tool_call events appear consecutively, the first tool_result will be mispaired with the second tool_call, leading to incorrect memory extraction.

if isinstance(events, list):
    # Pair tool_call + tool_result by order
    pending_tool_name: str | None = None
    for event in events:
        if not isinstance(event, dict):
            continue
        event_type = event.get("type")
        event_data = event.get("data", "")

        if event_type == "tool_call":
            # tool_call data format: "tool_name({args})"
            if isinstance(event_data, str):
                pending_tool_name = event_data.split("(", 1)[0].strip()
            else:
                pending_tool_name = None

        elif event_type == "tool_result":
            if pending_tool_name == "openviking_multi_read" and isinstance(event_data, str):
                llm_memories.extend(_extract_multi_read_memories(event_data))
            else:
                llm_memories.extend(_extract_memories_from_payload(event_data))
            pending_tool_name = None
Documentation: Missing referenced script

Step 3 references clean_failed_eval_rows.py, but this file is not present in the PR. Users will encounter an error when trying to run this step.

### step 3 - Clean up lines that failed (optional, execute before Judge).

If there are rows in `vaka_qa_result.csv` where `response_input_tokens` is 0, 
it means that although these issues were written to the results file, 
the actual bot call failed or there were no valid token statistics. 
These failed rows should be cleaned up and rerun the run_eval.py script before using the 
Judge function.

```bash
uv run python benchmark/vaka/vikingbot/clean_failed_eval_rows.py --input benchmark/vaka/vikingbot/result/vaka_qa_result.csv

</details>

<details><summary><a href='https://github.com/volcengine/OpenViking/pull/2147/files#diff-fa5e21ba5d499cc1d5b994ad41e6474e5bd8457d16bc4e82e7d36b37a9e34308R331-R332'><strong>Suggestion: Unintuitive default model override</strong></a>

The logic to override models in single mode compares against an old default model list. This is brittle and may not behave as expected if the user provides a custom model list that partially overlaps with the old default.
</summary>

```python
if not args.models or args.models == ["ep-20260423162207-qfqr8", "ep-20260501104936-72vfz", "ep-20260501105042-9kp5v"]:
    args.models = ["gpt-5.4-2026-03-05"]
Suggestion: Unhandled ValueError for --question-index

Parsing --question-index does not handle non-integer values gracefully, which will crash the script instead of showing a user-friendly error message.

if args.question_index is not None:
    indices = [int(x) for x in args.question_index.split(",")]
    indices_set = set(indices)
    qa_list = [(i, qa) for i, qa in enumerate(qa_list) if i in indices_set]
    if not qa_list:
        print(f"No questions matched --question-index={args.question_index}")
        return
    print(f"Filtered to {len(qa_list)} question(s) by --question-index={args.question_index}")

@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project May 20, 2026
@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants