Skip to content

feat: 补充百炼记忆库 LoCoMo benchmark 评测脚本#1664

Merged
yeshion23333 merged 1 commit intovolcengine:mainfrom
yangxinxin-7:benchmark/bailian-memory
Apr 23, 2026
Merged

feat: 补充百炼记忆库 LoCoMo benchmark 评测脚本#1664
yeshion23333 merged 1 commit intovolcengine:mainfrom
yangxinxin-7:benchmark/bailian-memory

Conversation

@yangxinxin-7
Copy link
Copy Markdown
Collaborator

@yangxinxin-7 yangxinxin-7 commented Apr 23, 2026

Summary

  • Add benchmark/locomo/bailian_memory/ directory for evaluating LoCoMo long-term conversation memory dataset using Alibaba Cloud Bailian (ModelStudio) Memory
  • ingest.py: ingest LoCoMo conversations into Bailian memory library, with resume support and per-sample/session filtering
  • eval.py: run QA evaluation via SearchMemory retrieval + Qwen LLM, with concurrent threading and resume support
  • delete_user.py: clean up memory nodes for specified users
  • README.md: full setup guide, usage instructions, and notes on why user profile extraction is not recommended for the LoCoMo scenario

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 80
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Performance: Quadratic time in judge-only mode

In run_judge_only, each grade_one thread writes the entire CSV file every time it grades a single row. This leads to O(k*m) time complexity where k is ungraded rows and m is total rows, which can be slow for large datasets. Consider batching writes or updating only the necessary rows.

def grade_one(idx: int) -> None:
    row = rows[idx]
    label, reasoning = judge_answer(
        row.get("question", ""),
        row.get("answer", ""),
        row.get("response", ""),
        args.judge_base_url,
        judge_token,
        args.judge_model,
    )
    row["result"] = label
    row["reasoning"] = reasoning
    with file_lock:
        tmp = args.output + ".tmp"
        with open(tmp, "w", encoding="utf-8", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
            writer.writeheader()
            writer.writerows(rows)
        os.replace(tmp, args.output)
    print(f"  Graded {row.get('question_id', '?')}: {label}", file=sys.stderr)
Documentation: Missing parameters

The README documents --model and --top-k parameters for eval.py that are not implemented in the current code, which will cause user confusion.

# 指定模型和检索数量
python eval.py --model qwen-max --top-k 15 --threads 20

</details>

</td></tr>
</table>

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@yeshion23333 yeshion23333 merged commit e706a8c into volcengine:main Apr 23, 2026
5 of 6 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project Apr 23, 2026
yeshion23333 pushed a commit that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants