Skip to content

fix(parse): prevent merged filename collision on duplicate headings#1005

Merged
qin-ctx merged 1 commit intovolcengine:mainfrom
deepakdevp:fix/markdown-merged-filename-collision
Mar 26, 2026
Merged

fix(parse): prevent merged filename collision on duplicate headings#1005
qin-ctx merged 1 commit intovolcengine:mainfrom
deepakdevp:fix/markdown-merged-filename-collision

Conversation

@deepakdevp
Copy link
Copy Markdown
Contributor

Summary

  • Fix content loss when importing Markdown files with duplicate top-level headings
  • _generate_merged_filename() now always appends a content-based hash suffix (8-char SHA-256 of name:index pairs) to guarantee uniqueness
  • Previously, two merge groups with the same first heading and count would produce identical filenames, causing the second write to overwrite the first

Root Cause

_generate_merged_filename() generated filenames like {first_heading}_{count}more. The hash suffix only triggered when len(name) > MAX_MERGED_FILENAME_LENGTH. Two merge groups sharing the same first heading name and same section count produced identical filenames → file overwrite → content loss.

Fix

Always compute and append the hash. Hash input uses name:section_index pairs (not just names) because duplicate headings share names — the section index is what differentiates them.

Fixes #1004.

Test plan

  • 3 new tests pass (pytest tests/parse/test_markdown_filename_collision.py)
  • 23 existing markdown tests pass — no regressions (pytest tests/parse/test_markdown_char_limit.py)
  • Ruff clean

_generate_merged_filename() now always appends a content-based hash
suffix derived from all section names and indices. This prevents
filename collisions when multiple merge groups share the same first
heading name and count, which previously caused file overwrites and
content loss during markdown resource ingestion.

Fixes volcengine#1004.
@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

@qin-ctx qin-ctx merged commit 99bc8e7 into volcengine:main Mar 26, 2026
7 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 26, 2026
zeattacker pushed a commit to zeattacker/OpenViking that referenced this pull request Mar 27, 2026
…olcengine#1005)

_generate_merged_filename() now always appends a content-based hash
suffix derived from all section names and indices. This prevents
filename collisions when multiple merge groups share the same first
heading name and count, which previously caused file overwrites and
content loss during markdown resource ingestion.

Fixes volcengine#1004.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: MarkdownParser:重复顶级标题导致 merged 文件名冲突覆盖,入库文档缺失内容

2 participants