feat(parse): rewrite markdown relative links on directory ingest by KCHENPENGFEI · Pull Request #2452 · volcengine/OpenViking

KCHENPENGFEI · 2026-06-05T06:14:11Z

Summary / 概述

目录入库 markdown 时，MarkdownParser 会把每个 .md 按标题拆成目录结构，导致 [x](./other.md)、![](./img.png) 等相对链接失效——目标已不在原路径。本 PR 在写入每个拆分 section 前重写相对链接，补偿入库引入的路径变换（源文件→目录、目标 .md→目录或文件）。

When a markdown directory is ingested, MarkdownParser splits each .md into a directory structure, breaking relative links like [x](./other.md) and ![](./img.png) — their targets no longer live at the original paths. This PR rewrites relative links before writing each split section, compensating for the path transformations ingest introduces.

Key design / 核心设计

磁盘坐标系 / Disk-coordinate：relpath 用原始磁盘绝对路径计算，对 --to 目标位置免疫。
保守精确 / Conservative：仅重写磁盘存在且落在 import_root 子树内的目标；外链 / 页内锚点 / 绝对路径 / 缺失 / 越界一律原样保留。单文件入库不重写（仅 DirectoryParser 触发）。
对入库零假设 / Zero ingest assumptions：目标 .md 的入库布局通过用同一个 MarkdownParser 实跑解析到内存 FS 得到（_target_split_files/_inmemory_split_files）；落点是文件还是目录（_doc_landing）、目录名、章节文件、内容全部来自 parser 本身。无论 parse_content 今后怎么改，跑一遍 in-memory parse 即得最终结果，与真实入库一致，不复刻命名/拆分规则、也不假设 .md 一定目录化。

The target's ingest layout is obtained by running the SAME MarkdownParser into an in-memory FS, so whether the landing is a file or a directory, its name and the section files all come from the parser itself — nothing about ingest is reimplemented or assumed, and the rewrite follows parse_content automatically.

Test plan / 测试

tests/parse/test_markdown_link_rewrite.py — 22 passed：端到端覆盖目录链接、小文件锚点/查询、大文件章节定位、图片、裸目录、外链/绝对路径/越界、以及"未来小 .md 不拆目录"的前瞻场景。
ruff check clean。

🤖 Generated with Claude Code

目录入库 markdown 时，MarkdownParser 会把每个 .md 按标题拆成目录结构，导致 [x](./other.md)、![](./img.png) 等相对链接失效——目标已不在原路径。本改动在写入每个拆分 section 前重写相对链接，补偿入库引入的路径变换（源文件→目录、目标 .md→ 目录或文件）。 When a markdown directory is ingested, MarkdownParser splits each .md into a directory structure, breaking relative links like [x](./other.md) and ![](./img.png) — their targets no longer live at the original paths. This rewrites relative links before writing each split section, compensating for the path transformations ingest introduces. 核心设计 / Key properties: - 磁盘坐标系 / Disk-coordinate: relpath 用原始磁盘路径计算，对 --to 免疫。 - 保守精确 / Conservative: 仅重写磁盘存在且落在 import_root 子树内的目标；外链 / 页内锚点 / 绝对路径 / 缺失 / 越界一律原样保留。 - 对入库零假设 / Zero ingest assumptions: 目标 .md 的入库布局通过用同一个 MarkdownParser 实跑解析到内存 FS 得到（_target_split_files/_inmemory_split_files）；落点是文件还是目录（_doc_landing）、目录名、章节文件、章节内容全部来自 parser 本身。无论 parse_content 今后怎么改，跑一遍 in-memory parse 即得最终结果，重写自动跟随、与真实入库一致，不复刻命名/拆分规则，也不假设 .md 一定目录化。 The target's ingest layout is obtained by running the SAME MarkdownParser into an in-memory FS: whether the landing is a file or a directory, its name, the section files and their content all come from the parser, so nothing about ingest is reimplemented or assumed and the rewrite follows parse_content automatically. 带锚点的 .md 经 in-memory parse 精确定位到章节文件（GitHub-slug 匹配标题）；单文件文档（含未来小 .md 不拆目录）保留后缀指向该文件；图片 / 裸目录保持路径仅调相对深度。重写链路为 async，且仅由 DirectoryParser 触发——单文件入库不重写。新增 tests/parse/test_markdown_link_rewrite.py，22 passed。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-05T06:15:48Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-06-05T06:20:47Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

Possible issue

Fix race condition with shared rewrite context

Fix race condition from shared self._rewrite_ctx instance variable in async
parse_content. When the same MarkdownParser instance processes multiple files
concurrently, _rewrite_ctx gets overwritten, causing incorrect link rewriting. Pass
rewrite context through method parameters instead.

openviking/parse/parsers/markdown.py [165-671]

-    # Relative-link rewrite context, set per parse_content() call.
-    self._rewrite_ctx = None
-
 async def parse_content(
     self,
     content: str,
     source_path: Optional[Union[str, Path]] = None,
     instruction: str = "",
     **kwargs,
 ) -> ParseResult:
     ...
         # Set up relative-link rewrite context consumed by _write_section.
         # Link rewriting is opt-in via kwargs (DirectoryParser sets these);
         # parse_content keeps the base **kwargs signature rather than adding
         # dedicated parameters, so read them from kwargs here.
-        self._rewrite_ctx = {
+        rewrite_ctx = {
             "enabled": bool(kwargs.get("enable_link_rewrite", False)),
             "source_path": source_path,
             "doc_name": self._sanitize_for_path(doc_title),
             "root_dir": root_dir,
             # Rewrite is driven only by DirectoryParser, which passes the ingest
             # root. Single-file ingestion never sets this, so it never rewrites.
             "import_root": kwargs.get("link_rewrite_root"),
+            "_split_cache": {},  # Cache for _target_split_files per parse call
         }
 
         # Find all headings
         headings = self._find_headings(content)
         logger.info(f"[MarkdownParser] Found {len(headings)} headings")
 
         # Parse and create directory structure
         await self._parse_and_create_structure(
+            content,
+            headings,
+            root_dir,
+            doc_title,
+            source_path,
+            temp_uri,
+            rewrite_ctx,  # Pass rewrite context through
+        )
     ...
-    finally:
-        # Rewrite context lives exactly one parse_content call.
-        self._rewrite_ctx = None
 
-async def _write_section(self, uri: str, content: str) -> None:
+async def _write_section(self, uri: str, content: str, rewrite_ctx: Optional[Dict] = None) -> None:
     """Write a markdown section file, rewriting relative links when enabled."""
-    ctx = self._rewrite_ctx
-    if ctx and ctx.get("enabled") and ctx.get("source_path") and ctx.get("import_root"):
+    ctx = rewrite_ctx or {}
+    if ctx.get("enabled") and ctx.get("source_path") and ctx.get("import_root"):
         content = await self._rewrite_relative_links(
             content,
             source_path=ctx["source_path"],
             doc_name=ctx["doc_name"],
             section_subpath=self._section_subpath(uri, ctx["root_dir"]),
             import_root=ctx["import_root"],
+            rewrite_ctx=ctx,  # Pass to _rewrite_relative_links if needed
         )
     await self._get_viking_fs().write_file(uri, content)
 
-async def _target_split_files(self, target_path: str) -> Optional[Dict[str, str]]:
+async def _target_split_files(self, target_path: str, rewrite_ctx: Optional[Dict] = None) -> Optional[Dict[str, str]]:
     """The target's real ingest layout {"<doc_dir>/<section...>": text} via an
     in-memory parse, cached per parse_content() call. None on parse failure."""
-    ctx = self._rewrite_ctx or {}
+    ctx = rewrite_ctx or {}
     cache = ctx.setdefault("_split_cache", {})
 
+# Update _parse_and_create_structure, _save_section, _save_merged, _split_content
+# to accept and propagate rewrite_ctx parameter to _write_section and _target_split_files
+

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a race condition in async code where a shared instance variable self._rewrite_ctx could be overwritten by concurrent parse_content calls. This is a critical issue that could lead to incorrect link rewriting. The proposed fix of passing the context through method parameters is sound and addresses the concurrency problem.

Medium

github-project-automation Bot added this to OpenViking project Jun 5, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jun 5, 2026

github-actions Bot added the Review effort 4/5 label Jun 5, 2026

MaojiaSheng approved these changes Jun 5, 2026

View reviewed changes

qin-ctx merged commit 960d04e into volcengine:main Jun 5, 2026
3 of 5 checks passed

github-project-automation Bot moved this from Backlog to Done in OpenViking project Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parse): rewrite markdown relative links on directory ingest#2452

feat(parse): rewrite markdown relative links on directory ingest#2452
qin-ctx merged 1 commit into
volcengine:mainfrom
KCHENPENGFEI:feat/markdown-link-rewrite

KCHENPENGFEI commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KCHENPENGFEI commented Jun 5, 2026

Summary / 概述

Key design / 核心设计

Test plan / 测试

Uh oh!

github-actions Bot commented Jun 5, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Jun 5, 2026

PR Code Suggestions ✨

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants