Skip to content

feat(parse): rewrite markdown relative links on directory ingest#2452

Merged
qin-ctx merged 1 commit into
volcengine:mainfrom
KCHENPENGFEI:feat/markdown-link-rewrite
Jun 5, 2026
Merged

feat(parse): rewrite markdown relative links on directory ingest#2452
qin-ctx merged 1 commit into
volcengine:mainfrom
KCHENPENGFEI:feat/markdown-link-rewrite

Conversation

@KCHENPENGFEI
Copy link
Copy Markdown
Contributor

Summary / 概述

目录入库 markdown 时,MarkdownParser 会把每个 .md 按标题拆成目录结构,导致 [x](./other.md)![](./img.png) 等相对链接失效——目标已不在原路径。本 PR 在写入每个拆分 section 前重写相对链接,补偿入库引入的路径变换(源文件→目录、目标 .md→目录或文件)。

When a markdown directory is ingested, MarkdownParser splits each .md into a directory structure, breaking relative links like [x](./other.md) and ![](./img.png) — their targets no longer live at the original paths. This PR rewrites relative links before writing each split section, compensating for the path transformations ingest introduces.

Key design / 核心设计

  • 磁盘坐标系 / Disk-coordinate:relpath 用原始磁盘绝对路径计算,对 --to 目标位置免疫。
  • 保守精确 / Conservative:仅重写磁盘存在且落在 import_root 子树内的目标;外链 / 页内锚点 / 绝对路径 / 缺失 / 越界一律原样保留。单文件入库不重写(仅 DirectoryParser 触发)。
  • 对入库零假设 / Zero ingest assumptions:目标 .md 的入库布局通过用同一个 MarkdownParser 实跑解析到内存 FS 得到(_target_split_files/_inmemory_split_files);落点是文件还是目录(_doc_landing)、目录名、章节文件、内容全部来自 parser 本身。无论 parse_content 今后怎么改,跑一遍 in-memory parse 即得最终结果,与真实入库一致,不复刻命名/拆分规则、也不假设 .md 一定目录化。

The target's ingest layout is obtained by running the SAME MarkdownParser into an in-memory FS, so whether the landing is a file or a directory, its name and the section files all come from the parser itself — nothing about ingest is reimplemented or assumed, and the rewrite follows parse_content automatically.

Test plan / 测试

  • tests/parse/test_markdown_link_rewrite.py22 passed:端到端覆盖目录链接、小文件锚点/查询、大文件章节定位、图片、裸目录、外链/绝对路径/越界、以及"未来小 .md 不拆目录"的前瞻场景。
  • ruff check clean。

🤖 Generated with Claude Code

目录入库 markdown 时,MarkdownParser 会把每个 .md 按标题拆成目录结构,导致
[x](./other.md)、![](./img.png) 等相对链接失效——目标已不在原路径。本改动在写入
每个拆分 section 前重写相对链接,补偿入库引入的路径变换(源文件→目录、目标 .md→
目录或文件)。

When a markdown directory is ingested, MarkdownParser splits each .md into a
directory structure, breaking relative links like [x](./other.md) and
![](./img.png) — their targets no longer live at the original paths. This rewrites
relative links before writing each split section, compensating for the path
transformations ingest introduces.

核心设计 / Key properties:
- 磁盘坐标系 / Disk-coordinate: relpath 用原始磁盘路径计算,对 --to 免疫。
- 保守精确 / Conservative: 仅重写磁盘存在且落在 import_root 子树内的目标;外链 /
  页内锚点 / 绝对路径 / 缺失 / 越界一律原样保留。
- 对入库零假设 / Zero ingest assumptions: 目标 .md 的入库布局通过用同一个
  MarkdownParser 实跑解析到内存 FS 得到(_target_split_files/_inmemory_split_files);
  落点是文件还是目录(_doc_landing)、目录名、章节文件、章节内容全部来自 parser
  本身。无论 parse_content 今后怎么改,跑一遍 in-memory parse 即得最终结果,重写
  自动跟随、与真实入库一致,不复刻命名/拆分规则,也不假设 .md 一定目录化。

The target's ingest layout is obtained by running the SAME MarkdownParser into an
in-memory FS: whether the landing is a file or a directory, its name, the section
files and their content all come from the parser, so nothing about ingest is
reimplemented or assumed and the rewrite follows parse_content automatically.

带锚点的 .md 经 in-memory parse 精确定位到章节文件(GitHub-slug 匹配标题);单文件
文档(含未来小 .md 不拆目录)保留后缀指向该文件;图片 / 裸目录保持路径仅调相对深度。
重写链路为 async,且仅由 DirectoryParser 触发——单文件入库不重写。

新增 tests/parse/test_markdown_link_rewrite.py,22 passed。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix race condition with shared rewrite context

Fix race condition from shared self._rewrite_ctx instance variable in async
parse_content. When the same MarkdownParser instance processes multiple files
concurrently, _rewrite_ctx gets overwritten, causing incorrect link rewriting. Pass
rewrite context through method parameters instead.

openviking/parse/parsers/markdown.py [165-671]

-    # Relative-link rewrite context, set per parse_content() call.
-    self._rewrite_ctx = None
-
 async def parse_content(
     self,
     content: str,
     source_path: Optional[Union[str, Path]] = None,
     instruction: str = "",
     **kwargs,
 ) -> ParseResult:
     ...
         # Set up relative-link rewrite context consumed by _write_section.
         # Link rewriting is opt-in via kwargs (DirectoryParser sets these);
         # parse_content keeps the base **kwargs signature rather than adding
         # dedicated parameters, so read them from kwargs here.
-        self._rewrite_ctx = {
+        rewrite_ctx = {
             "enabled": bool(kwargs.get("enable_link_rewrite", False)),
             "source_path": source_path,
             "doc_name": self._sanitize_for_path(doc_title),
             "root_dir": root_dir,
             # Rewrite is driven only by DirectoryParser, which passes the ingest
             # root. Single-file ingestion never sets this, so it never rewrites.
             "import_root": kwargs.get("link_rewrite_root"),
+            "_split_cache": {},  # Cache for _target_split_files per parse call
         }
 
         # Find all headings
         headings = self._find_headings(content)
         logger.info(f"[MarkdownParser] Found {len(headings)} headings")
 
         # Parse and create directory structure
         await self._parse_and_create_structure(
+            content,
+            headings,
+            root_dir,
+            doc_title,
+            source_path,
+            temp_uri,
+            rewrite_ctx,  # Pass rewrite context through
+        )
     ...
-    finally:
-        # Rewrite context lives exactly one parse_content call.
-        self._rewrite_ctx = None
 
-async def _write_section(self, uri: str, content: str) -> None:
+async def _write_section(self, uri: str, content: str, rewrite_ctx: Optional[Dict] = None) -> None:
     """Write a markdown section file, rewriting relative links when enabled."""
-    ctx = self._rewrite_ctx
-    if ctx and ctx.get("enabled") and ctx.get("source_path") and ctx.get("import_root"):
+    ctx = rewrite_ctx or {}
+    if ctx.get("enabled") and ctx.get("source_path") and ctx.get("import_root"):
         content = await self._rewrite_relative_links(
             content,
             source_path=ctx["source_path"],
             doc_name=ctx["doc_name"],
             section_subpath=self._section_subpath(uri, ctx["root_dir"]),
             import_root=ctx["import_root"],
+            rewrite_ctx=ctx,  # Pass to _rewrite_relative_links if needed
         )
     await self._get_viking_fs().write_file(uri, content)
 
-async def _target_split_files(self, target_path: str) -> Optional[Dict[str, str]]:
+async def _target_split_files(self, target_path: str, rewrite_ctx: Optional[Dict] = None) -> Optional[Dict[str, str]]:
     """The target's real ingest layout {"<doc_dir>/<section...>": text} via an
     in-memory parse, cached per parse_content() call. None on parse failure."""
-    ctx = self._rewrite_ctx or {}
+    ctx = rewrite_ctx or {}
     cache = ctx.setdefault("_split_cache", {})
 
+# Update _parse_and_create_structure, _save_section, _save_merged, _split_content
+# to accept and propagate rewrite_ctx parameter to _write_section and _target_split_files
+
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a race condition in async code where a shared instance variable self._rewrite_ctx could be overwritten by concurrent parse_content calls. This is a critical issue that could lead to incorrect link rewriting. The proposed fix of passing the context through method parameters is sound and addresses the concurrency problem.

Medium

@qin-ctx qin-ctx merged commit 960d04e into volcengine:main Jun 5, 2026
3 of 5 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants