confluence-mdx: 중첩 리스트 텍스트 붕괴 버그 수정 (autojunk=False)#806
Merged
Conversation
SequenceMatcher의 autojunk=True(기본값)가 한국어 텍스트에서 반복 패턴 (예: "700MB를 초과")이 있을 때 로컬 매칭을 건너뛰어 대규모 insert/delete를 생성하여 텍스트가 붕괴되는 현상을 수정합니다. - text_transfer.py: transfer_text_changes()의 SequenceMatcher에 autojunk=False 적용 - xhtml_patcher.py: _apply_text_changes()의 SequenceMatcher에 autojunk=False 적용 - 반복 패턴 긴 텍스트 유닛 테스트 추가 - 테스트 케이스 544383693 expected_status를 pass로 변경 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
autojunk=True(기본값)가 한국어 텍스트에서 반복 패턴이 있을 때 로컬 매칭을 건너뛰어 대규모 insert/delete를 생성하여 중첩 리스트 텍스트가 붕괴되는 버그를 수정합니다text_transfer.py와xhtml_patcher.py두 곳의 SequenceMatcher에autojunk=False를 적용합니다Root Cause
Python
difflib.SequenceMatcher의autojunk기능은 200자 이상의 시퀀스에서 등장 빈도 1% 이상인 문자를 "junk"으로 처리합니다. 한국어 텍스트에서 반복 패턴(예: "700MB를 초과"가 2회 등장)이 있으면, 흔한 한글 자모가 junk으로 분류되어 정확한 로컬 매칭이 불가능해지고, 대규모 insert(327자)+delete(326자) 쌍을 생성하여 텍스트가 붕괴됩니다.Test plan
🤖 Generated with Claude Code