-
Notifications
You must be signed in to change notification settings - Fork 315
[v1.3] 增强文本解码 #1166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.3] 增强文本解码 #1166
Conversation
| const uint8Array = new Uint8Array(arrayBuffer); | ||
| if (isText(uint8Array)) { | ||
| resource.content = await data.text(); | ||
| resource.content = await readBlobContent(data, contentType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
这个 PR 为 ScriptCat 增强了文本解码能力,主要解决从网络下载脚本时编码不是 UTF-8 且 charset 未正确设置的问题。通过实现 BOM 检测和基于 null 字节模式的启发式算法,可以自动识别 UTF-8、UTF-16LE/BE、UTF-32LE/BE 等编码格式。
Changes:
- 新增
readBlobContent函数,实现智能编码检测,按优先级依次尝试:Content-Type charset → BOM 检测 → null 模式启发式 → UTF-8 验证 → windows-1252 回退 - 新增
detectBOM辅助函数用于识别各种 BOM 标记 - 新增
guessByNullPattern辅助函数通过 null 字节分布模式识别 UTF-16/32 编码 - 优化
parseCharsetFromContentType函数,返回空字符串替代 null,简化逻辑 - 优化
decodeUTF32性能,仅在 BE 分支创建 DataView - 在脚本获取、资源加载、favicon 抓取等场景应用新的编码检测逻辑
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/pkg/utils/encoding.ts | 核心实现:新增智能编码检测函数和辅助算法,优化现有编码函数 |
| src/pkg/utils/encoding.test.ts | 更新测试:调整 parseCharsetFromContentType 返回值断言从 null 改为空字符串 |
| src/pkg/utils/script.ts | 集成应用:在 fetchScriptBody 中使用 readBlobContent 替代 resp.text() |
| src/pages/store/favicons.ts | 集成应用:在 fetchIconByDomain 中使用 readBlobContent 处理 HTML 内容 |
| src/app/service/service_worker/resource.ts | 集成应用:在 loadByUrl 中使用 readBlobContent 处理文本资源,并优化 contentType 变量提取 |
| export const readBlobContent = async (blob: Blob | File | Response, contentType: string | null): Promise<string> => { | ||
| const buffer = await blob.arrayBuffer(); | ||
| const uint8 = new Uint8Array(buffer); | ||
|
|
||
| if (uint8.length === 0) { | ||
| return ""; | ||
| } | ||
|
|
||
| // 优先尝试使用 Content-Type header 中的 charset | ||
| const headerCharset = parseCharsetFromContentType(contentType); | ||
| if (headerCharset) { | ||
| try { | ||
| // 验证 charset 是否有效 | ||
| return bytesDecode(headerCharset, uint8); | ||
| } catch (e: any) { | ||
| console.warn(`Invalid charset from Content-Type header: ${headerCharset}, error: ${e.message}`); | ||
| } | ||
| } | ||
|
|
||
| // BOM detection (highest priority) | ||
| const bomEncoding = detectBOM(uint8); | ||
| if (bomEncoding) return bytesDecode(bomEncoding, uint8); | ||
|
|
||
| const checkSize = Math.min(uint8.length, 16 * 1024); | ||
|
|
||
| if (uint8.length % 2 === 0) { | ||
| // Heuristic detection (first 16 KB) | ||
| const heuristicEncoding = guessByNullPattern(uint8, checkSize); | ||
| if (heuristicEncoding) return bytesDecode(heuristicEncoding, uint8); | ||
| } | ||
|
|
||
| // UTF-8 validation → legacy fallback | ||
| let encoding = "utf-8"; | ||
| try { | ||
| // Strict mode – throws on invalid sequences | ||
| new TextDecoder("utf-8", { fatal: true }).decode(uint8.subarray(0, checkSize)); | ||
| } catch { | ||
| // Invalid UTF-8 → most common real-world fallback | ||
| encoding = "windows-1252"; // OR detectEncoding(uint8, null) | ||
| } | ||
|
|
||
| return bytesDecode(encoding, uint8); | ||
| }; |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增的 readBlobContent 函数是这个 PR 的核心功能,但缺少专门的单元测试。考虑到该函数集成了多个编码检测步骤(Content-Type header → BOM → null pattern 启发式 → UTF-8 验证 → fallback),建议添加测试用例覆盖以下场景:
- 带有有效 Content-Type charset 的情况
- 带有无效 Content-Type charset 的情况
- 各种 BOM(UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)
- 无 BOM 但可通过 null pattern 检测的 UTF-16/32 编码
- UTF-8 文本(无 BOM)
- 无效 UTF-8 回退到 windows-1252
- 空 Blob 的情况
- 长度小于 64 字节的短文本
鉴于同目录下其他工具函数都有对应的测试文件,且 encoding.test.ts 已经有 510 行的测试代码,这表明项目对测试覆盖率有较高要求。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot 请开一个PR新增所需的单元测试
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CodFrm 我不能指示 copilot...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CodFrm 我不能指示 copilot...
似乎是因为跨仓库了?我也不行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| new TextDecoder("utf-8", { fatal: true }).decode(uint8.subarray(0, checkSize)); | ||
| } catch { | ||
| // Invalid UTF-8 → most common real-world fallback | ||
| encoding = "windows-1252"; // OR detectEncoding(uint8, null) |
Copilot
AI
Jan 31, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在 line 208,注释 // OR detectEncoding(uint8, null) 提示了一个替代方案。目前的实现直接使用 windows-1252 作为 fallback,这是西欧语言最常见的编码。
如果需要支持更广泛的编码(如亚洲语言),可以考虑使用 detectEncoding(uint8, null) 来通过 chardet 自动检测。但这会增加性能开销。
当前的实现偏向于性能和简单性,是合理的选择。如果未来收到关于非西欧语言编码的问题反馈,可以考虑改用 detectEncoding。这个注释可以保留作为未来改进的提示。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
当前的实现偏向于性能和简单性,是合理的选择。如果未来收到关于非西欧语言编码的问题反馈,可以考虑改用 detectEncoding。这个注释可以保留作为未来改进的提示。
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
概述 Descriptions
close #1163
主要用来处理网上下载代码时编码非utf-8但charset又没设置时
进行 BOM 或 null guessing 来决定 utf-8 / utf-16le / utf-16be / utf-32le / utf-32be
不处理 cloud sync 相关的代码
变更内容 Changes
截图 Screenshots