Skip to content

Conversation

@cyfung1031
Copy link
Collaborator

概述 Descriptions

close #1163

主要用来处理网上下载代码时编码非utf-8但charset又没设置时
进行 BOM 或 null guessing 来决定 utf-8 / utf-16le / utf-16be / utf-32le / utf-32be

不处理 cloud sync 相关的代码

变更内容 Changes

截图 Screenshots

@cyfung1031 cyfung1031 changed the title 增强文本解码 [v1.3] 增强文本解码 Jan 31, 2026
const uint8Array = new Uint8Array(arrayBuffer);
if (isText(uint8Array)) {
resource.content = await data.text();
resource.content = await readBlobContent(data, contentType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@resource 应该要保留原汁原味

@require和@require-css 是会转换成代码运行的,可以进行解码

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

这个 PR 为 ScriptCat 增强了文本解码能力,主要解决从网络下载脚本时编码不是 UTF-8 且 charset 未正确设置的问题。通过实现 BOM 检测和基于 null 字节模式的启发式算法,可以自动识别 UTF-8、UTF-16LE/BE、UTF-32LE/BE 等编码格式。

Changes:

  • 新增 readBlobContent 函数,实现智能编码检测,按优先级依次尝试:Content-Type charset → BOM 检测 → null 模式启发式 → UTF-8 验证 → windows-1252 回退
  • 新增 detectBOM 辅助函数用于识别各种 BOM 标记
  • 新增 guessByNullPattern 辅助函数通过 null 字节分布模式识别 UTF-16/32 编码
  • 优化 parseCharsetFromContentType 函数,返回空字符串替代 null,简化逻辑
  • 优化 decodeUTF32 性能,仅在 BE 分支创建 DataView
  • 在脚本获取、资源加载、favicon 抓取等场景应用新的编码检测逻辑

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/pkg/utils/encoding.ts 核心实现:新增智能编码检测函数和辅助算法,优化现有编码函数
src/pkg/utils/encoding.test.ts 更新测试:调整 parseCharsetFromContentType 返回值断言从 null 改为空字符串
src/pkg/utils/script.ts 集成应用:在 fetchScriptBody 中使用 readBlobContent 替代 resp.text()
src/pages/store/favicons.ts 集成应用:在 fetchIconByDomain 中使用 readBlobContent 处理 HTML 内容
src/app/service/service_worker/resource.ts 集成应用:在 loadByUrl 中使用 readBlobContent 处理文本资源,并优化 contentType 变量提取

Comment on lines 170 to 212
export const readBlobContent = async (blob: Blob | File | Response, contentType: string | null): Promise<string> => {
const buffer = await blob.arrayBuffer();
const uint8 = new Uint8Array(buffer);

if (uint8.length === 0) {
return "";
}

// 优先尝试使用 Content-Type header 中的 charset
const headerCharset = parseCharsetFromContentType(contentType);
if (headerCharset) {
try {
// 验证 charset 是否有效
return bytesDecode(headerCharset, uint8);
} catch (e: any) {
console.warn(`Invalid charset from Content-Type header: ${headerCharset}, error: ${e.message}`);
}
}

// BOM detection (highest priority)
const bomEncoding = detectBOM(uint8);
if (bomEncoding) return bytesDecode(bomEncoding, uint8);

const checkSize = Math.min(uint8.length, 16 * 1024);

if (uint8.length % 2 === 0) {
// Heuristic detection (first 16 KB)
const heuristicEncoding = guessByNullPattern(uint8, checkSize);
if (heuristicEncoding) return bytesDecode(heuristicEncoding, uint8);
}

// UTF-8 validation → legacy fallback
let encoding = "utf-8";
try {
// Strict mode – throws on invalid sequences
new TextDecoder("utf-8", { fatal: true }).decode(uint8.subarray(0, checkSize));
} catch {
// Invalid UTF-8 → most common real-world fallback
encoding = "windows-1252"; // OR detectEncoding(uint8, null)
}

return bytesDecode(encoding, uint8);
};
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的 readBlobContent 函数是这个 PR 的核心功能,但缺少专门的单元测试。考虑到该函数集成了多个编码检测步骤(Content-Type header → BOM → null pattern 启发式 → UTF-8 验证 → fallback),建议添加测试用例覆盖以下场景:

  1. 带有有效 Content-Type charset 的情况
  2. 带有无效 Content-Type charset 的情况
  3. 各种 BOM(UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)
  4. 无 BOM 但可通过 null pattern 检测的 UTF-16/32 编码
  5. UTF-8 文本(无 BOM)
  6. 无效 UTF-8 回退到 windows-1252
  7. 空 Blob 的情况
  8. 长度小于 64 字节的短文本

鉴于同目录下其他工具函数都有对应的测试文件,且 encoding.test.ts 已经有 510 行的测试代码,这表明项目对测试覆盖率有较高要求。

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot 请开一个PR新增所需的单元测试

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodFrm 我不能指示 copilot...

Copy link
Member

@CodFrm CodFrm Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodFrm 我不能指示 copilot...

似乎是因为跨仓库了?我也不行

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

new TextDecoder("utf-8", { fatal: true }).decode(uint8.subarray(0, checkSize));
} catch {
// Invalid UTF-8 → most common real-world fallback
encoding = "windows-1252"; // OR detectEncoding(uint8, null)
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 line 208,注释 // OR detectEncoding(uint8, null) 提示了一个替代方案。目前的实现直接使用 windows-1252 作为 fallback,这是西欧语言最常见的编码。

如果需要支持更广泛的编码(如亚洲语言),可以考虑使用 detectEncoding(uint8, null) 来通过 chardet 自动检测。但这会增加性能开销。

当前的实现偏向于性能和简单性,是合理的选择。如果未来收到关于非西欧语言编码的问题反馈,可以考虑改用 detectEncoding。这个注释可以保留作为未来改进的提示。

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前的实现偏向于性能和简单性,是合理的选择。如果未来收到关于非西欧语言编码的问题反馈,可以考虑改用 detectEncoding。这个注释可以保留作为未来改进的提示。

cyfung1031 and others added 4 commits January 31, 2026 12:40
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@CodFrm CodFrm merged commit f4ac909 into scriptscat:release/v1.3 Feb 1, 2026
1 of 2 checks passed
@cyfung1031 cyfung1031 deleted the pr-text-decode-101 branch February 1, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants