-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods #19410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods #19410
Conversation
Replaced unsafe UTF-8 conversion with safe handling to filter invalid candidates.
Refactor style block detection to use HTML parser for improved robustness.
WalkthroughThis pull request modifies two files: the scanner module in oxide replaces unsafe UTF-8 conversion with safe conversion that filters invalid sequences; the is-safe-migration template file replaces manual HTML scanning with a node-html-parser library implementation and adds error handling. Function signatures remain unchanged, and overall collection flows are preserved in both cases. Pre-merge checks✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
crates/oxide/src/scanner/mod.rs(1 hunks)packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (1)
packages/tailwindcss/src/utils/default-map.ts (1)
DefaultMap(5-20)
🔇 Additional comments (5)
crates/oxide/src/scanner/mod.rs (3)
556-563: Excellent safety improvement.Replacing
from_utf8_uncheckedwith safefrom_utf8and filtering invalid UTF-8 candidates prevents undefined behavior and potential security issues. The approach to skip rather than panic on invalid UTF-8 is appropriate for a scanner that processes arbitrary file content.
566-566: LGTM!The updated comment accurately explains why unstable sort is correct for unique candidates, and appropriately removes the outdated SAFETY justification.
444-452: Verify the safety of remainingfrom_utf8_uncheckedusage.The
get_candidates_with_positionsmethod usesfrom_utf8_uncheckedat lines 444-446 and 452 with SAFETY comments claiming the extractor guarantees valid UTF-8. This assumption should be verified against the extractor implementation to ensure it's sound. If the safer UTF-8 handling approach is applied elsewhere in the codebase (as suggested at lines 556-563), consider applying the same pattern here for consistency and to eliminate all unsafe UTF-8 conversions.packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (2)
2-2: Verify dependency installation.Ensure that
node-html-parserhas been added to thedependenciesinpackages/@tailwindcss-upgrade/package.json. The import statement is appropriate for HTML parsing, but the dependency must be properly declared for the code to function.
195-195: The parser options{ lowerCaseTagName: false, comment: false, blockTextElements: { style: true } }are all valid and correctly used according tonode-html-parserdocumentation. The configuration appropriately disables tag name lowercasing, excludes HTML comments, and preserves raw text content within<style>tags—which is the intended behavior for template parsing.
| // Robustly locates all <style> blocks (with or without attributes) using an HTML parser. | ||
| const styleBlockRanges = new DefaultMap((source: string) => { | ||
| let ranges: number[] = [] | ||
| let offset = 0 | ||
|
|
||
| while (true) { | ||
| let startTag = source.indexOf('<style', offset) | ||
| if (startTag === -1) return ranges | ||
|
|
||
| offset = startTag + 1 | ||
|
|
||
| // Ensure the style looks like: | ||
| // - `<style>` (closed) | ||
| // - `<style …>` (with attributes) | ||
| if (!source[startTag + 6].match(/[>\s]/)) continue | ||
|
|
||
| let endTag = source.indexOf('</style>', offset) | ||
| if (endTag === -1) return ranges | ||
| offset = endTag + 1 | ||
|
|
||
| ranges.push(startTag, endTag) | ||
| const ranges: number[] = [] | ||
| try { | ||
| const root = parseHtml(source, { lowerCaseTagName: false, comment: false, blockTextElements: { style: true } }) | ||
| const styleNodes = root.querySelectorAll('style') | ||
| styleNodes.forEach(node => { | ||
| const nodeHtml = node.toString() | ||
| const start = typeof node.range === 'object' && node.range !== null | ||
| ? node.range[0] | ||
| : source.indexOf(nodeHtml) | ||
| const end = start + nodeHtml.length | ||
| ranges.push(start, end) | ||
| }) | ||
| } catch (_) { | ||
| // fallback: do nothing if parser fails | ||
| } | ||
| return ranges | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Robust HTML parser implementation with minor fallback concern.
The replacement of manual string scanning with node-html-parser is a significant improvement that correctly handles style tags with attributes, nesting, and complex HTML structures. The parser options are appropriate:
blockTextElements: { style: true }ensures content inside<style>tags isn't parsed as HTMLlowerCaseTagName: falsepreserves case sensitivity for accuracy
However, the fallback on line 201 (source.indexOf(nodeHtml)) could match the wrong occurrence if the source contains multiple identical <style> blocks. This is an edge case but could lead to incorrect range detection.
The silent error handling (line 205-207) is reasonable for graceful degradation, but consider logging errors in development mode to aid debugging.
Optional improvement for the indexOf fallback:
- const start = typeof node.range === 'object' && node.range !== null
- ? node.range[0]
- : source.indexOf(nodeHtml)
+ // Only use indexOf as last resort; range should be available in most cases
+ const start = typeof node.range === 'object' && node.range !== null && node.range[0] !== undefined
+ ? node.range[0]
+ : source.indexOf(nodeHtml)🤖 Prompt for AI Agents
In packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts
around lines 191 to 209, the fallback that uses source.indexOf(nodeHtml) can
pick the wrong identical <style> occurrence; replace the fallback with a
deterministic search by tracking a moving search offset (e.g., maintain a
lastIndex variable and call source.indexOf(nodeHtml, lastIndex) then advance
lastIndex to end) or prefer parser-provided node.range/startIndex if available,
and ensure ranges are pushed in document order; additionally, change the silent
catch to optionally emit a development-only debug log (or rethrow in dev) rather
than completely swallowing the parser error.
|
|
||
| // SAFETY: Unstable sort is faster and in this scenario it's also safe because we are | ||
| // guaranteed to have unique candidates. | ||
| // Unstable sort is performant & remains correct with unique candidates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While not a traditional safety message this is 100% intentional because behavior would be incorrect in the face of non-unique candidates + an unstable sort.
| // eprintln!("Skipped invalid UTF-8 candidate, error: {:?}", _e); | ||
| None | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an intentional performance optimization. String::from_utf8 does a validation pass over the string which should be uncnessary here. The extractor's implementation should guarantee that the resulting candidates are already UTF-8.
If you have a test case for this that crashes when scanning files please open a separate issue and I will take a look.
|
Thanks for the clarification! That makes sense—I wasn't aware the extractor always guarantees valid UTF-8 candidates. I'll leave the existing performance-oriented implementation as is. If I ever encounter a crash or edge case where invalid candidates slip through, I'll open a separate issue with a reproducible example. |
This PR improves the safety and robustness of Tailwind's candidate extraction and codemod migration logic in two critical areas:
unsafe { String::from_utf8_unchecked(...) }with safeString::from_utf8(...)in candidate extraction.node-html-parserfor detecting<style>blocks in codemod logic.<style>tags (including those with attributes and complex nesting), so migrations never accidentally affect style blocks.node-html-parseras a dependency to the@tailwindcss-upgradepackage.These changes modernize the affected modules, prevent potential runtime errors from malformed files, and support robust migrations across real-world HTML/template structures.
Test plan
1. UTF-8 Candidate Extraction
cargo testto ensure the scanner skips bad candidates without panicking2. Codemod
<style>Block Handling<style>tags for multiple template syntaxes<style>blocks are not migrated