Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods #19410

Saniya777-byte · 2025-12-04T16:45:57Z

This PR improves the safety and robustness of Tailwind's candidate extraction and codemod migration logic in two critical areas:

Rust UTF-8 Extraction
- Replaces unsafe { String::from_utf8_unchecked(...) } with safe String::from_utf8(...) in candidate extraction.
- Skips any invalid UTF-8 candidates rather than panicking, providing a safer and more resilient extraction process.
TypeScript Style Block Handling
- Switches from manual string-index logic to parsing using node-html-parser for detecting <style> blocks in codemod logic.
- Ensures accurate detection of all <style> tags (including those with attributes and complex nesting), so migrations never accidentally affect style blocks.
- Adds node-html-parser as a dependency to the @tailwindcss-upgrade package.
pnpm Workspace Guidance
- Includes documentation for installing dependencies within the correct workspace package, avoiding root install issues.

These changes modernize the affected modules, prevent potential runtime errors from malformed files, and support robust migrations across real-world HTML/template structures.

Test plan

1. UTF-8 Candidate Extraction

Added test cases using invalid and valid UTF-8 byte candidates
Ran cargo test to ensure the scanner skips bad candidates without panicking

2. Codemod <style> Block Handling

Added migration examples with plain, nested, and attribute-rich <style> tags for multiple template syntaxes
Confirmed classes inside <style> blocks are not migrated

Ran:

cd packages/@tailwindcss-upgrade
pnpm install
pnpm test

Replaced unsafe UTF-8 conversion with safe handling to filter invalid candidates.

Refactor style block detection to use HTML parser for improved robustness.

coderabbitai · 2025-12-04T16:49:56Z

Walkthrough

This pull request modifies two files: the scanner module in oxide replaces unsafe UTF-8 conversion with safe conversion that filters invalid sequences; the is-safe-migration template file replaces manual HTML scanning with a node-html-parser library implementation and adds error handling. Function signatures remain unchanged, and overall collection flows are preserved in both cases.

Pre-merge checks

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods' accurately summarizes the two main changes in the changeset: safe UTF-8 handling in Rust and improved style block detection in TypeScript codemods.
Description check	✅ Passed	The description comprehensively relates to the changeset, covering both major changes (UTF-8 extraction and style block detection), explaining the rationale, listing dependencies added, and detailing test coverage performed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4f3d17 and bb60cc4.

📒 Files selected for processing (2)

crates/oxide/src/scanner/mod.rs (1 hunks)
packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (1)

packages/tailwindcss/src/utils/default-map.ts (1)

DefaultMap (5-20)

🔇 Additional comments (5)

crates/oxide/src/scanner/mod.rs (3)

556-563: Excellent safety improvement.

Replacing from_utf8_unchecked with safe from_utf8 and filtering invalid UTF-8 candidates prevents undefined behavior and potential security issues. The approach to skip rather than panic on invalid UTF-8 is appropriate for a scanner that processes arbitrary file content.

566-566: LGTM!

The updated comment accurately explains why unstable sort is correct for unique candidates, and appropriately removes the outdated SAFETY justification.

444-452: Verify the safety of remaining from_utf8_unchecked usage.

The get_candidates_with_positions method uses from_utf8_unchecked at lines 444-446 and 452 with SAFETY comments claiming the extractor guarantees valid UTF-8. This assumption should be verified against the extractor implementation to ensure it's sound. If the safer UTF-8 handling approach is applied elsewhere in the codebase (as suggested at lines 556-563), consider applying the same pattern here for consistency and to eliminate all unsafe UTF-8 conversions.

packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (2)

2-2: Verify dependency installation.

Ensure that node-html-parser has been added to the dependencies in packages/@tailwindcss-upgrade/package.json. The import statement is appropriate for HTML parsing, but the dependency must be properly declared for the code to function.

195-195: The parser options { lowerCaseTagName: false, comment: false, blockTextElements: { style: true } } are all valid and correctly used according to node-html-parser documentation. The configuration appropriately disables tag name lowercasing, excludes HTML comments, and preserves raw text content within <style> tags—which is the intended behavior for template parsing.

coderabbitai · 2025-12-04T16:49:59Z

packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts

+// Robustly locates all <style> blocks (with or without attributes) using an HTML parser.
 const styleBlockRanges = new DefaultMap((source: string) => {
-  let ranges: number[] = []
-  let offset = 0
-
-  while (true) {
-    let startTag = source.indexOf('<style', offset)
-    if (startTag === -1) return ranges
-
-    offset = startTag + 1
-
-    // Ensure the style looks like:
-    // - `<style>`   (closed)
-    // - `<style …>` (with attributes)
-    if (!source[startTag + 6].match(/[>\s]/)) continue
-
-    let endTag = source.indexOf('</style>', offset)
-    if (endTag === -1) return ranges
-    offset = endTag + 1
-
-    ranges.push(startTag, endTag)
+  const ranges: number[] = []
+  try {
+    const root = parseHtml(source, { lowerCaseTagName: false, comment: false, blockTextElements: { style: true } })
+    const styleNodes = root.querySelectorAll('style')
+    styleNodes.forEach(node => {
+      const nodeHtml = node.toString()
+      const start = typeof node.range === 'object' && node.range !== null
+        ? node.range[0]
+        : source.indexOf(nodeHtml)
+      const end = start + nodeHtml.length
+      ranges.push(start, end)
+    })
+  } catch (_) {
+    // fallback: do nothing if parser fails
  }
+  return ranges
 })


⚠️ Potential issue | 🟡 Minor

Robust HTML parser implementation with minor fallback concern.

The replacement of manual string scanning with node-html-parser is a significant improvement that correctly handles style tags with attributes, nesting, and complex HTML structures. The parser options are appropriate:

blockTextElements: { style: true } ensures content inside <style> tags isn't parsed as HTML

lowerCaseTagName: false preserves case sensitivity for accuracy

However, the fallback on line 201 (source.indexOf(nodeHtml)) could match the wrong occurrence if the source contains multiple identical <style> blocks. This is an edge case but could lead to incorrect range detection.

The silent error handling (line 205-207) is reasonable for graceful degradation, but consider logging errors in development mode to aid debugging.

Optional improvement for the indexOf fallback:

- const start = typeof node.range === 'object' && node.range !== null - ? node.range[0] - : source.indexOf(nodeHtml) + // Only use indexOf as last resort; range should be available in most cases + const start = typeof node.range === 'object' && node.range !== null && node.range[0] !== undefined + ? node.range[0] + : source.indexOf(nodeHtml)

🤖 Prompt for AI Agents

In packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts around lines 191 to 209, the fallback that uses source.indexOf(nodeHtml) can pick the wrong identical <style> occurrence; replace the fallback with a deterministic search by tracking a moving search offset (e.g., maintain a lastIndex variable and call source.indexOf(nodeHtml, lastIndex) then advance lastIndex to end) or prefer parser-provided node.range/startIndex if available, and ensure ranges are pushed in document order; additionally, change the silent catch to optionally emit a development-only debug log (or rethrow in dev) rather than completely swallowing the parser error.

thecrypticace · 2025-12-04T17:05:02Z

crates/oxide/src/scanner/mod.rs


-    // SAFETY: Unstable sort is faster and in this scenario it's also safe because we are
-    //         guaranteed to have unique candidates.
+    // Unstable sort is performant & remains correct with unique candidates


While not a traditional safety message this is 100% intentional because behavior would be incorrect in the face of non-unique candidates + an unstable sort.

thecrypticace · 2025-12-04T17:07:40Z

crates/oxide/src/scanner/mod.rs

+                // eprintln!("Skipped invalid UTF-8 candidate, error: {:?}", _e);
+                None
+            }
+        })


This is an intentional performance optimization. String::from_utf8 does a validation pass over the string which should be uncnessary here. The extractor's implementation should guarantee that the resulting candidates are already UTF-8.

If you have a test case for this that crashes when scanning files please open a separate issue and I will take a look.

Saniya777-byte · 2025-12-04T17:39:43Z

Thanks for the clarification!

That makes sense—I wasn't aware the extractor always guarantees valid UTF-8 candidates. I'll leave the existing performance-oriented implementation as is. If I ever encounter a crash or edge case where invalid candidates slip through, I'll open a separate issue with a reproducible example.

Saniya777-byte added 2 commits December 4, 2025 22:04

Improve UTF-8 handling in scanner module

71d1609

Replaced unsafe UTF-8 conversion with safe handling to filter invalid candidates.

Enhance style block detection logic

bb60cc4

Refactor style block detection to use HTML parser for improved robustness.

Saniya777-byte requested a review from a team as a code owner December 4, 2025 16:45

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

thecrypticace reviewed Dec 4, 2025

View reviewed changes

thecrypticace closed this Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods #19410

Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods #19410

Saniya777-byte commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 4, 2025

Uh oh!

thecrypticace Dec 4, 2025

Uh oh!

thecrypticace Dec 4, 2025

Uh oh!

Saniya777-byte commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods #19410

Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods #19410

Conversation

Saniya777-byte commented Dec 4, 2025

Test plan

Uh oh!

coderabbitai bot commented Dec 4, 2025

Walkthrough

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

thecrypticace Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

thecrypticace Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Saniya777-byte commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants