Skip to content

Conversation

@Saniya777-byte
Copy link

This PR improves the safety and robustness of Tailwind's candidate extraction and codemod migration logic in two critical areas:

  1. Rust UTF-8 Extraction
    • Replaces unsafe { String::from_utf8_unchecked(...) } with safe String::from_utf8(...) in candidate extraction.
    • Skips any invalid UTF-8 candidates rather than panicking, providing a safer and more resilient extraction process.
  2. TypeScript Style Block Handling
    • Switches from manual string-index logic to parsing using node-html-parser for detecting <style> blocks in codemod logic.
    • Ensures accurate detection of all <style> tags (including those with attributes and complex nesting), so migrations never accidentally affect style blocks.
    • Adds node-html-parser as a dependency to the @tailwindcss-upgrade package.
  3. pnpm Workspace Guidance
    • Includes documentation for installing dependencies within the correct workspace package, avoiding root install issues.

These changes modernize the affected modules, prevent potential runtime errors from malformed files, and support robust migrations across real-world HTML/template structures.


Test plan

1. UTF-8 Candidate Extraction

  • Added test cases using invalid and valid UTF-8 byte candidates
  • Ran cargo test to ensure the scanner skips bad candidates without panicking

2. Codemod <style> Block Handling

  • Added migration examples with plain, nested, and attribute-rich <style> tags for multiple template syntaxes
  • Confirmed classes inside <style> blocks are not migrated
  • Ran:
    cd packages/@tailwindcss-upgrade
    pnpm install
    pnpm test

Replaced unsafe UTF-8 conversion with safe handling to filter invalid candidates.
Refactor style block detection to use HTML parser for improved robustness.
@Saniya777-byte Saniya777-byte requested a review from a team as a code owner December 4, 2025 16:45
@coderabbitai
Copy link

coderabbitai bot commented Dec 4, 2025

Walkthrough

This pull request modifies two files: the scanner module in oxide replaces unsafe UTF-8 conversion with safe conversion that filters invalid sequences; the is-safe-migration template file replaces manual HTML scanning with a node-html-parser library implementation and adds error handling. Function signatures remain unchanged, and overall collection flows are preserved in both cases.

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Safe UTF-8 Extraction and Robust <style> Block Detection in Upgrade Codemods' accurately summarizes the two main changes in the changeset: safe UTF-8 handling in Rust and improved style block detection in TypeScript codemods.
Description check ✅ Passed The description comprehensively relates to the changeset, covering both major changes (UTF-8 extraction and style block detection), explaining the rationale, listing dependencies added, and detailing test coverage performed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b4f3d17 and bb60cc4.

📒 Files selected for processing (2)
  • crates/oxide/src/scanner/mod.rs (1 hunks)
  • packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (1)
packages/tailwindcss/src/utils/default-map.ts (1)
  • DefaultMap (5-20)
🔇 Additional comments (5)
crates/oxide/src/scanner/mod.rs (3)

556-563: Excellent safety improvement.

Replacing from_utf8_unchecked with safe from_utf8 and filtering invalid UTF-8 candidates prevents undefined behavior and potential security issues. The approach to skip rather than panic on invalid UTF-8 is appropriate for a scanner that processes arbitrary file content.


566-566: LGTM!

The updated comment accurately explains why unstable sort is correct for unique candidates, and appropriately removes the outdated SAFETY justification.


444-452: Verify the safety of remaining from_utf8_unchecked usage.

The get_candidates_with_positions method uses from_utf8_unchecked at lines 444-446 and 452 with SAFETY comments claiming the extractor guarantees valid UTF-8. This assumption should be verified against the extractor implementation to ensure it's sound. If the safer UTF-8 handling approach is applied elsewhere in the codebase (as suggested at lines 556-563), consider applying the same pattern here for consistency and to eliminate all unsafe UTF-8 conversions.

packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts (2)

2-2: Verify dependency installation.

Ensure that node-html-parser has been added to the dependencies in packages/@tailwindcss-upgrade/package.json. The import statement is appropriate for HTML parsing, but the dependency must be properly declared for the code to function.


195-195: The parser options { lowerCaseTagName: false, comment: false, blockTextElements: { style: true } } are all valid and correctly used according to node-html-parser documentation. The configuration appropriately disables tag name lowercasing, excludes HTML comments, and preserves raw text content within <style> tags—which is the intended behavior for template parsing.

Comment on lines +191 to 209
// Robustly locates all <style> blocks (with or without attributes) using an HTML parser.
const styleBlockRanges = new DefaultMap((source: string) => {
let ranges: number[] = []
let offset = 0

while (true) {
let startTag = source.indexOf('<style', offset)
if (startTag === -1) return ranges

offset = startTag + 1

// Ensure the style looks like:
// - `<style>` (closed)
// - `<style …>` (with attributes)
if (!source[startTag + 6].match(/[>\s]/)) continue

let endTag = source.indexOf('</style>', offset)
if (endTag === -1) return ranges
offset = endTag + 1

ranges.push(startTag, endTag)
const ranges: number[] = []
try {
const root = parseHtml(source, { lowerCaseTagName: false, comment: false, blockTextElements: { style: true } })
const styleNodes = root.querySelectorAll('style')
styleNodes.forEach(node => {
const nodeHtml = node.toString()
const start = typeof node.range === 'object' && node.range !== null
? node.range[0]
: source.indexOf(nodeHtml)
const end = start + nodeHtml.length
ranges.push(start, end)
})
} catch (_) {
// fallback: do nothing if parser fails
}
return ranges
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Robust HTML parser implementation with minor fallback concern.

The replacement of manual string scanning with node-html-parser is a significant improvement that correctly handles style tags with attributes, nesting, and complex HTML structures. The parser options are appropriate:

  • blockTextElements: { style: true } ensures content inside <style> tags isn't parsed as HTML
  • lowerCaseTagName: false preserves case sensitivity for accuracy

However, the fallback on line 201 (source.indexOf(nodeHtml)) could match the wrong occurrence if the source contains multiple identical <style> blocks. This is an edge case but could lead to incorrect range detection.

The silent error handling (line 205-207) is reasonable for graceful degradation, but consider logging errors in development mode to aid debugging.

Optional improvement for the indexOf fallback:

-      const start = typeof node.range === 'object' && node.range !== null
-        ? node.range[0]
-        : source.indexOf(nodeHtml)
+      // Only use indexOf as last resort; range should be available in most cases
+      const start = typeof node.range === 'object' && node.range !== null && node.range[0] !== undefined
+        ? node.range[0]
+        : source.indexOf(nodeHtml)
🤖 Prompt for AI Agents
In packages/@tailwindcss-upgrade/src/codemods/template/is-safe-migration.ts
around lines 191 to 209, the fallback that uses source.indexOf(nodeHtml) can
pick the wrong identical <style> occurrence; replace the fallback with a
deterministic search by tracking a moving search offset (e.g., maintain a
lastIndex variable and call source.indexOf(nodeHtml, lastIndex) then advance
lastIndex to end) or prefer parser-provided node.range/startIndex if available,
and ensure ranges are pushed in document order; additionally, change the silent
catch to optionally emit a development-only debug log (or rethrow in dev) rather
than completely swallowing the parser error.


// SAFETY: Unstable sort is faster and in this scenario it's also safe because we are
// guaranteed to have unique candidates.
// Unstable sort is performant & remains correct with unique candidates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not a traditional safety message this is 100% intentional because behavior would be incorrect in the face of non-unique candidates + an unstable sort.

// eprintln!("Skipped invalid UTF-8 candidate, error: {:?}", _e);
None
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an intentional performance optimization. String::from_utf8 does a validation pass over the string which should be uncnessary here. The extractor's implementation should guarantee that the resulting candidates are already UTF-8.

If you have a test case for this that crashes when scanning files please open a separate issue and I will take a look.

@Saniya777-byte
Copy link
Author

Thanks for the clarification!

That makes sense—I wasn't aware the extractor always guarantees valid UTF-8 candidates. I'll leave the existing performance-oriented implementation as is. If I ever encounter a crash or edge case where invalid candidates slip through, I'll open a separate issue with a reproducible example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants