Skip to content

fix(taxonomy): use utf8.DecodeRuneInString in GroupByLetter to handle non-ASCII names#96

Merged
greynewell merged 3 commits intomainfrom
fix-taxonomy-group-by-letter
Apr 9, 2026
Merged

fix(taxonomy): use utf8.DecodeRuneInString in GroupByLetter to handle non-ASCII names#96
greynewell merged 3 commits intomainfrom
fix-taxonomy-group-by-letter

Conversation

@greynewell
Copy link
Copy Markdown
Contributor

@greynewell greynewell commented Apr 9, 2026

Summary

GroupByLetter was calling rune(entry.Name[0]) to get the first character for alphabetical grouping. entry.Name[0] is a byte, not a rune — for any name that starts with a multi-byte UTF-8 character (é, ñ, ü, ā, etc.) the first byte is misread as a different Latin-1 character. For example:

Entry First byte (0x…) Misread as Correct letter
Étoile 0xC3 Ã É
Ñoño 0xC3 Ã Ñ
Über 0xC3 Ã Ü

All three entries would be silently grouped under 'Ã' instead of their correct letters, corrupting A–Z navigation pages.

Fix: replace rune(entry.Name[0]) with utf8.DecodeRuneInString(entry.Name) which correctly decodes the first Unicode code point regardless of byte length.

Test plan

  • TestGroupByLetterNonASCII: entries starting with É, Ñ, Ü are grouped under their correct Unicode letters, not under 'Ã'.
  • TestGroupByLetterASCII: ASCII entries still group correctly.
  • TestGroupByLetterEmpty: empty names are skipped.
  • TestTopEntries: TopEntries returns correct order without mutating the original.
  • go test ./... — all existing tests pass.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed slug collision handling to prevent duplicate documentation page identifiers.
    • Corrected line count calculation for generated documentation pages.
    • Improved alphabetical grouping to properly handle non-ASCII characters (é, ñ, ü, etc.).
  • Tests

    • Added comprehensive test coverage for slug collision resolution and line counting behavior.

greynewell and others added 2 commits April 9, 2026 13:47
…r Function/Class/Type

Slug collision: when a base slug (e.g. "fn-handler-go-run") was already
taken and a "-2" suffix was appended, the new slug was never recorded in
usedSlugs. A later node naturally producing "fn-handler-go-run-2" would
pass the uniqueness check and get the same slug, causing one output file
to silently overwrite the other. Fixed by marking the resolved slug used.

line_count: Function, Class, and Type frontmatter computed
endLine - startLine + 1 without guarding against startLine == 0 (the
sentinel for "not provided"). This produced line_count = endLine + 1
instead of endLine. File nodes already defaulted startLine to 1;
the three symbol node types now do the same.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… non-ASCII names

rune(entry.Name[0]) reads only the first byte of the entry name and
misinterprets it as a Unicode code point. For any name starting with a
multi-byte UTF-8 character (é, ñ, ü, etc.) the first byte (e.g. 0xC3)
is decoded as a different Latin-1 character (Ã), so those entries land
in the wrong letter group in the A–Z navigation.

Replace with utf8.DecodeRuneInString to correctly extract the first rune.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

Warning

Rate limit exceeded

@greynewell has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 51 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 6 minutes and 51 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b6c1db83-89b2-4361-a400-0771b88abb3f

📥 Commits

Reviewing files that changed from the base of the PR and between 51e6cfa and 1e01cd3.

📒 Files selected for processing (1)
  • internal/archdocs/pssg/taxonomy/taxonomy_test.go

Walkthrough

This PR fixes three distinct bugs in documentation generation: preventing slug collisions when multiple nodes resolve to the same slug path, correcting line count calculations when documentation source lines lack start information, and improving UTF-8 character support in taxonomy grouping. Test coverage is added for all changes.

Changes

Cohort / File(s) Summary
Slug Collision & Line Count Fixes
internal/archdocs/graph2md/graph2md.go, internal/archdocs/graph2md/graph2md_test.go
Fixed slug collision handling by registering disambiguated slugs to prevent reuse. Corrected line count calculation to clamp start lines to 1 when missing, using endLine - effectiveStart + 1 instead of endLine - startLine + 1. Added two test cases covering collision resolution and missing start line scenarios.
UTF-8 Taxonomy Handling
internal/archdocs/pssg/taxonomy/taxonomy.go, internal/archdocs/pssg/taxonomy/taxonomy_test.go
Upgraded character classification to use utf8.DecodeRuneInString instead of raw byte indexing, enabling correct grouping of non-ASCII characters (É, Ñ, Ü) under their proper Unicode letters rather than misclassifying multi-byte sequences. Added comprehensive test coverage for ASCII/non-ASCII grouping and ranking behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • jonathanpopham

Poem

🐛 Three bugs went down the documentation trail,
Slugs colliding, line counts off-scale,
UTF-8 runes caught by byte-blind eyes—
Now tested, fixed, and Unicode-wise! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix in the changeset: replacing byte-based character extraction with proper UTF-8 decoding in the GroupByLetter function.
Description check ✅ Passed The description covers the what (the bug and fix with clear examples), why (broken navigation), and includes a test plan with specific test cases, though it doesn't explicitly mention 'make test' and 'make lint' passing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-taxonomy-group-by-letter

Comment @coderabbitai help to get the list of available commands and usage tips.

Remove extra spaces used to manually align inline comments; gofmt uses
tab-based alignment, not space-based, so the extra spaces failed linting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/archdocs/graph2md/graph2md_test.go`:
- Around line 124-127: The test currently calls os.ReadDir(outDir) and discards
the error; update the test in graph2md_test.go to capture and check the error
returned by os.ReadDir (e.g., entries, err := os.ReadDir(outDir)) and call
t.Fatalf or t.Fatalf-like assertion if err != nil, including the err text; keep
the subsequent length check (len(entries) != 1) but only after ensuring err is
nil so failures show the real ReadDir error rather than hiding it.

In `@internal/archdocs/graph2md/graph2md.go`:
- Around line 373-377: The current disambiguation only appends "-{n+1}" once and
can collide if that suffixed slug already exists; update the logic around the
usedSlugs map and slug variable so that when a clash is detected you loop:
compute a candidate = fmt.Sprintf("%s-%d", baseSlug, next), increment next until
candidate is not present in usedSlugs, then register that candidate in
usedSlugs; ensure you preserve/increment the counter for the original base key
so future collisions continue from the correct next value. Reference the
usedSlugs map and the slug/baseSlug variables in the collision branch and
replace the single-assignment change with a while/for loop that finds a free
slug before assigning usedSlugs[candidate]=1.

In `@internal/archdocs/pssg/taxonomy/taxonomy_test.go`:
- Around line 43-46: Run goimports (or gofmt -w followed by goimports) on the
test file to fix formatting/import ordering so golangci-lint passes;
specifically reformat the block containing the struct literals with Name/Slug
(the entries "Étoile", "Ñoño", "Über", "English") so spacing/indentation and
import grouping conform to goimports, then stage the updated file and push the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7bc30a93-b826-4596-b798-bf5c5ac85690

📥 Commits

Reviewing files that changed from the base of the PR and between 5c9e737 and 51e6cfa.

📒 Files selected for processing (4)
  • internal/archdocs/graph2md/graph2md.go
  • internal/archdocs/graph2md/graph2md_test.go
  • internal/archdocs/pssg/taxonomy/taxonomy.go
  • internal/archdocs/pssg/taxonomy/taxonomy_test.go

Comment on lines +124 to +127
entries, _ := os.ReadDir(outDir)
if len(entries) != 1 {
t.Fatalf("expected 1 output file, got %d", len(entries))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don’t ignore ReadDir errors in the test.

At Line 124, swallowing the error can hide the real failure and make the test flaky to debug.

Proposed fix
-	entries, _ := os.ReadDir(outDir)
+	entries, err := os.ReadDir(outDir)
+	if err != nil {
+		t.Fatalf("ReadDir: %v", err)
+	}
 	if len(entries) != 1 {
 		t.Fatalf("expected 1 output file, got %d", len(entries))
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
entries, _ := os.ReadDir(outDir)
if len(entries) != 1 {
t.Fatalf("expected 1 output file, got %d", len(entries))
}
entries, err := os.ReadDir(outDir)
if err != nil {
t.Fatalf("ReadDir: %v", err)
}
if len(entries) != 1 {
t.Fatalf("expected 1 output file, got %d", len(entries))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/archdocs/graph2md/graph2md_test.go` around lines 124 - 127, The test
currently calls os.ReadDir(outDir) and discards the error; update the test in
graph2md_test.go to capture and check the error returned by os.ReadDir (e.g.,
entries, err := os.ReadDir(outDir)) and call t.Fatalf or t.Fatalf-like assertion
if err != nil, including the err text; keep the subsequent length check
(len(entries) != 1) but only after ensuring err is nil so failures show the real
ReadDir error rather than hiding it.

Comment on lines 373 to 377
if n, ok := usedSlugs[slug]; ok {
usedSlugs[slug] = n + 1
slug = fmt.Sprintf("%s-%d", slug, n+1)
usedSlugs[slug] = 1 // register the disambiguated slug so a later node can't claim it
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Slug disambiguation can still collide with pre-existing suffixed slugs.

Nice improvement, but there’s still a correctness gap at Line 373–377: if fn-x-3 already exists naturally, and a later collision on fn-x generates fn-x-3, you’ll still get duplicate output paths.

Proposed fix (keep incrementing until an unused slug is found)
-		// Handle slug collisions
-		if n, ok := usedSlugs[slug]; ok {
-			usedSlugs[slug] = n + 1
-			slug = fmt.Sprintf("%s-%d", slug, n+1)
-			usedSlugs[slug] = 1 // register the disambiguated slug so a later node can't claim it
-		} else {
-			usedSlugs[slug] = 1
-		}
+		// Handle slug collisions
+		baseSlug := slug
+		if _, exists := usedSlugs[baseSlug]; !exists {
+			usedSlugs[baseSlug] = 1
+		} else {
+			next := usedSlugs[baseSlug] + 1
+			for {
+				candidate := fmt.Sprintf("%s-%d", baseSlug, next)
+				if _, taken := usedSlugs[candidate]; !taken {
+					slug = candidate
+					usedSlugs[baseSlug] = next
+					usedSlugs[slug] = 1
+					break
+				}
+				next++
+			}
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/archdocs/graph2md/graph2md.go` around lines 373 - 377, The current
disambiguation only appends "-{n+1}" once and can collide if that suffixed slug
already exists; update the logic around the usedSlugs map and slug variable so
that when a clash is detected you loop: compute a candidate =
fmt.Sprintf("%s-%d", baseSlug, next), increment next until candidate is not
present in usedSlugs, then register that candidate in usedSlugs; ensure you
preserve/increment the counter for the original base key so future collisions
continue from the correct next value. Reference the usedSlugs map and the
slug/baseSlug variables in the collision branch and replace the
single-assignment change with a while/for loop that finds a free slug before
assigning usedSlugs[candidate]=1.

@greynewell greynewell merged commit 6267347 into main Apr 9, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant