fix(taxonomy): use utf8.DecodeRuneInString in GroupByLetter to handle non-ASCII names by greynewell · Pull Request #96 · supermodeltools/cli

greynewell · 2026-04-09T17:53:42Z

Summary

GroupByLetter was calling rune(entry.Name[0]) to get the first character for alphabetical grouping. entry.Name[0] is a byte, not a rune — for any name that starts with a multi-byte UTF-8 character (é, ñ, ü, ā, etc.) the first byte is misread as a different Latin-1 character. For example:

Entry	First byte (0x…)	Misread as	Correct letter
Étoile	0xC3	Ã	É
Ñoño	0xC3	Ã	Ñ
Über	0xC3	Ã	Ü

All three entries would be silently grouped under 'Ã' instead of their correct letters, corrupting A–Z navigation pages.

Fix: replace rune(entry.Name[0]) with utf8.DecodeRuneInString(entry.Name) which correctly decodes the first Unicode code point regardless of byte length.

Test plan

TestGroupByLetterNonASCII: entries starting with É, Ñ, Ü are grouped under their correct Unicode letters, not under 'Ã'.
TestGroupByLetterASCII: ASCII entries still group correctly.
TestGroupByLetterEmpty: empty names are skipped.
TestTopEntries: TopEntries returns correct order without mutating the original.
go test ./... — all existing tests pass.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed slug collision handling to prevent duplicate documentation page identifiers.
- Corrected line count calculation for generated documentation pages.
- Improved alphabetical grouping to properly handle non-ASCII characters (é, ñ, ü, etc.).
Tests
- Added comprehensive test coverage for slug collision resolution and line counting behavior.

…r Function/Class/Type Slug collision: when a base slug (e.g. "fn-handler-go-run") was already taken and a "-2" suffix was appended, the new slug was never recorded in usedSlugs. A later node naturally producing "fn-handler-go-run-2" would pass the uniqueness check and get the same slug, causing one output file to silently overwrite the other. Fixed by marking the resolved slug used. line_count: Function, Class, and Type frontmatter computed endLine - startLine + 1 without guarding against startLine == 0 (the sentinel for "not provided"). This produced line_count = endLine + 1 instead of endLine. File nodes already defaulted startLine to 1; the three symbol node types now do the same. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… non-ASCII names rune(entry.Name[0]) reads only the first byte of the entry name and misinterprets it as a Unicode code point. For any name starting with a multi-byte UTF-8 character (é, ñ, ü, etc.) the first byte (e.g. 0xC3) is decoded as a different Latin-1 character (Ã), so those entries land in the wrong letter group in the A–Z navigation. Replace with utf8.DecodeRuneInString to correctly extract the first rune. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-09T17:53:55Z

Warning

Rate limit exceeded

@greynewell has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 51 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 6 minutes and 51 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b6c1db83-89b2-4361-a400-0771b88abb3f

📥 Commits

Reviewing files that changed from the base of the PR and between 51e6cfa and 1e01cd3.

📒 Files selected for processing (1)

internal/archdocs/pssg/taxonomy/taxonomy_test.go

Walkthrough

This PR fixes three distinct bugs in documentation generation: preventing slug collisions when multiple nodes resolve to the same slug path, correcting line count calculations when documentation source lines lack start information, and improving UTF-8 character support in taxonomy grouping. Test coverage is added for all changes.

Changes

Cohort / File(s)	Summary
Slug Collision & Line Count Fixes `internal/archdocs/graph2md/graph2md.go`, `internal/archdocs/graph2md/graph2md_test.go`	Fixed slug collision handling by registering disambiguated slugs to prevent reuse. Corrected line count calculation to clamp start lines to 1 when missing, using `endLine - effectiveStart + 1` instead of `endLine - startLine + 1`. Added two test cases covering collision resolution and missing start line scenarios.
UTF-8 Taxonomy Handling `internal/archdocs/pssg/taxonomy/taxonomy.go`, `internal/archdocs/pssg/taxonomy/taxonomy_test.go`	Upgraded character classification to use `utf8.DecodeRuneInString` instead of raw byte indexing, enabling correct grouping of non-ASCII characters (É, Ñ, Ü) under their proper Unicode letters rather than misclassifying multi-byte sequences. Added comprehensive test coverage for ASCII/non-ASCII grouping and ranking behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix(graph2md): register disambiguated slugs and correct line_count for Function/Class/Type #95: Implements identical code-level fixes to slug collision registration and line count calculation in the same graph2md.go file.

Suggested reviewers

jonathanpopham

Poem

🐛 Three bugs went down the documentation trail,
Slugs colliding, line counts off-scale,
UTF-8 runes caught by byte-blind eyes—
Now tested, fixed, and Unicode-wise! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main fix in the changeset: replacing byte-based character extraction with proper UTF-8 decoding in the GroupByLetter function.
Description check	✅ Passed	The description covers the what (the bug and fix with clear examples), why (broken navigation), and includes a test plan with specific test cases, though it doesn't explicitly mention 'make test' and 'make lint' passing.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-taxonomy-group-by-letter

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Remove extra spaces used to manually align inline comments; gofmt uses tab-based alignment, not space-based, so the extra spaces failed linting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/archdocs/graph2md/graph2md_test.go`:
- Around line 124-127: The test currently calls os.ReadDir(outDir) and discards
the error; update the test in graph2md_test.go to capture and check the error
returned by os.ReadDir (e.g., entries, err := os.ReadDir(outDir)) and call
t.Fatalf or t.Fatalf-like assertion if err != nil, including the err text; keep
the subsequent length check (len(entries) != 1) but only after ensuring err is
nil so failures show the real ReadDir error rather than hiding it.

In `@internal/archdocs/graph2md/graph2md.go`:
- Around line 373-377: The current disambiguation only appends "-{n+1}" once and
can collide if that suffixed slug already exists; update the logic around the
usedSlugs map and slug variable so that when a clash is detected you loop:
compute a candidate = fmt.Sprintf("%s-%d", baseSlug, next), increment next until
candidate is not present in usedSlugs, then register that candidate in
usedSlugs; ensure you preserve/increment the counter for the original base key
so future collisions continue from the correct next value. Reference the
usedSlugs map and the slug/baseSlug variables in the collision branch and
replace the single-assignment change with a while/for loop that finds a free
slug before assigning usedSlugs[candidate]=1.

In `@internal/archdocs/pssg/taxonomy/taxonomy_test.go`:
- Around line 43-46: Run goimports (or gofmt -w followed by goimports) on the
test file to fix formatting/import ordering so golangci-lint passes;
specifically reformat the block containing the struct literals with Name/Slug
(the entries "Étoile", "Ñoño", "Über", "English") so spacing/indentation and
import grouping conform to goimports, then stage the updated file and push the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7bc30a93-b826-4596-b798-bf5c5ac85690

📥 Commits

Reviewing files that changed from the base of the PR and between 5c9e737 and 51e6cfa.

📒 Files selected for processing (4)

internal/archdocs/graph2md/graph2md.go
internal/archdocs/graph2md/graph2md_test.go
internal/archdocs/pssg/taxonomy/taxonomy.go
internal/archdocs/pssg/taxonomy/taxonomy_test.go

coderabbitai · 2026-04-09T17:57:49Z

internal/archdocs/graph2md/graph2md_test.go

+	entries, _ := os.ReadDir(outDir)
+	if len(entries) != 1 {
+		t.Fatalf("expected 1 output file, got %d", len(entries))
+	}


⚠️ Potential issue | 🟡 Minor

Don’t ignore ReadDir errors in the test.

At Line 124, swallowing the error can hide the real failure and make the test flaky to debug.

Proposed fix

- entries, _ := os.ReadDir(outDir) + entries, err := os.ReadDir(outDir) + if err != nil { + t.Fatalf("ReadDir: %v", err) + } if len(entries) != 1 { t.Fatalf("expected 1 output file, got %d", len(entries)) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

entries, _ := os.ReadDir(outDir)

if len(entries) != 1 {

t.Fatalf("expected 1 output file, got %d", len(entries))

}

entries, err := os.ReadDir(outDir)

if err != nil {

t.Fatalf("ReadDir: %v", err)

}

if len(entries) != 1 {

t.Fatalf("expected 1 output file, got %d", len(entries))

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/archdocs/graph2md/graph2md_test.go` around lines 124 - 127, The test currently calls os.ReadDir(outDir) and discards the error; update the test in graph2md_test.go to capture and check the error returned by os.ReadDir (e.g., entries, err := os.ReadDir(outDir)) and call t.Fatalf or t.Fatalf-like assertion if err != nil, including the err text; keep the subsequent length check (len(entries) != 1) but only after ensuring err is nil so failures show the real ReadDir error rather than hiding it.

coderabbitai · 2026-04-09T17:57:49Z

internal/archdocs/graph2md/graph2md.go

 		if n, ok := usedSlugs[slug]; ok {
 			usedSlugs[slug] = n + 1
 			slug = fmt.Sprintf("%s-%d", slug, n+1)
+			usedSlugs[slug] = 1 // register the disambiguated slug so a later node can't claim it
 		} else {


⚠️ Potential issue | 🟠 Major

Slug disambiguation can still collide with pre-existing suffixed slugs.

Nice improvement, but there’s still a correctness gap at Line 373–377: if fn-x-3 already exists naturally, and a later collision on fn-x generates fn-x-3, you’ll still get duplicate output paths.

Proposed fix (keep incrementing until an unused slug is found)

- // Handle slug collisions - if n, ok := usedSlugs[slug]; ok { - usedSlugs[slug] = n + 1 - slug = fmt.Sprintf("%s-%d", slug, n+1) - usedSlugs[slug] = 1 // register the disambiguated slug so a later node can't claim it - } else { - usedSlugs[slug] = 1 - } + // Handle slug collisions + baseSlug := slug + if _, exists := usedSlugs[baseSlug]; !exists { + usedSlugs[baseSlug] = 1 + } else { + next := usedSlugs[baseSlug] + 1 + for { + candidate := fmt.Sprintf("%s-%d", baseSlug, next) + if _, taken := usedSlugs[candidate]; !taken { + slug = candidate + usedSlugs[baseSlug] = next + usedSlugs[slug] = 1 + break + } + next++ + } + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/archdocs/graph2md/graph2md.go` around lines 373 - 377, The current disambiguation only appends "-{n+1}" once and can collide if that suffixed slug already exists; update the logic around the usedSlugs map and slug variable so that when a clash is detected you loop: compute a candidate = fmt.Sprintf("%s-%d", baseSlug, next), increment next until candidate is not present in usedSlugs, then register that candidate in usedSlugs; ensure you preserve/increment the counter for the original base key so future collisions continue from the correct next value. Reference the usedSlugs map and the slug/baseSlug variables in the collision branch and replace the single-assignment change with a while/for loop that finds a free slug before assigning usedSlugs[candidate]=1.

internal/archdocs/pssg/taxonomy/taxonomy_test.go

greynewell and others added 2 commits April 9, 2026 13:47

greynewell requested a review from jonathanpopham as a code owner April 9, 2026 17:53

Fix goimports formatting in taxonomy_test.go

1e01cd3

Remove extra spaces used to manually align inline comments; gofmt uses tab-based alignment, not space-based, so the extra spaces failed linting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

greynewell merged commit 6267347 into main Apr 9, 2026
7 checks passed

coderabbitai bot mentioned this pull request Apr 10, 2026

fix: rune-safe byte slicing in token estimator, error snippets, and maskKey #112

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(taxonomy): use utf8.DecodeRuneInString in GroupByLetter to handle non-ASCII names#96

fix(taxonomy): use utf8.DecodeRuneInString in GroupByLetter to handle non-ASCII names#96
greynewell merged 3 commits intomainfrom
fix-taxonomy-group-by-letter

greynewell commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

coderabbitai bot Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

greynewell commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greynewell commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading