Skip to content

Commit 0759d72

Browse files
committed
i18n: move reference-pile docs into the repo; add the fan-out-sweep playbook
The reference-pile DOCS (what's collected, the authority tiers, provenance, and the per-source mining recipes) now live tracked at docs/i18n/reference-pile/ (README.md, inventory.md, how-to-mine.md). The ~3 GB of macOS/Microsoft/GNOME/Xfce reference DATA stays gitignored under _ignored/i18n/ with a thin pointer README. docs/guides/i18n-translation.md links the new docs. Also adds a "Fan-out sweeps" section to docs/guides/multi-agent-refactors.md capturing this effort's learnings: cap concurrency at ~3 subagents (Anthropic API rate limits), give absolute write paths (subagents inherit the orchestrator cwd), terse returns with artifacts written to files, guardrails baked into the prompt, and a batched meta-check round after.
1 parent 45b6a7d commit 0759d72

5 files changed

Lines changed: 239 additions & 7 deletions

File tree

docs/guides/i18n-translation.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,12 @@ Set a translator (human or agent) up for excellence with three inputs, never mix
2828

2929
## Researching terms: the reference pile
3030

31-
Don't guess terms, and don't figure out the sources from scratch. A local reference pile at the repo root,
32-
`_ignored/i18n/`, holds authoritative localizations keyed by language: `_ignored/i18n/<tag>/` gathers, for that one
33-
language, the real macOS UI strings, the Microsoft terminology and localization style guide, and the GNOME/Xfce
34-
file-manager catalogs. Read `_ignored/i18n/CLAUDE.md` for what's there and the authority tiers, and
35-
`_ignored/i18n/HOW-TO-MINE.md` for tested per-source recipes (greps, jq, `msggrep`, `pdftotext`). (The pile is
36-
gitignored, so those paths are plain references, not links; in a linked worktree it's symlinked in, so the same paths
37-
resolve.)
31+
Don't guess terms, and don't figure out the sources from scratch. A reference pile holds authoritative localizations
32+
keyed by language: the ~3 GB of macOS, Microsoft, and GNOME/Xfce data sits gitignored at `_ignored/i18n/<tag>/` (one
33+
folder per language), and the docs explaining it are tracked in the repo. Read
34+
[`reference-pile/README.md`](../i18n/reference-pile/README.md) for what's there and the authority tiers, and
35+
[`reference-pile/how-to-mine.md`](../i18n/reference-pile/how-to-mine.md) for tested per-source recipes (greps, jq,
36+
`msggrep`, `pdftotext`).
3837

3938
For each term or convention: triangulate across every source the language has (macOS is highest authority, then
4039
Microsoft, then GNOME/Xfce), pick the most native-sounding fit for Cmdr's voice, then record it in the style guide's

docs/guides/multi-agent-refactors.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,24 @@ entry points; this is the playbook they assume.
2323
5. **Close out.** An end-of-phase adversarial conformance review agent (against the invariants register) plus a
2424
docs-audit agent that reads ALL commit bodies and verifies every touched `CLAUDE.md`.
2525

26+
## Fan-out sweeps (parallel research or generation over a list)
27+
28+
A different shape from the refactor loop above: many independent agents each handle a slice of a long list (e.g. one
29+
doc per language). Hard-won rules:
30+
31+
- **Cap concurrency at ~3 subagents.** The Anthropic API rate-limits aggressively; a 14-agent burst gets throttled and
32+
agents silently finish partial. Keep ~3 in flight; queue the rest. Batch size per agent (3 vs 10 items) is separate
33+
and can stay large once concurrency is capped.
34+
- **Give ABSOLUTE write paths.** Subagents inherit the orchestrator's cwd (the main clone, not a worktree), so a
35+
relative `docs/x.md` write scatters into the wrong tree. Pass the full worktree path.
36+
- **Terse returns, artifacts to files.** Tell agents to write their output to files and reply with one line per item
37+
(plus flags), not a full report. Verbose returns blow the orchestrator's context over many waves. Persist a
38+
progress/plan file so the loop survives a context compaction.
39+
- **Bake guardrails into the prompt.** Anti-hallucination (ground claims in local sources, mark web-only as unverified),
40+
format rules, and "don't emit literal tool-call tags" prevent a cleanup pass later.
41+
- **Batched meta-check round after.** Independent reviewer agents (alphabet thirds) spot-check claims against ground
42+
truth and flag stubs; this is what catches the fabrication-class errors the generators miss.
43+
2644
## Why it works
2745

2846
The review rounds catch data-corruption-class design errors before implementation (a per-pane-vs-global counter that

docs/i18n/reference-pile/README.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# i18n terminology evidence pile
2+
3+
Authoritative reference data for choosing translation terms with confidence, feeding the per-language translation style
4+
guides at [`/docs/i18n/`](..) (`<tag>-style.md`). Goal: every term we pick can cite what a real localized
5+
OS or an official vendor glossary actually says, so choices match user expectations instead of an agent's guess. Full
6+
inventory, provenance, layout rules, and open items: [inventory.md](inventory.md). Process and confidence model:
7+
[`/docs/guides/i18n-translation.md`](../../guides/i18n-translation.md).
8+
9+
## Must-knows
10+
11+
- **Language-first: one folder per locale gathers every source.** To research a language, point a translator at
12+
`_ignored/i18n/<tag>/` (e.g. `i18n/fr/`); inside are `macOS/`, `microsoft-terminology/`, `microsoft-style-guides/`,
13+
`gnome-nautilus/`, `xfce-thunar/` for that language. de, sv, hu, fr and every plain base language have all five.
14+
- **Mining recipes per source** (tested greps, jq, `msggrep`, `pdftotext`): [how-to-mine.md](how-to-mine.md). Use them;
15+
don't reinvent the search per term.
16+
- **Reference for picking terms, never strings to copy.** Apple's and Microsoft's strings are copyrighted; the
17+
GNOME/Xfce catalogs are GPL. We read them to decide what term matches user expectations, then write Cmdr's own catalog
18+
value. Don't paste any vendor or upstream string verbatim into `apps/desktop/src/lib/intl/messages/`.
19+
- **Locale key = BCP-47, lossless, base-preferred.** Tags are normalized from each source's native code (punctuation +
20+
script modifiers only: `pt_BR``pt-BR`, `sr@latin``sr-Latn`, legacy `no``nb`), with NO region↔script remapping. So
21+
multi-script/region languages stay as separate sibling folders (`zh-Hans` vs `zh-CN`, `sr-Cyrl` vs `sr-Latn`, `nb` vs
22+
`nb-NO`); those carry a `_see-also.txt` pointing to their siblings. This matches Cmdr's own `docs/i18n/<tag>-style.md`
23+
tag convention.
24+
- **Gitignored, lives in the main clone.** `_ignored/` is untracked (`.gitignore` line 9), so this ~3 GB pile stays
25+
local, isn't subject to the doc-system checks, and belongs in the main clone, not a worktree (worktrees get cleaned).
26+
- **Authority tiers** (how much a source proves "user expectation"): 1 = the real installed OS (macOS; strongest),
27+
2 = vendor terminology + style guides (Microsoft), 3 = broad open-source corpora (GNOME/Xfce; cross-language parity),
28+
4 = native human review (the only thing that makes a term "confirmed"; out of budget for now).
29+
30+
## Structure
31+
32+
```
33+
<tag>/<source>/… one folder per BCP-47 locale (201 of them); see inventory.md for the sources
34+
_extract/macos-extract/ reproducible macOS extractor — emits the <tag>/macOS/… layout (go run main.go)
35+
_extract/reorg/ one-shot source-first → language-first restructure (already run; kept for reference)
36+
_downloads/ raw MS Terminology zip, kept for re-extraction
37+
```
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# How to mine the reference pile
2+
3+
Tested recipes for extracting a term or convention from each source in `<tag>/`. The point: don't read whole files or
4+
reinvent the search each time. Triangulate a term across every source the language has, then record your choice with
5+
sources and a confidence in the per-language style guide (`docs/i18n/<tag>-style.md`). Structure and provenance of the
6+
pile: [README.md](README.md) and [inventory.md](inventory.md).
7+
8+
First check which sources your language has: `ls <tag>/` (a source is absent if its subdir is missing). Run everything
9+
below from `_ignored/i18n/`.
10+
11+
## macOS (Tier 1, strongest) — `<tag>/macOS/<bundle>/*.json`
12+
13+
Flat `key: value` JSON per bundle (`Finder`, `AppKit`, `CoreTypes`, `SystemSettings`). Keys are stable across
14+
languages, so cross-reference English→target by key. The `en/` folder is the English side.
15+
16+
```sh
17+
# 1. Find the key for an English term (which bundle/file, and the key):
18+
grep -rl '"Eject"' en/macOS/ # which file has it
19+
grep -i 'eject' en/macOS/AppKit/AccessibilityImageDescriptions.json # -> "NSNavEjectButton": "eject"
20+
# 2. Read the target translation of that same key:
21+
grep '"NSNavEjectButton"' sv/macOS/AppKit/AccessibilityImageDescriptions.json # -> "mata ut"
22+
```
23+
24+
Or with jq, search value strings and print key+value:
25+
```sh
26+
jq -r 'to_entries[]|select((.value|type)=="string" and (.value|test("eject";"i")))|"\(.key)\t\(.value)"' \
27+
sv/macOS/Finder/*.json
28+
```
29+
macOS is the highest-authority source: it's what a user literally sees in Finder. Prefer it when sources disagree.
30+
31+
## Microsoft terminology (Tier 2) — `<tag>/microsoft-terminology/<LANG>.tbx`
32+
33+
Pretty-printed TBX XML, no namespace. Each `<termEntry>` has two `<langSet>`: `en-US` first, then the target. So in a
34+
window after the English `<term>`, the next `<term>` is the translation. `termNote type="geographicalUsage"` flags
35+
region (e.g. `AUT, DEU, CHE` for German), and `descrip type="definition"` gives the sense.
36+
37+
```sh
38+
# English -> target (read the second <term> in the window, and any geographicalUsage):
39+
grep -i -A14 '<term[^>]*>folder</term>' de/microsoft-terminology/GERMAN.tbx | grep -iE '<term|xml:lang|geographicalUsage'
40+
# -> folder ... <langSet xml:lang="de"> ... Ordner ... AUT, DEU, CHE, LUX
41+
42+
# Validate a candidate target term exists, and see its English source (-B = lines before):
43+
grep -i -B14 '<term[^>]*>Ordner</term>' de/microsoft-terminology/GERMAN.tbx | grep -iE '<term'
44+
```
45+
Files are large; grep (streaming) beats loading them. `xmllint --xpath` works too but reads the whole doc into memory.
46+
47+
## Microsoft style guide (Tier 2) — `<tag>/microsoft-style-guides/StyleGuide.pdf`
48+
49+
Use for tone, formality (how to address the user), capitalization, and grammar conventions — not single terms. Extract
50+
text once, then grep; or open sections with the Read tool (it renders PDF pages).
51+
52+
```sh
53+
pdftotext de/microsoft-style-guides/StyleGuide.pdf - | grep -iE -A3 'addressing the user|formal|du-form|anrede|tilltal'
54+
```
55+
The high-value sections are the early style/tone/grammar chapters and the "addressing the user" / formality section.
56+
57+
## GNOME / Xfce (Tier 3, cross-language parity) — `<tag>/gnome-nautilus/nautilus.po`, `<tag>/xfce-thunar/thunar.po`
58+
59+
gettext catalogs (`msgid` English, `msgstr` translation). Exactly the file-manager domain. Use `msggrep` (cleaner than
60+
grep for multi-line and plural entries):
61+
62+
```sh
63+
msggrep --msgid -e 'Eject' sv/gnome-nautilus/nautilus.po # entries whose msgid matches
64+
msggrep --msgstr -e 'papperskorg' sv/gnome-nautilus/nautilus.po # reverse: find by target word
65+
grep -A2 'Plural-Forms' sv/gnome-nautilus/nautilus.po # the language's plural rule
66+
```
67+
Plural entries use `msgid`/`msgid_plural` with `msgstr[0]`, `msgstr[1]`, … — good evidence for how a real catalog
68+
phrases counted strings in your language.
69+
70+
## Confidence rubric (record this per term)
71+
72+
- **confirmed**: David or a native reviewer signed off. Use freely.
73+
- **high**: macOS and/or Microsoft agree (cite which). Safe to use, still review-gated.
74+
- **tentative**: sources conflict or none had it; your best judgment. Push it to the style guide's "Decisions to confirm
75+
with David" section rather than burying it.
76+
77+
When sources disagree, weight by tier (macOS > Microsoft > GNOME/Xfce) but note the disagreement — it's often a
78+
macOS-vs-Windows split worth recording for the translator who comes after you.
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# i18n terminology evidence pile — details
2+
3+
Full inventory, provenance, layout rules, and open items for the reference pile. Must-knows and the top-level structure
4+
map: [README.md](README.md).
5+
6+
## Layout and locale keys
7+
8+
The pile is language-first: `_ignored/i18n/<tag>/<source>/…`, 201 locale folders. Inside each `<tag>` the sources are
9+
subdirs (`macOS/`, `microsoft-terminology/`, `microsoft-style-guides/`, `gnome-nautilus/`, `xfce-thunar/`), present only
10+
where that source has the language.
11+
12+
`<tag>` is a BCP-47 locale, derived losslessly from each source's native code — punctuation and script-modifier
13+
normalization only, never region↔script remapping:
14+
15+
- **macOS / GNOME / Xfce**: `_``-` (`pt_BR``pt-BR`, `en_GB``en-GB`); `@mod``-Subtag` (`sr@latin``sr-Latn`,
16+
`ca@valencia``ca-valencia`, `uz@cyrillic``uz-Cyrl`); legacy macOS `no``nb`.
17+
- **microsoft-terminology**: the TBX's authoritative internal `xml:lang` (already BCP-47: `zh-Hans`, `pt-BR`, `sr-Latn`).
18+
- **microsoft-style-guides**: the slug→tag table in `_extract/reorg/main.go` — derived from terminology's codes by
19+
name-matching, plus an override map for Microsoft's regional/script splits its terminology lumps (`french-canada`
20+
`fr-CA`, `spanish-mexico``es-MX`, `english-uk``en-GB`, `azerbaijani``az-Latn`, etc.). Unspecified-script slugs map
21+
to base (`punjabi``pa`, `uzbek``uz`, `sanskrit``sa`).
22+
23+
### Lossless sibling families
24+
25+
Because we don't force region↔script merges, a few languages whose sources slice them differently end up as separate
26+
sibling folders, each with a `_see-also.txt` listing the set:
27+
28+
- **Chinese**: `zh-Hans`/`zh-Hant` (Microsoft, script) vs `zh-CN`/`zh-TW`/`zh-HK` (macOS/GNOME, region).
29+
- **Serbian**: `sr-Cyrl`/`sr-Latn`/`sr-Cyrl-BA` (Microsoft) vs `sr`/`sr-Latn`/`sr-ije` (GNOME).
30+
- **Norwegian**: `nb`/`nn` (macOS/GNOME) vs `nb-NO`/`nn-NO` (Microsoft).
31+
32+
For de, sv, hu, and every plain base language there are no such splits: one clean folder each.
33+
34+
### Scripts
35+
36+
- `_extract/macos-extract/` (`go run main.go`): harvests this Mac's bundles into `<tag>/macOS/<source>/…`. Re-runnable —
37+
clears every `<tag>/macOS` subtree and rewrites it, leaving other sources untouched.
38+
- `_extract/reorg/` (`go run main.go`): the one-shot source-first→language-first restructure. Already run (it consumes
39+
the flat source dirs). Kept as the documented, reproducible mapping.
40+
41+
## What's collected
42+
43+
### macOS (Tier 1)
44+
45+
- **What**: localized UI strings from this Mac's system bundles, per language as JSON.
46+
- **Layout**: `<tag>/macOS/<source>/<file>.json`. Sources harvested: `Finder`, `CoreTypes` (kind names like folder,
47+
volume), `AppKit` (standard buttons/menus: Cancel, Open, Eject, Move to Trash), `SystemSettings`.
48+
- **Coverage**: 42 languages, 6,174 JSON files, ~32 MB. Includes sv, de, hu plus the full macOS language set.
49+
- **Scope caveat**: curated to file-manager + standard-UI bundles, NOT every `.loctable` on the OS. Broaden by adding
50+
entries to the `sources` list in `_extract/macos-extract/main.go` and re-running.
51+
- **Provenance**: extracted from `/System/…` on this machine via `plutil -convert json`, 2026-06-19. Re-run any time to
52+
refresh against the current macOS build.
53+
54+
### microsoft-terminology (Tier 2)
55+
56+
- **What**: Microsoft Terminology Collection, the full per-language TBX glossaries.
57+
- **Layout**: `<tag>/microsoft-terminology/<LANGUAGE>.tbx` (e.g. `fr/microsoft-terminology/FRENCH.tbx`).
58+
- **Coverage**: 111 languages, ~2.6 GB. Pretty-printed XML (`xmllint --format --huge`, 111/111) so it's browsable.
59+
- **Provenance**: `https://download.microsoft.com/download/b/2/d/b2db7a7c-8d33-47f3-b2c1-ee5e6445cf45/MicrosoftTermCollection.zip`,
60+
downloaded 2026-06-19; upstream files dated 2024-11-06. The source zip is kept in `_downloads/` for re-extraction
61+
(note: re-extraction yields the original single-line TBX; re-run the `xmllint --format` pass after).
62+
- **License**: Microsoft Terminology license (reference use; see the usage rule in README.md).
63+
64+
### microsoft-style-guides (Tier 2)
65+
66+
- **What**: Microsoft Localization Style Guides (tone, formality, conventions, do/don't) per language.
67+
- **Layout**: `<tag>/microsoft-style-guides/StyleGuide.pdf`.
68+
- **Coverage**: all 102 available languages, ~82 MB, 0 download failures. German (82 pp), Swedish (58 pp), Hungarian
69+
(62 pp) among them.
70+
- **Provenance**: `https://aka.ms/<language>-styleguide` redirects, downloaded 2026-06-19. Language list from
71+
https://learn.microsoft.com/en-us/globalization/reference/microsoft-style-guides.
72+
73+
### gnome-nautilus, xfce-thunar (Tier 3)
74+
75+
- **What**: translation catalogs (`.po`) for the two GTK file managers — exactly the file-manager domain, across many
76+
languages — the cross-language parity source (equal depth for languages David speaks and ones he doesn't).
77+
- **Layout**: `<tag>/gnome-nautilus/nautilus.po`, `<tag>/xfce-thunar/thunar.po`.
78+
- **Coverage**: Nautilus 123 languages (~28 MB), Thunar 67 languages (~10 MB).
79+
- **License**: GPL (reference use; don't copy strings verbatim, same rule as the vendor sources).
80+
- **Provenance**: shallow `git clone` on 2026-06-19, `po/*.po` copied out, clones then removed:
81+
- Nautilus `https://gitlab.gnome.org/GNOME/nautilus.git` @ `c4658b913a21740b874a4c955f51ff4494b8417b` (2026-06-19).
82+
- Thunar `https://gitlab.xfce.org/xfce/thunar.git` @ `7410dc9b93a6c56b39ad2d0c6e29ccfbe1a76862` (2026-06-18).
83+
- Re-clone to refresh.
84+
85+
## Decisions made / open items
86+
87+
- **Lossless siblings, separate regional/script folders (2026-06-19).** Chosen over collapsing to base or force-merging
88+
scripts; the cost is that CJK/Serbian/Norwegian reference is spread across siblings (`_see-also.txt` bridges them), and
89+
the gain is zero data loss and no opinionated remap. Irrelevant to de/sv/hu.
90+
- **Windows (Tier 1) — skipped, by decision (2026-06-19).** Reading the UTM VM's filesystem from here isn't feasible,
91+
and Microsoft Terminology + Style Guides (Tier 2) already capture Windows terminology authoritatively. Revisit only if
92+
a specific term needs the live Windows wording; then share a folder out of the VM (or mount its disk image) and
93+
harvest the `.mui` resource strings.
94+
- **KDE Dolphin (Tier 3) — not collected.** KDE keeps translations in per-language l10n repos rather than the app repo,
95+
so harvesting Dolphin across languages is more work than the clean `po/`-dir clones used for Nautilus and Thunar.
96+
GNOME + Xfce already give broad file-manager parity; add Dolphin later if a term needs a third Linux data point.
97+
98+
## Notes
99+
100+
- (scratch space for term-by-term findings, conflicts, and rulings as the glossary work proceeds)

0 commit comments

Comments
 (0)