|
| 1 | +# i18n terminology evidence pile — details |
| 2 | + |
| 3 | +Full inventory, provenance, layout rules, and open items for the reference pile. Must-knows and the top-level structure |
| 4 | +map: [README.md](README.md). |
| 5 | + |
| 6 | +## Layout and locale keys |
| 7 | + |
| 8 | +The pile is language-first: `_ignored/i18n/<tag>/<source>/…`, 201 locale folders. Inside each `<tag>` the sources are |
| 9 | +subdirs (`macOS/`, `microsoft-terminology/`, `microsoft-style-guides/`, `gnome-nautilus/`, `xfce-thunar/`), present only |
| 10 | +where that source has the language. |
| 11 | + |
| 12 | +`<tag>` is a BCP-47 locale, derived losslessly from each source's native code — punctuation and script-modifier |
| 13 | +normalization only, never region↔script remapping: |
| 14 | + |
| 15 | +- **macOS / GNOME / Xfce**: `_`→`-` (`pt_BR`→`pt-BR`, `en_GB`→`en-GB`); `@mod`→`-Subtag` (`sr@latin`→`sr-Latn`, |
| 16 | + `ca@valencia`→`ca-valencia`, `uz@cyrillic`→`uz-Cyrl`); legacy macOS `no`→`nb`. |
| 17 | +- **microsoft-terminology**: the TBX's authoritative internal `xml:lang` (already BCP-47: `zh-Hans`, `pt-BR`, `sr-Latn`). |
| 18 | +- **microsoft-style-guides**: the slug→tag table in `_extract/reorg/main.go` — derived from terminology's codes by |
| 19 | + name-matching, plus an override map for Microsoft's regional/script splits its terminology lumps (`french-canada`→ |
| 20 | + `fr-CA`, `spanish-mexico`→`es-MX`, `english-uk`→`en-GB`, `azerbaijani`→`az-Latn`, etc.). Unspecified-script slugs map |
| 21 | + to base (`punjabi`→`pa`, `uzbek`→`uz`, `sanskrit`→`sa`). |
| 22 | + |
| 23 | +### Lossless sibling families |
| 24 | + |
| 25 | +Because we don't force region↔script merges, a few languages whose sources slice them differently end up as separate |
| 26 | +sibling folders, each with a `_see-also.txt` listing the set: |
| 27 | + |
| 28 | +- **Chinese**: `zh-Hans`/`zh-Hant` (Microsoft, script) vs `zh-CN`/`zh-TW`/`zh-HK` (macOS/GNOME, region). |
| 29 | +- **Serbian**: `sr-Cyrl`/`sr-Latn`/`sr-Cyrl-BA` (Microsoft) vs `sr`/`sr-Latn`/`sr-ije` (GNOME). |
| 30 | +- **Norwegian**: `nb`/`nn` (macOS/GNOME) vs `nb-NO`/`nn-NO` (Microsoft). |
| 31 | + |
| 32 | +For de, sv, hu, and every plain base language there are no such splits: one clean folder each. |
| 33 | + |
| 34 | +### Scripts |
| 35 | + |
| 36 | +- `_extract/macos-extract/` (`go run main.go`): harvests this Mac's bundles into `<tag>/macOS/<source>/…`. Re-runnable — |
| 37 | + clears every `<tag>/macOS` subtree and rewrites it, leaving other sources untouched. |
| 38 | +- `_extract/reorg/` (`go run main.go`): the one-shot source-first→language-first restructure. Already run (it consumes |
| 39 | + the flat source dirs). Kept as the documented, reproducible mapping. |
| 40 | + |
| 41 | +## What's collected |
| 42 | + |
| 43 | +### macOS (Tier 1) |
| 44 | + |
| 45 | +- **What**: localized UI strings from this Mac's system bundles, per language as JSON. |
| 46 | +- **Layout**: `<tag>/macOS/<source>/<file>.json`. Sources harvested: `Finder`, `CoreTypes` (kind names like folder, |
| 47 | + volume), `AppKit` (standard buttons/menus: Cancel, Open, Eject, Move to Trash), `SystemSettings`. |
| 48 | +- **Coverage**: 42 languages, 6,174 JSON files, ~32 MB. Includes sv, de, hu plus the full macOS language set. |
| 49 | +- **Scope caveat**: curated to file-manager + standard-UI bundles, NOT every `.loctable` on the OS. Broaden by adding |
| 50 | + entries to the `sources` list in `_extract/macos-extract/main.go` and re-running. |
| 51 | +- **Provenance**: extracted from `/System/…` on this machine via `plutil -convert json`, 2026-06-19. Re-run any time to |
| 52 | + refresh against the current macOS build. |
| 53 | + |
| 54 | +### microsoft-terminology (Tier 2) |
| 55 | + |
| 56 | +- **What**: Microsoft Terminology Collection, the full per-language TBX glossaries. |
| 57 | +- **Layout**: `<tag>/microsoft-terminology/<LANGUAGE>.tbx` (e.g. `fr/microsoft-terminology/FRENCH.tbx`). |
| 58 | +- **Coverage**: 111 languages, ~2.6 GB. Pretty-printed XML (`xmllint --format --huge`, 111/111) so it's browsable. |
| 59 | +- **Provenance**: `https://download.microsoft.com/download/b/2/d/b2db7a7c-8d33-47f3-b2c1-ee5e6445cf45/MicrosoftTermCollection.zip`, |
| 60 | + downloaded 2026-06-19; upstream files dated 2024-11-06. The source zip is kept in `_downloads/` for re-extraction |
| 61 | + (note: re-extraction yields the original single-line TBX; re-run the `xmllint --format` pass after). |
| 62 | +- **License**: Microsoft Terminology license (reference use; see the usage rule in README.md). |
| 63 | + |
| 64 | +### microsoft-style-guides (Tier 2) |
| 65 | + |
| 66 | +- **What**: Microsoft Localization Style Guides (tone, formality, conventions, do/don't) per language. |
| 67 | +- **Layout**: `<tag>/microsoft-style-guides/StyleGuide.pdf`. |
| 68 | +- **Coverage**: all 102 available languages, ~82 MB, 0 download failures. German (82 pp), Swedish (58 pp), Hungarian |
| 69 | + (62 pp) among them. |
| 70 | +- **Provenance**: `https://aka.ms/<language>-styleguide` redirects, downloaded 2026-06-19. Language list from |
| 71 | + https://learn.microsoft.com/en-us/globalization/reference/microsoft-style-guides. |
| 72 | + |
| 73 | +### gnome-nautilus, xfce-thunar (Tier 3) |
| 74 | + |
| 75 | +- **What**: translation catalogs (`.po`) for the two GTK file managers — exactly the file-manager domain, across many |
| 76 | + languages — the cross-language parity source (equal depth for languages David speaks and ones he doesn't). |
| 77 | +- **Layout**: `<tag>/gnome-nautilus/nautilus.po`, `<tag>/xfce-thunar/thunar.po`. |
| 78 | +- **Coverage**: Nautilus 123 languages (~28 MB), Thunar 67 languages (~10 MB). |
| 79 | +- **License**: GPL (reference use; don't copy strings verbatim, same rule as the vendor sources). |
| 80 | +- **Provenance**: shallow `git clone` on 2026-06-19, `po/*.po` copied out, clones then removed: |
| 81 | + - Nautilus `https://gitlab.gnome.org/GNOME/nautilus.git` @ `c4658b913a21740b874a4c955f51ff4494b8417b` (2026-06-19). |
| 82 | + - Thunar `https://gitlab.xfce.org/xfce/thunar.git` @ `7410dc9b93a6c56b39ad2d0c6e29ccfbe1a76862` (2026-06-18). |
| 83 | + - Re-clone to refresh. |
| 84 | + |
| 85 | +## Decisions made / open items |
| 86 | + |
| 87 | +- **Lossless siblings, separate regional/script folders (2026-06-19).** Chosen over collapsing to base or force-merging |
| 88 | + scripts; the cost is that CJK/Serbian/Norwegian reference is spread across siblings (`_see-also.txt` bridges them), and |
| 89 | + the gain is zero data loss and no opinionated remap. Irrelevant to de/sv/hu. |
| 90 | +- **Windows (Tier 1) — skipped, by decision (2026-06-19).** Reading the UTM VM's filesystem from here isn't feasible, |
| 91 | + and Microsoft Terminology + Style Guides (Tier 2) already capture Windows terminology authoritatively. Revisit only if |
| 92 | + a specific term needs the live Windows wording; then share a folder out of the VM (or mount its disk image) and |
| 93 | + harvest the `.mui` resource strings. |
| 94 | +- **KDE Dolphin (Tier 3) — not collected.** KDE keeps translations in per-language l10n repos rather than the app repo, |
| 95 | + so harvesting Dolphin across languages is more work than the clean `po/`-dir clones used for Nautilus and Thunar. |
| 96 | + GNOME + Xfce already give broad file-manager parity; add Dolphin later if a term needs a third Linux data point. |
| 97 | + |
| 98 | +## Notes |
| 99 | + |
| 100 | +- (scratch space for term-by-term findings, conflicts, and rulings as the glossary work proceeds) |
0 commit comments