feat(accent): local UniDic + POS-driven patches#53
Open
torrid-fish wants to merge 29 commits into
Open
Conversation
95c7ae1 to
80c55b9
Compare
ae50410 to
41620f8
Compare
This was referenced May 28, 2026
55b2117 to
01fcd71
Compare
wade00754
requested changes
May 30, 2026
The greedy aligner had two failure modes that cascaded across whole sentences: a numeric anchor that over-consumed when Yahoo and OJAD disagreed on phrase boundary, and a +1 fallback path that turned a single mismatch into type-0 fallback for every downstream token. Replaces it with a global DP over (yahoo_token, ojad_entry) pairs: each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries, with per-token cost computed via shape (punct/numeric/kana) and edit distance over rendaku-folded strings for kana tokens. Sub cost (0.4) is lower than ins/del (1.0) so the DP prefers same-length spans with substitutions over shorter spans with deletions — fixes the case where OJAD's `う` from `等→とう` leaked onto the next token. Adds a voicing-fold table so Yahoo's dictionary-form readings (ふんかん) align against OJAD's pronounced readings with rendaku (ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias to は/ふ. Refs #47.
Add api/accent/reading_overrides.py — a context-blind correction layer sitting between Yahoo Furigana and OJAD alignment. Each override is a regex on the concatenated surface text plus the replacement tokens that should appear instead. Covers: - 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays. - All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか, 14日 → じゅうよっか, 20日 → はつか, etc. - N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since the 1st-of-month reading is impossible for a duration), 7日間 → しちにちかん (modern technical writing preference over なのかかん). - 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading). Patterns accept arabic / full-width / kanji numeral variants of the same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger the same overrides. Order-of-overrides matters: duration list precedes date list so `N日間` wins over `N日` at the same start (longer match breaks ties in _collect_matches). apply_furigana_overrides runs BEFORE align_accent so merged spans like `5日→いつか` reach OJAD as a single token whose furigana matches OJAD's phrase reading (the numeric-anchor logic in align_accent otherwise cascades-fails because numeric tokens lack any Yahoo furigana). apply_accent_overrides runs AFTER align to re-stamp both furigana and accent on the same matched spans, so the response is consistent. Adds URL preprocessing: each https?:// is swapped for the placeholder "URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across several alphabet tokens; OJAD's phrasing scraper produces noise for Latin punctuation runs — both drag alignment off-rail). Placeholders are walked back to the originals in order after alignment. URL body stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded URLs strip cleanly. Adds a non-Japanese short-circuit: if (after URL stripping) the chunk contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD entirely and echo the chunk back as a single token. Lets pure-URL / pure-English lines stream through cheaply. Also adds stream_accent_chunks() to pipeline.py as a helper used by the streaming endpoint added in the next commit. Splits the input on \n then on full-width sentence terminators (。!?.) — long paragraphs degrade OJAD's phrasing predictor and parallelising across sentences caps the latency. In-flight work is bounded by a semaphore (concurrency=4) because OJAD's u-tokyo backend falls over with 30+ parallel scrapes. main.py docstring updated to reflect /MarkAccent/stream/. Refs #47.
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.
Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.
Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.
.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.
Refs #47.
Replace the Yahoo Furigana HTTP path with in-process fugashi + NINJAL
UniDic 3.1.0. The migration adds three new layers inside the `api/accent/`
package plus a sentence-level chunked streaming endpoint:
* `tokenizer.py` — singleton `fugashi.Tagger`; maps UniDic features into
the existing `WordResult` shape plus new strong-mode fields
`lexical_kernel` / `lexical_kernel_alts` (parsed from `aType`).
* `preprocess.py` — pre-alignment text rewrites (URL strip, western-
grouped thousands `1,234`→`1234`, `\d×\d`→`\d/\d`), `has_japanese`
short-circuit gate, sentence splitting, readable-symbol (`2%`, `15℃`)
pre-merge.
* `postprocess.py` — rendering passes: heiban-particle accent flatten
(の/な/は/が after a 平板調 word), pure-punct furigana suppression,
English / katakana toggle handling, 助詞 furigana suppression.
* `reading_overrides.py` — moved into the package; regex overrides for
日付/N日間/20歳/曜日 plus the POS-driven `apply_accent_patches` rule
for ます / たい first-mora FALL.
`align.py` is upgraded to a Needleman-Wunsch DP over (token, OJAD-entry)
pairs with weighted edit distance, rendaku voicing fold, an OJAD-punct
guard, and a numeric tiebreaker that fixes the `19×19` 1+7-split bug.
`models.py` extends `Request` with `render_english_furigana` /
`render_katakana_furigana` toggles, adds POS metadata fields (excluded
from serialization) plus strong-mode lexical-accent fields exposed in
JSON, and drops the standalone `FuriganaResponse`.
`routes.py` exposes `/api/MarkAccent/` (collected) and
`/api/MarkAccent/stream/` (NDJSON per chunk); both share
`pipeline.build_chunks` + `pipeline.schedule_chunks` for byte-identical
per-chunk results. The standalone MarkFurigana endpoint is removed —
there is no in-process equivalent for the Yahoo Furigana service.
`main.py` drops the slowapi rate limiter, CORS, trusted-host, and
X-API-KEY middleware — the service is now expected to run behind the
parent backend on a private network. `config/settings.py` is reduced to
just `load_dotenv()` accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh the data-flow diagram, file-responsibilities table, and
dependency graph to reflect the new layout (tokenizer / preprocess /
postprocess / reading_overrides). Update the alignment-algorithm
section to describe the DP / `_match_cost` / voicing fold / OJAD-punct
guard / numeric tiebreaker that replaced the old greedy implementation.
Append three new sections documenting layers that didn't exist when the
README was first written:
* **Surface overrides + POS patches** — regex `OVERRIDES` list shape,
apply_furigana_overrides vs apply_accent_overrides, POS-driven
`_is_masu_auxiliary` / `_is_tai_auxiliary` predicates.
* **Postprocess passes** — the four idempotent passes that run after
align + overrides + patches, with the rationale for their order.
* **Local UniDic tokeniser** — feature → WordResult mapping,
`feat.kana` vs `feat.pron` choice, `*` null handling, Field(
exclude=True) on POS metadata.
Also drops the MarkFurigana row from the endpoint table (the endpoint
was removed in the local-UniDic migration) and updates the "Adding
endpoints / overrides" section to reference the new file names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot the spike's investigation artefacts so future readers can
reconstruct the GO/NO-GO decision and the test cases that drove the
DP-aligner and POS-patch design:
* docs/spike-local-unidic.md — phased measurement report (verb forms,
て-form, long sentences) culminating in the GO recommendation.
* docs/spike-local-unidic-runbook.md — runbook for replaying the
spike with `uv run scripts/spike_local_unidic.py`.
* scripts/spike_local_unidic.py — end-to-end Yahoo-vs-UniDic
comparison harness against the existing OJAD pipeline.
* scripts/probe_verb_forms.py — generates verb-form coverage
matrices for the DP-aligner regression suite.
* scripts/probe_te_and_long.py — exercises te-form chains and long
sentences where OJAD's CRF was most likely to absorb kernels.
* scripts/smoke_test_partial.py — minimal in-process smoke test
against `_process_accent_chunk` for quick iteration.
These scripts still reference the pre-refactor `api.accent_marker`
monolith paths intentionally — they were the artefacts the spike
produced, and rewriting them would lose the audit trail. Rerunning
them in the new layout would require trivial import updates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fugashi's split of G2P / PSP-1000 / Wifi.7 / 12.5 into single-piece tokens lets the DP shuffle OJAD's morae onto the wrong child — `ーにピ` floating above `2` in G2P, `てん` leaking onto `5` in `12.5`, and the accent CRF collapsing on everything after `Wifi.7`. Fuse those runs into one token before alignment so each kind flows through one branch. - tokenizer.tag_local: glue contiguous (alpha|digit) runs, bridging `-` / `_` / `.` between alpha/digit pieces via look-ahead. Letter-less runs whose joined surface matches NUMERIC_PATTERN (`12.5`, `0.5`) get a decimal merge instead. fugashi's `white_space` attribute gates the merge so `Hello world` and `API key` stay split. - align: new `is_english_compound` free-consume branch in `_match_cost`, reordered ahead of the OJAD-punct guard so a merged acronym can swallow the `。` OJAD inserts when it normalises `.`. `_build_word_result` filters those punct entries from the rendered accent so they don't surface as ruby when the English toggle is on. - preprocess.strip_acronym_dots_for_ojad: OJAD-only strip — OJAD's `.` → `。` normalisation collapses its prosody CRF on the rest of the sentence, so the OJAD query gets `Wifi7` while fugashi keeps the original `.` and the tokenizer merge preserves the user-visible `Wifi.7` surface. - postprocess._is_pure_english_surface: accept `-` / `_` / `.` so the toggle wipe agrees with the aligner on which fused surfaces qualify. - pipeline: thread the OJAD-only stripped text into `get_ojad_result`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) OJAD silently elides English when interleaved with kana (probe in spike-only scripts: Whisper inside `ふりがなWhisper`, satochin inside `深掘りライターsatochin氏`, URLPLACEHOLDER after strip_urls all come back with 0 OJAD morae). The aligner charged _FALLBACK_COST for an english_compound token taking k=0, so the cheapest DP path was to steal 1 mora from the neighbouring kana token to dodge the 3.0 penalty — paying ~1.0 edit-distance on the kana side instead. That cascade left ふりがな missing trailing な (test_1 ×7), コメント empty-spanned and falling through to the collapsed single-entry fallback (test_0), ライター missing the trailing chōon ー (test_0), and テスト missing the leading テ after a URL token (test_0). Lower k=0 to 0.0 in the english_compound branch. Spelled-out cases (`G2P` → ジーツーピー) still align correctly because forcing those katakana morae onto a neighbouring kana token costs more edit- distance than letting the english token absorb them at k≥1 cost 0. Verified end-to-end against all 30 fixtures: 0 under-mora anomalies remaining (previously 10 across test_0 and test_1). Also adds scripts/run_10_tests.sh as a kept regression harness driving the full corpus via a TESTS env override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the local fugashi + UniDic migration, the Yahoo-era scaffolding
is fully unreferenced:
- config/settings.py loaded YAHOO_API_KEY via dotenv but nothing
imports it. config/__init__.py is empty.
- .env / .env.example only carried YAHOO_API_KEY (and an unread
API_TOOLS_PORT). The application reads neither.
- scripts/spike_local_unidic.py was the Yahoo↔local comparison
spike; the comparison is the merged work itself.
- scripts/probe_*.py and scripts/smoke_test_partial.py were
spike-only debug tools.
- docs/spike-local-unidic*.md narrate work now landed.
Dockerfile drops `config` from the compileall/COPY lines.
docker-compose.yml drops the `env_file: .env` block (the
${API_TOOLS_PORT:-8000} fallback still works from shell env).
README.md trims the false "obtain a Yahoo API key" paragraph.
scripts/run_10_tests.sh stays — it's the 30-fixture regression
harness committed in the previous commit, not a spike artefact.
Verified post-prune: server reloads cleanly (HTTP 200 on a fresh
MarkAccent POST) and test_0 / test_15 / test_29 all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four follow-on features and a doc refresh on top of the spike fix. 1. SYMBOL_READINGS table in preprocess.py, consumed by tokenizer.tag_local. Standalone symbols (#, %, @, &, +, =, $, ¥, €, ℃, °, *, ~, §, plus full-width siblings) now get their spoken katakana reading instead of an empty furigana. The aligner's edit-distance branch matches the OJAD span at cost 0 rather than refusing it; the `#病` cascade that stole one mora from the next particle (test_0 idx 1411) is gone. suppress_punct_furigana also learns to skip these surfaces so the symbol's furigana + accent survive the post-alignment scrub. 2. split_okurigana in postprocess.py populates WordResult.subword when a token mixes kanji and kana. `聞き分け` → subword=[(聞,き),(き,""),(分,わ),(け,"")]. Top-level surface, furigana, and accent are unchanged — clients that ignore subword get the previous behaviour bit-for-bit. Irregular readings that can't be aligned against the surface kana fall back to no subword (no garbled segments). Across the 30-fixture corpus, 1045 tokens in 30/30 files gain segments. 3. New `script` request arg: hiragana (default), katakana, or romaji. convert_furigana_script in postprocess.py rewrites every furigana field (top-level + per-mora + subword) before serialisation. Internal alignment stays hiragana. Default "hiragana" also normalises per-mora morae that OJAD echoed back as katakana (e.g. `ラ`/`イ` on ライター's accent[]) — the per-mora script is now consistent across surface types. 4. README rewritten in English: covers all five live endpoints (MarkAccent + UsageQuery + DictQuery + SentenceQuery), the full MarkAccent request body with the three new fields, response shape, examples, the regression harness, and the four known UniDic-vs-OJAD reading-mismatch tokens. Re-profiled against the 30-fixture corpus after the changes: 0 under-mora anomalies (was 0 after the spike fix), 4 over-mora cases (was 5 — the `#病→と` leak is fixed by the symbol table). The remaining four over-mora cases are pre-existing UniDic context- reading mismatches (世, 本当, 他, 寺) unrelated to this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The merge step was added before standalone symbols had a reading — it glued `(2, %)` into one token whose surface (`2%`) matched OJAD's phrase boundary so the パーセント morae wouldn't leak onto the digit. Side-effect: the merged token's furigana came out as `ごじゅうてんさんぱーせんと` for `50.3%`, with no way for a client to render ruby specifically over `%`. After the SYMBOL_READINGS work in the previous commit, `%` (and its siblings `@`, `&`, `+`, `$`, `¥`, `€`, `℃`, `°`, …) already carry their spoken katakana reading. The DP aligner matches each symbol's furigana against the OJAD span at edit-distance 0, so the パーセント morae no longer leak — the merge is redundant. Removing it gives the user's preferred shape: `50.3%とは` → [50.3|ごじゅうてんさん] [%|ぱーせんと] [と] [は] `READABLE_COMPOUND_RE` and the `is_readable_compound` branch in align.py stay in place — nothing wired produces a compound surface any more, but the dead branches are harmless and reading_overrides could in principle still synthesise one. Re-profiled all 30 fixtures: row counts unchanged, under-mora anomalies 0, over-mora 4 (same UniDic context-reading mismatches as before). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`apply_furigana_toggles` was clearing both furigana AND accent on
any surface matching `_is_pure_english_surface` — but fugashi+UniDic
hand back proper Japanese readings for unit compounds whose surface
happens to look ASCII (`53mm` → みりめーとる, `33m/s` →
めーとるまいびょう, `3kg` → きろぐらむ). With `render_english_furigana`
off (default), those unit tokens came back with empty furigana and
empty accent — the user saw `53mm` "escaped" entirely.
Skip the english wipe when the token's furigana already contains
any hiragana/katakana char. UniDic only fills a kana reading when
the surface IS a recognised Japanese unit / loanword token, so
truly foreign english (`Whisper`, `G2P`, `Apple`) still has
furigana==surface (no kana) and continues to be cleared.
Verified:
- `53mm` → surface=`53mm`, furi=`53みりめーとる`,
accent=[ご,じゅ,う,さ,ん,み,り,め,ー,と,る] with marks
- `m/s` → surface=`m/s`, furi=`めーとるまいびょう`, full accent
- `Whisper`, `G2P` → still wiped (no kana in furigana)
30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 4
(same pre-existing UniDic context-reading mismatches). 7 fixtures
gained rows where unit tokens previously were stripped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some inputs hit deterministic upstream quirks the local pipeline can't fix (OJAD reading `33m/s` as `さんじゅうみっめーとるまいびょう`, with a stray `みっ` from a CRF sound-change; UniDic giving a kanji its lemma reading instead of the contextual one). Rather than chase each with bespoke align/postprocess logic, give the caller a maintenance file they can grow over time. `api/accent/user_patches.py` exposes USER_PATCHES: a dict of literal-match surface fragments to a tuple of (segment_surface, segment_furigana) pairs. `reading_overrides._user_patch_overrides` compiles those into FuriganaOverride entries appended to the existing OVERRIDES list, so both the pre-OJAD furigana pass and the post-alignment accent pass pick them up — the second pass rewrites the contour with the prescribed reading. Accent defaults to heiban via a new `_mora_seq` helper that splits the reading into actual morae (so じゅ stays one entry, not two). Power users can drop full FuriganaOverride objects into the existing OVERRIDES section for atamadaka / per-mora custom marks. Seeded with one entry for `33m/s` as a working example. Edit the dict and re-run `./scripts/run_10_tests.sh` after each addition. Verified: `33m/s` now comes back as [33|さんじゅうさん] [m/s|めーとるまいびょう] instead of `33|さんじゅうみっ` + `m/s|めーとるまいびょう`. 30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 4 (same pre-existing UniDic-context mismatches — addressable by adding USER_PATCHES entries case-by-case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the patch schema with an optional third element per segment so
users can prescribe a non-heiban contour:
(surf, furi) → heiban (default)
(surf, furi, "heiban") → heiban (explicit)
(surf, furi, "atamadaka") → first-mora FALL, rest LOW
(surf, furi, "low") → all-LOW
(surf, furi, (0, 1, 2)) → explicit per-mora types
The shape names live in `_accent_from_spec` in reading_overrides.py;
unknown specs warn and fall back to heiban. `_split_morae` is now
factored out so both `_mora_seq` and the new helper share the same
小さな仮名-attach mora splitter.
Seeded three patches for the pre-existing UniDic-vs-OJAD context-
reading mismatches in the 30-fixture corpus:
- `本当の` → ほんとう / の (heiban)
- `他の` → ほか (atamadaka) / の
- `世にも` → よ / に / も (heiban; demonstrates flatten-after-heiban
naturally drops the trailing に to LOW)
Re-profile: under-mora 0 (unchanged), **over-mora 4 → 1**. The
remaining `寺` case (test_16, after `永昌寺という`) involves a
compound-boundary mis-tokenisation, not addressable by a simple
literal patch — left for a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
) The previous schema accepted 2-tuples (heiban default) plus three named shapes ("heiban" / "atamadaka" / "low") as the accent_spec. Convenience came at the cost of one rule per shape and a wall of docs explaining which shape maps to what. Drop all of that — every segment is now exactly `(surface, furigana, accent_ints)` with the int tuple required and one entry per mora. The shapes are trivially expressible as tuples: heiban → (1, 1, 1, ...) atamadaka → (2, 0, 0, ...) low → (0, 0, 0, ...) `_accent_from_spec` now returns `None` on any malformed spec and the caller skips the whole patch entry (no per-segment fallback). All four seeded patches are rewritten in the strict form. Re-profile: under-mora 0, over-mora 1 (same `寺` compound boundary case remains). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_age_overrides` merges `20歳` into a single WordResult with the
prescribed furigana `はたち` (3 morae), but OJAD still pronounces
the same surface as `にじゅっさい` (5 morae). The DP aligner's
kana branch couldn't grant the merged token 5 morae cheaply (the
edit-distance between `はたち` and `にじゅっさい` is huge), so it
allocated only 3 morae and the leftover `さい` cascaded onto the
following kana tokens — `20歳の私達へ` ended up with `の` getting
acc=[さ] and `私` getting acc=[い,の,わ,た,し].
Override-merged tokens carry no UniDic backing (both `base` and
`pos` are None — `ReplacementToken` doesn't set MA metadata).
Detect that combination in `_match_cost` and give the same
free-consume treatment as numeric / readable_compound: k=0 returns
_FALLBACK_COST so the DP prefers absorption, k≥1 returns 0 up to
a generous upper. `apply_accent_overrides` rewrites the accent
post-align so whatever DP picked up from OJAD is discarded.
Verified `20歳の私達へ`:
20歳 → はたち [(は,2),(た,0),(ち,0)]
の → の [(の,0)]
私 → わたくし [(わ,0),(た,1),(し,2)]
達 → たち [(た,0),(ち,0)]
へ → へ [(へ,0)]
30-fixture regression: 30/30 HTTP 200, under-mora 0, over-mora 1
(same `寺` compound boundary case).
Also refreshes `api/accent/README.md`:
- Adds `user_patches.py` to file map + new section documenting
the strict 3-tuple schema and accent_ints shapes.
- Documents the new synthesized branch in `_match_cost`.
- Adds Request toggle table (render_english/katakana_furigana,
script) and the unit-compound exception for english toggle.
- Adds `split_okurigana` + `convert_furigana_script` to the
postprocess pass list and updates the data-flow diagram.
- Removes references to merge_readable_symbol_compounds (gone
since the SYMBOL_READINGS refactor).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `apply_furigana_toggles` cleared the top-level `furigana` on pure-katakana tokens when `render_katakana_furigana=False`, but left every `AccentInfo.furigana` populated (with hiragana morae). Clients that draw ruby from the per-mora field rendered hiragana copies (`ふ・ら・ん・つ`) on top of katakana surfaces (`フランツ`) despite the toggle saying "no furigana" — the user-visible symptom on inputs like `フランツ・ヨーゼフ・ハイドン` was katakana names gaining unwanted ruby. Clear every `AccentInfo.furigana` to `""` for those tokens while keeping `accent_marking_type` and `length` intact, so clients that draw pitch overlay against the surface chars can still do so (length-aware iteration handles small kana like `ァ` / `ェ`). `render_katakana_furigana=True` is unaffected — both top-level and per-mora furigana flow through normally. 30-fixture regression: 30/30 HTTP 200, anomalies unchanged (one false-positive in the heuristic dropped because the cleared per-mora field stops triggering the "collapsed entry" check). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full translation of the package-level documentation from Traditional Chinese to English. Structure and section ordering preserved; same mermaid data-flow diagram, same tables. Also folds in the changes since the last refresh: - Request toggle table documents the per-mora-furigana clear for the katakana toggle (the フランツ/Frаnz ruby-on-katakana fix). - _match_cost branches list now includes the synthesized free- consume rule (override-merged 20歳 → はたち) alongside the english-compound k=0=0 rule. - Postprocess pass list calls out unit-compound exemption from the english toggle wipe (53mm, 33m/s, 3kg keep their reading). - User-patches section uses the strict 3-tuple schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `unidic` pip package ships the loader but not the ~770MB dicdir, so fugashi.Tagger() failed at runtime (missing mecabrc) and /api/MarkAccent/ returned 500. Run `unidic download` in the builder stage; the venv copy into the final image carries the dict along. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The version-selectable dict script (01fcd71) switched the UniDic download to curl, but python:3.11-slim ships without it — the docker build died with exit 127 at the download step. Add curl next to unzip in the builder-stage apt install (multi-stage, so the runtime image is unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The collected endpoint rebuilds AccentResponse from per-chunk results and silently dropped the new `warning` field (#60), so OJAD-degraded responses looked like full results. Keep the first chunk warning, mirroring the first_error convention. The stream endpoint already passes it through via model_dump(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The global+if-None lazy init let two concurrent first requests each see _TAGGER as None and build their own fugashi.Tagger(), reloading the ~1.3GB UniDic dictionary twice (raised in PR #53 review). functools .lru_cache(maxsize=1) makes the lazy init atomic so only one tagger is ever constructed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A leftover '# extra う onto the following の' line was sitting between the all-LOW and nakadaka rows of the accent-tuple table in the module docstring (flagged in PR #53 review). Remove it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f-strings build the message eagerly even when the debug level is disabled; pass the value as a lazy %-arg instead (PR #53 review). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
schedule_chunks created detached asyncio tasks that kept scraping OJAD even after the client went away (PR #53 review): on the streaming endpoint a disconnect just stopped consuming the generator, and on the collected endpoint a cancelled handler orphaned its tasks. Add a shared cancel_pending helper and call it from a finally in both endpoints, so a disconnect (GeneratorExit into the stream, or the collected handler being cancelled) tears down any still-pending chunk. A TaskGroup would scope the tasks automatically, but async with TaskGroup() inside the streaming async generator wraps the aclose() GeneratorExit into a BaseExceptionGroup, so explicit cancellation is the only shape that closes the stream cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ef5f097 to
c509f70
Compare
🛡️ PR Quality Check Summary✅ PR Title: Passed (Length: 47/75, Format: OK). 🎉 All checks passed! |
This was referenced Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
目的
Closes #45, Closes #48, Closes #50, Closes #57, Closes #58.
Supersedes (now-closed) PRs #47 and #49.
Replaces #51 (same branch, renamed
spike/local-unidic→feat/local-unidic;the rename closed #51, so this is its continuation).
Issue #50 was explicitly framed as an alternative architecture to #48 ("two architectures targeting the same goal. Production will likely pick ONE"). This PR is the GO outcome: local fugashi + NINJAL UniDic CWJ 2025-12-31 in-process replaces the Yahoo MA HTTP round-trip, while keeping OJAD for surface-level phrase pitch and keeping every POS-driven rule originally introduced in PR #49.
This is the single, complete migration PR — it targets
maindirectly and folds in the foundation that was previously proposed as the standalone PR #47 (regex reading-override layer, Needleman-Wunsch DP aligner,/MarkAccent/stream/NDJSON endpoint). Since the Yahoo-backed intermediate from #47 is a stepping stone that won't ship independently, #47 was closed and its commits are the base of this branch rather than a separate merge.PR structure (19 commits, base
ci/cd-ghcr→main)Stacked on top of PR #55 (
ci/cd-ghcr→main), which carries the generic CI/CD basework (GHCR workflow, Node 24 bumps, non-root Dockerfile). Once #55 merges, this PR's base auto-retargets back tomainand the diff stays the same — only the foundation + UniDic + UniDic-specific Docker tweak remain here.ci/cd-ghcr→ foundation (3) → UniDic migration (15) → UniDic Docker tweak (1):Foundation (was #47): Needleman-Wunsch DP aligner (+ rendaku fold), regex reading-override layer + URL/non-JP preprocessing, and the
/MarkAccent/stream/NDJSON endpoint.UniDic migration (#50): swap Yahoo MA → fugashi+UniDic, drop Yahoo config, add tokenizer / preprocess / postprocess / user_patches modules, UniDic strong-mode fields, English README.
UniDic Docker tweak:
scripts/download_unidic.shat build time (depends on theunidicpip dep that lands in this PR — that's why it can't live in #55).Architecture
flowchart LR text([text input]) fugashi["fugashi (MeCab) + UniDic CWJ 2025-12-31<br/>surface · kana · lemma<br/>pos · conjugation · aType"] ojad["OJAD scrape (suzukikun)<br/>per-mora surface pitch contour"] align["DP alignment<br/><code>align_accent</code>"] patches["<code>apply_accent_patches</code><br/>POS-driven rules"] out([WordAccentResult]) text --> fugashi text --> ojad fugashi -- tokens + POS --> align ojad -- per-mora pitch --> align align --> patches fugashi -. POS metadata .-> patches patches --> outYahoo MA HTTP endpoint fully removed (
api/accent/furigana.pyand the/MarkFurigana/route deleted). Everything #49 introduced (POS-drivenapply_accent_patches,pos_matchoverride predicate, 5 POS columns onWordResult/WordAccentResult) carries over to UniDic identically — same UniDic schema, only the upstream changed.Package layout
api/accent/tokenizer.pyapi/accent/preprocess.pyapi/accent/postprocess.pyapi/accent/reading_overrides.py+api/accent/user_patches.pyapi/accent/models.py_build_chunks,_schedule_chunks) →api/accent/pipeline.pyLocal UniDic replace Yahoo MA
_fetch_yahoo_raw→_tag_localusing singletonfugashi.Tagger()+ UniDic CWJ 2025-12-31pos, feat.pos2 →pos1, feat.cType →conjugation_type, feat.cForm →conjugation_form, feat.lemma →base(with-glosssuffix stripped), feat.aType →lexical_kernel(parsed for multi-reading via_parse_atype)feat.kanaoverfeat.pron— UniDic stores 忙しい as kana=イソガシイ (matches OJAD's ortho-kana) while pron=イソガシー with chōonpu would never alignStrong-mode fields
lexical_kernel: int | None— aType primary (0=heiban, N≥1=kernel on mora N)lexical_kernel_alts: list[int] | None— multi-reading alternates (e.g. aType="2,0" → [2, 0])kernel_absorbed: bool— UniDic says kernel exists but OJAD has no FALL in this word's range (connected-speech sandhi case, e.g. 忙しい inside お忙しい中)Dictionary variant selection
NINJAL publishes two UniDic variants — the download script supports both:
cwj-2025-12-31csj-2025-12-31CWJ is the default because this service primarily processes written input. CSJ may be preferable when processing conversational or transcribed speech. The Dockerfile defaults to CWJ; override by changing the script argument in the builder stage.
Hide internal MA metadata from response
base/pos/pos1/conjugation_type/conjugation_formmarkedField(exclude=True)— kept on the model for in-pipeline use byapply_accent_patches, excluded from JSON serialization. Cleaner client contract; nothing functional changed.Unify chunking between two endpoints
_build_chunks/_schedule_chunksshared by/MarkAccent/and/MarkAccent/stream/. Both endpoints emit byte-identical per-chunk word lists; only delivery shape differs (collected vs streamed NDJSON).Rendering polish
render_katakana_furigana=Falsekeepsaccent(only clearsfurigana)pos=助詞) tokens clear their redundant top-levelfurigana\d × \dswapped →\d/\dpre-pipeline,×restored on surface post-alignment19×19→ 1919 (千九百十九); swap forces independent reading per number_match_costadds tiny per-empty-OJAD-entry cost19/19as 1+7 morae instead of 4+4 because all cost-0 ties picked an arbitrary pathuser_patches.py) with explicit per-mora int tuplesVerified locally
ruff check,ruff format --checkpass (17 files);import main+ all accent submodules import clean学校に行きます。今日は寒いですね。19×19の格子。— heiban-particle flatten (に stays 1, は/が/の/な flatten after heiban), 助詞 furigana cleared, 19×19 splits 4+0+4 not 1+7base/pos/pos1/conjugation_type/conjugation_form(Field(exclude=True))/MarkAccent/stream/returns NDJSON with{chunk, subchunk, status, result, error}Disambiguation probe
All 11 cases from #49's table still pass — POS gates fire identically because the UniDic schema is the same upstream of the rule layer.
UniDic Docker tweak (the only Docker/CI commit still in this PR)
unidicpip package ships only the loader, not the ~1.3GBdicdir, sofugashi.Tagger()would fail at runtime without the dictionary.Dockerfilenow runsscripts/download_unidic.shin the builder stage as its own cache layer → the image is self-contained, no runtime download.The CD workflow, Node 24 bumps, and non-root runtime are in #55 (the base of this stacked PR).
Out of scope
Notes
🤖 Generated with Claude Code