refactor: absorb 3 reranker heuristics into compile-time dictionary costs#209
Merged
refactor: absorb 3 reranker heuristics into compile-time dictionary costs#209
Conversation
…y costs Move two reranker heuristics into compile-time dictionary cost adjustments: - person_name (role 6): +2000 cost offset at compile time - pronoun (role 5): -3500 cost offset at compile time This eliminates two post-hoc reranker passes, making the Viterbi search see more accurate costs during beam search. 216,105 entries adjusted. Changes: - dict compile: add --id-def option for role-based cost adjustment - reranker: remove pronoun_bonus() and person_name_penalty() - settings: remove pronoun_cost_bonus and person_name_penalty params - explain: remove pronoun_bonus from cost breakdown display Accuracy: 61/61 pass (4 skip), history 6/6 — identical to baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move non-independent kanji penalty to compile-time: entries with role=NON_INDEPENDENT and kanji surface get +1500 cost offset. 889 additional entries adjusted (216,994 total with person_name/pronoun). Removes non_independent_kanji_penalty from reranker, settings, and explain output. te_form_kanji_penalty remains (context-dependent, requires role expansion to absorb). Accuracy: 61/61 pass (4 skip), history 6/6 — identical to baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR moves 3 reranker heuristics (person-name penalty, pronoun bonus, non-independent-kanji penalty) into dictionary compile-time cost offsets using Mozc id.def role information, reducing runtime reranker complexity while keeping the remaining context-dependent heuristics in the reranker.
Changes:
- Add
dictool compile --id-defsupport and apply compile-time cost offsets based on morpheme roles. - Remove the corresponding 3 settings knobs and runtime reranker/explain accounting for those heuristics.
- Update the
misedictionary build task to pass--id-def.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mise.toml | Update dictionary build task to pass --id-def to dictool compile. |
| engine/crates/lex-core/src/settings.rs | Remove 3 reranker settings parameters and validation/tests for them. |
| engine/crates/lex-core/src/default_settings.toml | Remove the 3 deleted reranker parameters from defaults. |
| engine/crates/lex-core/src/converter/reranker.rs | Remove 3 heuristics from runtime reranking and delete associated tests. |
| engine/crates/lex-core/src/converter/explain.rs | Remove display/breakdown fields for the deleted heuristics. |
| engine/crates/lex-cli/src/commands/dict_ops.rs | Add compile-time cost adjustment logic driven by id.def roles. |
| engine/crates/lex-cli/src/bin/dictool.rs | Add --id-def option to the compile subcommand and plumb through. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address Copilot review: - Auto-detect id.def in input_dir when --id-def is not specified - Validate left_id against roles table size instead of silently defaulting to role 0 - Add id.def to mise.toml dict-mozc sources for rebuild tracking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update stale reranker comment (removed heuristics still mentioned) - Use PathBuf instead of String for id_def resolution - Remove dead code: is_non_independent(), is_pronoun(), is_person_name() on ConnectionMatrix (no callers after compile-time cost absorption) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reranker の5つのヒューリスティックのうち3つを辞書コンパイル時のコスト調整に移行し、ランタイムコードを削減。
dictool compileに--id-defオプションを追加し、id.defのロール情報を使ってコスト調整。216,994 エントリが対象。残りの2つ(
te_form_kanji_penalty,single_char_kanji_penalty)は文脈依存のため reranker に残留。変更点
dict_ops.rs: compile 時コスト調整(--id-defオプション)reranker.rs: 3関数削除(pronoun_bonus,person_name_penalty,non_independent_kanji_penalty)settings.rs/default_settings.toml: 対応する3パラメータ削除explain.rs: コスト内訳表示から削除済みペナルティを除去mise.toml:dictool compileに--id-defを追加Test plan
cargo fmt/clippy/test全 pass(323 + 68 + 20 tests)mise run accuracy— 61/61 pass (4 skip) ← ベースラインと同一mise run accuracy-history— 6/6 pass ← ベースラインと同一🤖 Generated with Claude Code