fix: improve structure_cost_filter to keep valid multi-segment paths#205
fix: improve structure_cost_filter to keep valid multi-segment paths#205
Conversation
今ですね (今|です|ね) was being filtered out because: 1. Single-segment paths (sc=0) set min_sc too low 2. Prefix POS transitions (e.g. 今[prefix]→デスネ, conn=256) dragged the baseline down further Changes: - Raise structure_cost_filter from 4000 to 6000 - Impute single-segment paths with prefix_floor for min_sc computation so 0-transition paths don't set artificially low baseline - Floor prefix POS transitions at filter/2 to prevent anomalously cheap connections from skewing the threshold - Cap script_cost scale at min(reading_chars, 2) to reduce excessive kanji bonuses on long compound readings - Add いまですね regression test case (accuracy: 61/61) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adjusts reranker heuristics and default settings to prevent valid multi-segment candidates from being dropped by the structure-cost hard filter (notably for inputs like 「いまですね」→「今ですね」), and updates tests/corpus to lock in the regression fix.
Changes:
- Raise
structure_cost_filterdefault from 4000 → 6000 and update parsing tests/fixtures accordingly. - Modify structure-cost computation/filtering to apply a prefix-transition floor and to avoid single-segment paths setting an overly low
min_sc. - Reduce mixed/pure-kanji script bonus scaling cap (reading length cap 3 → 2) and update reranker unit tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| engine/testcorpus/accuracy-corpus.toml | Adds a regression corpus case ensuring 「いまですね」 converts to 「今ですね」. |
| engine/crates/lex-core/src/settings.rs | Updates settings parsing tests and embedded TOML examples for the new structure_cost_filter default. |
| engine/crates/lex-core/src/default_settings.toml | Bumps default structure_cost_filter to 6000. |
| engine/crates/lex-core/src/converter/tests/reranker.rs | Updates unit tests to reflect new structure-cost filtering logic and script-cost scaling. |
| engine/crates/lex-core/src/converter/reranker.rs | Implements prefix-transition flooring and single-segment baseline imputation in structure-cost filtering. |
| engine/crates/lex-core/src/converter/cost.rs | Changes script-cost scaling cap from 3 to 2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Updates the lex-core reranking heuristics to avoid incorrectly filtering out valid multi-segment conversions (notably around prefix-driven low transition costs), and adjusts scoring defaults/tests accordingly.
Changes:
- Increase
structure_cost_filterdefault from 4000 → 6000 (settings + default TOML). - Add a “prefix transition floor” and single-segment baseline imputation to the reranker’s hard structure-cost filter.
- Reduce mixed/pure-kanji script-cost scaling cap (reading length cap 3 → 2) and update affected tests; add an accuracy-corpus regression case.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| engine/testcorpus/accuracy-corpus.toml | Adds a regression case for “いまですね” → “今ですね”. |
| engine/crates/lex-core/src/settings.rs | Updates default-setting expectations and embedded TOML samples for structure_cost_filter = 6000. |
| engine/crates/lex-core/src/default_settings.toml | Bumps default structure_cost_filter to 6000. |
| engine/crates/lex-core/src/converter/tests/reranker.rs | Updates test expectations for new script-cost scaling and revised structure-cost filtering behavior. |
| engine/crates/lex-core/src/converter/reranker.rs | Implements prefix transition floor + adjusted min baseline to prevent over-filtering correct paths. |
| engine/crates/lex-core/src/converter/cost.rs | Caps script-cost scaling at 2 reading chars instead of 3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Verify that is_prefix() floor logic is exercised by using from_text_with_roles to build a ConnectionMatrix with a prefix POS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adjusts lex-core reranking heuristics to avoid over-filtering correct multi-segment paths (notably cases involving prefixes) and to moderate script-based bonuses, with corresponding updates to defaults and regression coverage.
Changes:
- Raised
structure_cost_filterdefault from 4000 → 6000 (settings + tests). - Updated reranker structure-cost filtering with a prefix-transition floor and single-segment baseline imputation.
- Reduced
script_costreading-length scaling cap (3 → 2) and updated reranker tests + accuracy corpus.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| engine/testcorpus/accuracy-corpus.toml | Adds a regression case for “いまですね” → “今ですね”. |
| engine/crates/lex-core/src/settings.rs | Updates settings parsing tests and embedded TOML fixtures for new filter default. |
| engine/crates/lex-core/src/default_settings.toml | Bumps default structure_cost_filter to 6000. |
| engine/crates/lex-core/src/converter/tests/reranker.rs | Updates expected reranker behavior and adds a new prefix-floor-related test. |
| engine/crates/lex-core/src/converter/reranker.rs | Implements prefix-transition flooring + revised min_sc baseline computation. |
| engine/crates/lex-core/src/converter/cost.rs | Caps script-cost scaling at 2 reading chars (was 3). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Clamp prefix_floor to min(filter/2, cap) so the floor remains effective when structure_cost_transition_cap is lower than the floor. - Rewrite test_prefix_floor_prevents_low_baseline so that path B would be dropped without the floor but survives with it, ensuring the test actually validates the flooring logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adjusts reranker heuristics and defaults to avoid over-filtering correct multi-segment candidates (notably around prefix transitions), and updates regression coverage to lock in the behavior.
Changes:
- Raise
structure_cost_filterdefault from4000to6000across settings + fixtures. - Add prefix-transition “floor” logic and min-baseline imputation to the reranker’s hard structure-cost filter, plus targeted tests.
- Cap
script_costscaling at 2 reading chars (from 3) and update affected test expectations; add an accuracy corpus regression case.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| engine/testcorpus/accuracy-corpus.toml | Adds a regression case for “いまですね” → “今ですね”. |
| engine/crates/lex-core/src/settings.rs | Updates default/fixture settings and assertions for new filter value. |
| engine/crates/lex-core/src/default_settings.toml | Bumps structure_cost_filter to 6000. |
| engine/crates/lex-core/src/converter/tests/reranker.rs | Updates filter-related test scenarios and adds a new prefix-floor regression test. |
| engine/crates/lex-core/src/converter/reranker.rs | Implements prefix-transition floor + single-segment baseline imputation for structure-cost filtering. |
| engine/crates/lex-core/src/converter/cost.rs | Changes script_cost scaling cap from 3 to 2. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The transition cost is conn_cost(prev.right_id, next.left_id), so the prefix check should use right_id (the outgoing POS) rather than left_id (the incoming POS) of the previous segment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adjusts the reranker’s heuristics to avoid incorrect hard-filter drops caused by very low prefix transition costs, and updates defaults/tests/corpus to reflect the new behavior.
Changes:
- Raise
structure_cost_filterdefault from 4000 → 6000 and update settings tests accordingly. - Update structure-cost filtering to apply a prefix-transition floor and to avoid single-segment paths setting an artificially low baseline.
- Reduce
script_costlength scaling cap (3 → 2) and update reranker tests; add a new accuracy regression case for 「いまですね」→「今ですね」.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| engine/testcorpus/accuracy-corpus.toml | Adds regression case covering prefix/particle path being incorrectly filtered out. |
| engine/crates/lex-core/src/settings.rs | Updates default/settings parsing tests for the new structure_cost_filter value. |
| engine/crates/lex-core/src/default_settings.toml | Changes default structure_cost_filter to 6000. |
| engine/crates/lex-core/src/converter/tests/reranker.rs | Updates existing reranker expectations and adds a new test for the prefix-floor behavior. |
| engine/crates/lex-core/src/converter/reranker.rs | Implements prefix-transition floor + single-segment min baseline imputation for hard filtering. |
| engine/crates/lex-core/src/converter/cost.rs | Caps script-cost scaling at 2 characters instead of 3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
structure_cost_filterthreshold from 4000 to 6000 to prevent valid multi-segment paths (e.g. 今|です|ね) from being filtered outmin_scfor single-segment paths (which havesc=0) to prevent them from dragging down the thresholdscript_costscale cap from 3 to 2 to limit over-bonus for long compound readingsTest plan
mise run accuracypasses (100%)mise run accuracy-historypassesいまですね → 今ですね🤖 Generated with Claude Code