feat(semantic): add output_language_override to pin summary/overview language#1607
Conversation
…language Content-based language detection (volcengine#1076) works well for monolingual corpora but flips to the detected script whenever non-dominant content is present. For mixed-corpus users, this produces overviews in a language they don't read and breaks downstream agents that rely on consistent language in .overview.md / .abstract.md. Add `output_language_override` to OpenVikingConfig. When non-empty, it bypasses `_detect_language_from_text` and forces the configured language for all semantic summary and overview generation, as well as memory extraction. When empty (default), behavior is unchanged. Changes: - Add `output_language_override` config field - Add `resolve_output_language` and `resolve_output_language_from_conversation` helpers in memory/utils/language - Wire the helpers into semantic_processor (file summary + overview generation), memory_extractor, and session_extract_context_provider - Add TestOutputLanguageOverride covering override set/unset paths for both text and conversation resolvers
…ide primitive Extract the override-plus-fallback pattern into a single public helper so the three call sites (text, conversation, message-based) share one source of truth and cannot drift. - Add `resolve_with_override(config, detect_with_fallback)` as the canonical primitive. It reads `output_language_override`, returns early if set, otherwise invokes a caller-supplied detector with the resolved fallback language. - Rewrite `resolve_output_language` and `resolve_output_language_from_conversation` as thin wrappers. - Update `memory_extractor` to go through the primitive instead of inlining its own override branch, keeping the specialized `_detect_output_language` (user-message extraction with thresholds). - Move `get_openviking_config` to a top-level import in `session/memory/utils/language.py` (consistent with the existing `openviking_cli.utils.get_logger` import).
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
…language Adds config field output_language_override and resolve_with_override primitive used by semantic file summaries, directory overviews, and memory extraction. When non-empty, bypasses content-based language detection and forces the configured language. Default "" preserves existing auto-detect behavior. Fixes mixed-corpus case where any kana/han fragment in an English-primary directory flips the entire overview to ja/zh-CN. Upstream PR: volcengine#1607
qin-ctx
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I think we should deprecate fallback_language. If this field is set, we should log a warning.
To avoid redundant logic, I think this part can be refactored a bit. If Output Language is configured, we should always use that language. If it is not configured, we should use the detected language instead. We do not need fallback_language as an additional fallback anymore.
Per review feedback (volcengine#1607), simplify language resolution to the model: if output_language_override is set, use it otherwise use the detected language `language_fallback` is no longer consulted in the detection chain. The config field stays for backwards compatibility but is marked deprecated; a warning is logged when a non-default value is loaded from user config. Detection's internal "no script detected" branch now falls back to hardcoded "en" directly, which is the prior behavior for the default config anyway. - openviking/session/memory/utils/language.py: resolve_with_override now takes a zero-arg detect callable; inline "en" fallback at the detection sites. - openviking_cli/utils/config/open_viking_config.py: deprecate language_fallback field via model_validator that logs a warning when non-default.
Description
Adds a new config key
output_language_overridethat, when set, bypassescontent-based language detection in semantic summary/overview generation and
memory extraction. When empty (default), behavior is unchanged.
Content-based detection (added in #1076 to resolve #934, strengthened in
open #1521) works well when a directory's content is monolingual, but
fails on mixed-language corpora —
_detect_language_from_textlatchesonto minority-script content and flips the entire directory's
.overview.md/.abstract.mdto a language the user does not read.language_fallbackdoes not help because detection runs regardless.Symmetric impact: an English-primary deployment with some imported
Japanese/Chinese content, or a Chinese-primary deployment with embedded
English documentation, now has a single knob to pin output language.
Related Issue
Related to #934 (closed by #1076), #1067, and #1521. Complements, does
not conflict with, the source-language-following direction — override is
opt-in and defaults off.
Type of Change
Changes Made
output_language_override: str = ""field toOpenVikingConfigresolve_with_override(config, detect_with_fallback)as thecanonical primitive and two thin convenience wrappers
(
resolve_output_language,resolve_output_language_from_conversation)in
openviking/session/memory/utils/language.pySemanticProcessor(file summary + overviewgeneration),
MemoryExtractor, andSessionExtractContextProvidersoall three language-resolution paths share one override source of truth
TestOutputLanguageOverridecovering override set/unset paths forboth text and conversation resolvers (7 new tests)
Testing
Checklist
Example
Before (English-primary corpus with minority non-English content):
Reproducible deterministically via
ov reindex --regenerate --wait.After, with
output_language_override: en:The detector is skipped, prompts receive
Output Language: en, and theoverview is generated in English regardless of in-corpus non-English
fragments. Same mechanism works for any target language — e.g.,
output_language_override: zh-CNpins summaries to Chinese on anEnglish-heavy subdirectory.
Pseudocode of the change in
semantic_processor.py:Additional Notes
The override is strictly additive: when unset (the default), behavior is
identical to current
main, so this does not interact with the open PR#1521. They can be merged in either order.
No docs update included — happy to add a note under the semantic
summaries section of
docs/en/anddocs/zh/if maintainers preferthat in-PR vs a follow-up.