Skip to content

feat(semantic): add output_language_override to pin summary/overview language#1607

Merged
qin-ctx merged 4 commits intovolcengine:mainfrom
0xble:feat/output-language-override
Apr 23, 2026
Merged

feat(semantic): add output_language_override to pin summary/overview language#1607
qin-ctx merged 4 commits intovolcengine:mainfrom
0xble:feat/output-language-override

Conversation

@0xble
Copy link
Copy Markdown
Contributor

@0xble 0xble commented Apr 21, 2026

Description

Adds a new config key output_language_override that, when set, bypasses
content-based language detection in semantic summary/overview generation and
memory extraction. When empty (default), behavior is unchanged.

Content-based detection (added in #1076 to resolve #934, strengthened in
open #1521) works well when a directory's content is monolingual, but
fails on mixed-language corpora — _detect_language_from_text latches
onto minority-script content and flips the entire directory's
.overview.md / .abstract.md to a language the user does not read.
language_fallback does not help because detection runs regardless.

Symmetric impact: an English-primary deployment with some imported
Japanese/Chinese content, or a Chinese-primary deployment with embedded
English documentation, now has a single knob to pin output language.

Related Issue

Related to #934 (closed by #1076), #1067, and #1521. Complements, does
not conflict with, the source-language-following direction — override is
opt-in and defaults off.

Type of Change

  • New feature (non-breaking change that adds functionality)

Changes Made

  • Add output_language_override: str = "" field to OpenVikingConfig
  • Add resolve_with_override(config, detect_with_fallback) as the
    canonical primitive and two thin convenience wrappers
    (resolve_output_language, resolve_output_language_from_conversation)
    in openviking/session/memory/utils/language.py
  • Wire the helpers into SemanticProcessor (file summary + overview
    generation), MemoryExtractor, and SessionExtractContextProvider so
    all three language-resolution paths share one override source of truth
  • Add TestOutputLanguageOverride covering override set/unset paths for
    both text and conversation resolvers (7 new tests)

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows
$ pytest tests/storage/test_semantic_processor_language.py::TestOutputLanguageOverride \
        tests/storage/test_semantic_processor_language.py::TestLanguageDetection \
        tests/storage/test_semantic_processor_language.py::TestLanguageFlow \
        tests/storage/test_semantic_processor_language.py::TestOverviewGenerationFlow -v
======================= 25 passed, 11 warnings in 0.02s =======================

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Example

Before (English-primary corpus with minority non-English content):

$ ov overview viking://user/<me>/memories/events
# events

このディレクトリは、システム管理、ソフトウェア開発、プロジェクト管理、および個人的な業務記録を網羅した...

## Quick Navigation

*   **システムメンテナンスとクリーンアップについて知りたい**
    *   OpenClawの削除と監査 → [mem_928970ea-...], [mem_fcd7d860-...]

Reproducible deterministically via ov reindex --regenerate --wait.

After, with output_language_override: en:

# ov.conf
semantic:
  language_fallback: en
  output_language_override: en

The detector is skipped, prompts receive Output Language: en, and the
overview is generated in English regardless of in-corpus non-English
fragments. Same mechanism works for any target language — e.g.,
output_language_override: zh-CN pins summaries to Chinese on an
English-heavy subdirectory.

Pseudocode of the change in semantic_processor.py:

# Before
fallback_language = (config.language_fallback or "en").strip() or "en"
output_language = _detect_language_from_text(content, fallback_language)

# After
output_language = resolve_output_language(content, config=config)
# where resolve_output_language honors config.output_language_override first,
# then falls back to _detect_language_from_text when override is empty.

Additional Notes

The override is strictly additive: when unset (the default), behavior is
identical to current main, so this does not interact with the open PR
#1521. They can be merged in either order.

No docs update included — happy to add a note under the semantic
summaries section of docs/en/ and docs/zh/ if maintainers prefer
that in-PR vs a follow-up.

0xble added 2 commits April 20, 2026 20:35
…language

Content-based language detection (volcengine#1076) works well for monolingual
corpora but flips to the detected script whenever non-dominant content
is present. For mixed-corpus users, this produces overviews in a
language they don't read and breaks downstream agents that rely on
consistent language in .overview.md / .abstract.md.

Add `output_language_override` to OpenVikingConfig. When non-empty, it
bypasses `_detect_language_from_text` and forces the configured
language for all semantic summary and overview generation, as well as
memory extraction. When empty (default), behavior is unchanged.

Changes:
- Add `output_language_override` config field
- Add `resolve_output_language` and `resolve_output_language_from_conversation`
  helpers in memory/utils/language
- Wire the helpers into semantic_processor (file summary + overview
  generation), memory_extractor, and session_extract_context_provider
- Add TestOutputLanguageOverride covering override set/unset paths
  for both text and conversation resolvers
…ide primitive

Extract the override-plus-fallback pattern into a single public helper
so the three call sites (text, conversation, message-based) share one
source of truth and cannot drift.

- Add `resolve_with_override(config, detect_with_fallback)` as the
  canonical primitive. It reads `output_language_override`, returns
  early if set, otherwise invokes a caller-supplied detector with the
  resolved fallback language.
- Rewrite `resolve_output_language` and
  `resolve_output_language_from_conversation` as thin wrappers.
- Update `memory_extractor` to go through the primitive instead of
  inlining its own override branch, keeping the specialized
  `_detect_output_language` (user-message extraction with thresholds).
- Move `get_openviking_config` to a top-level import in
  `session/memory/utils/language.py` (consistent with the existing
  `openviking_cli.utils.get_logger` import).
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

934 - Partially compliant

Compliant requirements:

  • Add output_language_override config to pin summary/overview language

Non-compliant requirements:

Requires further human verification:

1067 - Partially compliant

Compliant requirements:

  • Add output_language_override config to bypass content-based detection

Non-compliant requirements:

Requires further human verification:

1521 - Partially compliant

Compliant requirements:

  • Add output_language_override config as an optional complement to source-language following

Non-compliant requirements:

Requires further human verification:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 92
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

0xble added a commit to 0xble/OpenViking that referenced this pull request Apr 21, 2026
…language

Adds config field output_language_override and resolve_with_override
primitive used by semantic file summaries, directory overviews, and
memory extraction. When non-empty, bypasses content-based language
detection and forces the configured language. Default "" preserves
existing auto-detect behavior.

Fixes mixed-corpus case where any kana/han fragment in an
English-primary directory flips the entire overview to ja/zh-CN.

Upstream PR: volcengine#1607
@qin-ctx qin-ctx self-requested a review April 21, 2026 04:05
@qin-ctx qin-ctx self-assigned this Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I think we should deprecate fallback_language. If this field is set, we should log a warning.

To avoid redundant logic, I think this part can be refactored a bit. If Output Language is configured, we should always use that language. If it is not configured, we should use the detected language instead. We do not need fallback_language as an additional fallback anymore.

Per review feedback (volcengine#1607), simplify language resolution to the model:
  if output_language_override is set, use it
  otherwise use the detected language

`language_fallback` is no longer consulted in the detection chain.
The config field stays for backwards compatibility but is marked
deprecated; a warning is logged when a non-default value is loaded
from user config. Detection's internal "no script detected" branch
now falls back to hardcoded "en" directly, which is the prior
behavior for the default config anyway.

- openviking/session/memory/utils/language.py: resolve_with_override
  now takes a zero-arg detect callable; inline "en" fallback at the
  detection sites.
- openviking_cli/utils/config/open_viking_config.py: deprecate
  language_fallback field via model_validator that logs a warning
  when non-default.
@0xble
Copy link
Copy Markdown
Contributor Author

0xble commented Apr 22, 2026

Thanks @qin-ctx — addressed in 0725fa3d. language_fallback is no longer consulted in the chain; it stays as a deprecated config field with a warning logged when a non-default value is loaded. Detection falls back to hardcoded en directly. The resolution is now strictly override -> detected -> 'en'.

@qin-ctx qin-ctx merged commit 7e347f2 into volcengine:main Apr 23, 2026
6 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: If you want to summarize and overview, you can configure it to generate Chinese

2 participants