Skip to content

fix(embedder): auto-retry with smaller chunks when input exceeds model limit#736

Open
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/embedder-respect-model-max-tokens
Open

fix(embedder): auto-retry with smaller chunks when input exceeds model limit#736
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/embedder-respect-model-max-tokens

Conversation

@deepakdevp
Copy link
Contributor

Summary

  • When the embedding API rejects input as too large, automatically retry with chunking at half the current max_tokens instead of crashing
  • Logs a warning guiding users to set embedding.dense.max_tokens in ov.conf

Fixes #731

Root Cause

The default max_tokens is 8000 (designed for OpenAI models). Users with models that have smaller limits (e.g., bce-embedding-base_v1 with 512 max tokens) hit errors because:

  1. Token estimation uses len(text) // 3 for non-OpenAI models (no tiktoken), which underestimates CJK text
  2. _chunk_and_embed() creates chunks of max_tokens size, but each chunk still exceeds the model's actual limit
  3. The API returns "input too large" and the embedder crashes

Fix

In OpenAIDenseEmbedder.embed(), catch "too large/too long/maximum context length" errors from _embed_single() and retry with _chunk_and_embed() at half the max_tokens. This handles both:

  • Text that passes the estimation check but fails at the API
  • Models with smaller limits than the 8000 default

Changes Made

  • openviking/models/embedder/openai_embedders.py: Added error-driven retry in embed() with reduced chunk size and user-facing warning

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Testing

  • Unit tests pass locally (11/11 embedder tests)
  • Tested on macOS

…l limit

When the embedding API rejects input as "too large" (common with
non-OpenAI models where token estimation is inaccurate), retry
with chunking at half the current max_tokens instead of crashing.

Also logs a warning guiding users to set embedding.dense.max_tokens
in ov.conf to match their model's actual limit.

Fixes volcengine#731.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@qin-ptr qin-ptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR fixes a crash when embedding input exceeds model limits by adding automatic retry with smaller chunks. However, there is a critical concurrency bug that must be fixed before merging.

See inline comments for details.

Blocking issue: The implementation modifies instance attribute _max_tokens which is not thread-safe when embed() is called concurrently via asyncio.to_thread() (as seen in collection_schemas.py:251).

reduced = max(self.max_tokens // 2, 128)
logger.warning(
f"Embedding failed due to input length. "
f"Retrying with chunk size {reduced} tokens. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

Concurrency Safety Issue: Directly modifying the instance attribute self._max_tokens is not thread-safe.

From collection_schemas.py:251-252, embed() is called via asyncio.to_thread() which allows concurrent execution in a thread pool. If multiple threads call embed() simultaneously:

  • Thread A sets self._max_tokens = reduced
  • Thread B reads self.max_tokens (in _chunk_and_embed() or _chunk_text()) and gets the wrong value
  • Thread A restores self._max_tokens in finally, potentially overwriting Thread B's modification

Suggested Fix: Pass max_tokens as a parameter instead of modifying the instance attribute.

Option 1: Add optional parameter to _chunk_and_embed():

def _chunk_and_embed(self, text: str, is_query: bool = False, override_max_tokens: Optional[int] = None) -> EmbedResult:
    max_tok = override_max_tokens if override_max_tokens is not None else self.max_tokens
    # Use max_tok instead of self.max_tokens

# In retry logic:
reduced = max(self.max_tokens // 2, 128)
return self._chunk_and_embed(text, is_query=is_query, override_max_tokens=reduced)

This approach is thread-safe because each thread uses its own local variable instead of modifying shared state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added override_max_tokens parameter to _chunk_and_embed() and _chunk_text() in base.py. The retry now passes override_max_tokens=reduced instead of mutating self._max_tokens. Thread-safe.

except RuntimeError as e:
error_msg = str(e).lower()
if (
"too large" in error_msg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

Error Message Matching May Be Too Broad: The current check for "too large", "too long", or "maximum context length" might match unrelated errors:

  • "Request body too large" (not a token length issue)
  • "File too large" (not an embedding input issue)

Consider more precise matching:

if (
    ("input" in error_msg and "too large" in error_msg)
    or ("token" in error_msg and ("too long" in error_msg or "too many" in error_msg))
    or "context length" in error_msg
):

Or check for specific API error codes if available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Tightened error matching to require 'input' + 'too large', 'token' + 'too long/many', or 'context length'. Won't match unrelated 'Request body too large' errors.

…ching

- Add override_max_tokens parameter to _chunk_and_embed() and
  _chunk_text() instead of mutating self._max_tokens (thread-safe)
- Tighten error message matching to require specific patterns
  ("input" + "too large", "token" + "too long/many", "context length")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: Input sequence length exceeds the max input length of embedding model.

2 participants