[Serve][LLM] Replace SGLang format_messages_to_prompt with _build_chat_messages#61372
Conversation
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request fixes a crash in the SGLang example caused by an undefined function call. The fix correctly replaces the call with existing helper methods from the same class. My review includes one suggestion to adjust the prompt formatting for embedding requests to ensure the resulting embedding is not unintentionally skewed by a generation-specific token.
| prompt = self._render_fallback_prompt( | ||
| self._build_chat_messages(request.messages) | ||
| ) |
There was a problem hiding this comment.
The _render_fallback_prompt method appends assistant: to the prompt, which is suitable for generation tasks to prompt the model for a response. However, for embedding requests, the goal is typically to get a representation of the existing conversation history. Including this generation prompt might skew the resulting embedding vector.
To ensure the embedding accurately reflects only the provided conversation, consider creating the prompt from the chat messages directly, without appending the assistant: token.
| prompt = self._render_fallback_prompt( | |
| self._build_chat_messages(request.messages) | |
| ) | |
| chat_messages = self._build_chat_messages(request.messages) | |
| prompt_lines = [] | |
| for message in chat_messages: | |
| role = str(message.get("role", "user")) | |
| content = message.get("content", "") or "" | |
| prompt_lines.append(f"{role}: {content}") | |
| prompt = "\n".join(prompt_lines) |
| prompt = format_messages_to_prompt(request.messages) | ||
| prompt = self._render_fallback_prompt( | ||
| self._build_chat_messages(request.messages) | ||
| ) |
There was a problem hiding this comment.
Fallback prompt appends "assistant:" unsuitable for embeddings
Medium Severity
_render_fallback_prompt unconditionally appends "assistant:" at the end of the rendered prompt (line 153), which is designed for text generation to cue the model to produce the next assistant turn. For embedding requests, this trailing "assistant:" contaminates the text being encoded, producing embeddings that don't faithfully represent the actual conversation content. The embedding code path now always uses this generation-oriented renderer rather than a plain message concatenation.


Fix