See
|
# TODO(andy): prompt logprobs + chunked prefill can |
Prompt logprobs + chunked prefill can result in engine core returning an output for a partial prefill (in order to send back partial prompt logprobs.) This breaks the invariant that process_outputs is only operating on engine core outputs associated with non-partial completions. Currently this is handled by having is_prefilling in OutputProcessor check for new decoded tokens, indicating that the completion is not partial.
A follow-up PR should aggregate partial prompt logprobs in the EngineCore.
Before submitting a new issue...