-
-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Feature Description
When using LlamaChatSession.prompt() to generate a response, the model internally computes a
full probability distribution (via softmax) for every token it samples. This data exists at the
native layer but is currently discarded before surfacing to JavaScript.
Applications that need per-token logprobs (e.g. for confidence visualization, uncertainty
estimation, or OpenAI-compatible logprobs API emulation) therefore have no choice but to
replay the entire output sequence on a second LlamaContextSequence using
controlledEvaluate with generateNext: { probabilities: true }. This doubles the inference
cost:
- Pass 1 (main generation):
LlamaChatSession.prompt()onsequence - Pass 2 (logprob replay):
controlledEvaluateon a dedicatedreplaySequence
Because a second sequence is needed simultaneously, the context must be created with
sequences: 2, which also splits the KV cache budget.
The Solution
Add a logprobs / topLogprobs option to LlamaChatSession.prompt() (and ideally to
LlamaCompletion as well) that captures the probability distribution during the original
generation pass and returns it alongside the text, similar to the OpenAI API:
const result = await session.prompt(input, {
logprobs: true,
topLogprobs: 5,
// ...existing options
})
// result.logprobs.content[i] = { token, logprob, top_logprobs: [...] }This would eliminate the replay pass entirely and halve the inference cost for any consumer
that needs token-level probabilities.
Considered Alternatives
The only current workaround is to use controlledEvaluate on a parallel replaySequence after
(or concurrently with) the main generation. A reference implementation of this approach can be
found here:
https://github.com/steve02081504/fount/blob/master/src/public/parts/serviceGenerators/AI/local/src/localLogprobs.mjs
Additional Context
- node-llama-cpp version: 3.18.1
- Use case: OpenAI-compatible
logprobsvisualization in a local GGUF inference service
Related Features to This Feature Request
- Metal support
- CUDA support
- Vulkan support
- Grammar
- Function calling
Are you willing to resolve this issue by submitting a Pull Request?
No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.