feat: Support native logprobs in `LlamaChatSession.prompt()` to avoid double inference

### Feature Description

When using `LlamaChatSession.prompt()` to generate a response, the model internally computes a
full probability distribution (via softmax) for every token it samples. This data exists at the
native layer but is currently discarded before surfacing to JavaScript.

Applications that need per-token logprobs (e.g. for confidence visualization, uncertainty
estimation, or OpenAI-compatible `logprobs` API emulation) therefore have no choice but to
**replay the entire output sequence on a second `LlamaContextSequence`** using
`controlledEvaluate` with `generateNext: { probabilities: true }`. This doubles the inference
cost:

- Pass 1 (main generation): `LlamaChatSession.prompt()` on `sequence`
- Pass 2 (logprob replay): `controlledEvaluate` on a dedicated `replaySequence`

Because a second sequence is needed simultaneously, the context must be created with
`sequences: 2`, which also splits the KV cache budget.

### The Solution

Add a `logprobs` / `topLogprobs` option to `LlamaChatSession.prompt()` (and ideally to
`LlamaCompletion` as well) that captures the probability distribution **during the original
generation pass** and returns it alongside the text, similar to the OpenAI API:

```ts
const result = await session.prompt(input, {
  logprobs: true,
  topLogprobs: 5,
  // ...existing options
})

// result.logprobs.content[i] = { token, logprob, top_logprobs: [...] }
```

This would eliminate the replay pass entirely and halve the inference cost for any consumer
that needs token-level probabilities.


### Considered Alternatives

The only current workaround is to use `controlledEvaluate` on a parallel `replaySequence` after
(or concurrently with) the main generation. A reference implementation of this approach can be
found here:
https://github.com/steve02081504/fount/blob/master/src/public/parts/serviceGenerators/AI/local/src/localLogprobs.mjs


### Additional Context

- node-llama-cpp version: 3.18.1
- Use case: OpenAI-compatible `logprobs` visualization in a local GGUF inference service


### Related Features to This Feature Request

- [ ] Metal support
- [ ] CUDA support
- [ ] Vulkan support
- [ ] Grammar
- [ ] Function calling

### Are you willing to resolve this issue by submitting a Pull Request?

No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support native logprobs in `LlamaChatSession.prompt()` to avoid double inference #584

Feature Description

The Solution

Considered Alternatives

Additional Context

Related Features to This Feature Request

Are you willing to resolve this issue by submitting a Pull Request?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

feat: Support native logprobs in LlamaChatSession.prompt() to avoid double inference #584

Description

Feature Description

The Solution

Considered Alternatives

Additional Context

Related Features to This Feature Request

Are you willing to resolve this issue by submitting a Pull Request?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

feat: Support native logprobs in `LlamaChatSession.prompt()` to avoid double inference #584