Warning: this is a vibe-coded library. It was built with AI assistance, has no test suite, and has seen limited real-world use. The API may change. Use in production at your own risk.
Node.js native addon that runs llama.cpp inference directly in the Node.js/Electron process — no child processes, no HTTP, no IPC overhead. Model weights stay loaded in memory across calls.
- Node.js 18+ or Electron 35+
- CMake 3.15+
- C++17 compiler (Xcode CLT on macOS,
build-essentialon Linux) - A
.ggufmodel file
git clone --recurse-submodules <repo-url>
npm installThe vendor/llama.cpp submodule is built automatically. On macOS, Metal acceleration is enabled by default. No extra flags needed.
const { LlamaModel } = require('llama-cpp-node-api');
const model = new LlamaModel('/path/to/model.gguf', { nGpuLayers: 99 });
for await (const token of model.generate(prompt, { nPredict: 256 })) {
process.stdout.write(token);
}
model.dispose();Models embed a Jinja2-style chat template in their metadata. Use applyChatTemplate() to format messages instead of hand-crafting prompts:
const formatted = model.applyChatTemplate([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Hello!' },
]);
for await (const token of model.generate(formatted, { nPredict: 256 })) {
process.stdout.write(token);
}You can also inspect the raw template string via model.chatTemplate.
| Option | Default | Description |
|---|---|---|
nGpuLayers |
99 |
Layers to offload to GPU. 0 = CPU only. |
nCtx |
2048 |
Context window size in tokens. |
Returns an AsyncGenerator<string> that yields token text pieces.
Concurrent calls on the same model are allowed — they queue internally and execute one at a time (llama.cpp's context is not thread-safe). The JS event loop is never blocked while a generation is running, so timers, I/O, and abort signals keep firing.
| Option | Default | Description |
|---|---|---|
nPredict |
256 |
Max tokens. 0 or negative = unlimited. |
temperature |
0.8 |
Sampling temperature. 0 = greedy. |
topP |
0.95 |
Nucleus sampling cutoff. 1 to disable. |
topK |
40 |
Top-k sampling. 0 to disable. |
minP |
0 |
Min-p sampling threshold. Typical: 0.05. |
repeatPenalty |
1.0 |
Repetition penalty. 1.0 = disabled. |
repeatLastN |
64 |
Token window for repeat penalty. |
grammar |
— | GBNF grammar string to constrain output. |
grammarFile |
— | Path to a .gbnf file. |
stop |
[] |
Stop sequences. Generation halts on first match; the sequence is not included in output. |
nCtx |
— | Override context size for this call. |
resetContext |
false |
Clear KV cache before generating (start fresh). |
signal |
— | AbortSignal that cancels this call. The generator throws an AbortError on the next iteration. Other concurrent calls are unaffected. |
// Cancel a single call:
const ac = new AbortController();
setTimeout(() => ac.abort(), 5_000);
try {
for await (const t of model.generate(prompt, { signal: ac.signal })) {
process.stdout.write(t);
}
} catch (e) {
if (e.name !== 'AbortError') throw e;
}
// Cancel every in-flight and queued call on this model:
model.abort();The Jinja2 chat template string embedded in the model's metadata, or null if the model doesn't include one.
Formats an array of { role, content } messages using the model's built-in template. Returns the formatted prompt string.
| Option | Default | Description |
|---|---|---|
addAssistant |
true |
Append the assistant turn prefix so the model continues naturally. |
Cancels every currently running and queued generation on this model. Each call stops at its next token boundary. For cancelling a single call only, use opts.signal on that call (see above).
Frees model weights and KV cache. Must be called when done.
Number of token slots in the current context window. 0 before the first generate() call.
Manages multiple named models, loading each lazily on first use.
const { LlamaModelPool } = require('llama-cpp-node-api');
const pool = new LlamaModelPool();
pool.register('fast', '/models/phi-3-mini.gguf', { nGpuLayers: 99 });
pool.register('smart', '/models/llama-3-8b.gguf', { nGpuLayers: 99 });
for await (const token of pool.generate('fast', prompt, { nPredict: 128 })) {
process.stdout.write(token);
}
pool.dispose();Requantizes a GGUF file to a different ftype. Runs on a libuv worker thread and returns a Promise<void> that resolves once the output has been written. Progress is printed to stderr by llama.cpp.
const { quantize, quantizeFtypes } = require('llama-cpp-node-api');
console.log(quantizeFtypes()); // → ['F16', 'Q4_K_M', 'Q8_0', ...]
await quantize('/models/model-f16.gguf',
'/models/model-q4_k_m.gguf',
{ ftype: 'Q4_K_M' });| Option | Default | Description |
|---|---|---|
ftype |
required | Target quantization. Accepts a string name (e.g. 'Q4_K_M') or the raw llama_ftype enum value. Use quantizeFtypes() to list all accepted names. |
nthread |
hardware concurrency | Threads to use. |
allowRequantize |
false |
Allow re-quantizing tensors that are not f32/f16. |
quantizeOutputTensor |
true |
Quantize output.weight too. |
onlyCopy |
false |
Copy tensors as-is (useful for shard repacking). |
pure |
false |
Quantize every tensor to the default type (no per-tensor overrides). |
keepSplit |
false |
Preserve the input's shard count. |
dryRun |
false |
Compute and report final size without writing the output. |
Note: this wraps llama_model_quantize (C API). Converting from safetensors / Hugging Face formats to GGUF is not included — that path lives in llama.cpp's convert_hf_to_gguf.py and requires a Python environment with torch, safetensors, etc.
npm run build # default (Metal on macOS)
npm run build:cpu # disable Metal
npm run build:cuda # enable CUDA
npm run build:debug # debug symbolsMIT. The bundled vendor/llama.cpp is MIT licensed by its authors.
