The llm-go
package is a Go wrapper for llama.cpp
derived from an early
version of ollama that is intended to be easier to maintain and use in Go projects.
This package is licensed under the MIT license. See LICENSE
for more information.
This package comes with an command line utility, cmd/llm
, that can demonstrates how to use the llm-go
package to predict text. (This utility also includes a number of other subcommands for development, such as the bump
subcommand explained below, see cmd/llm/dev.go
for more information.)
$ go run github.com/swdunlop/llm-go/cmd/llm predict -m OpenRP.q5_K_M.gguf Once upon a time.. 2>/dev/null
The story of the little girl who was born with a silver spoon in her mouth is well known. But what about the little boy who was born with a golden one? This is his story.
In a small village, there lived a poor family. The father worked as a blacksmith and the mother took care of their only son, named Jack. They were happy despite their poverty because they had each other's love and support.
One day, while working at his forge, the blacksmith found an old chest buried under some rubble. Inside it was a golden spoon. He knew that this discovery would change their lives forever.
The next morning, he woke up early to tell his wife about the treasure they had found. She couldn't believe her eyes when she saw the shiny object. They decided to keep it hidden from everyone else until they figured out how to use it wisely.
...
The llm-go package is relatively simple to use:
// Like llama.cpp, the llm-go package loads its model from GGUF format files.
m, _ := llm.Load("OpenRP.q5_K_M.gguf")
defer m.Close() // Ensures we release the memory associated with the model.
input := m.Encode("Once upon a time..")
s, _ := m.Predict(input)
defer s.Close() // Ensures we releaswe the memory associated with the prediction context.
var output []llm.Token
for {
// Next produces one predicted token at a time. You can add more tokens to the context if you want, this is
// helpful for implementing things like Guidance.
next, err := m.Next(nil)
if err != nil {
break
}
// Each time Next is returned, it will return a single token or an error. The error is usually either io.EOF
// when the model has reached the end of its prediction (by returning the end of stream (EOS) token) or
// llm.Overflow when the context is full and the model cannot predict any more tokens.
output = append(output, next)
}
fmt.Println(m.Decode(output))
The cmd/llm
command can also be used to run a server that exposes the llm-go
package over HTTP. This
command is largely meant as a proof of concept and will only allow one client to run a prediction at a time.
$ go run github.com/swdunlop/llm-go/cmd/llm service
While working on this package, it is useful to run the server using the dev
subcommand which runs the service
as
a child process and restarts it if any Go files change.
$ go run github.com/swdunlop/llm-go/cmd/llm dev
A client can POST
a JSON object to the /predict
endpoint to request a prediction and the service will block until
the prediction is complete. The JSON object must have the following structure:
{
// Model identifies the GGUF model to use for prediction, if omitted, the default model will be used if it was
// specified when the service was started. (If no default model was specified, the service will return an error.)
"model": "OpenRP.q5_K_M.gguf",
// Text provides the text before the prediction, if this is omitted, the service will expect the client to provide
// a list of tokens using the "tokens" field.
"text": "Once upon a time.."
// Tokens provides a list of tokens to use for prediction, as an alternative to providing text. This cannot be
// combined with the "text" field.
"tokens": [7138, 338, 3081],
// Settings is a list of Model and Prediction settings, see examples/server in the llama.cpp project for
// documentation.
"settings": {
"temperature": 0.0
}
// Stop provides a list of regular expressions that, when found, will cause the prediction to stop. This defaults to
// "\n" (a newline) if omitted. The supported regular expression syntax is that of the Go standard library.
"stop": ["\n"],
}
The settings
field provides the same options as examples/server from the llama.cpp project, with the exception of options like stream
that affects how the prediction is returned.
The response will have the following structure:
{
// Error will be set if there was an error during prediction to an English string describing the error. If this
// is set, other fields may be omitted.
"error": "",
// Text is the text that was generated by the model.
"text": "The more you know, the better off you are.",
// Tokens is a list of tokens that were generated by the model.
"tokens": [7138, 338, 3081],
}
A client can also use GET
to establish a WebSocket connection to the /predict
endpoint, send a request and receive
a prediction stream. This is useful that want to stream the prediction as it is generated. The first request must
have the same structure as the POST /v0/predict
request described above.
Responses from the prediction API has the same structure as the POST /v0/predict
response described above, with the
addition of a boolean field, end
, that will be true
when the prediction is complete with a complete prediction
including all of the text and generated tokens.
The GET /v0/models
endpoint returns a list of the GGUF models that are available to the service from the model
directory and will identify the default model.
{
// Default identifies the default model, if any.
"default": "OpenRP.q5_K_M.gguf",
// Models is a list of the models that are available to the service.
"models": [
{"model": "OpenRP.q5_K_M.gguf"}
]
}
The POST /v0/encode
API encodes a string of text into a list of tokens. This is useful for clients that operate on
tokens directly. The request has the following structure:
{
// Model identifies the GGUF model to use for prediction, if omitted, the default model will be used if it was
// specified when the service was started. (If no default model was specified, the service will return an error.)
"model": "OpenRP.q5_K_M.gguf",
// Text provides the text to encode.
"text": "Once upon a time.."
}
The response will have the following structure:
{
// Error will be set if there was an error during prediction to an English string describing the error. If this
// is set, other fields may be omitted.
"error": "",
// Tokens is a list of tokens that were generated by the model.
"tokens": [9038, 2501, 263, 931, 636],
}
The POST /v0/decode
API decodes a list of tokens into a string of text. Like the POST /v0/encode
API, this is
meant for clients that operate on tokens directly. The request has the following structure:
{
// Model identifies the GGUF model to use for prediction, if omitted, the default model will be used if it was
// specified when the service was started. (If no default model was specified, the service will return an error.)
"model": "OpenRP.q5_K_M.gguf",
// Tokens is a list of tokens to decode.
"tokens": [9038, 2501, 263, 931, 636]
}
The response will have the following structure:
{
// Error will be set if there was an error during prediction to an English string describing the error. If this
// is set, other fields may be omitted.
"error": "",
// Text is the text that was generated by the model.
"text": " Once upon a time..",
// Note that this isn't exactly the same text as the previous example -- LLaMA's tokens often include a leading
// space.
}
This package embeds a copy of llama.cpp
(which in turns embeds GGML) in
internal/llama
. To update this copy, run the bump
subcommand with a commitish from the
upstream repository.
$ go run github.com/swdunlop/llm-go/cmd/llm bump -c bc34dd4f5b5a7c10ae3ed85a265ce6f2ed2fab79
Frequently, this will cause issues with a patch in this repository for MacOS that embeds the content of
ggml-metal.metal
into the package so it can be loaded from memory. (The upstream llama.cpp project expects you
to provide this file at runtime.) If this happens, you can run the bump
subcommand with the --no-patch
flag to
not update the ggml-metal.m
file and then recreate the patch.