Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



9 Commits

Repository files navigation

Rust based AI LLM inference service

This repository contains all code to run a super simple AI LLM model - such as Mistral 7b; probably currently the best model to run locally - for inference; it includes simple RAG functionalities. Most importantly it exposes metrics about how long it took to create a response, as well as how long it took to generate the tokens.

Currently, uses LLM and candle - although I hope to replace the latter over time for performance reasons.

rusting llama being observed in mistral winds.


This is for testing only; Use at your own risk! Main purpose is to learn hands-up on how this stuff works and to 
intrument and characterize the behaviour of AI LLMs.


The following key metrics are exposed through Prometheus:

  • token_creation_duration - Histogram for the time it took to generate the tokens.
  • inference_response_duration - Histogram for the time it took to generate the full response (includes tokenization and embedding additional context).
  • embedding_duration - Histogram for the time it took to create a vector representation of the query and lookup contextual information in the knowledge base.
  • TODO: add more such as time it took to tokenize, read from KV store etc; also check if we can add tracing.

Here is an example dashboard that capture the metrics described as well as some host metrics such as power, CPU utilisation etc.:



You will need to download a model & matching tokenizer file:

Best to put both files into a model/ folder as model.gguf and tokenizer.json.


This service can be configured through environment variables. The following variables are supported:

Environment variable Description Example/Default
DATA_PATH Directory path from which to read text files into the knowledge base. data
EMBEDDING_MODEL Name of the embedding model to user (will be downloaded using HF). "BAAI/bge-small-en-v1.5"
HTTP_ADDRESS Bind address to use.
HTTP_WORKERS Number of threads to run with the HTTP server. 1
MODEL_BATCH_SIZE Batch size to use. 8
MODEL_GPU_LAYERS Number of layers to offload to GPU. 0
MODEL_MAX_TOKEN Maximum number of tokens to generate. 128
MODEL_PATH Full path to the gguf file of the model. model/model.gguf
MODEL_TOKENIZER Full path to the json tokenizer file. model/tokenizer.json
MODEL_THREADS Number of threads we'll use for inference. 6
PROMETHEUS_HTTP_ADDRESS Bind address to use for prometheus.

Other environment variables such as RUST_LOG can also be used.


The following curl commands show the format the service understands:

$ curl -X POST localhost:8080/query -d '{"query": "Who was Albert Einstein?"}' -H "Content-Type: application/json"
{"response":"[INST]Using this information: [] answer the Question: Who was Albert Einstein?[/INST] Albert Einstein 
(14 March 1879 – 18 April 1955) was a German-born theoretical physicist [...]"}

You can also test if the RAG works by running the following query - notice how easy it is to trick these word prediction machines:

$ curl -X POST localhost:8080/query -d '{"query": "Who was thom Rhubarb?"}' -H "Content-Type: application/json"

Kubernetes based deployment

The Dockerfile to build the image is best used on a machine with the same CPU as were it will be deployed as it uses target-cpu=native flag. Note that is also can optionally also include options to build with CLBlast for GPU support.

Use the following example manifest to deploy this application:

kubectl apply -f k8s_deployment.yaml

Note: make sure to adapt the docker image & paths - the manifest above uses hostPaths!

Further reading

Some of the following links can be useful: