Torchchat provides powerful capabilities for running large language models (LLMs) locally. This guide focuses on utilizing local copies of model checkpoints or models in GGUF format to create a chat application. It also highlights relevant options for advanced users.
To work with local models, you need:
- Model Weights: A checkpoint file (e.g.,
.pth
,.pt
) or a GGUF file (e.g.,.gguf
). - Tokenizer: A tokenizer model file.This can either be in SentencePiece or TikToken format, depending on the tokenizer used with the model.
- Parameter File: (a) A custom parameter file in JSON format, or (b) a pre-existing parameter file with
--params-path
or--params-table
, or (c) a pathname that’s matched against known models by longest substring in configuration name, using the same algorithm as GPT-fast.
Ensure the tokenizer and parameter files are in the same directory as the checkpoint or GGUF file for automatic detection. Let’s use a local download of the stories15M tinyllama model as an example:
mkdir stories15M
cd stories15M
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt
wget https://github.com/karpathy/llama2.c/raw/refs/heads/master/tokenizer.model
cp ../torchchat/model_params/stories15M.json model.json
cd ..
Torchchat provides the CLI flag --checkpoint-path
for specifying local model weights. The tokenizer is
loaded from the same directory as the checkpoint with the name ‘tokenizer.model’ unless separately specified.
This example obtains the model parameters by name matching to known models because ‘stories15M’ is one of the
models known to torchchat with a configuration stories in ‘torchchat/model_params’:
python3 torchchat.py generate \
--checkpoint-path stories15M/stories15M.pt \
--prompt "Hello, my name is"
The following is an example of how to specify a local model checkpoint, the model architecture, and a tokenizer file:
python3 torchchat.py generate \
--prompt "Once upon a time" \
--checkpoint-path stories15M/stories15M.pt \
--params-path stories15M/model.json \
--tokenizer-path stories15M/tokenizer.model
Alternatively, we can specify the known architecture configuration for known models using ‘--params-table’ to specify a p[particular architecture in the ‘torchchat/model_params’:
python3 torchchat.py generate \
--prompt "Once upon a time" \
--checkpoint-path stories15M/stories15M.pt \
--params-table stories15M \
--tokenizer-path stories15M//tokenizer.model
Torchchat supports loading models in GGUF format using the --gguf-file
. Refer to GGUF.md for additional
documentation about using GGUF files in torchchat.
The GGUF format is compatible with several quantization levels such as F16, F32, Q4_0, and Q6_K. Model
configuration information is obtained directly from the GGUF file, simplifying setup and obviating the
need for a separate model.json
model architecture specification.
Torchchat supports all commands such as chat, browser, server and export using local models. (In fact, known models simply download and populate the parameters specified for local models.) Here is an example setup for running a server with a local model:
python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt
[shell default]: python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt & server_pid=$! ; sleep 90 # wait for server to be ready to accept requests
In another terminal, query the server using curl
. Depending on the model configuration, this query might take a few minutes to respond.
Note
Since this feature is under active development, not every parameter is consumed. See #api/api.pyi
for details on
which request parameters are implemented. If you encounter any issues, please comment on the tracking Github issue.
Example Query
Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.Example: using the server A model server used witha local model works like any other torchchat server. You can test it by sending a request with ‘curl’:
curl http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"stream": "true",
"max_tokens": 200,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
[shell default]: kill ${server_pid}
For more information about using different commands, see the root README.md and refer to the Advanced Users Guide for further details on advanced configurations and parameter tuning.