Forked from mdrokz/rust-llama.cpp

Main changes from the forked version:

This fork has the changes in development on the 'dev' branch, which will be merged into 'master' once tested well enough.

Behavior of the original repo isn't guaranteed to stay the same! Any deviations should be mentioned in the above list. HOWEVER ... if there's unit tests for a method, you can be sure some attention has been paid to at least try to get it working in a reasonable manner. The version numbers for this fork will also no longer keep parity with the original repo and will change as needed for this fork.

Notes:

Git's --recurse-submodules parameter is your friend! Don't forget to use it or things will get out of sync between the vendored llama.cpp version this project wraps.
Setting ModelOptions::context_size to zero will cause memory errors currently as that is what is used to create the buffers to send through the FFI.
PLATFORMS: Windows 11, MacOS and Linux are tested without any features enabled. MacOS with the metal feature has been tested and should be functional. Windows 11 and Linux have been tested with cuda and should work.
MAC: If some of the integration tests crash, consider the possibility that the context_size used in the test sends your mac out of memory. For example, my MBA M1 with 8gb of memory cannot handle a context_size of 4096 with 7B_Q4_K_S models and I have to drop the context down to 2048. Use the RUST_LLAMA_CTX_LEN env variable as described in the section below.
WINDOWS: Make sure to install an llvm package to compile the bindings. I use scoop, so it's as easy as running scoop install llvm. VS 2022 and Cuda 11.8 were also installed in addition to the rust toolchain (msvc version, the default) and the cargo commands were issued from the VS Developer Command Prompt. Note: The logfile feature is currently broken under Windows MSVC builds.
If you're not using full GPU offloading, be mindful of the threads and batch values in your PredictOptions object that you're sending to predict(). CPU will work better with thread counts <= [number of actual CPU cores]. For example, a Macbook Air M3 will choke on the predict_options_test (see below) with a thread count of 16 when the metal feature isn't enabled, but does just fine with threads set to 4. On my machine you can even see it start to misstep with a value of 8.

Running tests

To run the tests, the library will need a GGUF model to load and use. The path for this model is hardcoded to models/model.gguf. On a unix system you should be able to create a symbolic link named model.gguf in that directory to the GGUF model you wish to test with. FWIW, the test prompts use vicuna style prompts.

Any of the 'default' parameters for the integration tests should be modified in the tests/common/mod.rs file.

The recommended way to run the tests involves using the correct feature for your hardware acceleration. The following example is for CUDA device.

cargo test --release --features cuda --test "*" -- --nocapture --test-threads 1

With --nocapture, you'll be able to see the generated output. If it seems like nothing is happening, make sure you're using the right feature for your system. You also may wish to use the --release flag as well to speed up the tests.

Environment variables can be used to customize the test harness for a few parameters:

RUST_LLAMA_MODEL: The relative filepath for the model to load (default is "models/model.gguf")
RUST_LLAMA_GPU_LAYERS: The number of layers to offload to the gpu (default is 100)
RUST_LLAMA_CTX_LEN: The context length to use for the test (default is 4096)

Original README.md content below:

rust_llama.cpp

LLama.cpp rust bindings.

The rust bindings are mostly based on https://github.com/go-skynet/go-llama.cpp/

Building Locally

Note: This repository uses git submodules to keep track of LLama.cpp.

Clone the repository locally:

git clone --recurse-submodules https://github.com/mdrokz/rust-llama.cpp

cargo build

Usage

[dependencies]
llama_cpp_rs = "0.3.0"

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};

fn main() {
    let model_options = ModelOptions::default();

    let llama = LLama::new(
        "../wizard-vicuna-13B.ggmlv3.q4_0.bin".into(),
        &model_options,
    )
    .unwrap();

    let predict_options = PredictOptions {
        token_callback: Some(Box::new(|token| {
            println!("token1: {}", token);

            true
        })),
        ..Default::default()
    };

    llama
        .predict(
            "what are the national animals of india".into(),
             predict_options,
        )
        .unwrap();
}

Examples

The examples contain dockerfiles to run them

see examples

TODO

Implement support for cublas,openBLAS & OpenCL #7
Implement support for GPU (Metal)
Add some test cases
Support for fetching models through http & S3
Sync with latest master & support GGUF
Add some proper examples mdrokz#7

LICENSE

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
examples		examples
include_shims		include_shims
llama.cpp @ 1613ef8		llama.cpp @ 1613ef8
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
binding.cpp		binding.cpp
binding.h		binding.h
build.rs		build.rs

License

tbogdala/rust-llama.cpp

Folders and files

Latest commit

History

Repository files navigation

Forked from mdrokz/rust-llama.cpp

Notes:

Running tests

rust_llama.cpp

Building Locally

Usage

Examples

TODO

LICENSE

About

Resources

License

Stars

Watchers

Forks

Languages