LLMY

All-in-one LLM utilities for Rust — plug OpenAI / Azure settings straight into clap, track spend with built-in billing, and replay every request when things go wrong.

Harnessing An Agent

The harness layer gives you a concrete in-memory Agent that can hold conversation state, expose tools to the model, and run a full user turn through any tool-call loop. A minimal coding agent only needs a system prompt, an LLM, and a ToolBox with the tools you want to expose.

The example below builds a basic agent that can read files, list directories, and search for files by glob pattern:

[dependencies]
clap = { version = "4", features = ["derive"] }
llmy = "0.7"
tokio = { version = "1", features = ["macros", "rt-multi-thread"] }

use std::path::PathBuf;

use clap::Parser;
use llmy::agent::tool::ToolBox;
use llmy::agent::tools::files::{FindFileTool, ListDirectoryTool, ReadFileTool};
use llmy::clap::OpenAISetup;
use llmy::harness::Agent;

#[derive(Parser)]
struct Cli {
    #[command(flatten)]
    llm: OpenAISetup,

    #[arg(long, default_value = ".")]
    root: PathBuf,
}

#[tokio::main]
async fn main() -> Result<(), llmy::LLMYError> {
    let cli = Cli::parse();
    let settings = cli.llm.settings();
    let llm = cli.llm.to_llm();

    let mut tools = ToolBox::new();
    tools.add_tool(ReadFileTool::new(cli.root.clone()));
    tools.add_tool(ListDirectoryTool::new_root(cli.root.clone()));
    tools.add_tool(FindFileTool::new(cli.root.clone()));

    let mut agent = Agent::new(
        "You are a coding assistant. Use the available file tools whenever you need to inspect the workspace.".to_string(),
        tools,
        "readme-basic-agent".to_string(),
    );

    let result = agent
        .loop_step_user(
            "List the root directory, find Rust files under src, and then read Cargo.toml."
                .to_string(),
            &llm,
            Some("readme-basic-agent"),
            Some(settings),
        )
        .await?;

    if let Some(message) = result.assistant_message() {
        println!("{message}");
    }

    Ok(())
}

Run it with your OpenAI settings:

OPENAI_API_KEY=sk-... cargo run -- --model gpt-4o --root .

CLI

Install the command-line tool:

cargo install llmy-cli

`llmy chat` — interactive chat

OPENAI_API_KEY=sk-... llmy chat --model gpt-4o

You: Explain async Rust in one sentence.
Assistant: Async Rust uses futures and an executor to let you write non-blocking,
concurrent code with zero-cost abstractions at compile time.

Supports --system for a custom system prompt. Reads from stdin when not a TTY.

`llmy tokenizer` — count tokens offline

echo "Hello, world!" | llmy tokenizer --model openai/gpt-4o
# 4

llmy tokenizer --encoding cl100k_base --input my_prompt.txt --verbose
# 0  9906   "Hello"
# 1  11    ","
# 2  1917  " world"
# 3  0     "!"
# 4

`llmy models` — list supported models

Model                           Input (per 1M)  Output (per 1M) Max Input  Max Output  Encoding
anthropic/claude-sonnet-4       $3.00           $15.00          136000     64000       claude
google/gemini-2.5-flash         $0.30           $2.50           936000     64000       o200k_base
google/gemini-2.5-pro           $1.25           $10.00          983040     65536       o200k_base
openai/gpt-4.1                  $2.00           $8.00           1014808    32768       o200k_base
openai/gpt-4o                   $2.50           $10.00          111616     16384       o200k_base
openai/gpt-4o-mini              $0.15           $0.60           111616     16384       o200k_base
openai/o1                       $15.00          $60.00          100000     100000      o200k_base
openai/o3                       $2.00           $8.00           100000     100000      o200k_base
openai/o4-mini                  $1.10           $4.40           100000     100000      o200k_base
…                               (112 models total)

Library

Add the dependency (the root crate re-exports everything):

[dependencies]
llmy = "0.7"

1. Clap integration — up to 3 LLM slots

llmy-clap provides three generated arg structs (OpenAISetup, OptOpenAISetup, OptOptOpenAISetup) so you can wire one, two, or three LLMs into any clap-based CLI with zero boilerplate. Each slot is controlled by its own set of env-vars / flags, and can be converted to the core LLM client in one call.

use clap::Parser;
use llmy::clap::OpenAISetup;      // primary
use llmy::clap::OptOpenAISetup;   // optional secondary

#[derive(Parser)]
struct Cli {
    #[command(flatten)]
    llm: OpenAISetup,

    #[command(flatten)]
    fallback_llm: OptOpenAISetup,
}

#[tokio::main]
async fn main() {
    let cli = Cli::parse();

    // One-liner: clap args → ready-to-use async LLM client
    let llm = cli.llm.to_llm();

    let resp = llm
        .prompt_once_with_retry(
            "You are a helpful assistant.",
            "Explain async Rust in one sentence.",
            None,
            None,
            None,
        )
        .await
        .unwrap();

    println!("{}", resp.choices[0].message.content.as_deref().unwrap_or(""));
}

Run it:

# OpenAI
OPENAI_API_KEY=sk-... cargo run -- --model gpt-4o

# Azure
OPENAI_API_KEY=... cargo run -- \
    --azure-openai-endpoint https://my.openai.azure.com \
    --azure-deployment gpt-4o \
    --model gpt-4o

Every setting (temperature, timeout, retries, max tokens, reasoning effort, tool choice, …) is exposed as a flag and an env-var:

Flag	Env var	Default
`--model`	`OPENAI_API_MODEL`	`o1`
`--llm-temperature`	`LLM_TEMPERATURE`	—
`--llm-presence-penalty`	`LLM_PRESENCE_PENALTY`	—
`--llm-max-completion-tokens`	`LLM_MAX_COMPLETION_TOKENS`	—
`--top-p`	`LLM_TOP_P`	—
`--llm-retry`	`LLM_RETRY`	`5`
`--llm-prompt-timeout`	`LLM_PROMPT_TIMEOUT`	`1200` (s)
`--llm-stream`	`LLM_STREAM`	`false`
`--reasoning-effort`	`LLM_REASONING_EFFORT`	—

The second and third slots use the prefixes OPT_ and OPT_OPT_ for their env-vars (e.g. OPT_OPENAI_API_KEY, OPT_OPT_OPENAI_API_MODEL).

2. Detailed debug logging (`LLM_DEBUG`)

Point LLM_DEBUG at a directory and every LLM round-trip is saved as an XML-like .xml (not strict XML — just an easy-to-skim tagged format) and a raw .json — perfect for post-mortem debugging or dataset building.

LLM_DEBUG=./debug_logs OPENAI_API_KEY=sk-... cargo run

This creates a per-process subfolder with numbered files:

debug_logs/
└── 48291-0-main/
    ├── llm-000000000001.xml
    ├── llm-000000000001.json
    ├── llm-000000000002.xml
    └── llm-000000000002.json

The .xml file looks like:

=====================
<Request>
<SYSTEM>
You are a helpful assistant.
</SYSTEM>
<USER>
Explain async Rust in one sentence.
</USER>
<tool name="search", description="Search the web", strict=false>
{
  "type": "object",
  "properties": { "query": { "type": "string" } }
}
</tool>
</Request>
=====================
=====================
<Response>
<ASSISTANT>
Async Rust lets you write concurrent code ...
</ASSISTANT>
</Response>
=====================

The .json companion contains the full serialised CreateChatCompletionRequest / CreateChatCompletionResponse objects for programmatic analysis.

3. Built-in billing with automatic budget enforcement

llmy ships with up-to-date per-token pricing for 110+ models (GPT-4o, o1, o3, GPT-5 family, Claude, Gemini, …). Token usage is tracked in real-time including cached-input and reasoning token discounts. When spend exceeds the budget cap the client returns LLMYError::Billing immediately — no more surprise bills.

use llmy::client::{LLM, SupportedConfig};
use llmy::client::settings::LLMSettings;

let settings = LLMSettings::default();
let model = "gpt-4o".parse().unwrap();

let llm = LLM::new(
    SupportedConfig::new("https://api.openai.com/v1", "sk-..."),
    model,
    5.0, // budget cap in USD
    settings,
    None,
    None,
);

match llm.prompt_once("system", "user", None, None, None).await {
    Ok(resp) => { /* … */ }
    Err(llmy::LLMYError::Billing(cap, current)) => {
        eprintln!("Budget exceeded: ${:.4} / ${:.2}", current, cap);
    }
    Err(e) => { eprintln!("Error: {e}"); }
}

Via clap the cap defaults to $10 and can be overridden:

cargo run -- --billing-cap 2.5 --model gpt-4o-mini

For models not in the built-in list, pass pricing inline:

cargo run -- --model "my-custom-model,1.0,4.0,0.5"
#                      name,         in, out, cached

4. Offline token estimation

llmy includes a built-in tokenizer with fast, offline BPE token estimation for 110+ models across OpenAI, Anthropic, Google, and more. Encodings and model metadata are baked into the binary at compile time — no network calls, no data files to ship.

Four encodings are supported: cl100k_base, o200k_base, p50k_base (OpenAI / tiktoken) and claude (Anthropic).

use llmy::tokenizer::{encode, count_tokens, count_tokens_for_model, Encoding};

// Encode text into token IDs
let tokens: Vec<u32> = encode("Hello, world!", Encoding::O200kBase);

// Count tokens directly
let n = count_tokens("Hello, world!", Encoding::Cl100kBase);

// Or let the library resolve the encoding from a model ID
let n = count_tokens_for_model("Hello, world!", "openai/gpt-4o"); // Some(4)
let n = count_tokens_for_model("Hello, world!", "anthropic/claude-sonnet-4"); // Some(4)

The model registry is generated from the same source-of-truth JSON used by the billing system, so model look-ups, pricing, and token counts always stay in sync.

5. Defining tools with `Tool` and `#[tool(...)]`

llmy-agent models callable tools as a Rust trait. A tool has a strongly typed argument struct, a stable tool name, an optional description, and an async invoke method that returns Result<String, LLMYError>.

You can depend on either the focused crate pair:

[dependencies]
llmy-agent = "0.5"
llmy-agent-derive = "0.7"

or the root crate plus the derive crate:

[dependencies]
llmy = "0.7"
llmy-agent-derive = "0.7"

The trait contract is:

use std::future::Future;

use llmy_agent::{LLMYError, Tool};
use schemars::JsonSchema;
use serde::de::DeserializeOwned;

pub trait Tool: Send + Sync + std::fmt::Debug {
    type ARGUMENTS: DeserializeOwned + JsonSchema + Send;
    const NAME: &str;
    const DESCRIPTION: Option<&str>;

    fn invoke(
        &self,
        arguments: Self::ARGUMENTS,
    ) -> impl Future<Output = Result<String, LLMYError>> + Send;
}

In practice you usually write the typed arguments and the async method, then let llmy-agent-derive generate the impl Tool for you:

use std::path::PathBuf;

use llmy::agent::LLMYError;
use llmy_agent_derive::tool;
use schemars::JsonSchema;
use serde::Deserialize;

#[derive(Deserialize, JsonSchema, Default)]
pub struct ReadFileArgs {
    /// The path of the file to read
    pub file_path: PathBuf,
}

#[derive(Debug, Clone)]
#[tool(
    arguments = ReadFileArgs,
    invoke = read_file,
    description = "Read file contents from `file_path`.",
    name = "read_file",
)]
pub struct ReadFileTool {
    pub cwd: PathBuf,
}

impl ReadFileTool {
    pub async fn read_file(&self, args: ReadFileArgs) -> Result<String, LLMYError> {
        let path = self.cwd.join(args.file_path);
        Ok(tokio::fs::read_to_string(path).await?)
    }
}

Notes:

arguments and invoke are required in #[tool(...)].
description is optional.
name is optional; if omitted, the struct name is converted to snake_case, for example ReadFileTool -> read_file_tool.
The generated impl works with either llmy_agent::Tool or llmy::agent::Tool.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
crates		crates
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMY

Harnessing An Agent

CLI

`llmy chat` — interactive chat

`llmy tokenizer` — count tokens offline

`llmy models` — list supported models

Library

1. Clap integration — up to 3 LLM slots

2. Detailed debug logging (`LLM_DEBUG`)

3. Built-in billing with automatic budget enforcement

4. Offline token estimation

5. Defining tools with `Tool` and `#[tool(...)]`

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMY

Harnessing An Agent

CLI

llmy chat — interactive chat

llmy tokenizer — count tokens offline

llmy models — list supported models

Library

1. Clap integration — up to 3 LLM slots

2. Detailed debug logging (LLM_DEBUG)

3. Built-in billing with automatic budget enforcement

4. Offline token estimation

5. Defining tools with Tool and #[tool(...)]

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`llmy chat` — interactive chat

`llmy tokenizer` — count tokens offline

`llmy models` — list supported models

2. Detailed debug logging (`LLM_DEBUG`)

5. Defining tools with `Tool` and `#[tool(...)]`

Packages