ThriftLM

Stop paying for the same LLM call twice just because users phrased it differently.

0.2.0 adds plan caching on top of response caching — repeated agent workflows now skip the planner entirely.

pip install thriftlm

What ThriftLM is

ThriftLM is a two-layer caching system for LLM applications.

V1 — response cache (thriftlm==0.1.x, stable) Same query → same answer. Intercepts repeated or semantically similar LLM calls before they reach your provider. Three-tier stack: Redis exact hash → local numpy cosine index → Supabase pgvector HNSW.

V2 — plan cache (thriftlm==0.2.0, new) Same job → same execution plan, filled with fresh context. Intercepts agent tasks before planning. If a semantically similar task was planned before, V2 returns a validated, slot-filled FilledPlan — no planner call, no LLM call. If it misses, your planner runs and the result can be stored for next time.

Both layers can run together. V1 sits underneath V2 to cache repeated leaf LLM calls inside agent steps.

What's new in 0.2.0

Plan-level cache (thriftlm.v2) — reuse reasoning skeletons across task families, not just identical queries
Intent canonicalization — tasks are routed to deterministic buckets via a structured IntentKey (gpt-4o-mini, cached 1h in Redis)
Composite candidate reranking — 0.7 × semantic_similarity + 0.3 × structural_score over fetched candidates
Slot filling + 7-stage validation — plans are filled with caller context and validated before being returned; bad fills are silently discarded
Automatic plan extraction — after a planner runs on a miss, extract_plan_template() generalizes the trace into a reusable template (deterministic, no LLM)
scripts/ — seed, smoke-test, and extract-and-store helpers for developer workflow
364 tests passing

Architecture

V1 — response cache

query
  │
  ▼
┌─────────────────┐   HIT → return (~0.5ms)
│  Redis          │   exact embedding hash
└────────┬────────┘
         │ MISS
         ▼
┌─────────────────┐   HIT → Supabase PK fetch → return (~50ms)
│  Local Numpy    │   cosine similarity matmul
│  Index          │
└────────┬────────┘
         │ MISS
         ▼
┌─────────────────┐
│  Your LLM fn    │   llm_fn() called here
└────────┬────────┘
         │
         ▼
   PII scrub (Presidio, responses only) → store → return

V2 — plan cache

Lookup path:

task + context + runtime_caps
  │
  ▼
canonicalize(task)          → IntentKey + intent_bucket_hash
  └── Redis 1h TTL          (no second OpenAI call on repeat tasks)
  │
  ▼
bucket fetch (Supabase)     → candidates matching intent_bucket_hash
  │
  ▼
composite rerank            → 0.7 × sem_sim + 0.3 × structural_score
  │
  ▼
adapt_plan()                → fill SlotSpecs from context + transforms
  │
  ▼
validate_plan()             → 7-stage pipeline; discard + try next on fail
  │
  ├── HIT  → return FilledPlan (planner never ran)
  └── MISS → return miss signal → caller runs planner

Miss → extract → store path:

caller planner runs → execution trace
  │
  ▼
extract_plan_template()     → generalize trace to PlanTemplate (deterministic)
  │
  ▼
POST /v2/plan/store         → server verifies bucket hash → stores in Supabase
  │
  ▼
next similar task hits the plan cache

Installation

pip install thriftlm

Prerequisites:

Python 3.10+
Supabase project with pgvector (supabase/setup.sql to provision tables)
Redis (local or Upstash)
OPENAI_API_KEY for V2 canonicalization (gpt-4o-mini)

SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-anon-key
REDIS_URL=redis://localhost:6379
OPENAI_API_KEY=sk-...

V1 Quickstart — response cache

from thriftlm import SemanticCache
import openai

cache = SemanticCache(threshold=0.85, api_key="your-key")

def call_llm(query: str) -> str:
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
    )
    return resp.choices[0].message.content

# Cache check + LLM fallback in one call
response = cache.get_or_call("Explain semantic caching", call_llm)

# Near-duplicate → instant cache hit, no LLM called
response2 = cache.get_or_call("What is semantic caching?", call_llm)

That's the entire integration. No architecture changes — wrap the existing LLM call.

V2 Quickstart — plan cache

Start the V2 server:

py -m uvicorn thriftlm.v2._server:app --port 8000

Use ThriftLMPlanCache in your agent:

from thriftlm.v2.adapters.generic import ThriftLMPlanCache

cache = ThriftLMPlanCache(
    api_key="tlm_xxx",
    base_url="http://localhost:8000",
    timeout=30.0,   # first call does OpenAI canonicalization, allow extra time
)

task = "summarize open PRs for org/myrepo"
context = {"repo": "org/myrepo"}
runtime_caps = {"tool_families": ["github"], "allow_side_effects": False}

result = cache.lookup(task=task, context=context, runtime_caps=runtime_caps)

if result["status"] == "hit":
    # Planner skipped entirely — use the validated, slot-filled plan
    filled_plan = result["filled_plan"]
    executor.run(filled_plan, context)

else:
    # Cache miss — run your planner, then store the trace for next time
    planner_output = my_planner(task, context)
    executor.run(planner_output, context)

    # Optional: extract and store for future reuse
    from scripts.extract_and_store import extract_and_store
    extract_and_store(
        task=task,
        context=context,
        execution_trace=planner_output["trace"],
        canonicalization_result=result.get("canonicalization_result"),
        api_key="tlm_xxx",
        base_url="http://localhost:8000",
    )

On the second call with a semantically similar task (same intent, different repo), V2 returns a hit — slots are filled with the new context, validation passes, planner never runs.

Core V2 concepts

IntentKey — structured decomposition of a task into action, target, goal, time_scope, and optional metadata fields (domain, format, audience, tool_family). Produced by the canonicalizer (gpt-4o-mini at temperature=0).

intent_bucket_hash — 16-char SHA-256 of the 4 core fields only (action, target, goal, time_scope). Optional fields are excluded from the hash intentionally: LLMs vary them across invocations even for the same task. The hash is the routing key for plan lookup.

PlanTemplate — a stored execution skeleton: ordered steps with typed I/O, SlotSpec declarations for caller-supplied values, output schema, and version metadata. Retrieved from Supabase by bucket hash.

FilledPlan — a PlanTemplate with all SlotSpec values resolved from the caller's context. Step inputs referencing {slot_name} are substituted. Prior-step output references ({prs}, {grouped}) are left for the executor.

Structural scoring — composite score used to rank candidates within a bucket:

final_score = 0.7 × semantic_similarity + 0.3 × structural_score

structural_score =
    0.35 × slot_overlap      (required context keys present)
  + 0.25 × tool_family_match (plan needs tools the runtime has)
  + 0.20 × format_audience   (format/audience fields match)
  + 0.20 × side_effect_compat (side-effecting steps allowed?)

Validation — 7 ordered stages:

Stage	What it checks
1	All required slots resolved from context
2	Slot values match declared type hints
3	All step inputs satisfied by prior outputs or slots
4	Required `tool_family` values present in `runtime_caps`
5	No unsubstituted `{placeholder}` strings remain
6	Every non-optional output schema field has a producing step
7	Side-effecting steps permitted by `runtime_caps.allow_side_effects`

A candidate that fails any stage is discarded silently. The next ranked candidate is tried. If all top_k candidates fail, V2 returns a miss.

Safety and invariants

V2 never executes plans. It returns a validated FilledPlan. The caller owns execution.
Bucket hash is recomputed server-side on store. Caller-supplied intent_bucket_hash is not trusted — a mismatch returns 400 hash_mismatch.
Plans are tenant-isolated. Every plan is scoped to api_key. No cross-tenant reads or writes.
Extractor is deterministic. extract_plan_template() calls no LLM and makes no network requests. It generalizes a trace using reverse context mapping. It will refuse extraction (return ok=False) if the trace has fewer than 2 steps, all steps are side-effecting with no slots extracted, or extraction confidence is below 0.5.
Canonicalization is cached. Once a task string is canonicalized, the result is stored in Redis for 1 hour. The same task never triggers two OpenAI calls within that window.

Current scope and limitations

Text-first. V2 is designed for text-input agent tasks. Multimodal support (EvidenceProfile) is designed but not yet built.
Shallow slot extraction. The extractor handles exact top-level context value → placeholder substitution. Nested placeholder extraction and fuzzy abstraction are not supported in v0.1.
No benchmark yet. V2 hit rate and latency benchmarks across diverse task families are planned for Phase 3.
plan_threshold = 0.60. SBERT cosine similarity between short task strings and plan descriptions typically lands in 0.50–0.70. The threshold may need tuning as your plan bank grows.
seed_task vs description split is future polish. Currently seed_v2_plans.py canonicalizes the plan description string, which works but conflates routing vocabulary with reranking text. Not a blocker.

Developer scripts

Script	What it does
`scripts/seed_v2_plans.py --api-key tlm_xxx`	Seeds Supabase with canonical plan templates. Calls `canonicalize()` on each description to get live bucket hashes — no hardcoded intent keys.
`scripts/smoke_v2_lookup.py --api-key tlm_xxx --base-url http://localhost:8000 --task "..." --context '{}' --timeout 30`	Fires a single lookup and prints the full JSON response. Use `--timeout 30` on cold starts.
`scripts/extract_and_store.py --api-key tlm_xxx --base-url http://localhost:8000 --task "..." --context '{}' --trace trace.json --canon canon.json`	Extracts a `PlanTemplate` from an execution trace and stores it via `/v2/plan/store`.
`scripts/debug_v2_lookup.py --api-key tlm_xxx --bucket <hash> --task "..." --context '{}'`	Fetches raw DB rows for a bucket, scores them, and runs `adapt_plan` + `validate_plan` — useful for diagnosing misses.

V2 API endpoints

Method + Path	Description
`POST /v2/plan/lookup`	Main entry: task + context → `FilledPlan` or miss
`POST /v2/plan/store`	Store a template (server recomputes and verifies bucket hash)
`GET /v2/plan/bucket/:hash`	List templates for a bucket
`DELETE /v2/plan/:id`	Evict a single plan
`DELETE /v2/plan/bucket/:hash`	Evict an entire bucket
`POST /v2/plan/invalidate-by-version`	Bulk soft-invalidate by version string
`GET /v2/metrics`	Server health + version

V1 metrics dashboard

thriftlm serve --api-key your-key
# → http://localhost:8000  (opens automatically)

Shows hit rate, tokens saved, estimated cost saved, and top cached queries. Reads directly from your Supabase.

V1 benchmark

Threshold | Hit Rate | Hits / 200
----------|----------|------------
0.70      |  92.5%   |   185
0.75      |  86.0%   |   172
0.80      |  78.0%   |   156
0.82      |  73.5%   |   147   ← recommended
0.85      |  62.5%   |   125   (default)
0.90      |  40.0%   |    80

Model: all-MiniLM-L6-v2  ·  Dataset: Quora Question Pairs (200 pairs)

Project structure

ThriftLM/
├── thriftlm/
│   ├── cache.py                 # V1 SemanticCache
│   ├── embedder.py              # SBERT all-MiniLM-L6-v2
│   ├── privacy.py               # Presidio PII scrubbing
│   ├── _server.py               # V1 FastAPI (thriftlm serve)
│   ├── cli.py                   # CLI entry point
│   ├── backends/
│   │   ├── local_index.py       # Numpy cosine index
│   │   ├── redis_backend.py     # Exact hash cache
│   │   └── supabase_backend.py  # pgvector HNSW store
│   └── v2/
│       ├── schemas.py           # TypedDicts: IntentKey, PlanTemplate, FilledPlan, …
│       ├── intent.py            # canonicalize() → IntentKey + bucket hash
│       ├── canonicalization_cache.py  # Redis cache for canonicalization results
│       ├── plan_cache.py        # bucket fetch + composite rerank
│       ├── adapter.py           # slot filling + TransformRegistry
│       ├── validator.py         # 7-stage validation pipeline
│       ├── extractor.py         # trace → PlanTemplate (deterministic)
│       ├── _server.py           # V2 FastAPI endpoints
│       └── adapters/
│           ├── base.py          # BasePlanCache ABC
│           └── generic.py       # ThriftLMPlanCache HTTP client
├── scripts/
│   ├── seed_v2_plans.py
│   ├── smoke_v2_lookup.py
│   ├── extract_and_store.py
│   └── debug_v2_lookup.py
├── tests/                       # 364 passing
├── supabase/setup.sql
├── api/                         # Multi-tenant self-hosted backend
└── pyproject.toml

Roadmap

Item	Status
V1 response cache	Shipped (`0.1.x`)
V2 plan cache	Shipped (`0.2.0`)
V2 benchmark (200 tasks, 5 intent buckets)	Phase 3
Fly.io deploy + hosted endpoint	Phase 3
Claude Code MCP adapter / Codex CLI hook	Roadmap
`seed_task` vs `description` split in seed script	Post-0.2.0 polish
V2.5 multimodal `EvidenceProfile`	Future

Development

git clone https://github.com/samujure/ThriftLM
cd ThriftLM
pip install -e ".[dev]"
cp .env.example .env      # fill in SUPABASE_URL, SUPABASE_KEY, REDIS_URL, OPENAI_API_KEY
docker compose up -d      # local Redis
pytest tests/ -q          # 364 tests
py scripts/seed_v2_plans.py --api-key tlm_test
py scripts/smoke_v2_lookup.py --api-key tlm_test --base-url http://localhost:8000 \
  --task "summarize open PRs for org/myrepo" --context '{"repo":"org/myrepo"}' \
  --runtime-caps '{"tool_families":["github"],"allow_side_effects":false}' --timeout 30

This README reflects thriftlm==0.2.0.

Built by Srivamsi Amujure & Ivan Thomas Shen

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
api		api
demo		demo
docs		docs
scratch		scratch
scripts		scripts
supabase		supabase
tests		tests
thriftlm		thriftlm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ThriftLM

What ThriftLM is

What's new in 0.2.0

Architecture

V1 — response cache

V2 — plan cache

Installation

V1 Quickstart — response cache

V2 Quickstart — plan cache

Core V2 concepts

Safety and invariants

Current scope and limitations

Developer scripts

V2 API endpoints

V1 metrics dashboard

V1 benchmark

Project structure

Roadmap

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ThriftLM

What ThriftLM is

What's new in 0.2.0

Architecture

V1 — response cache

V2 — plan cache

Installation

V1 Quickstart — response cache

V2 Quickstart — plan cache

Core V2 concepts

Safety and invariants

Current scope and limitations

Developer scripts

V2 API endpoints

V1 metrics dashboard

V1 benchmark

Project structure

Roadmap

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages