Run SQL agents, code assistants, or research bots on any LLM. Bring your own framework. Memory, safety, and failure recovery come included.
Think about what actually happens when you run an AI agent in production. The LLM call needs to work. It needs to not cost $500 a day. It needs to not loop forever when the API is slow. It needs to remember context from three messages ago. It needs to not crash your app when one provider goes down.
HarnessAgent handles all of that. You write the task. It handles the rest.
| What you see | What happens under the hood |
|---|---|
| AI answers your question | Picks the healthiest LLM, checks the budget, falls back if the provider fails |
| AI runs a SQL query | Validates the input schema, checks safety rules, executes, logs the result |
| AI remembers past context | Short-term in Redis, long-term in a vector DB |
| AI finds relevant info fast | Graph RAG: entity extraction plus BFS traversal, 83% fewer tokens than naive vector search |
| AI gets better after failures | Hermes loop: samples errors, proposes a prompt fix, evaluates it, applies if the score clears 70% |
| One provider goes down | Circuit breaker opens after 5 failures, auto-recovers after 60 seconds |
graph TB
subgraph CLIENT["๐ Client Layer"]
UI[Web / CLI / SDK]
API[REST API POST /runs]
end
subgraph HARNESS["โ๏ธ HarnessAgent Core"]
RUNNER[AgentRunner lifecycle manager]
subgraph AGENTS["๐ค Agent Layer"]
BASE[BaseAgent run loop]
SQL[SQLAgent]
CODE[CodeAgent]
LG[LangGraph Adapter]
AG[AutoGen Adapter]
CR[CrewAI Adapter]
end
subgraph MEMORY["๐ง Memory System"]
STM[Short-Term Redis]
LTM[Long-Term Qdrant / Chroma / Weaviate]
GRAPH[Knowledge Graph NetworkX / Neo4j]
RAG[Graph RAG Engine]
end
subgraph LLM["๐ฎ LLM Router"]
ROUTER[Health-aware Circuit-broken Router]
ANT[Claude]
OAI[GPT-4o / GPT-5]
LOCAL[vLLM / SGLang / llama.cpp]
end
subgraph TOOLS["๐ง Tool System"]
REG[Tool Registry]
MCP[MCP Servers]
SQL2[SQL Tools]
CODE2[Code Sandbox]
FILE[File Tools]
end
subgraph SAFETY["๐ก๏ธ Safety"]
GUARD[Guardrail Pipeline]
HITL[Human-in-the-Loop]
RATE[Rate Limiter]
CB[Circuit Breaker]
end
end
subgraph OBS["๐ Observability"]
MLFLOW[MLflow Traces]
OTEL[OpenTelemetry]
PROM[Prometheus]
GRAFANA[Grafana Dashboard]
end
subgraph IMPROVE["๐ Self-Improvement"]
HERMES[Hermes Loop]
ERR[Error Collector]
PATCH[Patch Generator]
EVAL[Evaluator]
end
UI --> API --> RUNNER --> BASE
BASE --> LLM & MEMORY & TOOLS & SAFETY
BASE --> OBS
BASE -.->|failures| IMPROVE
IMPROVE -.->|better prompts| AGENTS
ROUTER --> ANT & OAI & LOCAL
STM & LTM & GRAPH --> RAG
REG --> MCP & SQL2 & CODE2 & FILE
| Feature | Description |
|---|---|
| ๐ LLM Routing | Claude, GPT-5, o4-mini, vLLM, SGLang, llama.cpp with automatic health-aware fallback |
| ๐ง 3-Tier Memory | Redis (hot) then vector DB (warm) then knowledge graph (structured) |
| ๐ Graph RAG | 83% token reduction via multi-hop graph traversal vs naive vector search |
| ๐ Framework Adapters | LangGraph, AutoGen, CrewAI plug in without rewriting your agents |
| ๐ก๏ธ Safety Pipeline | PII redaction, injection detection, tool policy, loop detection, budget enforcement |
| ๐ Hermes Loop | Analyzes failures, proposes prompt patches, evaluates them, applies if the score is good |
| ๐ค Human-in-the-Loop | Agent pauses on risky actions, waits for approval, then continues or stops |
| โก Circuit Breaker | Opens after 5 failures, self-heals after 60 seconds |
| ๐ฐ Cost Tracking | Per-run, per-tenant USD cost with hard monthly caps |
| ๐ Code Sandbox | Docker-isolated execution for code agents, 256MB limit, no network |
| ๐ Observability | MLflow agent traces, OTel infra spans, Prometheus metrics, Grafana dashboards |
| ๐งฉ MCP | Connect any MCP server over stdio or SSE |
| Provider | Models | Tool Calling | Prompt Caching | Cost per 1M input tokens |
|---|---|---|---|---|
| ๐ฃ Anthropic | Sonnet 4.6, Haiku 4.5, Opus 4.7 | Native | Yes | $0.25 to $15 |
| ๐ข OpenAI | GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini, o1, o3, o4-mini | Native | Auto | $0.15 to $75 |
| ๐ต vLLM | Any HuggingFace model | Native | No | Free (self-hosted) |
| ๐ก SGLang | Any HuggingFace model | Native | No | Free (self-hosted) |
| ๐ด llama.cpp | Any GGUF quantized model | ReAct text injection | No | Free (CPU / Metal) |
| ๐ Ollama | Any Ollama model | Native | No | Free (local) |
No GPU? llama.cpp runs on any Mac or CPU machine. Tool calling works through ReAct text injection when native function calling is not available.
# 1. Clone and install
git clone https://github.com/thepradip/HarnessAgent.git
cd HarnessAgent
poetry install
# 2. Configure (set at least one API key, or a local model URL)
cp .env.example .env
# 3. Start infrastructure (Redis, Qdrant, Neo4j, MLflow, Prometheus, Grafana)
docker compose up -d
# 4. Start the API and worker
make api # terminal 1, FastAPI on port 8000
make worker # terminal 2, async agent worker
# 5. Run your first agent
curl -X POST http://localhost:8000/runs \
-H "Content-Type: application/json" \
-d '{"agent_type": "sql", "task": "How many users signed up this week?"}'
# Watch steps in real time
curl http://localhost:8000/runs/{run_id}/stepsNo API key? Use llama.cpp locally:
# Put a GGUF model in ./models/ then:
docker compose --profile local-cpu up -d llamacpp
# Add to .env: LLAMACPP_BASE_URL=http://localhost:8080SQL Data Agent โ Ask business questions in plain English. The agent reads your schema into a knowledge graph, writes safe SELECT queries, and returns formatted results with PII redacted.
Code Assistant โ Give it a ticket or a spec. It reads your workspace, writes the code, lints it, runs it in a Docker sandbox, and fixes errors until it passes.
Research Agent โ Feed it documents or URLs. It ingests them into the vector store and knowledge graph, then answers multi-hop questions with citations.
Multi-Agent Pipeline โ Chain specialists through the planner: a researcher feeds a coder, which feeds a reviewer. All agents share the same memory pool.
Existing Framework โ Already using LangGraph, AutoGen, or CrewAI? Drop your graph or crew into the adapter. You get traces, cost tracking, circuit breaking, and safety without changing a line of your agent logic.
HarnessAgent/
โโโ src/harness/
โ โโโ agents/ # BaseAgent loop, SQLAgent, CodeAgent
โ โโโ adapters/ # LangGraph, AutoGen, CrewAI wrappers
โ โโโ api/ # FastAPI routes, JWT auth, SSE streaming
โ โโโ core/ # Config, circuit breaker, cost tracker, rate limiter
โ โโโ eval/ # Datasets, runners, scorers for Hermes evaluation
โ โโโ filesystem/ # Isolated workspaces, Docker sandbox, checkpoints
โ โโโ improvement/ # Hermes loop, error collector, patch generator
โ โโโ ingestion/ # PDF/HTML/MD loaders, chunker, knowledge graph extraction
โ โโโ llm/ # Anthropic, OpenAI, local providers, router, factory
โ โโโ memory/ # Redis, vector backends, graph, Graph RAG engine
โ โโโ messaging/ # Redis Streams inter-agent bus
โ โโโ observability/ # MLflow tracer, OTel spans, Prometheus metrics, audit log
โ โโโ orchestrator/ # AgentRunner, HITL manager, planner, scheduler
โ โโโ prompts/ # Versioned prompt store, patch application
โ โโโ safety/ # Guardrail pipeline factory and per-tenant policies
โ โโโ tools/ # Tool registry, MCP client, SQL / code / file tools
โ โโโ workers/ # RQ agent worker, Hermes background scheduler
โโโ configs/ # Model capabilities, MCP server definitions
โโโ docs/ # Architecture diagrams and full reference docs
โโโ infra/ # Prometheus scrape config, OTel collector, Grafana
โโโ tests/ # 96 unit tests, 2 integration test suites
โโโ docker-compose.yml # Full infrastructure: Redis, Qdrant, Neo4j, MLflow, Grafana
โโโ Dockerfile # Multi-stage: api, worker, hermes targets
โโโ Makefile # install, test, lint, api, worker, hermes, docker-up/down
โโโ pyproject.toml # Poetry dependencies and tooling
| Layer | Technology | Notes |
|---|---|---|
| API | FastAPI + uvicorn | Async by default, SSE for step streaming |
| LLM | anthropic + openai SDKs | Both support streaming and native tool calling |
| Short-term memory | Redis | Conversation history, pub/sub, task queue |
| Long-term memory | Qdrant / ChromaDB / Weaviate | Chroma for dev (zero infra), Qdrant for prod |
| Knowledge graph | NetworkX / Neo4j | NetworkX in-process for dev, Neo4j for production |
| Agent tracing | MLflow | LLM-native spans, experiment tracking, eval metrics |
| Infra tracing | OpenTelemetry | Vendor-neutral, exports to Jaeger or Tempo |
| Metrics | Prometheus + Grafana | 15 pre-defined metrics, pre-built dashboard |
| Safety | Guardrail | 3-stage pipeline: input, intermediate, output |
| Workers | RQ + Redis | Same Redis connection, no extra broker needed |
| Deployment | Docker Compose | Scale workers independently with replicas |
Once docker compose up -d is running:
| Dashboard | URL | Credentials |
|---|---|---|
| MLflow Traces | http://localhost:5000 | none |
| Grafana | http://localhost:3000 | admin / harness_admin |
| Prometheus | http://localhost:9090 | none |
| Qdrant UI | http://localhost:6333/dashboard | none |
| Neo4j Browser | http://localhost:7474 | neo4j / harnesspassword |
Everything goes in .env. Copy .env.example and set what you need.
# Cloud LLMs
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
OPENAI_MODELS=gpt-4o-mini # comma-separated, e.g. gpt-4o-mini,gpt-4o
# Local LLMs (no API key needed)
VLLM_BASE_URL=http://localhost:8000
LLAMACPP_BASE_URL=http://localhost:8080
# Memory backends (chroma is default, zero setup)
VECTOR_BACKEND=chroma # chroma | qdrant | weaviate
GRAPH_BACKEND=networkx # networkx | neo4j
# Hermes self-improvement
HERMES_AUTO_APPLY=false # keep this off until you trust it
HERMES_PATCH_SCORE_THRESHOLD=0.7
# Cost and safety
COST_BUDGET_USD_PER_TENANT=100.0
RATE_LIMIT_RPM=60Full reference: docs/guides/CONFIGURATION.md
# Run unit tests
PYTHONPATH=src python3 -m pytest tests/unit/
# Run integration tests (needs SQLite, no Docker required)
PYTHONPATH=src python3 -m pytest tests/integration/
# With coverage
PYTHONPATH=src python3 -m pytest tests/ --cov=src/harness --cov-report=term-missingCurrent: 96 unit tests passing, 0 failures.
- Architecture Overview โ C4 diagrams, every system flow
- Quick Start Guide โ three setup paths: cloud, local, production
- Configuration Reference โ every env var explained
- Deployment Guide โ Docker Compose to Kubernetes
- Component Reference โ all 17 components documented
- Code Walkthrough โ follow a request through the actual code
- Troubleshooting โ common issues and fixes
- HTML Docs โ open in browser, click Export PDF for a printable version
Planned improvements focused on making HarnessAgent more efficient at scale.
| Area | Feature | Expected Impact |
|---|---|---|
| Token Efficiency | Adaptive context compression โ summarize stale history with a small model before appending to new prompts | 40โ60% token reduction on long sessions |
| Cost Optimization | Semantic response caching โ skip LLM calls when a sufficiently similar query was answered recently | Up to 30% cost savings on repetitive workloads |
| Cost Optimization | Batch inference mode โ route low-urgency tasks through Anthropic/OpenAI Batch APIs at 50% list price | 50% cost reduction for async pipelines |
| Routing | ML-based predictive model selection โ learn per-task-type patterns to auto-select the cheapest sufficient model | Eliminates over-provisioned Opus/GPT-5 usage |
| Memory | Differential re-indexing โ re-embed only modified chunks on ingestion, not the full corpus | Faster incremental ingestion at scale |
| Parallelism | Streaming pipeline overlap โ start tool execution while the LLM is still generating, cut latency per step | Lower end-to-end agent step latency |
| Multi-Agent | Shared tool execution pool โ deduplicate identical tool calls across concurrent agents in the same run | Fewer redundant DB and API round-trips |
| Hermes | Cost-aware patch targeting โ rank prompt candidates by token spend, optimize the most expensive patterns first | Better ROI from self-improvement cycles |
| Scheduling | Fair-share multi-tenant scheduler โ priority queues and resource caps to prevent noisy-neighbor budget spikes | Predictable per-tenant cost and latency |
| Extensibility | Plugin SDK โ first-class API for registering custom LLM providers, memory backends, and tool namespaces | Faster integration of new models and datastores |
| Observability | Automated cost anomaly alerts โ Prometheus rule + Grafana annotation when a run exceeds per-step cost threshold | Catch runaway agents before they exhaust budgets |
| Safety | Streaming guardrail evaluation โ evaluate guardrail rules token-by-token instead of waiting for full output | Interrupt unsafe responses earlier, reduce wasted tokens |
Fork, branch off main, write tests for anything new, open a PR.
git checkout -b feat/your-feature
PYTHONPATH=src python3 -m pytest tests/unit/
ruff check src/ tests/Things that would be useful: new LLM provider adapters, additional vector backends, more tool integrations, Kubernetes Helm chart, and examples for specific use cases.
MIT. See LICENSE.
Architecture ย |ย Quick Start ย |ย Components ย |ย Issues