A production-grade dashboard for evaluating Large Language Models (LLMs) across multiple providers (OpenAI, Anthropic, etc.) focusing on accuracy, latency, and cost.
This project utilizes the OpenClaw paradigm for standardized model interfacing.
- Multi-Provider Support: Track OpenAI, Anthropic, and local models via OpenClaw.
- Metric Tracking:
- Accuracy: Semantic similarity and exact match.
- Latency: Time to first token and total response time.
- Cost: Per-token pricing calculations.
- Dashboard: Interactive data visualization of model performance.
- Automated Benchmarking: Run suites of prompts against multiple models simultaneously.
- Evaluator Core: Python-based engine that sends prompts and collects telemetry.
- OpenClaw Wrapper: Standardized interface for model interactions.
- Database: SQLite/JSON backend for storing historical run data.
- API: FastAPI backend to serve metrics.
- Frontend: Streamlit-based dashboard for visualization.
# Clone the repository
git clone https://github.com/username/llm-eval-dashboard.git
cd llm-eval-dashboard
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keysExecute a prompt suite against configured models:
python scripts/run_benchmark.py --suite generic_tasksVisualize the results in your browser:
streamlit run app/dashboard.pycore/: Evaluation logic and metric calculations.models/: OpenClaw model definitions.data/: Local storage for benchmark results.app/: Streamlit dashboard code.tests/: Unit tests for evaluation logic.
Please see CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
MIT