Panopticon is a microservices-based system designed for the systematic evaluation, comparison, and monitoring of Large Language Model (LLM) performance.
In the rapidly evolving landscape of LLMs, understanding how different models perform on your specific tasks is crucial. Panopticon provides a framework to:
- Define Custom Evaluations: Move beyond generic benchmarks. Create queries (prompts) relevant to your domain and define specific evaluation criteria (metrics/evaluation prompts) to measure what matters most to you.
- Compare Models & Providers: Systematically run the same queries and evaluations across different LLMs (e.g., GPT-4 vs. Claude 3 vs. Gemini Pro) or different versions of the same model.
- Monitor Performance Over Time: Track how model performance changes as models are updated or as you refine your prompts and evaluation strategies.
- Visualize Results: Gain insights through an integrated dashboard showing trends, comparisons, and detailed results.
Panopticon aims to provide an objective, configurable, and centralized platform for your LLM quality assurance and monitoring needs.
- Microservice Architecture: Scalable and maintainable design using FastAPI and Docker.
- Configurable Evaluation: Define your own queries (prompts) and evaluation metrics (judge prompts).
- Multi-Provider Support: Integrates with various LLM providers via LiteLLM and custom adapters (OpenAI, Google Gemini, Anthropic included by default).
- Centralized Model Registry: Manage available models and their configurations.
- Judge-Based Scoring: Utilizes a powerful LLM (e.g., GPT-4) to score responses based on your criteria.
- Vector Storage & Search: Stores queries and metrics with vector embeddings (using
sentence-transformersandpgvector) for semantic similarity search. - Data Persistence: Uses PostgreSQL to store queries, metrics, model configurations, and detailed evaluation results.
- Grafana Dashboard: Integrated Grafana dashboards to explore evaluation trends, compare models, analyze themes, and view detailed results.
- API Gateway: A central entry point (
main-app) for interacting with the system.
Panopticon employs a microservice architecture:
main-app(API Gateway): The front door. Receives API requests and routes them to the appropriate backend service. Handles authentication.item-storage-queries: Stores user-defined input queries/prompts, categorized by theme. Includes vector embeddings for search.item-storage-metrics: Stores user-defined evaluation prompts/criteria, categorized by type. Includes vector embeddings for search.judge-service: The evaluation engine. Fetches queries and metrics, interacts with themodel-registryto get LLM responses, uses a judge model for scoring, and stores results in the database.model-registry: Manages LLM providers and models. Provides a unified interface for generating text completions via adapters.grafana: Grafana service for data visualization. Connects directly to PostgreSQL to query and visualize evaluation data.postgres: Shared PostgreSQL database withpgvectorextension, storing all persistent data except embeddings managed within item-storage.migrations: Alembic setup for managing database schema evolution.
graph TD
User -->|API Request| GW(main-app API Gateway :8000)
User -->|View Dashboard| Grafana(Grafana :3000)
subgraph Panopticon System
GW -->|Forward Request| ISQ(item-storage-queries :8001)
GW -->|Forward Request| ISM(item-storage-metrics :8002)
GW -->|Forward Request| Judge(judge-service :8003)
GW -->|Forward Request| MR(model-registry :8005)
ISQ -->|Store/Fetch Queries| DB[(Postgres DB :5432)]
ISM -->|Store/Fetch Metrics| DB
Judge -->|Fetch Queries| ISQ
Judge -->|Fetch Metrics| ISM
Judge -->|Run Query/Evaluate| MR
Judge -->|Store Results| DB
MR -->|Store/Fetch Models/Providers| DB
MR -->|External API Call| LLMAPI[External LLM APIs]
Grafana -->|Query Data| DB
end
style GW fill:#f9f,stroke:#333,stroke-width:2px
style DB fill:#ccf,stroke:#333,stroke-width:2px
style Grafana fill:#f96,stroke:#333,stroke-width:2px
(This diagram shows the primary request flows. GitHub automatically renders Mermaid diagrams).
- Docker: Install Docker
- Docker Compose: Usually included with Docker Desktop.
- Git: To clone the repository.
- Python 3.12+: (Optional, for local development or running scripts)
- LLM API Keys: Obtain API keys for the providers you want to use (OpenAI, Google Gemini, Anthropic, etc.).
curl: Command-line tool for making HTTP requests.
-
Clone the Repository:
git clone https://github.com/yourusername/panopticon.git # Replace with your repo URL cd panopticon
-
Configure Environment Variables:
- Copy the example environment file:
cp .env.example .env
- Edit the
.envfile:- Set a secure
API_KEYfor internal service communication and external access. Replaceyour_api_key_herebelow with this value. - Add your LLM API keys (
LITELLM_API_KEY,GOOGLE_API_KEY,ANTHROPIC_API_KEY, etc.).LITELLM_API_KEYis often used for OpenAI by default in LiteLLM, but check provider configs. - Review
DATABASE_URLand other PostgreSQL settings if you're not using the default Docker setup.
- Set a secure
- Copy the example environment file:
-
Build and Run with Docker Compose:
docker-compose up --build -d
--build: Forces Docker to build the images based on the Dockerfiles.-d: Runs the containers in detached mode (in the background).
-
Verify Services:
- Check container logs:
docker-compose logs -f(PressCtrl+Cto exit) - Wait for services to become healthy (check
docker-compose ps). Health checks are configured. - Test the main API gateway:
curl -H "X-API-Key: dev_api_key_for_testing" http://localhost:8000/health - Test individual services (e.g.,
curl http://localhost:8001/health).
- Check container logs:
-
Access the Dashboard:
- Open your web browser and navigate to
http://localhost:3000. - Log in with the credentials configured in your
.envfile (GRAFANA_ADMIN_USERandGRAFANA_ADMIN_PASSWORD).
- Open your web browser and navigate to
The primary way to interact with Panopticon is through the main-app API gateway running on port 8000. An API key (X-API-Key header) matching the one in your .env file is required for most endpoints.
GET /: Basic info about the Panopticon system.GET /health: Health check for the API gateway.GET /api/services: Lists the available backend services and their primary endpoints.POST /api/queries: Store a new query (prompt). (Targetsitem-storage-queries).GET /api/queries/...: Retrieve or search queries. (Targetsitem-storage-queries).POST /api/metrics: Store a new evaluation metric (prompt). (Targetsitem-storage-metrics).GET /api/metrics/...: Retrieve or search metrics. (Targetsitem-storage-metrics).POST /api/judge/evaluate/query: Evaluate a single query against a model using specified metrics. (Targetsjudge-service).POST /api/judge/evaluate/theme: Evaluate all queries of a specific theme against a model. (Targetsjudge-service).GET /api/judge/results: Get detailed evaluation results stored by the judge service.GET /api/models: List models registered in themodel-registry.GET /api/providers: List LLM providers registered in themodel-registry.POST /api/completion: Directly generate text using a registered model (viamodel-registry).
Refer to the OpenAPI documentation available at /api/docs on the running main-app (http://localhost:8000/api/docs) for detailed request/response schemas.
Hereβs a typical workflow for using Panopticon:
-
Define Your Queries:
- Identify the tasks or prompts you want to evaluate (e.g., "Summarize the following text...", "Write Python code to...", "Explain this concept...").
- Group related queries under a
theme(e.g., "summarization", "coding", "qa"). - Action: Send
POSTrequests to/api/queriesfor each query:curl -X POST http://localhost:8000/api/queries \ -H "Content-Type: application/json" \ -H "X-API-Key: dev_api_key_for_testing" \ -d '{ "item": "Summarize the provided article about renewable energy trends.", "type": "summarization", "metadata": { "source": "tech_crunch_article_123", "difficulty": "medium" } }' - Best Practice: Use consistent and descriptive themes. Add relevant metadata for later filtering.
-
Define Your Evaluation Metrics (Judge Prompts):
- Decide how you want to score the LLM's responses. Create prompts for the judge model.
- Action: Send
POSTrequests to/api/metricsfor each evaluation criterion:curl -X POST http://localhost:8000/api/metrics \ -H "Content-Type: application/json" \ -H "X-API-Key: dev_api_key_for_testing" \ -d '{ "item": "Evaluate the summary based on conciseness (1-10) and factual accuracy compared to the original text (1-10). Respond with ONLY a single score from 1 to 10, averaging the two criteria.", "type": "summary_quality", "metadata": { "version": "1.1", "author": "eval_team" } }' - Best Practice: Write clear, objective evaluation prompts for the judge. Ensure the prompt asks for a specific output format (like a single number). Keep track of the
idreturned in the response β you'll need it for evaluation.
-
Run Evaluations:
- Choose the model(s) you want to test (e.g.,
gpt-4o,claude-3-opus-20240229) and the evaluation metric IDs from step 2. - Action (Single Query): Send a
POSTrequest to/api/judge/evaluate/query:curl -X POST http://localhost:8000/api/judge/evaluate/query \ -H "Content-Type: application/json" \ -H "X-API-Key: dev_api_key_for_testing" \ -d '{ "query": "Summarize the provided article about renewable energy trends.", "model_id": "gpt-4o", "theme": "summarization", "evaluation_prompt_ids": ["<metric_id_from_step_2>"], "judge_model": "gpt-4" }' - Action (Entire Theme): Send a
POSTrequest to/api/judge/evaluate/theme:curl -X POST http://localhost:8000/api/judge/evaluate/theme \ -H "Content-Type: application/json" \ -H "X-API-Key: dev_api_key_for_testing" \ -d '{ "theme": "summarization", "model_id": "claude-3-opus-20240229", "evaluation_prompt_ids": ["<metric_id_from_step_2>"], "judge_model": "gpt-4", "limit": 50 }' - Recommendation: Start with evaluating single queries or small theme batches (
limit) to ensure prompts and metrics work as expected before running large-scale evaluations. Repeat for different models.
- Choose the model(s) you want to test (e.g.,
-
Analyze Results:
- Action: Open Grafana at
http://localhost:3000and log in. - Explore the different dashboards:
- Summary Dashboard: Overall statistics and trends.
- Model Comparison: Side-by-side performance using bar and radar charts.
- Theme Analysis: Heatmap showing model strengths/weaknesses across themes.
- Detailed Results: A filterable table view of individual evaluation records.
- Action (Programmatic): Use
GET /api/judge/resultswith filters to fetch raw data for custom analysis (usingcurlor another HTTP client). Example:# Get first 10 results for 'summarization' theme by model 'gpt-4o' curl -G http://localhost:8000/api/judge/results \ -H "X-API-Key: dev_api_key_for_testing" \ --data-urlencode "theme=summarization" \ --data-urlencode "model_id=gpt-4o" \ --data-urlencode "limit=10"
- Best Practice: Use the dashboard for high-level insights and trend spotting. Use the API or direct database queries for deep dives or specific statistical analysis.
- Action: Open Grafana at
- Backend: Python, FastAPI
- Visualization: Grafana
- Database: PostgreSQL, pgvector
- LLM Interaction: LiteLLM, sentence-transformers
- Containerization: Docker, Docker Compose
- Database Migrations: Alembic
- Async:
asyncio,aiohttp,asyncpg
Contributions are welcome! Please follow standard Forking Workflow:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature-name). - Make your changes.
- Commit your changes (
git commit -am 'Add some feature'). - Push to the branch (
git push origin feature/your-feature-name). - Create a new Pull Request.
Please ensure your code follows the style guidelines (Black, Ruff, isort) and includes tests where applicable.
This project is licensed under the MIT License - see the LICENSE file for details.
Happy Evaluating! π