A local LLM evaluation lab for comparing prompting strategies. Run benchmarks across Zero-shot, Few-shot, Chain-of-thought, and Role-based prompting on extraction, classification, and summarisation tasks.
- Runs your chosen prompting strategies against a local Ollama model (or OpenAI if you have a key)
- Evaluates outputs using an LLM-as-judge with multi-judge consensus and position randomisation to reduce bias
- Streams live results to the frontend via SSE
- Logs everything to MLflow so you can compare runs over time
- Works completely offline with Ollama
- Backend: FastAPI + async SSE streaming
- Frontend: React + Recharts
- LLM: Ollama (local) or OpenAI
- Evaluation: custom LLM judge with bias reduction
- Tracking: MLflow
- Ollama installed and running
- At least one model pulled:
ollama pull llama3 - Python 3.10+
- Node.js 18+
cd PromptPlayground
pip install -r backend/requirements.txt
# start MLflow (optional but recommended)
mlflow ui --port 5000
# start the API
uvicorn backend.app:app --reload --port 8000cd PromptPlayground/frontend
npm install
npm run dev- Select a task (Extraction, Classification, or Summarisation)
- Pick which prompting strategies to compare
- Choose your model and sample size
- Hit Run - results stream in live
- Check MLflow at http://localhost:5000 for historical comparisons
Demo mode works without any backend - useful for showing the UI or testing the frontend.
| Strategy | Description |
|---|---|
| Zero-shot | Just the task, no examples |
| Few-shot | 2-3 examples included in the prompt |
| Chain-of-thought | Ask the model to reason step by step |
| Role-based | Give the model a persona ("You are an expert...") |
Each generated output is scored by an LLM judge (the same local model) on a 0-10 scale. To reduce bias:
- Multi-judge consensus: the judge runs 3 times and scores are averaged
- Position randomisation: which answer is "Answer 1" vs "Answer 2" is randomised each time
For extraction tasks, entity F1 is also computed. For classification, exact match.
If you don't want the web UI, you can run benchmarks directly:
cd PromptPlayground
python main.pyThis uses the local HuggingFace model (Qwen2.5-0.5B by default) instead of Ollama.
Upload your own CSV or JSONL files through the UI. Required columns:
CSV:
id,text,expected
1,"The quick brown fox...","positive"
JSONL:
{"id": "1", "text": "The quick brown fox...", "expected": "positive"}Minimum 5 samples. The app auto-detects task type from the expected field format.
PromptPlayground/
├── backend/
│ ├── app.py # FastAPI server + SSE streaming
│ ├── llm_engine.py # Ollama async client
│ ├── model_router.py # unified Ollama/OpenAI/local interface
│ ├── prompt_store.py # versioned prompt storage (SQLite)
│ ├── database.py # run history storage
│ ├── task_upload.py # custom task upload handling
│ └── evaluator/
│ ├── llm_judge.py # LLM-as-judge
│ ├── bias_reduction.py
│ └── metrics.py # F1, exact match
├── frontend/
│ └── src/
│ └── App.jsx # main UI
├── prompts/ # prompt builders
├── tasks/ # built-in datasets
├── evaluator/ # shared evaluator (used by main.py)
├── llm_engine.py # local HuggingFace engine
├── main.py # CLI benchmark runner
└── mlruns/ # MLflow data (gitignored)
OPENAI_API_KEY= # optional, enables OpenAI models
OLLAMA_BASE_URL=http://localhost:11434
MLFLOW_TRACKING_URI=http://localhost:5000- The LLM judge uses the same model you're evaluating, which introduces some bias. Using a stronger judge model (GPT-4o) gives more reliable scores.
- Self-consistency voting (3 runs, majority vote) is applied automatically for Chain-of-thought and Role-based on structured tasks.
- MLflow run data is stored in
mlruns/- this is gitignored but you can pointMLFLOW_TRACKING_URIto a remote server.
