PromptPlayground

A local LLM evaluation lab for comparing prompting strategies. Run benchmarks across Zero-shot, Few-shot, Chain-of-thought, and Role-based prompting on extraction, classification, and summarisation tasks.

What it does

Runs your chosen prompting strategies against a local Ollama model (or OpenAI if you have a key)
Evaluates outputs using an LLM-as-judge with multi-judge consensus and position randomisation to reduce bias
Streams live results to the frontend via SSE
Logs everything to MLflow so you can compare runs over time
Works completely offline with Ollama

Stack

Backend: FastAPI + async SSE streaming
Frontend: React + Recharts
LLM: Ollama (local) or OpenAI
Evaluation: custom LLM judge with bias reduction
Tracking: MLflow

Prerequisites

Ollama installed and running
At least one model pulled: ollama pull llama3
Python 3.10+
Node.js 18+

Setup

Backend

cd PromptPlayground
pip install -r backend/requirements.txt

# start MLflow (optional but recommended)
mlflow ui --port 5000

# start the API
uvicorn backend.app:app --reload --port 8000

Frontend

cd PromptPlayground/frontend
npm install
npm run dev

Open http://localhost:5173

Usage

Select a task (Extraction, Classification, or Summarisation)
Pick which prompting strategies to compare
Choose your model and sample size
Hit Run - results stream in live
Check MLflow at http://localhost:5000 for historical comparisons

Demo mode works without any backend - useful for showing the UI or testing the frontend.

Prompting strategies

Strategy	Description
Zero-shot	Just the task, no examples
Few-shot	2-3 examples included in the prompt
Chain-of-thought	Ask the model to reason step by step
Role-based	Give the model a persona ("You are an expert...")

Evaluation

Each generated output is scored by an LLM judge (the same local model) on a 0-10 scale. To reduce bias:

Multi-judge consensus: the judge runs 3 times and scores are averaged
Position randomisation: which answer is "Answer 1" vs "Answer 2" is randomised each time

For extraction tasks, entity F1 is also computed. For classification, exact match.

Running the CLI benchmark

If you don't want the web UI, you can run benchmarks directly:

cd PromptPlayground
python main.py

This uses the local HuggingFace model (Qwen2.5-0.5B by default) instead of Ollama.

Custom tasks

Upload your own CSV or JSONL files through the UI. Required columns:

CSV:

id,text,expected
1,"The quick brown fox...","positive"

JSONL:

{"id": "1", "text": "The quick brown fox...", "expected": "positive"}

Minimum 5 samples. The app auto-detects task type from the expected field format.

Project structure

PromptPlayground/
├── backend/
│   ├── app.py               # FastAPI server + SSE streaming
│   ├── llm_engine.py        # Ollama async client
│   ├── model_router.py      # unified Ollama/OpenAI/local interface
│   ├── prompt_store.py      # versioned prompt storage (SQLite)
│   ├── database.py          # run history storage
│   ├── task_upload.py       # custom task upload handling
│   └── evaluator/
│       ├── llm_judge.py     # LLM-as-judge
│       ├── bias_reduction.py
│       └── metrics.py       # F1, exact match
├── frontend/
│   └── src/
│       └── App.jsx          # main UI
├── prompts/                 # prompt builders
├── tasks/                   # built-in datasets
├── evaluator/               # shared evaluator (used by main.py)
├── llm_engine.py            # local HuggingFace engine
├── main.py                  # CLI benchmark runner
└── mlruns/                  # MLflow data (gitignored)

Environment variables

OPENAI_API_KEY=          # optional, enables OpenAI models
OLLAMA_BASE_URL=http://localhost:11434
MLFLOW_TRACKING_URI=http://localhost:5000

Notes

The LLM judge uses the same model you're evaluating, which introduces some bias. Using a stronger judge model (GPT-4o) gives more reliable scores.
Self-consistency voting (3 runs, majority vote) is applied automatically for Chain-of-thought and Role-based on structured tasks.
MLflow run data is stored in mlruns/ - this is gitignored but you can point MLFLOW_TRACKING_URI to a remote server.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
evaluator		evaluator
frontend		frontend
prompts		prompts
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
llm_engine.py		llm_engine.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
serve.py		serve.py
start.bat		start.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PromptPlayground

What it does

Stack

Prerequisites

Setup

Backend

Frontend

Usage

Prompting strategies

Evaluation

Running the CLI benchmark

Custom tasks

Project structure

Environment variables

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PromptPlayground

What it does

Stack

Prerequisites

Setup

Backend

Frontend

Usage

Prompting strategies

Evaluation

Running the CLI benchmark

Custom tasks

Project structure

Environment variables

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages