Skip to content

rutuforai/PromptPlayground

Repository files navigation

PromptPlayground

A local LLM evaluation lab for comparing prompting strategies. Run benchmarks across Zero-shot, Few-shot, Chain-of-thought, and Role-based prompting on extraction, classification, and summarisation tasks.

PromptPlayground UI

What it does

  • Runs your chosen prompting strategies against a local Ollama model (or OpenAI if you have a key)
  • Evaluates outputs using an LLM-as-judge with multi-judge consensus and position randomisation to reduce bias
  • Streams live results to the frontend via SSE
  • Logs everything to MLflow so you can compare runs over time
  • Works completely offline with Ollama

Stack

  • Backend: FastAPI + async SSE streaming
  • Frontend: React + Recharts
  • LLM: Ollama (local) or OpenAI
  • Evaluation: custom LLM judge with bias reduction
  • Tracking: MLflow

Prerequisites

  • Ollama installed and running
  • At least one model pulled: ollama pull llama3
  • Python 3.10+
  • Node.js 18+

Setup

Backend

cd PromptPlayground
pip install -r backend/requirements.txt

# start MLflow (optional but recommended)
mlflow ui --port 5000

# start the API
uvicorn backend.app:app --reload --port 8000

Frontend

cd PromptPlayground/frontend
npm install
npm run dev

Open http://localhost:5173

Usage

  1. Select a task (Extraction, Classification, or Summarisation)
  2. Pick which prompting strategies to compare
  3. Choose your model and sample size
  4. Hit Run - results stream in live
  5. Check MLflow at http://localhost:5000 for historical comparisons

Demo mode works without any backend - useful for showing the UI or testing the frontend.

Prompting strategies

Strategy Description
Zero-shot Just the task, no examples
Few-shot 2-3 examples included in the prompt
Chain-of-thought Ask the model to reason step by step
Role-based Give the model a persona ("You are an expert...")

Evaluation

Each generated output is scored by an LLM judge (the same local model) on a 0-10 scale. To reduce bias:

  • Multi-judge consensus: the judge runs 3 times and scores are averaged
  • Position randomisation: which answer is "Answer 1" vs "Answer 2" is randomised each time

For extraction tasks, entity F1 is also computed. For classification, exact match.

Running the CLI benchmark

If you don't want the web UI, you can run benchmarks directly:

cd PromptPlayground
python main.py

This uses the local HuggingFace model (Qwen2.5-0.5B by default) instead of Ollama.

Custom tasks

Upload your own CSV or JSONL files through the UI. Required columns:

CSV:

id,text,expected
1,"The quick brown fox...","positive"

JSONL:

{"id": "1", "text": "The quick brown fox...", "expected": "positive"}

Minimum 5 samples. The app auto-detects task type from the expected field format.

Project structure

PromptPlayground/
├── backend/
│   ├── app.py               # FastAPI server + SSE streaming
│   ├── llm_engine.py        # Ollama async client
│   ├── model_router.py      # unified Ollama/OpenAI/local interface
│   ├── prompt_store.py      # versioned prompt storage (SQLite)
│   ├── database.py          # run history storage
│   ├── task_upload.py       # custom task upload handling
│   └── evaluator/
│       ├── llm_judge.py     # LLM-as-judge
│       ├── bias_reduction.py
│       └── metrics.py       # F1, exact match
├── frontend/
│   └── src/
│       └── App.jsx          # main UI
├── prompts/                 # prompt builders
├── tasks/                   # built-in datasets
├── evaluator/               # shared evaluator (used by main.py)
├── llm_engine.py            # local HuggingFace engine
├── main.py                  # CLI benchmark runner
└── mlruns/                  # MLflow data (gitignored)

Environment variables

OPENAI_API_KEY=          # optional, enables OpenAI models
OLLAMA_BASE_URL=http://localhost:11434
MLFLOW_TRACKING_URI=http://localhost:5000

Notes

  • The LLM judge uses the same model you're evaluating, which introduces some bias. Using a stronger judge model (GPT-4o) gives more reliable scores.
  • Self-consistency voting (3 runs, majority vote) is applied automatically for Chain-of-thought and Role-based on structured tasks.
  • MLflow run data is stored in mlruns/ - this is gitignored but you can point MLFLOW_TRACKING_URI to a remote server.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors