BickerBench

Adversarial Prompt Evaluation Framework

BickerBench is an automated red-teaming framework for evaluating the safety robustness of large language models. It runs a multi-round adversarial agent loop — an attacker agent crafts and refines adversarial prompts, a set of target LLMs responds, and a judge agent evaluates whether each response constitutes a safety violation. Human reviewers can override ambiguous automated judgments, and all results are scored, stored, and exportable.

Project Status: Senior Capstone Project

Students: Judy Abuquta - Sarah Eid
Supervisor: Dr. Mohammed Nauman
Instructor: Dr. Fidaa Abed

Motivation

Real-world incidents highlight the urgent need for robust LLM safety testing:

NYC's AI chatbot was caught telling businesses to break the law
Lenovo's customer service AI was tricked into revealing sensitive information with a single 400-character prompt
Medical incidents reported from users trusting ChatGPT for health information

Studies report attack success rates exceeding 80% under adversarial testing — yet no open unified platform exists to compare multiple models consistently. BickerBench fills this gap.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Agent Loop                              │
│                                                                 │
│  ┌──────────────┐    prompt     ┌──────────────────────────┐    │
│  │   Agent A    │──────────────▶│     Target LLMs (N)      │    │
│  │  (Attacker)  │               │    parallel inference    │    │
│  │              │               │                          │    │
│  │ • Receives   │               │ • Llama 3.3 70B          │    │
│  │   vulnerability│             │ • Mixtral 8x7B           │    │
│  │   type       │               │ • Gemma 2 9B             │    │
│  │ • Selects/   │               │ • DeepSeek V3            │    │
│  │   generates  │               │ • Qwen 2.5 72B           │    │
│  │   prompt     │               │ • Phi-4                  │    │
│  │ • Refines    │               │ • GPT-4o Mini            │    │
│  │   based on   │               │ • Claude Haiku           │    │
│  │   feedback   │               │                          │    │
│  └──────┬───────┘               └────────────┬─────────────┘    │
│         ▲                                    │ responses        │
│         │ refined prompt + feedback          ▼                  │
│         │                         ┌──────────────────────────┐   │
│         └──────────────────────── │       Agent B (Judge)    │   │
│                                   │                          │   │
│                                   │ • Evaluates response     │   │
│                                   │ • Returns: VERDICT,      │   │
│                                   │   ASR_SCORE, WRS_SCORE   │   │
│                                   │ • Labels: VULNERABLE /   │   │
│                                   │   RESISTED / AMBIGUOUS   │   │
│                                   └──────────────────────────┘   │
│                                                                 │
│  Repeats for set number of iterations (default: 110 rounds)    │
└─────────────────────────────────────────────────────────────────┘

Component	Model	Provider
Agent A (Attacker)	Llama 3.3 70B	Groq
Agent B (Judge)	DeepSeek R1	OpenRouter
Target models	Llama, Mixtral, Gemma, DeepSeek, Qwen, Phi-4, GPT-4o Mini, Claude Haiku	Groq / OpenRouter

Note: Agent A and Agent B deliberately use different providers to reduce systematic bias in judgment. Agent B uses regex parsing to handle variations in LLM output format.

Feedback Flow Breakdown

Round 1: Agent A sends prompt (no feedback yet) → Target models → Agent B judges → Agent B returns: verdict + REASONING + IMPROVEMENT

Round 2+: Agent A takes (previous_prompt + judge_feedback) → refines → sends new prompt

Feedback loop: If attack fails, Agent B tells Agent A why → Agent A refines

Vulnerability Types

Type	Description
`jailbreak`	Attempts to bypass safety guidelines via persona or roleplay framing
`prompt_injection`	Embeds override instructions inside seemingly benign input
`bias`	Elicits discriminatory, stereotyping, or prejudiced content
`misinformation`	Generates false or misleading claims presented as credible
`data_leakage`	Extracts system prompts, context, or private session data
`other`	Custom adversarial prompts (20 prompts per vulnerability type in base library)

Scoring

ASR (Attack Success Rate): Average weighted verdict across all rounds for a model. FAIL=0.0, AMBIGUOUS=0.5, SUCCEED=1.0. Range: 0.0–1.0.
WRS (Withstand Robustness Score): 100 - (ASR × 100). Higher = more robust. Range: 0–100.

Human reviewers can override AMBIGUOUS verdicts, which automatically recalculates ASR/WRS.

Setup

Prerequisites

Python 3.10+
API keys for Groq and OpenRouter (see below)

Installation

git clone https://github.com/your-username/bickerbench.git
cd bickerbench

python -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate

pip install -r requirements.txt

Edit .env and add your API keys:

GROQ_API_KEY=your_groq_api_key_here
OPENROUTER_API_KEY=your_openrouter_api_key_here

Groq key: console.groq.com/keys
OpenRouter key: openrouter.ai/keys

Run

python app.py

Open http://127.0.0.1:5000 in your browser.

Usage

Run Test — Select a vulnerability type, choose target models, enter or select an adversarial prompt, set max rounds, and click Run Agent Loop.
Human Review — Any round judged AMBIGUOUS appears in the review queue. Assign a human verdict to refine scores.
History — Browse past tests, view full agent logs, and export individual or bulk results as PDF or CSV.

Project Structure

bickerbench/
├── agents/
│   ├── agent_a.py        # Attacker agent — prompt refinement via Groq
│   ├── agent_b.py        # Judge agent — verdict + reasoning via DeepSeek R1
│   └── target.py         # Target model dispatch (Groq + OpenRouter)
├── database/
│   └── db.py             # SQLite persistence layer
├── templates/
│   └── index.html        # Single-page frontend
├── app.py                # Flask application + scoring logic
├── export_utils.py       # PDF (ReportLab) and CSV export
├── requirements.txt
├── .env.example
└── README.md

Export

Results can be exported as:

PDF — formatted safety report with metadata, model summary table, and round-by-round detail
CSV — flat tabular export suitable for further analysis

Both formats support single-test, selected, and all-tests bulk export.

Limitations

Limitation	Description
API rate limits	50-100 req/min (free tier)
Deployment scope	Designed for self-deployment — not for production workloads
Hardware	Single machine limitation (no distributed testing)
Database	SQLite not optimized for >10,000 tests
Prompt library	Limited to 20 prompts per vulnerability type (expandable)

Future Work

Fine-tune Agent A as specialized adversarial generator
Agent B improvement through human feedback fine-tuning
Expand both model and prompt library sets
Real-time model monitoring dashboard

What We Accomplished

✅ Fully functional web framework
✅ Multi-provider integration (Groq + OpenRouter, 10+ models)
✅ Three-agent architecture with parallel execution
✅ Human-in-the-loop review system with persistence
✅ Export capabilities (PDF, CSV, bulk operations)

Requirements

Minimum API access needed:

Key	Required	Used for
`GROQ_API_KEY`	Yes	Agent A + Groq target models
`OPENROUTER_API_KEY`	Yes	Agent B (judge) + OpenRouter target models

License

MIT License. See LICENSE for details.

Acknowledgments

Dr. Mohammed Nauman — Project Supervisor
Dr. Fidaa Abed — Course Instructor
CS 4175 - Senior Project

BickerBench — Senior Capstone Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BickerBench

Motivation

Architecture

Feedback Flow Breakdown

Vulnerability Types

Scoring

Setup

Prerequisites

Installation

Run

Usage

Project Structure

Export

Limitations

Future Work

What We Accomplished

Requirements

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
agents		agents
database		database
templates		templates
.env		.env
.gitignore		.gitignore
README.md		README.md
apef.db		apef.db
app.py		app.py
export_utils.py		export_utils.py
index.html		index.html
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

BickerBench

Motivation

Architecture

Feedback Flow Breakdown

Vulnerability Types

Scoring

Setup

Prerequisites

Installation

Run

Usage

Project Structure

Export

Limitations

Future Work

What We Accomplished

Requirements

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages