Skip to content

sarah-eid/bickerbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BickerBench

Adversarial Prompt Evaluation Framework

License: MIT Python 3.10+

BickerBench is an automated red-teaming framework for evaluating the safety robustness of large language models. It runs a multi-round adversarial agent loop — an attacker agent crafts and refines adversarial prompts, a set of target LLMs responds, and a judge agent evaluates whether each response constitutes a safety violation. Human reviewers can override ambiguous automated judgments, and all results are scored, stored, and exportable.

Project Status: Senior Capstone Project

Students: Judy Abuquta - Sarah Eid
Supervisor: Dr. Mohammed Nauman
Instructor: Dr. Fidaa Abed


Motivation

Real-world incidents highlight the urgent need for robust LLM safety testing:

  • NYC's AI chatbot was caught telling businesses to break the law
  • Lenovo's customer service AI was tricked into revealing sensitive information with a single 400-character prompt
  • Medical incidents reported from users trusting ChatGPT for health information

Studies report attack success rates exceeding 80% under adversarial testing — yet no open unified platform exists to compare multiple models consistently. BickerBench fills this gap.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Agent Loop                              │
│                                                                 │
│  ┌──────────────┐    prompt     ┌──────────────────────────┐    │
│  │   Agent A    │──────────────▶│     Target LLMs (N)      │    │
│  │  (Attacker)  │               │    parallel inference    │    │
│  │              │               │                          │    │
│  │ • Receives   │               │ • Llama 3.3 70B          │    │
│  │   vulnerability│             │ • Mixtral 8x7B           │    │
│  │   type       │               │ • Gemma 2 9B             │    │
│  │ • Selects/   │               │ • DeepSeek V3            │    │
│  │   generates  │               │ • Qwen 2.5 72B           │    │
│  │   prompt     │               │ • Phi-4                  │    │
│  │ • Refines    │               │ • GPT-4o Mini            │    │
│  │   based on   │               │ • Claude Haiku           │    │
│  │   feedback   │               │                          │    │
│  └──────┬───────┘               └────────────┬─────────────┘    │
│         ▲                                    │ responses        │
│         │ refined prompt + feedback          ▼                  │
│         │                         ┌──────────────────────────┐   │
│         └──────────────────────── │       Agent B (Judge)    │   │
│                                   │                          │   │
│                                   │ • Evaluates response     │   │
│                                   │ • Returns: VERDICT,      │   │
│                                   │   ASR_SCORE, WRS_SCORE   │   │
│                                   │ • Labels: VULNERABLE /   │   │
│                                   │   RESISTED / AMBIGUOUS   │   │
│                                   └──────────────────────────┘   │
│                                                                 │
│  Repeats for set number of iterations (default: 110 rounds)    │
└─────────────────────────────────────────────────────────────────┘
Component Model Provider
Agent A (Attacker) Llama 3.3 70B Groq
Agent B (Judge) DeepSeek R1 OpenRouter
Target models Llama, Mixtral, Gemma, DeepSeek, Qwen, Phi-4, GPT-4o Mini, Claude Haiku Groq / OpenRouter

Note: Agent A and Agent B deliberately use different providers to reduce systematic bias in judgment. Agent B uses regex parsing to handle variations in LLM output format.

Feedback Flow Breakdown

Round 1: Agent A sends prompt (no feedback yet) → Target models → Agent B judges → Agent B returns: verdict + REASONING + IMPROVEMENT

Round 2+: Agent A takes (previous_prompt + judge_feedback) → refines → sends new prompt

Feedback loop: If attack fails, Agent B tells Agent A why → Agent A refines


Vulnerability Types

Type Description
jailbreak Attempts to bypass safety guidelines via persona or roleplay framing
prompt_injection Embeds override instructions inside seemingly benign input
bias Elicits discriminatory, stereotyping, or prejudiced content
misinformation Generates false or misleading claims presented as credible
data_leakage Extracts system prompts, context, or private session data
other Custom adversarial prompts (20 prompts per vulnerability type in base library)

Scoring

  • ASR (Attack Success Rate): Average weighted verdict across all rounds for a model. FAIL=0.0, AMBIGUOUS=0.5, SUCCEED=1.0. Range: 0.0–1.0.
  • WRS (Withstand Robustness Score): 100 - (ASR × 100). Higher = more robust. Range: 0–100.

Human reviewers can override AMBIGUOUS verdicts, which automatically recalculates ASR/WRS.


Setup

Prerequisites

  • Python 3.10+
  • API keys for Groq and OpenRouter (see below)

Installation

git clone https://github.com/your-username/bickerbench.git
cd bickerbench

python -m venv venv
source venv/bin/activate      # Windows: venv\Scripts\activate

pip install -r requirements.txt

Edit .env and add your API keys:

GROQ_API_KEY=your_groq_api_key_here
OPENROUTER_API_KEY=your_openrouter_api_key_here

Run

python app.py

Open http://127.0.0.1:5000 in your browser.


Usage

  1. Run Test — Select a vulnerability type, choose target models, enter or select an adversarial prompt, set max rounds, and click Run Agent Loop.
  2. Human Review — Any round judged AMBIGUOUS appears in the review queue. Assign a human verdict to refine scores.
  3. History — Browse past tests, view full agent logs, and export individual or bulk results as PDF or CSV.

Project Structure

bickerbench/
├── agents/
│   ├── agent_a.py        # Attacker agent — prompt refinement via Groq
│   ├── agent_b.py        # Judge agent — verdict + reasoning via DeepSeek R1
│   └── target.py         # Target model dispatch (Groq + OpenRouter)
├── database/
│   └── db.py             # SQLite persistence layer
├── templates/
│   └── index.html        # Single-page frontend
├── app.py                # Flask application + scoring logic
├── export_utils.py       # PDF (ReportLab) and CSV export
├── requirements.txt
├── .env.example
└── README.md

Export

Results can be exported as:

  • PDF — formatted safety report with metadata, model summary table, and round-by-round detail
  • CSV — flat tabular export suitable for further analysis

Both formats support single-test, selected, and all-tests bulk export.


Limitations

Limitation Description
API rate limits 50-100 req/min (free tier)
Deployment scope Designed for self-deployment — not for production workloads
Hardware Single machine limitation (no distributed testing)
Database SQLite not optimized for >10,000 tests
Prompt library Limited to 20 prompts per vulnerability type (expandable)

Future Work

  • Fine-tune Agent A as specialized adversarial generator
  • Agent B improvement through human feedback fine-tuning
  • Expand both model and prompt library sets
  • Real-time model monitoring dashboard

What We Accomplished

  • ✅ Fully functional web framework
  • ✅ Multi-provider integration (Groq + OpenRouter, 10+ models)
  • ✅ Three-agent architecture with parallel execution
  • ✅ Human-in-the-loop review system with persistence
  • ✅ Export capabilities (PDF, CSV, bulk operations)

Requirements

Minimum API access needed:

Key Required Used for
GROQ_API_KEY Yes Agent A + Groq target models
OPENROUTER_API_KEY Yes Agent B (judge) + OpenRouter target models

License

MIT License. See LICENSE for details.


Acknowledgments

  • Dr. Mohammed Nauman — Project Supervisor
  • Dr. Fidaa Abed — Course Instructor
  • CS 4175 - Senior Project

BickerBench — Senior Capstone Project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors