Skip to content
This repository was archived by the owner on Jun 15, 2025. It is now read-only.

An open-source evaluation suite for testing LLMs on refusal handling, tone control, and reasoning. Built to explore model behavior across nuanced user cases.

Notifications You must be signed in to change notification settings

StressTestor/PromptPressure-EvalSuite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PromptPressure Logo

PromptPressure Eval Suite

Python Version License: MIT Code Style: Black GitHub Issues GitHub Stars

A comprehensive evaluation suite for LLM assessment with CI/CD integration, automated visualizations, and advanced analytics.

Features: Automated model evaluation pipelines β€’ Interactive dashboards β€’ Multi-provider support β€’ Extensible architecture

🌟 Features

  • Multi-Model Support

    • Seamless integration with OpenAI, Groq, LM Studio, and custom models
    • Dynamic adapter system for easy extension
    • Local and cloud deployment options
  • Comprehensive Evaluation

    • Refusal mapping and instruction following
    • Tone consistency and psychological reasoning
    • Performance metrics and benchmarking
  • Developer Experience

    • CI/CD ready with GitHub Actions
    • Automated testing and validation
    • Detailed logging and debugging
    • Type hints and documentation

πŸš€ Getting Started

Prerequisites

  • Python 3.10 or higher
  • Poetry (recommended) or pip
  • Git

Installation

  1. Clone the repository

    git clone https://github.com/StressTestor/PromptPressure-EvalSuite.git
    cd PromptPressure-EvalSuite
  2. Set up the environment

    # Install dependencies
    pip install -r requirements.txt
    
    # For development
    pip install -r requirements-dev.txt
  3. Configure environment variables Create a .env file in the project root:

    # Required API Keys
    OPENAI_API_KEY=your-openai-key-here
    GROQ_API_KEY=your-groq-key-here
    
    # Optional: LM Studio configuration
    LMSTUDIO_ENDPOINT=http://localhost:1234/v1

Basic Usage

# Run evaluation with default settings
python run_eval.py

# Run with specific model
python run_eval.py --model groq --output results.csv

# Generate visualizations
python visualization/generate_plots.py

Development

# Run tests
pytest

# Format code
black .

# Check types
mypy .

# Lint code
flake8 .

πŸ“š Documentation

For detailed documentation, please visit our documentation site.


πŸš€ Key Features (v1.5.3+)

  • Automated CI/CD Pipeline

    • GitHub Actions workflow for automated testing
    • Trigger evaluations on dataset/model changes
    • Automatic artifact generation and storage
  • Advanced Visualization

    • Interactive model performance dashboards
    • Success rate tracking over time
    • Latency distribution analysis
    • Model comparison tools
  • Enhanced Adapter System

    • Dynamic adapter selection based on model provider
    • Built-in support for Groq, OpenAI, LM Studio, and Mock
    • Easy extension for new model providers
    • Comprehensive error handling and retries
  • Evaluation Categories

    • Refusal Sensitivity
    • Instruction Following
    • Psychological Reasoning
    • Tone & Role Consistency
    • Emergent Story Logic

πŸ“ Project Structure

.
β”œβ”€β”€ .github/
β”‚   └── workflows/           # GitHub Actions workflows
β”‚       └── evaluation.yml    # CI/CD pipeline definition
β”œβ”€β”€ adapters/                # Model adapters
β”‚   β”œβ”€β”€ __init__.py          # Adapter registry
β”‚   β”œβ”€β”€ groq_adapter.py      # Groq API integration
β”‚   β”œβ”€β”€ openai_adapter.py    # OpenAI API integration
β”‚   β”œβ”€β”€ lmstudio_adapter.py  # Local LM Studio integration
β”‚   └── mock_adapter.py      # Mock adapter for testing
β”œβ”€β”€ visualization/           # Visualization scripts
β”‚   β”œβ”€β”€ generate_plots.py    # Plot generation
β”‚   └── dashboard.py         # Interactive dashboard (coming soon)
β”œβ”€β”€ configs/                 # Configuration presets
β”‚   β”œβ”€β”€ default.yaml         # Default configuration
β”‚   └── lmstudio.yaml        # LM Studio specific settings
β”œβ”€β”€ datasets/                # Evaluation datasets
β”‚   └── evals_dataset.json   # Default evaluation dataset
β”œβ”€β”€ outputs/                 # Evaluation results
β”‚   └── results_*.csv        # Generated result files
β”œβ”€β”€ run_eval.py              # Main evaluation script
β”œβ”€β”€ deepseek_post_analysis.py # Analysis and reporting
β”œβ”€β”€ requirements.txt          # Production dependencies
β”œβ”€β”€ requirements-dev.txt      # Development dependencies
└── README.md                # This file

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Git
  • (Optional) Docker for containerized execution

Installation

  1. Clone the repository

    git clone https://github.com/StressTestor/PromptPressure-EvalSuite.git
    cd PromptPressure-EvalSuite
  2. Set up a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    # Core dependencies
    pip install -r requirements.txt
    
    # For development and visualization
    pip install -r requirements-dev.txt
  4. Configure environment variables Create a .env file in the project root:

    # Required API Keys
    OPENAI_API_KEY=your-openai-key-here
    GROQ_API_KEY=your-groq-key-here
    
    # Optional: LM Studio configuration
    LMSTUDIO_ENDPOINT=http://localhost:1234/v1
    
    # Logging
    LOG_LEVEL=INFO
    ERROR_LOG=error_log.txt

Running Evaluations

Basic Usage

# Run with default settings
python run_eval.py

# Specify model and output file
python run_eval.py --model groq --output outputs/results.csv

# Run post-analysis and generate visualizations
python deepseek_post_analysis.py

Advanced Usage

# Run with custom configuration
python run_eval.py --config configs/custom.yaml

# Generate visualizations only
python visualization/generate_plots.py --input outputs/ --output visualization/

# Run in simulation mode (no API calls)
python run_eval.py --simulation --model mock

πŸ“Š CI/CD Integration

The GitHub Actions workflow automatically runs on:

  • Pushes to main or release/* branches
  • Changes to datasets/, models/, or adapters/
  • Manual workflow dispatch

Manual Trigger

  1. Go to GitHub Actions
  2. Select "Run workflow"
  3. Choose branch and optional parameters
  4. View results in the Actions tab

πŸ“ˆ Visualizations

Success Rate Over Time

Success Rate

Latency Distribution

Latency

Model Comparison

# Generate comparison report
python visualization/compare_models.py --input outputs/ --output reports/

πŸ”Œ Adapters

Available Adapters

  • OpenAI Adapter - For GPT-4, GPT-3.5, and other OpenAI models
  • Groq Adapter - For ultra-fast inference with Groq API
  • LM Studio Adapter - For local model inference
  • Mock Adapter - For testing and development

Adding a New Adapter

  1. Create a new file in adapters/ (e.g., my_adapter.py)
  2. Implement the required interface:
    def generate_response(
        prompt: str,
        model_name: str,
        config: Dict[str, Any]
    ) -> Union[str, List[str], Iterator[str]]:
        # Your implementation here
        pass
  3. Register it in adapters/__init__.py:
    from .my_adapter import generate_response as my_adapter_resp
    
    ADAPTER_REGISTRY = {
        # ... existing adapters ...
        'my_adapter': my_adapter_resp
    }

πŸ“Š Metrics & Monitoring

Key Metrics

  • Success Rate: Percentage of successful responses
  • Latency: Response time in seconds
  • Token Usage: Input/Output tokens per request
  • Error Rate: Percentage of failed requests

Monitoring Setup

  1. Local Monitoring

    # Start monitoring dashboard
    python -m http.server 8000 -d visualization/

    Then open http://localhost:8000 in your browser

  2. Cloud Monitoring

    • Configure your preferred monitoring solution (Datadog, Prometheus, etc.)
    • Point it to the metrics endpoint (coming in v1.6)

🀝 Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/

# Format code
black .

# Check types
mypy .

# Lint code
flake8 .

πŸ“„ License

MIT License. See LICENSE for more information.

πŸ“š Resources

πŸ™ Acknowledgments

  • Project maintained by Joseph Grey
  • Built with ❀️ for the AI community
  • Inspired by OpenAI's API Research & Evals initiative

Made with Python and ❀️ | v1.5.3