PromptPressure Eval Suite

A comprehensive evaluation suite for LLM assessment with CI/CD integration, automated visualizations, and advanced analytics.

Features: Automated model evaluation pipelines • Interactive dashboards • Multi-provider support • Extensible architecture

🌟 Features

Multi-Model Support
- Seamless integration with OpenAI, Groq, LM Studio, and custom models
- Dynamic adapter system for easy extension
- Local and cloud deployment options
Comprehensive Evaluation
- Refusal mapping and instruction following
- Tone consistency and psychological reasoning
- Performance metrics and benchmarking
Developer Experience
- CI/CD ready with GitHub Actions
- Automated testing and validation
- Detailed logging and debugging
- Type hints and documentation

🚀 Getting Started

Prerequisites

Python 3.10 or higher
Poetry (recommended) or pip
Git

Installation

Clone the repository

git clone https://github.com/StressTestor/PromptPressure-EvalSuite.git
cd PromptPressure-EvalSuite

Set up the environment

# Install dependencies
pip install -r requirements.txt

# For development
pip install -r requirements-dev.txt

Configure environment variables Create a .env file in the project root:

# Required API Keys
OPENAI_API_KEY=your-openai-key-here
GROQ_API_KEY=your-groq-key-here

# Optional: LM Studio configuration
LMSTUDIO_ENDPOINT=http://localhost:1234/v1

Basic Usage

# Run evaluation with default settings
python run_eval.py

# Run with specific model
python run_eval.py --model groq --output results.csv

# Generate visualizations
python visualization/generate_plots.py

Development

# Run tests
pytest

# Format code
black .

# Check types
mypy .

# Lint code
flake8 .

📚 Documentation

For detailed documentation, please visit our documentation site.

🚀 Key Features (v1.5.3+)

Automated CI/CD Pipeline
- GitHub Actions workflow for automated testing
- Trigger evaluations on dataset/model changes
- Automatic artifact generation and storage
Advanced Visualization
- Interactive model performance dashboards
- Success rate tracking over time
- Latency distribution analysis
- Model comparison tools
Enhanced Adapter System
- Dynamic adapter selection based on model provider
- Built-in support for Groq, OpenAI, LM Studio, and Mock
- Easy extension for new model providers
- Comprehensive error handling and retries
Evaluation Categories
- Refusal Sensitivity
- Instruction Following
- Psychological Reasoning
- Tone & Role Consistency
- Emergent Story Logic

📁 Project Structure

.
├── .github/
│   └── workflows/           # GitHub Actions workflows
│       └── evaluation.yml    # CI/CD pipeline definition
├── adapters/                # Model adapters
│   ├── __init__.py          # Adapter registry
│   ├── groq_adapter.py      # Groq API integration
│   ├── openai_adapter.py    # OpenAI API integration
│   ├── lmstudio_adapter.py  # Local LM Studio integration
│   └── mock_adapter.py      # Mock adapter for testing
├── visualization/           # Visualization scripts
│   ├── generate_plots.py    # Plot generation
│   └── dashboard.py         # Interactive dashboard (coming soon)
├── configs/                 # Configuration presets
│   ├── default.yaml         # Default configuration
│   └── lmstudio.yaml        # LM Studio specific settings
├── datasets/                # Evaluation datasets
│   └── evals_dataset.json   # Default evaluation dataset
├── outputs/                 # Evaluation results
│   └── results_*.csv        # Generated result files
├── run_eval.py              # Main evaluation script
├── deepseek_post_analysis.py # Analysis and reporting
├── requirements.txt          # Production dependencies
├── requirements-dev.txt      # Development dependencies
└── README.md                # This file

🚀 Quick Start

Prerequisites

Python 3.10+
Git
(Optional) Docker for containerized execution

Installation

Clone the repository

git clone https://github.com/StressTestor/PromptPressure-EvalSuite.git
cd PromptPressure-EvalSuite

Set up a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

# Core dependencies
pip install -r requirements.txt

# For development and visualization
pip install -r requirements-dev.txt

Configure environment variables Create a .env file in the project root:

# Required API Keys
OPENAI_API_KEY=your-openai-key-here
GROQ_API_KEY=your-groq-key-here

# Optional: LM Studio configuration
LMSTUDIO_ENDPOINT=http://localhost:1234/v1

# Logging
LOG_LEVEL=INFO
ERROR_LOG=error_log.txt

Running Evaluations

Basic Usage

# Run with default settings
python run_eval.py

# Specify model and output file
python run_eval.py --model groq --output outputs/results.csv

# Run post-analysis and generate visualizations
python deepseek_post_analysis.py

Advanced Usage

# Run with custom configuration
python run_eval.py --config configs/custom.yaml

# Generate visualizations only
python visualization/generate_plots.py --input outputs/ --output visualization/

# Run in simulation mode (no API calls)
python run_eval.py --simulation --model mock

📊 CI/CD Integration

The GitHub Actions workflow automatically runs on:

Pushes to main or release/* branches
Changes to datasets/, models/, or adapters/
Manual workflow dispatch

Manual Trigger

Go to GitHub Actions
Select "Run workflow"
Choose branch and optional parameters
View results in the Actions tab

📈 Visualizations

Success Rate Over Time

Latency Distribution

Model Comparison

# Generate comparison report
python visualization/compare_models.py --input outputs/ --output reports/

🔌 Adapters

Available Adapters

OpenAI Adapter - For GPT-4, GPT-3.5, and other OpenAI models
Groq Adapter - For ultra-fast inference with Groq API
LM Studio Adapter - For local model inference
Mock Adapter - For testing and development

Adding a New Adapter

Create a new file in adapters/ (e.g., my_adapter.py)

Implement the required interface:

def generate_response(
    prompt: str,
    model_name: str,
    config: Dict[str, Any]
) -> Union[str, List[str], Iterator[str]]:
    # Your implementation here
    pass

Register it in adapters/__init__.py:

from .my_adapter import generate_response as my_adapter_resp

ADAPTER_REGISTRY = {
    # ... existing adapters ...
    'my_adapter': my_adapter_resp
}

📊 Metrics & Monitoring

Key Metrics

Success Rate: Percentage of successful responses
Latency: Response time in seconds
Token Usage: Input/Output tokens per request
Error Rate: Percentage of failed requests

Monitoring Setup

Local Monitoring

# Start monitoring dashboard
python -m http.server 8000 -d visualization/

Then open http://localhost:8000 in your browser

Cloud Monitoring
- Configure your preferred monitoring solution (Datadog, Prometheus, etc.)
- Point it to the metrics endpoint (coming in v1.6)

🤝 Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/

# Format code
black .

# Check types
mypy .

# Lint code
flake8 .

📄 License

MIT License. See LICENSE for more information.

📚 Resources

🙏 Acknowledgments

Project maintained by Joseph Grey
Built with ❤️ for the AI community
Inspired by OpenAI's API Research & Evals initiative

Made with Python and ❤️ | v1.5.3

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
adapters		adapters
assets		assets
scripts		scripts
visualization		visualization
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
config.yaml		config.yaml
config_lmstudio.yaml		config_lmstudio.yaml
deepseek_post_analysis.py		deepseek_post_analysis.py
evals_dataset.json		evals_dataset.json
gitignore		gitignore
groq_batch_eval.py		groq_batch_eval.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_eval.py		run_eval.py
run_promptpressure_dynamic.bat		run_promptpressure_dynamic.bat

StressTestor/PromptPressure-EvalSuite

Folders and files

Latest commit

History

Repository files navigation

PromptPressure Eval Suite

🌟 Features

🚀 Getting Started

Prerequisites

Installation

Basic Usage

Development

📚 Documentation

🚀 Key Features (v1.5.3+)

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Running Evaluations

Basic Usage

Advanced Usage

📊 CI/CD Integration

Manual Trigger

📈 Visualizations

Success Rate Over Time

Latency Distribution

Model Comparison

🔌 Adapters

Available Adapters

Adding a New Adapter

📊 Metrics & Monitoring

Key Metrics

Monitoring Setup

🤝 Contributing

Development Setup

📄 License

📚 Resources

🙏 Acknowledgments

About

Topics

Resources

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages