
A comprehensive evaluation suite for LLM assessment with CI/CD integration, automated visualizations, and advanced analytics.
Features: Automated model evaluation pipelines β’ Interactive dashboards β’ Multi-provider support β’ Extensible architecture
-
Multi-Model Support
- Seamless integration with OpenAI, Groq, LM Studio, and custom models
- Dynamic adapter system for easy extension
- Local and cloud deployment options
-
Comprehensive Evaluation
- Refusal mapping and instruction following
- Tone consistency and psychological reasoning
- Performance metrics and benchmarking
-
Developer Experience
- CI/CD ready with GitHub Actions
- Automated testing and validation
- Detailed logging and debugging
- Type hints and documentation
- Python 3.10 or higher
- Poetry (recommended) or pip
- Git
-
Clone the repository
git clone https://github.com/StressTestor/PromptPressure-EvalSuite.git cd PromptPressure-EvalSuite
-
Set up the environment
# Install dependencies pip install -r requirements.txt # For development pip install -r requirements-dev.txt
-
Configure environment variables Create a
.env
file in the project root:# Required API Keys OPENAI_API_KEY=your-openai-key-here GROQ_API_KEY=your-groq-key-here # Optional: LM Studio configuration LMSTUDIO_ENDPOINT=http://localhost:1234/v1
# Run evaluation with default settings
python run_eval.py
# Run with specific model
python run_eval.py --model groq --output results.csv
# Generate visualizations
python visualization/generate_plots.py
# Run tests
pytest
# Format code
black .
# Check types
mypy .
# Lint code
flake8 .
For detailed documentation, please visit our documentation site.
-
Automated CI/CD Pipeline
- GitHub Actions workflow for automated testing
- Trigger evaluations on dataset/model changes
- Automatic artifact generation and storage
-
Advanced Visualization
- Interactive model performance dashboards
- Success rate tracking over time
- Latency distribution analysis
- Model comparison tools
-
Enhanced Adapter System
- Dynamic adapter selection based on model provider
- Built-in support for Groq, OpenAI, LM Studio, and Mock
- Easy extension for new model providers
- Comprehensive error handling and retries
-
Evaluation Categories
- Refusal Sensitivity
- Instruction Following
- Psychological Reasoning
- Tone & Role Consistency
- Emergent Story Logic
.
βββ .github/
β βββ workflows/ # GitHub Actions workflows
β βββ evaluation.yml # CI/CD pipeline definition
βββ adapters/ # Model adapters
β βββ __init__.py # Adapter registry
β βββ groq_adapter.py # Groq API integration
β βββ openai_adapter.py # OpenAI API integration
β βββ lmstudio_adapter.py # Local LM Studio integration
β βββ mock_adapter.py # Mock adapter for testing
βββ visualization/ # Visualization scripts
β βββ generate_plots.py # Plot generation
β βββ dashboard.py # Interactive dashboard (coming soon)
βββ configs/ # Configuration presets
β βββ default.yaml # Default configuration
β βββ lmstudio.yaml # LM Studio specific settings
βββ datasets/ # Evaluation datasets
β βββ evals_dataset.json # Default evaluation dataset
βββ outputs/ # Evaluation results
β βββ results_*.csv # Generated result files
βββ run_eval.py # Main evaluation script
βββ deepseek_post_analysis.py # Analysis and reporting
βββ requirements.txt # Production dependencies
βββ requirements-dev.txt # Development dependencies
βββ README.md # This file
- Python 3.10+
- Git
- (Optional) Docker for containerized execution
-
Clone the repository
git clone https://github.com/StressTestor/PromptPressure-EvalSuite.git cd PromptPressure-EvalSuite
-
Set up a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
# Core dependencies pip install -r requirements.txt # For development and visualization pip install -r requirements-dev.txt
-
Configure environment variables Create a
.env
file in the project root:# Required API Keys OPENAI_API_KEY=your-openai-key-here GROQ_API_KEY=your-groq-key-here # Optional: LM Studio configuration LMSTUDIO_ENDPOINT=http://localhost:1234/v1 # Logging LOG_LEVEL=INFO ERROR_LOG=error_log.txt
# Run with default settings
python run_eval.py
# Specify model and output file
python run_eval.py --model groq --output outputs/results.csv
# Run post-analysis and generate visualizations
python deepseek_post_analysis.py
# Run with custom configuration
python run_eval.py --config configs/custom.yaml
# Generate visualizations only
python visualization/generate_plots.py --input outputs/ --output visualization/
# Run in simulation mode (no API calls)
python run_eval.py --simulation --model mock
The GitHub Actions workflow automatically runs on:
- Pushes to
main
orrelease/*
branches - Changes to
datasets/
,models/
, oradapters/
- Manual workflow dispatch
- Go to GitHub Actions
- Select "Run workflow"
- Choose branch and optional parameters
- View results in the Actions tab
# Generate comparison report
python visualization/compare_models.py --input outputs/ --output reports/
- OpenAI Adapter - For GPT-4, GPT-3.5, and other OpenAI models
- Groq Adapter - For ultra-fast inference with Groq API
- LM Studio Adapter - For local model inference
- Mock Adapter - For testing and development
- Create a new file in
adapters/
(e.g.,my_adapter.py
) - Implement the required interface:
def generate_response( prompt: str, model_name: str, config: Dict[str, Any] ) -> Union[str, List[str], Iterator[str]]: # Your implementation here pass
- Register it in
adapters/__init__.py
:from .my_adapter import generate_response as my_adapter_resp ADAPTER_REGISTRY = { # ... existing adapters ... 'my_adapter': my_adapter_resp }
- Success Rate: Percentage of successful responses
- Latency: Response time in seconds
- Token Usage: Input/Output tokens per request
- Error Rate: Percentage of failed requests
-
Local Monitoring
# Start monitoring dashboard python -m http.server 8000 -d visualization/
Then open
http://localhost:8000
in your browser -
Cloud Monitoring
- Configure your preferred monitoring solution (Datadog, Prometheus, etc.)
- Point it to the metrics endpoint (coming in v1.6)
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/
# Format code
black .
# Check types
mypy .
# Lint code
flake8 .
MIT License. See LICENSE for more information.
- Project maintained by Joseph Grey
- Built with β€οΈ for the AI community
- Inspired by OpenAI's API Research & Evals initiative