LLM-Evals
[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Supercharge Your LLM Application Evaluations 🚀
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform…
🐢 Open-Source Evaluation & Testing library for LLM Agents
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with co…
An open-source visual programming environment for battle-testing prompts to LLMs.
Attribute (or cite) statements generated by LLMs back to in-context information.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.
Tests for long context window evaluation
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
Open-source observability for your GenAI or LLM application, based on OpenTelemetry
A benchmark for role-playing language models
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
A unified evaluation framework for large language models
The repository for papaer "Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs"
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
official code for "Large Language Models as Optimizers"
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
URS Benchmark: Evaluating LLMs on User Reported Scenarios
Evaluating the faithfulness of long-context language models
VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning






