Skip to content
View rmusser01's full-sized avatar
💯
¯\_(ツ)_/¯
💯
¯\_(ツ)_/¯

Block or report rmusser01

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

LLM-Evals

38 repositories
Python 46 4 Updated Oct 28, 2025

Visual Novel Translation Benchmark

Python 1 Updated Oct 7, 2024

The Abstraction and Reasoning Corpus

JavaScript 4,726 704 Updated Apr 4, 2025

[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.

Python 256 22 Updated Oct 30, 2024

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Python 17,991 2,908 Updated Nov 3, 2025

Supercharge Your LLM Application Evaluations 🚀

Python 12,880 1,286 Updated Feb 24, 2026

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]

Python 347 54 Updated Feb 20, 2026

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform…

Python 2,337 202 Updated Aug 18, 2024

🐢 Open-Source Evaluation & Testing library for LLM Agents

Python 5,152 405 Updated Mar 10, 2026

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with co…

TypeScript 11,584 1,065 Updated Mar 10, 2026

An open-source visual programming environment for battle-testing prompts to LLMs.

TypeScript 2,957 252 Updated Jan 2, 2026

Attribute (or cite) statements generated by LLMs back to in-context information.

Jupyter Notebook 325 25 Updated Oct 8, 2024

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 6,740 742 Updated Mar 10, 2026

BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.

Jupyter Notebook 241 21 Updated Sep 2, 2025

Tests for long context window evaluation

Python 10 1 Updated Jul 8, 2024

Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718

Python 378 32 Updated Sep 25, 2024

Open-source observability for your GenAI or LLM application, based on OpenTelemetry

Python 6,893 896 Updated Mar 8, 2026

A benchmark for role-playing language models

Python 116 11 Updated May 25, 2025

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

Python 179 36 Updated Feb 26, 2026

A unified evaluation framework for large language models

Python 2,784 219 Updated Feb 20, 2026

The HELMET Benchmark

Jupyter Notebook 203 39 Updated Feb 26, 2026

The repository for papaer "Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs"

Python 14 1 Updated Dec 16, 2024
Python 131 6 Updated Feb 9, 2026

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Python 2,331 437 Updated Mar 9, 2026

official code for "Large Language Models as Optimizers"

Python 713 88 Updated Dec 4, 2024

Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"

1,825 154 Updated Jun 17, 2025

URS Benchmark: Evaluating LLMs on User Reported Scenarios

Python 30 1 Updated May 30, 2025
Python 215 17 Updated Apr 2, 2025

Evaluating the faithfulness of long-context language models

Python 30 2 Updated Oct 21, 2024

VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning

Python 136 18 Updated Sep 17, 2024