rmusser01

💯

¯\_(ツ)_/¯

Robert rmusser01

💯

¯\_(ツ)_/¯

Contact Me: contact@rmusser.net

638 followers · 323 following

Someplace
Somewhere
rmusser.net/docs

Achievements

x3 x3 x4

Achievements

x3 x3 x4

Stars

LLM-Evals

38 repositories

csbench / csbench

Python 46 4 Updated Oct 28, 2025

lmg-anon / vntl-benchmark

Visual Novel Translation Benchmark

Python 1 Updated Oct 7, 2024

fchollet / ARC-AGI

The Abstraction and Reasoning Corpus

JavaScript 4,726 704 Updated Apr 4, 2025

OpenBMB / UltraEval

[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.

Python 256 22 Updated Oct 30, 2024

openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Python 17,991 2,908 Updated Nov 3, 2025

vibrantlabsai / ragas

Supercharge Your LLM Application Evaluations 🚀

Python 12,880 1,286 Updated Feb 24, 2026

TIGER-AI-Lab / MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]

Python 347 54 Updated Feb 20, 2026

uptrain-ai / uptrain

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform…

Python 2,337 202 Updated Aug 18, 2024

Giskard-AI / giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

Python 5,152 405 Updated Mar 10, 2026

promptfoo / promptfoo

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with co…

TypeScript 11,584 1,065 Updated Mar 10, 2026

ianarawjo / ChainForge

An open-source visual programming environment for battle-testing prompts to LLMs.

TypeScript 2,957 252 Updated Jan 2, 2026

MadryLab / context-cite

Attribute (or cite) statements generated by LLMs back to in-context information.

Jupyter Notebook 325 25 Updated Oct 8, 2024

open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 6,740 742 Updated Mar 10, 2026