The LLM Evaluation Framework
-
Updated
Jul 11, 2025 - Python
The LLM Evaluation Framework
The one-stop repository for large language model (LLM) unlearning. Supports TOFU, MUSE, WMDP, and many unlearning methods. All features: benchmarks, methods, evaluations, models etc. are easily extensible.
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Multirun - Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.
In this we evaluate the LLM responses and find accuracy
This repo is for an streamlit application that provides a user-friendly interface for evaluating large language models (LLMs) using the beyondllm package.
Tools for systematic large language model evaluations
Add a description, image, and links to the llm-evaluation-metrics topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation-metrics topic, visit your repo's landing page and select "manage topics."