llm-evaluation-metrics

Here are 10 public repositories matching this topic...

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jul 11, 2025
Python

The one-stop repository for large language model (LLM) unlearning. Supports TOFU, MUSE, WMDP, and many unlearning methods. All features: benchmarks, methods, evaluations, models etc. are easily extensible.

open-source benchmarks right-to-be-forgotten privacy-protection unlearning membership-inference-attacks membership-inference llms llm-privacy llm-unlearning llm-evaluation-metrics

Updated Jun 30, 2025
Python

cvs-health / langfair

Star

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

python ai artificial-intelligence bias fairness ai-safety fairness-testing bias-detection fairness-ai fairness-ml responsible-ai ethical-ai large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jul 11, 2025
Python

zhuohaoyu / KIEval

Star

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 19, 2024
Python

attogram / ollama-multirun

Sponsor

Star

Multirun - Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance statistics and model info. All in a single Bash shell script.

static-site-generator ai bash-script llm-eval ollama llm-evaluation ollama-interface ollama-app llm-evaluation-metrics ai-evaluation-tools

Updated Jul 11, 2025
Shell

pyladiesams / eval-llm-based-apps-jan2025

Star

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

ronniross / llm-confidence-scorer

Sponsor

Star

A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.

dataset datasets llm llms llm-training llm-evaluation llms-reasoning llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework llm-evaluation-metrics llms-efficency llms-evalution

Updated May 25, 2025
Python

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

Sponsor

Star

In this we evaluate the LLM responses and find accuracy

llm-evaluation-metrics llm-evals geval

Updated Jul 8, 2025
Python

ritwickbhargav80 / quick-llm-model-evaluations

Star

This repo is for an streamlit application that provides a user-friendly interface for evaluating large language models (LLMs) using the beyondllm package.

streamlit llms retrieval-augmented-generation llm-evaluation-metrics beyondllm

Updated Aug 29, 2024
Python

nhsengland / evalsense

Star

Tools for systematic large language model evaluations

evaluation-metrics evaluation-framework llm llms llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics llm-benchmarking

Updated Jul 4, 2025
Python

Improve this page

Add a description, image, and links to the llm-evaluation-metrics topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation-metrics topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation-metrics

Here are 10 public repositories matching this topic...

confident-ai / deepeval

locuslab / open-unlearning

cvs-health / langfair

zhuohaoyu / KIEval

attogram / ollama-multirun

pyladiesams / eval-llm-based-apps-jan2025

ronniross / llm-confidence-scorer

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

ritwickbhargav80 / quick-llm-model-evaluations

nhsengland / evalsense

Improve this page

Add this topic to your repo