llm-evaluation

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

promptengineering llms generative-ai llm-evaluation

Updated Oct 19, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Sep 29, 2024
Python

relari-ai / continuous-eval

Star

Data-Driven Evaluation for LLM-Powered Applications

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Sep 2, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Oct 16, 2024
Python

iMeanAI / WebCanvas

Star

Connect agents to live web environments evaluation.

agent benchmark-framework llm-agent llm-evaluation

Updated Sep 13, 2024
Python

raga-ai-hub / raga-llm-hub

Star

Framework for LLM evaluation, guardrails and security

guardrails llmops llm-security llm-evaluation

Updated Sep 9, 2024
Python

Re-Align / just-eval

Star

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

Updated Jan 29, 2024
Python

Babelscape / ALERT

Star

Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"

nlp benchmark ai artificial-intelligence nlp-machine-learning red-teaming bias-detection safety-monitoring transformers-models llm llm-evaluation llm-safety llm-safety-benchmark

Updated Sep 20, 2024
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Sep 12, 2024
Python

aws-samples / fm-leaderboarder

Star

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated Jul 21, 2024
Python

adithya-s-k / indic_eval

Star

A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks

evaluation-framework llms llm-evaluation

Updated Jun 10, 2024
Python

allenai / CommonGen-Eval

Star

Evaluating LLMs with CommonGen-Lite

evaluation text-generation llm chatgpt gpt-evaluation llama2 llm-evaluation

Updated Mar 21, 2024
Python

VILA-Lab / Open-LLM-Leaderboard

Star

Open-LLM-Leaderboard: Open-Style Question Evaluation. Paper at https://arxiv.org/abs/2406.07545

leaderboard llms open-ended-question-marker llm-evaluation open-ended-evaluation llm-leaderboard

Updated Jun 27, 2024
Python

VITA-Group / llm-kick

Star

[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.

llm-inference llm-evaluation llm-compression llm-pruning

Updated Mar 13, 2024
Python

intuit-ai-research / DCR-consistency

Star

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

consistency summarization blackbox divide-and-conquer-approach hallucinations large-language-models llm llm-evaluation

Updated May 23, 2024
Python

zhuohaoyu / KIEval

Star

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 19, 2024
Python

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation

Here are 52 public repositories matching this topic...

confident-ai / deepeval

Giskard-AI / giskard

Agenta-AI / agenta

PacktPublishing / LLM-Engineers-Handbook

microsoft / prompty

Psycoy / MixEval

relari-ai / continuous-eval

athina-ai / athina-evals

iMeanAI / WebCanvas

raga-ai-hub / raga-llm-hub

Re-Align / just-eval

Babelscape / ALERT

parea-ai / parea-sdk-py

aws-samples / fm-leaderboarder

adithya-s-k / indic_eval

allenai / CommonGen-Eval

VILA-Lab / Open-LLM-Leaderboard

VITA-Group / llm-kick

intuit-ai-research / DCR-consistency

zhuohaoyu / KIEval

Improve this page

Add this topic to your repo