evaluation

Here are 1,095 public repositories matching this topic...

langfuse / langfuse

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated May 29, 2024
TypeScript

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated May 29, 2024
Python

🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering

Updated May 29, 2024
TypeScript

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated May 29, 2024
Python

ncalc / ncalc

Star

Mathematical Expressions Evaluator for .NET

parser csharp math runtime dotnet evaluation antlr antlr4 expressions ncalc parlot

Updated May 29, 2024
C#

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated May 29, 2024
Jupyter Notebook

theresiavr / can-we-trust-recsys-fairness-evaluation

Star

This hosts the code and appendix of the SIGIR 2024 full paper "Can We Trust Recommender System Fairness Evaluation: The Role of Fairness and Relevance"

evaluation recsys recommendation measures fairness fairness-ml fairness-metric

Updated May 29, 2024
Jupyter Notebook

MMMU-Benchmark / MMMU

Star

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

machine-learning natural-language-processing deep-neural-networks computer-vision deep-learning evaluation question-answering stem multimodality multimodal-learning visual-question-answering multimodal multimodal-deep-learning foundation-models large-language-models llm llms large-multimodal-models

Updated May 29, 2024
Python

iamrk04 / LLM-Solutions-Playbook

Star

Unlock the potential of AI-driven solutions and delve into the world of Large Language Models. Explore cutting-edge concepts, real-world applications, and best practices to build powerful systems with these state-of-the-art models.

python memory chatbot evaluation openai llama chains agents parsers prompts llm prompt-engineering chatgpt deeplake langchain gpt4all

Updated May 29, 2024
Jupyter Notebook

Xnhyacinth / Awesome-LLM-Long-Context-Modeling

Star

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

agent benchmark evaluation survey transformer compress blogs papers ssm long-term-memory rag awsome-list large-language-models llm long-context-modeling length-extrapolation

Updated May 29, 2024

liyucheng09 / llm-compressive

Star

Longitudinal Evaluation of LLMs via Data Compression

nlp benchmark evaluation llm llms

Updated May 29, 2024
Python

promptfoo / promptfoo

Star

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 29, 2024
TypeScript

kolenaIO / kolena

Star

Python client for Kolena's machine learning testing platform

testing machine-learning evaluation evaluation-metrics evaluation-framework mlops evaluate-models llmops

Updated May 29, 2024
Python

langchain-ai / langsmith-sdk

Star

LangSmith Client SDK Implementations

evaluation language-model observability

Updated May 28, 2024
Python

cdaringe / programming-language-selector

Star

Programming Language Selector based on language metadata and user-specified values.

decision-making evaluation languages

Updated May 28, 2024
TypeScript

Dartvauder / NeuroTrainerWebUI

Star

(Windows/Linux) Local WebUI for finetuning, evaluation and generation of neural network models (LLM and StableDiffusion) on python (In Gradio interface)

python evaluation transformers neural-networks generation gradio finetuning datasets-preparation diffusers safetensors

Updated May 28, 2024
Python

foreai-co / fore

Star

The fore client package

metrics evaluation artificial-intelligence rag llm

Updated May 28, 2024
Python

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated May 28, 2024
TypeScript

zzzprojects / Eval-Expression.NET

Sponsor

Star

C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.

csharp dotnet evaluation evaluator eval eval-expression

Updated May 28, 2024
C#

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated May 28, 2024
Python

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,095 public repositories matching this topic...

langfuse / langfuse

open-compass / opencompass

langwatch / langwatch

open-compass / VLMEvalKit

ncalc / ncalc

tatsu-lab / alpaca_eval

theresiavr / can-we-trust-recsys-fairness-evaluation

MMMU-Benchmark / MMMU

iamrk04 / LLM-Solutions-Playbook

Xnhyacinth / Awesome-LLM-Long-Context-Modeling

liyucheng09 / llm-compressive

promptfoo / promptfoo

kolenaIO / kolena

langchain-ai / langsmith-sdk

cdaringe / programming-language-selector

Dartvauder / NeuroTrainerWebUI

foreai-co / fore

lunary-ai / lunary

zzzprojects / Eval-Expression.NET

microsoft / rag-experiment-accelerator

Improve this page

Add this topic to your repo