🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
-
Updated
May 29, 2024 - TypeScript
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
🤖 Build AI applications with confidence ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
This hosts the code and appendix of the SIGIR 2024 full paper "Can We Trust Recommender System Fairness Evaluation: The Role of Fairness and Relevance"
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
Unlock the potential of AI-driven solutions and delve into the world of Large Language Models. Explore cutting-edge concepts, real-world applications, and best practices to build powerful systems with these state-of-the-art models.
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Longitudinal Evaluation of LLMs via Data Compression
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
Python client for Kolena's machine learning testing platform
LangSmith Client SDK Implementations
Programming Language Selector based on language metadata and user-specified values.
(Windows/Linux) Local WebUI for finetuning, evaluation and generation of neural network models (LLM and StableDiffusion) on python (In Gradio interface)
The fore client package
The production toolkit for LLMs. Observability, prompt management and evaluations.
C# Eval Expression | Evaluate, Compile, and Execute C# code and expression at runtime.
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."