llm-evaluation

Here are 69 public repositories matching this topic...

prompt-foundry / typescript-sdk

The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.

typescript gpt open-ai gpt-3 gpt-4 llm prompt-engineering llmops prompt-testing prompt-manager prompt-management llm-eval llm-test llm-ops llm-evaluation prompt-evaluation

Updated Jun 19, 2024
TypeScript

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 19, 2024
TypeScript

microsoft / prompty

Star

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

promptengineering llms generative-ai llm-evaluation

Updated Jun 19, 2024
TypeScript

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Jun 18, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 18, 2024
TypeScript

hkust-nlp / dart-math

Star

Official implementation for the paper *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*

nlp deep-learning mathematics llm llm-training llm-inference llm-evaluation

Updated Jun 18, 2024
Jupyter Notebook

relari-ai / continuous-eval

Star

Open-Source Evaluation for GenAI Application Pipelines

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Jun 18, 2024
Python

gretelai / navigator-helpers

Star

Navigator Helpers

ai agent-based synthetic-data llm llm-evaluation

Updated Jun 18, 2024
Python

villagecomputing / superpipe

Star

Superpipe - optimized LLM pipelines for structured data

classification data-extraction structured-data data-labeling llm llm-evaluation llm-optimization

Updated Jun 18, 2024
Python

confident-ai / deepeval

Star

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jun 18, 2024
Python

Agenta-AI / agenta

Star

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

prompt-toolkit rag human-annotation large-language-models llm prompt-engineering llms langchain llmops llama-index prompt-management llm-tools llm-framework llm-evaluation rag-evaluation

Updated Jun 18, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 18, 2024
Python

LLM-Evaluation-s-Always-Fatiguing / leaf-playground

Star

A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.

agent automation evaluations agents agent-based-simulation chatgpt llm-evaluation