Python SDK for running evaluations on LLM generated responses
-
Updated
Jul 28, 2024 - Python
Python SDK for running evaluations on LLM generated responses
Data from BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology paper
Python SDK for agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
The LLM Evaluation Framework
Valor is a centralized evaluation store which makes it easy to measure, explore, and rank model performance.
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
Python client for Kolena's machine learning testing platform
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation (ECCV 2022)
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
Utilities for easy use of custom losses in CatBoost, LightGBM, XGBoost.
GREEN: n-gram F-score for Grammatical Error Correction
Awesome diffusion Video-to-Video (V2V). A collection of paper on diffusion model-based video editing, aka. video-to-video (V2V) translation. And a video editing benchmark code.
A conversational agent for customer support queries was built using React.js for the frontend and Python (Flask) for the backend, using RESTful API architecture. The OpenAI Assistant API manages multiple conversations, utilizing file search with stored FAQs, function API calls, and a NoSQL database for order statuses. Evaluation scripts.
NeurIPS 2023 - TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models Official Code
The most comprehensive Python package for evaluating survival analysis models.
📈 Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.
Production-Grade Evaluation for LLM-Powered Applications
Evaluates neuron segmentations in terms of statistics related to the number of splits and merges
VELOCITI Benchmark Evaluation and Visualisation Code
Add a description, image, and links to the evaluation-metrics topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-metrics topic, visit your repo's landing page and select "manage topics."