evaluation-framework

This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.

deep-learning neural-network reproducible-research collaborative-filtering matrix-factorization hyperparameters bpr recommendation-system recommender-system reproducibility recommendation-algorithms knn matrix-completion evaluation-framework content-based-recommendation hybrid-recommender-system funksvd bprmf bprslim slimelasticnet

Updated May 25, 2023
Python

huggingface / lighteval

Star

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-metrics evaluation-framework huggingface

Updated Nov 18, 2024
Python

relari-ai / continuous-eval

Star

Data-Driven Evaluation for LLM-Powered Applications

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated Sep 2, 2024
Python

TonicAI / tonic_validate

Star

Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.

evaluation-metrics evaluation-framework rag large-language-models llm llms llmops retrieval-augmented-generation

Updated Nov 14, 2024
Python

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Nov 10, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Nov 15, 2024
Python

diningphil / PyDGN

Star

A research library for automating experiments on Deep Graph Networks

evaluation-framework deep-graph-networks deep-learning-for-graphs

Updated Sep 9, 2024
Python

aiverify-foundation / moonshot

Star

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

benchmarking evaluation-framework red-teaming trustworthy-ai llm

Updated Nov 18, 2024
Python

lartpang / PySODEvalToolkit

Star

PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection

Updated Sep 27, 2024
Python

AI21Labs / lm-evaluation

Star

Evaluation suite for large-scale language models.

language-model evaluation-framework

Updated Aug 15, 2021
Python

tsenst / CrowdFlow

Star

Optical Flow Dataset and Benchmark for Visual Crowd Analysis

tracking computer-vision dataset video-processing synthetic-images video-surveillance optical-flow benchmark-suite motion-estimation multi-object-tracking evaluation-framework trajectories video-analytics crowd-counting crowd-analysis tracking-by-detection tub-crowdflow-dataset

Updated Aug 11, 2023
Python

nlp-uoregon / mlmm-evaluation

Star

Multilingual Large Language Models Evaluation Benchmark

multilingual nlp natural-language-processing evaluation evaluation-datasets datasets language-model evaluation-framework large-language-models

Updated Aug 21, 2024
Python

Borda / BIRL

Sponsor

Star

BIRL: Benchmark on Image Registration methods with Landmark validations

benchmark docker-image dataset medical-imaging landmarks image-registration evaluation-framework pathology-image anhir image-pair registration-methods cima registration-performances registration-benchmark

Updated Jan 4, 2022
Python

haeyeoni / lidar_slam_evaluator

Star

LiDAR SLAM comparison and evaluation framework

slam evaluation-framework lidar-slam

Updated Aug 10, 2021
Python

hpclab / rankeval

Star

Official repository of RankEval: An Evaluation and Analysis Framework for Learning-to-Rank Solutions.

learning-to-rank analysis-framework evaluation-metrics evaluation-framework regression-trees ensemble-models

Updated Aug 14, 2020
Python

microsoft / eureka-ml-insights

Star

A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.

machine-learning ai artificial-intelligence evaluation-framework llm mllm

Updated Nov 18, 2024
Python

codefuse-ai / codefuse-evaluation

Star

Industrial-level evaluation benchmarks for Coding LLMs in the full life-cycle of AI native software developing.企业级代码大模型评测体系,持续开放中

code-evaluation lcc evaluation-framework repository-eval codetranseval codecommenteval codefuse

Updated Jan 19, 2024
Python

Improve this page

Add a description, image, and links to the evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-framework

Here are 89 public repositories matching this topic...

EleutherAI / lm-evaluation-harness

Giskard-AI / giskard

confident-ai / deepeval

MaurizioFD / RecSys2019_DeepLearning_Evaluation

huggingface / lighteval

relari-ai / continuous-eval

TonicAI / tonic_validate

Psycoy / MixEval

athina-ai / athina-evals

diningphil / PyDGN

aiverify-foundation / moonshot

lartpang / PySODEvalToolkit

AI21Labs / lm-evaluation

tsenst / CrowdFlow

nlp-uoregon / mlmm-evaluation

Borda / BIRL

haeyeoni / lidar_slam_evaluator

hpclab / rankeval

microsoft / eureka-ml-insights

codefuse-ai / codefuse-evaluation

Improve this page

Add this topic to your repo