benchmark

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Nov 15, 2024
Python

baichuan-inc / Baichuan2

Star

A series of large language models developed by Baichuan Intelligent Technology

benchmark natural-language-processing artificial-intelligence chinese gpt huggingface ceval gpt-4 large-language-models chatgpt mmlu llama2

Updated Nov 8, 2024
Python

CLUEbenchmark / CLUE

Star

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

benchmark tensorflow nlu glue corpus transformers pytorch dataset chinese pretrained-models language-model albert bert roberta chineseglue

Updated May 23, 2024
Python

MichaelGrupp / evo

Star

Python package for the evaluation of odometry and SLAM

benchmark robotics tum mapping metrics evaluation ros slam trajectory-analysis odometry trajectory ros2 kitti euroc trajectory-evaluation

Updated Oct 31, 2024
Python

baichuan-inc / Baichuan-13B

Star

A 13B large language model developed by Baichuan Intelligent Technology

benchmark natural-language-processing artificial-intelligence chinese huggingface ceval gpt-4 large-language-models chatgpt mmlu

Updated Sep 6, 2023
Python

microsoft / promptbench

Star

A unified evaluation framework for large language models

benchmark evaluation prompt robustness adversarial-attacks large-language-models prompt-engineering chatgpt

Updated Oct 28, 2024
Python

princeton-nlp / SWE-bench

Star

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?

benchmark software-engineering language-model

Updated Nov 17, 2024
Python

beir-cellar / beir

Star

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

nlp elasticsearch benchmark information-retrieval deep-learning retrieval pytorch dataset bert dpr passage-retrieval question-generation sentence-transformers sbert zero-shot-retrieval colbert retrieval-models ance use-qa

Updated Jul 28, 2024
Python

mlcommons / training

Star

Reference implementations of MLPerf™ training benchmarks

benchmark machine-learning

Updated Oct 17, 2024
Python

logpai / logparser

Star

A machine learning toolkit for log parsing [ICSE'19, DSN'16]

benchmark log-analysis log log-parser log-mining anomaly-detection log-parsing

Updated Jan 28, 2024
Python

OpenGVLab / InternVideo

Star

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

benchmark action-recognition video-understanding video-data self-supervised multimodal video-dataset open-set-recognition video-retrieval video-question-answering masked-autoencoder temporal-action-localization contrastive-learning spatio-temporal-action-localization zero-shot-retrieval video-clip vision-transformer zero-shot-classification foundation-models instruction-tuning

Updated Nov 17, 2024
Python

xlang-ai / OSWorld

Star

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

agent cli benchmark natural-language-processing gui reinforcement-learning artificial-intelligence code-generation language-model vlm rpa multimodal llm large-action-model

Updated Nov 15, 2024
Python

cheind / py-motmetrics

Star

📊 Benchmark multiple object trackers (MOT) in Python

tracker benchmark metrics object-detection object-tracking mot clear-mot-metrics mot-challenge

Updated Oct 30, 2024
Python

IntelLabs / fastRAG

Star

Efficient Retrieval Augmentation and Generation Framework

nlp benchmark information-retrieval transformers knowledge-graph question-answering summarization multi-modal semantic-search diffusion sentence-transformers colbert llm generative-ai

Updated Nov 11, 2024
Python

RUC-NLPIR / FlashRAG

Star

⚡FlashRAG: A Python Toolkit for Efficient RAG Research

benchmark datasets large-language-models retrieval-augmented-generation

Updated Nov 17, 2024
Python

kengz / SLM-Lab

Star

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

benchmark reinforcement-learning deep-reinforcement-learning pytorch dqn policy-gradient a3c sac ppo a2c

Updated Aug 26, 2022
Python

Improve this page

Add a description, image, and links to the benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark

Here are 1,062 public repositories matching this topic...

zalandoresearch / fashion-mnist

open-mmlab / mmpose

erikbern / ann-benchmarks

open-mmlab / mmaction2

open-compass / opencompass

baichuan-inc / Baichuan2

CLUEbenchmark / CLUE

MichaelGrupp / evo

baichuan-inc / Baichuan-13B

microsoft / promptbench

princeton-nlp / SWE-bench

beir-cellar / beir

mlcommons / training

logpai / logparser

OpenGVLab / InternVideo

xlang-ai / OSWorld

cheind / py-motmetrics

IntelLabs / fastRAG

RUC-NLPIR / FlashRAG

kengz / SLM-Lab

Improve this page

Add this topic to your repo