Skip to content
 
OpenCompass Website HOT      OpenCompass Toolkit TRY IT OUT
 

GitHub Org's stars

What is OpenCompass ? OpenCompass is a platform focused on understanding of the AGI, include Large Language Model and Multi-modality Model.

We aim to:

  • develop high-quality libraries to reduce the difficulties in evaluation
  • provide convincing leaderboards for improving the understanding of the large models
  • create powerful toolchains targeting a variety of abilities and tasks
  • build solid benchmarks to support the large model research
  • research on inference of Large Model(analysis, reasoning, prompt engineering.)

Toolkit

OpenCompass

VLMEvalKit

Benchmarks and Methods

Project Topic Paper

DevBench

Automated Software Development

DevBench: Towards LLMs based Automated Software Development

CriticBench

Critic Reasoning

CriticBench: Evaluating Large Language Models as Critic

ANAH

Hallucination Annotation

ANAH: Analytical Annotation of Hallucinations in Large Language Models

MathBench

Mathematical Reasoning

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

T-Eval

Tool Utilization

T-Eval: Evaluating the Tool Utilization Capability Step by Step

MMBench

Multi Modality

MMBench: Is Your Multi-modal Model an All-around Player?

BotChat

Subjective Evaluation

BotChat: Evaluating LLMs’ Capabilities of Having Multi-Turn Dialogues

LawBench

Domain Evaluation

LawBench: Benchmarking Legal Knowledge of Large Language Models

Pinned Loading

  1. opencompass Public

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    Python 4.9k 518

  2. VLMEvalKit Public

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    Python 2k 287

  3. CompassJudger Public

    87 5

  4. LawBench Public

    Benchmarking Legal Knowledge of Large Language Models

    Python 303 50

  5. T-Eval Public

    [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step

    Python 261 15

  6. GAOKAO-Eval Public

    Jupyter Notebook 102 5

Repositories

Showing 10 of 30 repositories
  • VLMEvalKit Public

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    Python 1,973 Apache-2.0 287 71 12 Updated Mar 10, 2025
  • opencompass Public

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    Python 4,874 Apache-2.0 518 267 (1 issue needs help) 36 Updated Mar 7, 2025
  • ANAH Public

    [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO

    Python 38 Apache-2.0 3 1 0 Updated Mar 6, 2025
  • GPassK Public

    Official Repository of Are Your LLMs Capable of Stable Reasoning?

    Python 22 1 2 0 Updated Mar 3, 2025
  • CompassJudger Public
    87 5 0 0 Updated Feb 25, 2025
  • GTA Public

    [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

    Python 75 Apache-2.0 6 1 0 Updated Feb 13, 2025
  • 0 0 0 0 Updated Feb 12, 2025
  • GAOKAO-Eval Public
    Jupyter Notebook 102 5 5 0 Updated Dec 16, 2024
  • CriticEval Public

    [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs

    Python 39 Apache-2.0 2 0 0 Updated Nov 29, 2024
  • ProSA Public

    [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

    Python 24 Apache-2.0 2 0 0 Updated Oct 22, 2024