llms-benchmarking

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

biases synthetic-dataset-generation layoutlm synthetic-dataset layoutxlm token-classification layoutlmv3 layoutlmv2 llms-benchmarking

Updated May 29, 2024
Python

s2e-lab / RegexEval

Star

Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.

regex code-generation benchmark-framework redos-checker redos-detector llms-benchmarking

Updated Mar 13, 2024
Python

melvinebenezer / Liah-Lie_in_a_haystack

Star

needle in a haystack for LLMs

needle-in-haystack llm long-context llm-inference llms-benchmarking

Updated Apr 15, 2024
Python

stair-lab / villm-eval

Star

Evaluation of Language Models in Non-English Languages

llms-benchmarking llm-evaluation-framework

Updated Jun 7, 2024
Python

microsoft / private-benchmarking

Star

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

benchmarking inference secure private mpc large-language-models llms-benchmarking private-benchmarking ezpc

Updated Jun 11, 2024
Python

aflah02 / Humans-v-s-LLM-Benchmarks

Star

LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them

streamlit llms llms-benchmarking

Updated Jan 1, 2024
Python

EvilPsyCHo / Open-LLM-Benchmark

Star

Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent，格式化输出，指令追随，长文本，多语言，代码，自定义任务的能力基准测试。

openai evaluation-framework huggingface large-language-models llamacpp vllm llm-agent llms-benchmarking

Updated May 10, 2024
Python

amit-sarker / ICL-Analysis-NLP-685

Star

sentiment-analysis huggingface in-context-learning cerebras llama2 mistral-7b llms-benchmarking btlms mamba-state-space-models arithemtic-tasks

Updated May 18, 2024
Python

Improve this page

Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llms-benchmarking

Here are 12 public repositories matching this topic...

parea-ai / parea-sdk-py

epfl-dlab / cc_flows

declare-lab / resta

Paulescu / text-embedding-evaluation

nachoDRT / MERIT-Dataset

s2e-lab / RegexEval

melvinebenezer / Liah-Lie_in_a_haystack

stair-lab / villm-eval

microsoft / private-benchmarking

aflah02 / Humans-v-s-LLM-Benchmarks

EvilPsyCHo / Open-LLM-Benchmark

amit-sarker / ICL-Analysis-NLP-685

Improve this page

Add this topic to your repo