Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
-
Updated
Sep 10, 2024 - Python
Examining how large language models (LLMs) perform across various synthetic regression tasks when given (input, output) examples in their context, without any parameter update
A benchmark for prompt injection detection systems.
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks
FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts
LLM-KG-Bench is a Framework and task collection for automated benchmarking of Large Language Models (LLMs) on Knowledge Graph (KG) related tasks.
An app and set of methodologies designed to evaluate the performance of various Large Language Models (LLMs) on the text-to-SQL task. Our goal is to offer a standardized way to measure how well these models can generate SQL queries from natural language descriptions
RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects - IEEE LAD'24
This is a series of Python scripts for zero-shot and chain-of-thought LLM scripting
Human readability judgments as a benchmark for LLMs
Framework to benchmark LLMs performance on domain categorization, done as part of my internship at iQ Global.
Benchmark LLMs' abilities to plan, strategize, and reason by making them play chess against each other.
Python code for the paper "LLMs are zero-shot next-location predictors" by Beneduce et al.
Add a description, image, and links to the llm-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmarking topic, visit your repo's landing page and select "manage topics."