Repo of DebugBench

Overview

Implementation for paper DebugBench: Evaluating Debugging Capabilities of Large Language Models with datasets, prompts, model outputs.

Benchmark

Please refer to the Hugging Face Dataset for the data source and evaluation script if you want to use the benchmark.

DebugBench is a Large Language Model (LLM) debugging benchmark introduced in the paper "DebugBench: Evaluating Debugging Capability of Large Language Models" [url]. We collect code snippets from the LeetCode community and implant bugs into source data with GPT-4.

It consists of 4,253 instances.
It covers four major bug categories and 18 minor types.
It includes C++, Java, and Python instances.
It contains three difficulty levels: easy, medium, and hard.
All the instances were released after June 2022.
Please refer to the article [url] for more details.

Repo Content

This repository contains the implementation for benchmark construction and evaluation.

benchmark directory contains the 51 JSON shards of different languages and bug types of the benchmark.
dataset_construction directory contains the implementation for bug implantation to solution code via LLMs.
evaluation directory contains the implementation for evaluating the debugging capabilities of LLMs with API.
evalution_result directory contains the model output of gpt-4-0613 , gpt-3.5-turbo-0613 and CodeLlama-34b-instruct under different scenarios.

More elements will be added to the repository soon.

Citations

Please cite the paper and star the repo if you use DebugBench and find it helpful.

Feel free to contact trc20@mails.tsinghua.edu.cn or open an issue if you have any questions.

@misc{tian2024debugbench,
      title={DebugBench: Evaluating Debugging Capability of Large Language Models}, 
      author={Runchu Tian and Yining Ye and Yujia Qin and Xin Cong and Yankai Lin and Zhiyuan Liu and Maosong Sun},
      year={2024},
      eprint={2401.04621},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
benchmark		benchmark
dataset_construction		dataset_construction
evaluation		evaluation
evalution_result		evalution_result
figs		figs
leetcode_data		leetcode_data
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repo of DebugBench

Overview

Benchmark

Repo Content

Citations

About

Releases

Packages

Languages

License

thunlp/DebugBench

Folders and files

Latest commit

History

Repository files navigation

Repo of DebugBench

Overview

Benchmark

Repo Content

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages