This is is the accompanying repository for the paper Is Self-Repair a Silver Bullet for Code Generation?, presented at the Twelfth International Conference on Learning Representations (Vienna, May 2024). It contains source code used to run the experiments; the resulting data; as well as scripts to replicate the data analysis and figures from the paper.
To install the libraries needed to run the code and analysis scripts, you can use pip install -r requirements.txt
.
All figures in the paper can be replicated by running cd paper && make figures
. This will use pre-computed results of the data analysis, and will place the figures in paper/figures/
.
If you instead want to do all of the data analysis from scratch, run APPS_DIR=<path to my APPS directory> cd paper && make all
; note that this requires having APPS installed locally.
N.B.: This repository does not contain the data collected during the human study, due to IRB policies.
@inproceedings{olausson2024repair,
title = {Is Self-Repair a Silver Bullet for Code Generation?},
author = {Theo X. Olausson and Jeevana Priya Inala and Chenglong Wang and Jianfeng Gao and Armando Solar-Lezama},
year = 2024,
booktitle = {International Conference on Learning Representations (ICLR)}
}
Note: the below only applies if you want to use this code base to run new self-repair experiments on HumanEval yourself. You do not need to worry about this if you are merely interested in replicating the figures and results from this paper.
This code base uses a modified version of HumanEval, in which it is easier to extract error messages from failed assertions. This can be downloaded from people.csail.mit.edu/theoxo/data/HumanEval_with_assertion_messages.jsonl.gz.gpg; you can then decrypt it with gpg -d
using the password theoxoiclr2024
and unpack it with gunzip
, after which it can be used as a drop-in replacement for HumanEval.jsonl
in your local installation of HumanEval.
Note: the below only applies if you want to use this code base to run new self-repair experiments on APPS yourself. You do not need to worry about this if you are merely interested in replicating the figures and results from this paper.
Due to dependencies on an internal project, one function (exec_sample
) has been left unimplemented in src/apps/apps.py
. If you want to make use of the APPS part of the source code, you must implement this function; see the doc-string for pointers.
src/
: source code used to run the experiments.apps/
: source code for experiments on APPS.humaneval/
: source code for experiments on humaneval.
paper/
: data and scripts used to analyze and plot the results of the experiments.Makefile
: makefile to reproduce figures (make figures
), run the analysis scripts (make analysis
) or both (make all
)analysis/sample-and-estimate.py
: Python script to generate bootstrapped estimates of pass rates at various budgets.data/
:calculate-token-counts.py
: Python script to add counts for how many tokens were used to generate the programs/feedback/repairs. Used for pass@t metrics in Appendix A.apps/
: data from APPS experiments, with bash scripts to analyze the data and plot the results.humaneval/
: data from humaneval experiments, with bash scripts to analyze the data and plot the results.
plotting/
: Python scripts to generate the types of figures used in the paper.
The data generated by the models can be found by de-compressing the tarballs paper/data/apps/apps-data.tar.bz2
and paper/data/humaneval/humaneval-data.tar.bz2
. Data files are in .jsonl
format: each line is a valid json
serialization.
The data contains the following fields:
prob_path
/task_id
(for APPS and HumanEval, respectively): the identifier for the particular problem/task.completions
: a list oforiginal_completion
: the completion before any processingexecuted_completion
: the completion after processing/executiontokens_generated
: the total number of tokens for the (executed) completionbinary
: (poorly named) boolean flag for whether this completion passed the tests or notfault
:passed
if the completion passed, otherwise the error message receivederrors
: a list of execution results for each unit test, in order (APPS only)repairs
: if the completion passed,null
. Otherwise, a list of items much like the completions, except also equipped with anexplanation
field (which, in the case of modelX+modelY results, is generated separately by modelX). Note that for repairs,tokens_generated
counts both the program and the explanation (all text preceding it).
In addition to these tarballs, there are also additional tarballs with the -raw
postfix. These are identical to the above, but also contain some auxiliary fields which are irrelevant for the final analysis but where used during debugging and running these large experiments. Any auxiliary fields present in these raw tarballs should be considered legacy and possibly inaccurate.