This repository contains code, scripts and data necessary to reproduce the paper "The Fact Selection Problem in LLM-Based Program Repair".
Before installing the project, ensure you have the following prerequisites installed on your system:
- Python version 3.10 or higher.
Follow these steps to install and set up the project on your local machine:
cd maniple
python3 -m pip install .
The project is organized into several directories, each serving a specific purpose:
data/ # Training and testing datasets
BGP32/ # Sampled 32 bugs from the BugsInPy dataset
black/ # The bug project folder
10/ # The bug ID folder
100000001/ # The bitvector used for prompting
prompt.md # The prompt used for this bitvector
response_1.md # The response from the model
response_1.json # The response in JSON format
response_1.patch # The response in patch format
result_1.json # Testing result
...
BGP314/ # 314 bugs from the BugsInPy dataset
maniple/ # Scripts for getting facts and generate prompts
strata_based/ # Scripts for generating prompts
utils/ # Utility functions
tests/ # Test scripts
metrics/ # Scripts for calculating metrics for dataset
experiment.ipynb # Jupyter notebook for training models
experiment-initialization-resources/ # Contains raw facts for each bug
bug-data/ # row facts for each bug
ansible/ # Bug project folder
5/ # Bug ID folder
bug-info.json # Metadata for the bug
facts_in_prompt.json # Facts used in the prompt
processed_facts.json # Processed facts
external_facts.json # GitHub issues for this bug
static-dynamic-facts.json # Static and dynamic facts
...
datasets-list/ # Subsets from BugsInPy dataset
strata-bitvector/ # Debugging information for bitvectors
Due to the large size of BGP314
, it is not stored in this repository, but available on Zenodo: https://zenodo.org/records/10853003.
Please follow the steps below sequentially to reproduce the experiments on 314 bugs in BugsInPy with our bitvector based prompt
First, you need to ensure python3.7 command is available globally in your system. If not install manually with commands.
cd /tmp/
wget https://www.python.org/ftp/python/3.7.17/Python-3.7.17.tgz
tar xzf Python-3.7.17.tgz
cd Python-3.7.17
sudo ./configure --prefix=/opt/python/3.7.17/ --enable-optimizations --with-lto --with-computed-gotos --with-system-ffi
sudo make -j "$(nproc)"
sudo make altinstall
sudo rm /tmp/Python-3.7.17.tgz
Then, you can install the required dependencies by running the following command:
The CLI scripts under the `maniple` directory provide useful commands to download and prepare environments for each bug.
To download and prepare environments for each bugs, you can use the `prep` command.
```sh
bgp update_bug_records
maniple prep --dataset experiment-initialization-resources/datasets-list/314-dataset.json --envs-dir ~/Documents/maniple-env --bugdata-dir ~/Documents/maniple-bugsdata
This script will automatically download all 314 bugs from GitHub, create a virtual environment for the bug and install the necessary dependencies.
Then you can extract facts from the bug data using the extract
command as follows:
maniple extract --dataset 314-dataset --output-dir data/BGP314
This script will extract facts from the bug data and save them in the specified output directory.
You can find all extracted facts under the experiment-initialization-resources/bug-data
directory.
First, you need to generate bitvector for the facts. The 128 bitvector for our paper can be generated by the following command.
python3 -m maniple.strata_based.fact_bitvector_generator
You can customize your bitvectors, they should be put under experiment-initialization-resources/strata-bitvectors
directory. You can refer the example bitvector format used for our paper.
To reproduce our experiment prompt and response, please use the command below, and replace <YOUR_OPENAI_KEY> with your own key.
On Linux/macOS:
# if you want to use OpenAI backend
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>
# if you want to use Ollama backend
export USE_OLLAMA=true
# run LLM query
python3 -m maniple.strata_based.prompt_generator --database BGP314 --partition 10 --trial 15 --model "gpt-3.5-turbo-0125"
Again, you can build your own customize prompt with customize bitvector using our extracted facts. Above is only for reproducing our prompt and response.
This script will generate prompts and responses for all 314 bugs in the dataset by enumerating all possible bitvectors according to current strata design specified in maniple/strata_based/fact_strata_table.json
. By specifying --trial 15
, the script will generate 15 responses for each prompt. And by specifying --partition 10
the script will start 10 threads to speed up the process. And by specifying --model
, you can select which LLM model to use by their name.
Please use following command:
maniple validate --output-dir data/BGP314
This script will validate the generated patches for the specified bug and save the results in the specified output directory. The test comes from the developer's fix commit.
Contributions to this project are welcome! Please submit a PR if you find any bugs or have any suggestions.
This project is licensed under the MIT - see the LICENSE file for details.