Skip to content

stringing/LANTERN

Repository files navigation

LANTERN

This is the artifact for the work "Unlocking LLM Repair Capabilities Through Cross-Language Translation and Multi-Agent Refinement" accepted in ICSE 2026.

Project Structure

.
├── analyzer                    # reason about the optimal target language 
│   ├── decide.py
│   └── decision.py
├── config                      # configuration files for different strategies
├── dataset                     # APR evaluation benchmark of xCodeEval
│   └── apr.tar.gz
├── evaluator                   # evaluate the repaired code and calculate metrics
│   ├── eval_apr.py
│   └── get_result.py
├── logs                        # log records of each execution
├── middleware                  # coordination, historical storage and retrieval, prompt construction, etc.
│   ├── coordinator.py
│   ├── history.py
│   ├── prompt.py
│   ├── repair_retrieval.py
│   └── retrieval.py
├── repairer                    # program repair
│   ├── gen_apr.py
│   └── re_gen.py
└── translator                  # bug translation and code back-translation
    ├── back_translate.py
    ├── initilize.py
    └── translate.py
├── main.py                     # the main entry of the pipeline

Dependency

Docker Engine

Install docker engine at Docker-CE.

Python Environment

Install Python environment with necessary packages.

conda create -n lantern python=3.9.2
conda activate lantern
cd LANTERN
pip install -r requirements.txt

ExecEval

Install the execution engine of xCodeEval at ExecEval and start the docker server.

git clone https://github.com/ntunlp/ExecEval
cd ExecEval
docker build . -t exec-eval:1.0
docker run -it -p 5000:5000 -e NUM_WORKERS=37 exec-eval:1.0

Pipeline Configuration

Below is a template of the config file.

base_dir: /root/my/data/xCodeEval/evaluation/tr_reasoning   # the execution directory where all outcomes are produced
dataset_path: /root/my/data/xCodeEval/apr                   # the benchmark path
dry_run: 0                      
gen:
  nattempt: 20                                              # number of samples generated for each problem
  nsample: 1
  temperature: 1.0                                          # LLM temperature
hist_top_k: 15                                              # number of top-k historical feedback
langs:                                                      # programming language scope
- C
- C#
- C++
- Go
- Java
- Javascript
- Kotlin
- PHP
- Python
- Ruby
- Rust
log_dir: logs                                               # log directory
name: reasoning trans-repair v3 lt                          # name of this run
num_proc: 17                                                # number of paralell processes
repair:
  mode: vanilla                                             # repair mode [vanilla/cmp]
result:
  k: 20                                                     # calculation from Pass@1 to Pass@k
state:                                                      # current state of the pipeline
  action: save_history                                      # last finished action
  it: 11                                                    # current iteration
termination:                                                # termination condition
  max_it: 11                                                # maxinal number of iterations
translate:
  mode: reasoning                                           # translation mode [greedy/random/reasoning/notrans/nohist]
unfixed_k: 0                                                

Experiments

Decompress the dataset:

tar -xzvf dataset/apr.tar.gz -C dataset

Set the base_dir, dataset_path, and other necessary configurations in the yaml config files.

Set the API configuration of your LLM:

export API_KEY=your_api_key
export API_BASE=your_api_base
export MODEL_NAME=your_model_name

Greedy strategy

python main.py --config config/tr_greedy.yaml

Random strategy

python main.py --config config/tr_random.yaml

Reasoning strategy

python main.py --config config/tr_reasoning.yaml

w/o translation

python main.py --config config/tr_cmp.yaml

w/o historical feedback

python main.py --config config/tr_cmp_nohist.yaml

Approach Comparison

  • ChatRepair
cd LANTERN

python main.py --config config/add/tr_chatreapir.yaml
  • Self-Planning
export API_KEY=your_api_key
export API_BASE=your_api_base
export MODEL_NAME=your_model_name

cd baseline/self-planning

python planning.py --base-dir <result directory> --num-proc <number of process> --dataset-path <xcodeeval_dataset>

python implementation.py --base-dir <result directory> --num-proc <number of process>
  • Self-Collaboration
export API_KEY=your_api_key
export API_BASE=your_api_base
export MODEL_NAME=your_model_name

cd baseline/Self-collaboration-Code-Generation

bash run.sh

bash evaluate.sh

Real-world Generalizability

1. SWE-Bench Lite

Prerequisite

Install the SWE-Bench framework for evaluation:

cd baseline/SWE-bench
pip install -e .

In baseline/Agentless:

Install Agentless according to the document at Agentless.

Please download the repository structure in advance at repo_structure and prior generation from AGENTLESS for bug context extraction at swe-bench-lite.

Unzip the compressed repository structure file in baseline/Agentless.

Export the structure location:

export PROJECT_FILE_LOC={xxx/Agentless/repo_structure/repo_structures}

Create a results folder in baseline/Agentless.

Unzip the agentless_swebench_lite.zip in results.

The final structures of them should be:

Agentless
...repo_structure
......repo_structures
.........astropy__astropy-6938.json
        ...
...results
......swe-bench-lite
.........edit_location_individual
        ...

Next, please set the OpenAI configurations in Agentless/script/api_key.sh.

Then run the script to repair:

cd baseline/Agentless

bash script/run_trans.sh

Finally, get the result:

python script/cmp_all.py ../SWE-bench

2. Defects4J

ChatRepair:

export API_KEY=your_api_key
export API_BASE=your_api_base
export MODEL_NAME=your_model_name

cd baseline/FSE_ChatRepair/code/Generation

python repair.py --folder Results/1.2f --lang java --dataset defects4j-1.2-function --few_shot 1 --chain_length 3 --total_tries 11 --assertion_line

python repair.py --folder Results/1.2sh --lang java --dataset defects4j-1.2-single-hunk --few_shot 1 --chain_length 3 --total_tries 11 --assertion_line

python repair.py --folder Results/1.2sl --lang java --dataset defects4j-1.2-single-line --few_shot 1 --chain_length 3 --total_tries 11 --assertion_line

python repair.py --folder Results/2.0 --lang java --dataset defects4j-2.0-single-line --few_shot 1 --chain_length 3 --total_tries 11 --assertion_line

Combine 3 scenarios for D4J 1.2 to count the solved bugs:

python myutil/count_num_proj.py Results/1.2f

python myutil/count_num_proj.py Results/1.2sh

python myutil/count_num_proj.py Results/1.2sl

python myutil/combine.py Results/CR_combine

Count the solved bugs on D4J 2.0:

python myutil/count_num.py Results/2.0

LANTERN:

export API_KEY=your_api_key
export API_BASE=your_api_base
export MODEL_NAME=your_model_name

cd baseline/ChatRepair_LANTERN/code/Generation

bash run12.sh

bash run20.sh

Count the solved bugs:

python myutil/count_num_proj.py Results/1.2f

python myutil/count_num.py Results/2.0

Model Generalizability

(Please set corresponding OpenAI API before running the scripts.)

  • Claude 3.5 Sonnet
python main.py --config config/add/tr_reasoning_claude.yaml
  • QWen2.5-72B-Instruct
python main.py --config config/add/tr_reasoning_qwen.yaml

Implementation Details

prompt design & settings

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages