This project was developed as part of the research-lab "Interactive Learning" at the Karlsruhe Institute of Technology (KIT) in the summer of 2024.
The scope of this project is to develop an autonomous coding agent similar to Devin, SWE-Agent and AutoCodeRover.
All these agents have in common that they are using GPT4 or some other high-end (and thus expensive) LLMs as a backbone. In this project, we want to test the feasibility of using smaller models as agents.
The ambitious goal of this project is to get onto the leaderboard of SWEBench - a by-now well-established benchmark for coding agents.
This project has three distinct parts:
- SmolCoder, which can be found on the master Branch is loosely based on SWE-Agent.
- Agentless. which can be found on the agentless branch is based on same named paper Agentless.
- InteractiveLearning, which can be found in the evaluation notebookb.
This project is a work-in-progress.
- Writing an Eval Pipeline for SWEBench ✅
- Creating the Coding Agent Framework ✅
- Definining and programming the Tools that the Agent will use ✅
- Creating an interface between the Agent and the Computer ✅
Evaluating:
- Phi3 out-of-the-box
- Phi3 as coding agent
- Phi3 finetuned on code
- Phi3 as coding agent, finetuned on code/tool-use
- Phi3 as coding agent, finetuned on code/tool-use with human interaction
To run the evaluation for agentless or interactive-learning, check out the evaluation notebook.
Python version needs to be at least 3.11
and docker needs to be installed (alternative) and docker needs to run as daemon.
- Navigate inside the folder and make sure the requirements are installed:
cd Evaluation
cd SWE-bench
pip install -e .
- Test the installation:
python -m swebench.harness.run_evaluation \
--predictions_path gold \
--max_workers 1 \
--instance_ids sympy__sympy-20590 \
--run_id validate-gold
- Install the required python packages, I would recommend doing it using conda:
conda create --name <env> --file requirements.txt
and activate itconda activate <env>
. - Get your predictions by running the appropiate part of the
Evaluation.ipynb
, make sure to choose the correct dataset (eitherswe-bench.json
for the full dataset orswe-bench-lite.json
for a smaller version). Alternatively you can also run theevaluation.py
inside theEvaluation
folder. - To evaluate the predictions, navigate to the SWE-Bench folder
cd Evaluation
cd SWE-bench
- Run the evaluation with the following command, you may need to customize the command:
python -m swebench.harness.run_evaluation \
--predictions_path ../prediction.json \
--max_workers 1 \
--dataset_name ../swe-bench-lite.json \
--run_id YOUR_ID
- Your should find a
json
report, listing the evaluation result of your predictions,withYOUR_ID
inside theSWE-bench
directory.
- Connect and login in your server.
- Clone this repository and put an
ollama
binary into the folder. - Create a new file
vi evaluate.sh
with following content:
#!/bin/bash
#SBATCH --job-name=evaluate_gemma_2
#SBATCH --mem 20000
#SBATCH --nodes 1
#SBATCH --time 5
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
module load devel/miniconda
OLLAMA_LLM_LIBRARY="cpu_avx2" ./ollama-linux-amd64 serve &
python evaluate.py --logging_enabled=True --model_name="gemma2" --output_file="prediction_gemma2B.json"
- Set the memory depending on the model, e.g. 2B memory <= 10GB, 8B memory <= 20GB
- Remove the line
set CUDA_VISIBLE_DEVICES
and remvoeOLLAMA_LLM_LIBRARY="cpu_avx2"
if you want to use cuda. - Modify
job-name
,model_name
,output_file
- The
sleep 60
is because, downlaoding the model takes some time.
- Queue the job with
sbatch -p single evaluate.sh
- To check the progress:
squeue
For more on Slurm jobs, check this website out.
Very Relevant Papers:
Less Relevant Papers:
Misc: