Skip to content

An open source, locally runnable Coding Agent that doesn't need any APIs.

Notifications You must be signed in to change notification settings

theonetruekn/interactive-learning

Repository files navigation

SmolCoder: An Open Source LLM-based coding agent that works with human interaction (WIP)

This project was developed as part of the research-lab "Interactive Learning" at the Karlsruhe Institute of Technology (KIT) in the summer of 2024.

Description

The scope of this project is to develop an autonomous coding agent similar to Devin, SWE-Agent and AutoCodeRover.

All these agents have in common that they are using GPT4 or some other high-end (and thus expensive) LLMs as a backbone. In this project, we want to test the feasibility of using smaller models as agents.

The ambitious goal of this project is to get onto the leaderboard of SWEBench - a by-now well-established benchmark for coding agents.

This project has three distinct parts:

  • SmolCoder, which can be found on the master Branch is loosely based on SWE-Agent.
  • Agentless. which can be found on the agentless branch is based on same named paper Agentless.
  • InteractiveLearning, which can be found in the evaluation notebookb.

This project is a work-in-progress.

Roadmap

  • Writing an Eval Pipeline for SWEBench ✅
  • Creating the Coding Agent Framework ✅
  • Definining and programming the Tools that the Agent will use ✅
  • Creating an interface between the Agent and the Computer ✅

Evaluating:

  • Phi3 out-of-the-box
  • Phi3 as coding agent
  • Phi3 finetuned on code
  • Phi3 as coding agent, finetuned on code/tool-use
  • Phi3 as coding agent, finetuned on code/tool-use with human interaction

SWE-Bench Evaluation

Agentless and Interactive-Learning: Evaluation

To run the evaluation for agentless or interactive-learning, check out the evaluation notebook.

SmolCoder: Evaluation

Python version needs to be at least 3.11 and docker needs to be installed (alternative) and docker needs to run as daemon.

Test the SWE-Bench installation

  1. Navigate inside the folder and make sure the requirements are installed:
cd Evaluation
cd SWE-bench
pip install -e .
  1. Test the installation:
python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids sympy__sympy-20590 \
    --run_id validate-gold

Run the SWE-Bench evaluation

  1. Install the required python packages, I would recommend doing it using conda: conda create --name <env> --file requirements.txt and activate it conda activate <env>.
  2. Get your predictions by running the appropiate part of the Evaluation.ipynb, make sure to choose the correct dataset (either swe-bench.json for the full dataset or swe-bench-lite.json for a smaller version). Alternatively you can also run the evaluation.py inside the Evaluation folder.
  3. To evaluate the predictions, navigate to the SWE-Bench folder
cd Evaluation
cd SWE-bench
  1. Run the evaluation with the following command, you may need to customize the command:
python -m swebench.harness.run_evaluation \
 --predictions_path ../prediction.json \
 --max_workers 1 \
 --dataset_name ../swe-bench-lite.json \
 --run_id YOUR_ID
  1. Your should find a json report, listing the evaluation result of your predictions,with YOUR_ID inside the SWE-bench directory.

Run SWE-Bench-Evaluation on "BwUniCluster" or other Slurm Batch System

  1. Connect and login in your server.
  2. Clone this repository and put an ollama binary into the folder.
  3. Create a new file vi evaluate.sh with following content:
#!/bin/bash
#SBATCH --job-name=evaluate_gemma_2
#SBATCH --mem 20000
#SBATCH --nodes 1
#SBATCH --time 5
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
module load devel/miniconda
OLLAMA_LLM_LIBRARY="cpu_avx2" ./ollama-linux-amd64 serve &
python evaluate.py --logging_enabled=True --model_name="gemma2" --output_file="prediction_gemma2B.json"
  • Set the memory depending on the model, e.g. 2B memory <= 10GB, 8B memory <= 20GB
  • Remove the line set CUDA_VISIBLE_DEVICES and remvoe OLLAMA_LLM_LIBRARY="cpu_avx2" if you want to use cuda.
  • Modify job-name, model_name, output_file
  • The sleep 60 is because, downlaoding the model takes some time.
  1. Queue the job with sbatch -p single evaluate.sh
  2. To check the progress: squeue

For more on Slurm jobs, check this website out.

Resources

Very Relevant Papers:


Less Relevant Papers:


Misc:

About

An open source, locally runnable Coding Agent that doesn't need any APIs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published