In [1]:
import pandas as pd
from pandas import Series
import git
import os
import validators
import shutil

# Evaluating the LLM-Agen on SWE-Benchmark

We have two datasets we can use for predicting `swe-bench.json` which has 2200 entries and `swe-bench-dev-dataset.json` which has 224 entries, they are from the [SWE-Bench](https://github.com/princeton-nlp/SWE-bench/tree/main).

In [2]:
df = pd.read_json("SWEBench/swe-bench-dev-dataset.json")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype              
---  ------                    --------------  -----              
 0   repo                      225 non-null    object             
 1   instance_id               225 non-null    object             
 2   base_commit               225 non-null    object             
 3   patch                     225 non-null    object             
 4   test_patch                225 non-null    object             
 5   problem_statement         225 non-null    object             
 6   hints_text                225 non-null    object             
 7   created_at                225 non-null    datetime64[ns, UTC]
 8   version                   225 non-null    float64            
 9   FAIL_TO_PASS              225 non-null    object             
 10  PASS_TO_PASS              225 non-null    object             
 11  environment_setup_c

In [3]:
df.iloc[0]

repo                                                        sqlfluff/sqlfluff
instance_id                                           sqlfluff__sqlfluff-4764
base_commit                          a820c139ccbe6d1865d73c4a459945cd69899f8f
patch                       diff --git a/src/sqlfluff/cli/commands.py b/sr...
test_patch                  diff --git a/test/cli/commands_test.py b/test/...
problem_statement           Enable quiet mode/no-verbose in CLI for use in...
hints_text                                                                   
created_at                                          2023-04-16 14:24:42+00:00
version                                                                   1.4
FAIL_TO_PASS                [test/cli/commands_test.py::test__cli__fix_mul...
PASS_TO_PASS                [test/cli/commands_test.py::test__cli__command...
environment_setup_commit             d19de0ecd16d298f9e3bfb91da122734c40c01e5
Name: 0, dtype: object

After we used our LLM on the dataset to generate solutions to the problems, our output needs to be in the following format:
```
{
    "instance_id": "<Unique task instance ID>",
    "model_patch": "<.patch file content string>",
    "model_name_or_path": "<Model name here (i.e. SWE-Llama-13b)>",
}
```
With multiple prediction like this `[<prediction 1>, <prediction 2>,... <prediction n>]`.

**Example:**
```
{
    "instance_id": "django__django-15127",
    "model_name_or_path": "test",
    "model_patch": "--- a/django/contrib/messages/storage/base.py\n+++ b/django/contrib/messages/storage/base.py\n@@ -52,6 +52,7 @@\n                 if self._loaded_data is None:\n                     self._loaded_data = self.load()\n                 level, message, extra_tags = self._loaded_data\n+                extra_tags.update(self.get_level_tags())\n                 return {\n                     'message': message,\n                     'level': level,\n"
  },
``` 

# Generating our Predictions

## Defining the AgentWrapper

We first define an `AgentWrapper` which job it is to:
- clone the repos, set the head to the correct commit.
- Calls our internal Agent, which does the changes.
- Stages our changes.
- Calculates the git diff, which we return.

In [4]:
class AgentStub():
    def __init__(self,  start_cwd):
         self.start_cwd =  start_cwd

    def __call__(self, userprompt):
        new_file = os.path.join(self.start_cwd, 'test_file.md')
        
        fp = open(new_file, 'w+')
        fp.write('This is a test file, which test if the git diff gets caluclated correctly.')
        fp.close()
        
        return ""

class AgentWrapper():
    def __init__(self, agent, name, working_directory="repos", logging=True):
        self.name = name
        self.working_directory = working_directory
        self.agent = agent
        self.logging_enabled = logging

        if not os.path.isdir(working_directory):
            os.makedirs(working_directory)

    def predict(self, row_input: Series):
        repo_dir = self._clone_repo(row_input["repo"], row_input["base_commit"])
        
        result = self.agent("Question: " + str(df.iloc[0]["problem_statement"]))
        
        if self.logging_enabled:
            print("LOGGING: Finished the Agent with the following as return: " + str(result))
        repo = git.Repo(repo_dir)
        repo.git.add("*")
        
        diff = repo.git.diff("--cached")
        
        if self.logging_enabled:
            print("LOGGING: Finished caculating the git diff.")
            if diff == '': print("LOGGING: Git Diff is empty.")
        
        return diff

    def _clone_repo(self, repo_name: str, base_commit: str, overwrite_cloned_repo : bool = True):
        repo_url = "https://github.com/" + repo_name
        repo_dir = os.path.join(self.working_directory, repo_name.split('/', 1)[1])
        
        if not validators.url(repo_url):
            raise Exception("The Repo url is not valid: " + repo_url)
                    
        if not (os.path.isdir(repo_dir)) or (overwrite_cloned_repo):
            if os.path.exists(repo_dir):
                shutil.rmtree(repo_dir)
            os.makedirs(repo_dir)

            # clones the repo on which llm will work
            git.Repo.clone_from(repo_url, repo_dir)

        if self.logging_enabled:
            print("LOGGING: Repo " + str(repo_url) + "downloaded in folder `" + str(repo_dir) + "`.")
            
        # we need to make sure we have the correct commit stage
        repo = git.Repo(repo_dir)
        repo.git.reset('--hard', base_commit)

        if self.logging_enabled:
            print("LOGGING: Reset Git repo to the commit " + str(base_commit))
        
        return repo_dir

## Testing AgentWrapper

Testing that the cloning mechanism for repos and checking out the correct git commit is working.

In [5]:
df.iloc[0]["base_commit"]

'a820c139ccbe6d1865d73c4a459945cd69899f8f'

In [6]:
stub = AgentStub("repos")
agent = AgentWrapper(stub, "stub")

print(agent.name)
print("----------------")
print("RESULT: " + str(agent.predict(df.iloc[0])))

stub
----------------
LOGGING: Repo https://github.com/sqlfluff/sqlfluffdownloaded in folder `repos/sqlfluff`.
LOGGING: Reset Git repo to the commit a820c139ccbe6d1865d73c4a459945cd69899f8f
LOGGING: Finished the Agent with the following as return: 
LOGGING: Finished caculating the git diff.
LOGGING: Git Diff is empty.
RESULT: 


# Testing SmolCoder

This requires starting the `phi3:latest` model, with ollama.

In [7]:
import sys
import os

sys.path.append(str(os.path.abspath('SmolCoder')))
print(sys.path)

['/home/lupos/miniconda3/envs/llm/lib/python311.zip', '/home/lupos/miniconda3/envs/llm/lib/python3.11', '/home/lupos/miniconda3/envs/llm/lib/python3.11/lib-dynload', '', '/home/lupos/miniconda3/envs/llm/lib/python3.11/site-packages', '/home/lupos/interactive-learning/SmolCoder']


In [8]:
from pathlib import Path

from SmolCoder.src.agent import SmolCoder
from SmolCoder.src.llm_wrapper import LLM
from SmolCoder.src.toolkit import Toolkit

from SmolCoder.src.tools.list_methods import ListMethods
from SmolCoder.src.tools.list_classes import ListClasses
from SmolCoder.src.tools.list_files import ListFiles
from SmolCoder.src.tools.replace_method import ReplaceMethod
from SmolCoder.src.tools.finish import Finish
from SmolCoder.src.tools.execute_python import ExecutePythonCode
from SmolCoder.src.tools.show_method import ShowMethodBody
from SmolCoder.src.tools.move_folder import MoveFolder

In [9]:
# Tool Definition
class_sumary = ListMethods()
list_classes = ListClasses()
list_files = ListFiles()
replace_method = ReplaceMethod()
finish = Finish()
execute_python = ExecutePythonCode()
show_method = ShowMethodBody()
move_folder = MoveFolder()

## Testing Execute Python Tool

In [10]:
tools = Toolkit([execute_python])

model = LLM("llama3:instruct")
smolCoder = SmolCoder(model, Path("tests/test_codebase"), tools, mode=1)
agent = AgentWrapper(smolCoder, "smolecoder")

prompt = df.iloc[0]

In [11]:
# result = agent.predict(prompt)
#print("RESULT: " + str(result))

In [12]:
#print(smolCoder.inspect_history(n=5))

# SmolCoder on SWE

This tests SmolCoder on a single Instance of the SWE-Benchmark.
This is without first trying to reproduce the bug, just barebones ReAct with tools.

In [13]:
toolkit = Toolkit([list_files, move_folder, list_classes, class_sumary, show_method, replace_method, finish])

model = LLM("phi3:latest")
smol_coder = SmolCoder(model, Path("repos"), toolkit=toolkit)

agent = AgentWrapper(smol_coder, working_directory="repos", name="SmolCoder")

In [14]:
print(agent.name)
print("----------------")
print(agent.predict(df.iloc[0]))

SmolCoder
----------------
LOGGING: Repo https://github.com/sqlfluff/sqlfluffdownloaded in folder `repos/sqlfluff`.
LOGGING: Reset Git repo to the commit a820c139ccbe6d1865d73c4a459945cd69899f8f

------------

You will be given `question` and you will respond with `answer`.

To do this, you will interleave Thought, Action, and Observation steps.

Thought can reason about the current situation.
Action can be the following types, 
(1) List_Files[folder], which lists all the files and subfolder that are in the folder. Example use: List_Files[some_dir] or List_Files[.] to list the files of the current directory
(2) Move_to_Folder[new_directory], which sets the current working directory to `new_directory` Example use: Move_to_Folder[new_directory]
(3) List_Classes[file_name], which lists all class names and their docstring that are in the Python file file_name. Example use: List_Classes[test.py]
(4) List_Methods[class_name], which lists the signatures and docstring of all the method of the 

AttributeError: 'Thought' object has no attribute 'unpack'

In [None]:
# print(smol_coder.inspect_history(n=5))

## Generating all Predictions

When running this on a server, it could happen that something crashed or an error is thrown which doesn't get catches, as such it is important to write the changes to disk for each entry in the dataset.


In [17]:
# This implementation uses checkpoints, this means if the program 
# is interuppted it can start again, where it left oft.

import tempfile
import json

#tools = Toolkit([class_sumary, list_classes, list_files, finish])
#model = LLM("phi3:latest")
#smol_coder = SmolCoder(model, Path("repos"), tools)
#agent = AgentWrapper(smol_coder, working_directory="repos", name="SmolCoder")

stub = AgentStub()
agent = AgentWrapper(stub, "repos")

checkpoint_file = 'checkpoint.txt'
resume_index = 0

activated = 1

if activated:
    # Check if checkpoint file exists and read the last processed index
    try:
        with open(checkpoint_file, 'r') as f:
            resume_index = int(f.read().strip())
    except FileNotFoundError:
        pass
    except Exception as e:
        print(f"Error reading checkpoint file: {e}")
    
    if resume_index < len(df) - 1:
        # Open a file to save predictions
        with open('predictions.json', 'a', encoding="utf-8-sig") as json_file:
            if resume_index == 0:
                json_file.write('[')  # Start of JSON array
                json_file.write('\n')
            # Generating our solution
            for index, row in df.iterrows():
                if index % 10 == 0: print("Current idx: " + str(index))
                # Skip rows that were already processed
                if index < resume_index:
                    continue
        
                predictions = {
                    "instance_id": row["instance_id"],
                    "model_patch": agent.predict(row),
                    "model_name_or_path": agent.name
                }
                # Convert the dictionary to a JSON formatted string and write to file
                json_data = json.dumps(predictions, indent=4)
                json_file.write(json_data)
                if index < len(df) - 1:
                    json_file.write(',')
                json_file.write('\n')
        
                with open(checkpoint_file, 'w') as f:
                    f.write(str(index))
                    
            if index == len(df) - 1:
                json_file.write(']')

Current idx: 0
Current idx: 10
Current idx: 20
Current idx: 30
Current idx: 40
Current idx: 50
Current idx: 60
Current idx: 70
Current idx: 80
Current idx: 90
Current idx: 100
Current idx: 110
Current idx: 120
Current idx: 130
Current idx: 140
Current idx: 150
Current idx: 160
Current idx: 170
Current idx: 180
Current idx: 190
Current idx: 200
Current idx: 210
Current idx: 220


# Meta Tokenizer

In [15]:
from pathlib import Path

from SmolCoder.src.llm_wrapper import LLM
from SmolCoder.src.prompting_strategy import PromptingStrategy
from SmolCoder.src.toolkit import Toolkit
from SmolCoder.src.tools.list_methods import ListMethods
from SmolCoder.src.tools.list_files import ListFiles
from SmolCoder.src.tools.list_classes import ListClasses
from SmolCoder.src.tools.finish import Finish
from SmolCoder.src.meta_tokenizer import MetaTokenizer

from SmolCoder.src.agent import SmolCoder

list_methods = ListMethods()
list_classes = ListClasses()
list_files = ListFiles()
finish = Finish()

toolkit = Toolkit([list_methods, list_classes, list_files, finish])

smol = SmolCoder(model=LLM("phi3:latest"), codebase_dir= Path("test_codebase/"), toolkit=toolkit)

In [16]:
smol("Question: What methods do the classes in test.py have?")


------------

You will be given `question` and you will respond with `answer`.

To do this, you will interleave Thought, Action, and Observation steps.

Thought can reason about the current situation.
Action can be the following types, 
(1) List_Methods[class_name], which lists the signatures and docstring of all the method of the class `class_name`. Example use: List_Methods[MyClass]
(2) List_Classes[file_name], which lists all class names and their docstring that are in the Python file file_name. Example use: List_Classes[test.py]
(3) List_Files[folder], which lists all the files and subfolder that are in the folder. Example use: List_Files[some_dir] or List_Files[.] to list the files of the current directory
(4) Finish[answer], which finishes the program and returns the answer. Example use: Finish["The Answer is 42."]
 Input variables of the tools do not need quotation marks around them. In addition, do NOT use the `finish` tool before having made all changes to remedy the issue.
-

'You will be given `question` and you will respond with `answer`.\n\nTo do this, you will interleave Thought, Action, and Observation steps.\n\nThought can reason about the current situation.\nAction can be the following types, \n(1) List_Methods[class_name], which lists the signatures and docstring of all the method of the class `class_name`. Example use: List_Methods[MyClass]\n(2) List_Classes[file_name], which lists all class names and their docstring that are in the Python file file_name. Example use: List_Classes[test.py]\n(3) List_Files[folder], which lists all the files and subfolder that are in the folder. Example use: List_Files[some_dir] or List_Files[.] to list the files of the current directory\n(4) Finish[answer], which finishes the program and returns the answer. Example use: Finish["The Answer is 42."]\n Input variables of the tools do not need quotation marks around them. In addition, do NOT use the `finish` tool before having made all changes to remedy the issue.\n---\