In [11]:
import pandas as pd
import json

# Evaluating the LLM-Agen on SWE-Benchmark

We have two datasets we can use for predicting `swe-bench.json` which has 2200 entries and `swe-bench-dev-dataset.json` which has 224 entries, they are from the [SWE-Bench](https://github.com/princeton-nlp/SWE-bench/tree/main).

In [6]:
df = pd.read_json("SWEBench/swe-bench-dev-dataset.json")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype              
---  ------                    --------------  -----              
 0   repo                      225 non-null    object             
 1   instance_id               225 non-null    object             
 2   base_commit               225 non-null    object             
 3   patch                     225 non-null    object             
 4   test_patch                225 non-null    object             
 5   problem_statement         225 non-null    object             
 6   hints_text                225 non-null    object             
 7   created_at                225 non-null    datetime64[ns, UTC]
 8   version                   225 non-null    float64            
 9   FAIL_TO_PASS              225 non-null    object             
 10  PASS_TO_PASS              225 non-null    object             
 11  environment_setup_c

In [9]:
df.head(1)

Unnamed: 0,repo,instance_id,base_commit,patch,test_patch,problem_statement,hints_text,created_at,version,FAIL_TO_PASS,PASS_TO_PASS,environment_setup_commit
0,sqlfluff/sqlfluff,sqlfluff__sqlfluff-4764,a820c139ccbe6d1865d73c4a459945cd69899f8f,diff --git a/src/sqlfluff/cli/commands.py b/sr...,diff --git a/test/cli/commands_test.py b/test/...,Enable quiet mode/no-verbose in CLI for use in...,,2023-04-16 14:24:42+00:00,1.4,[test/cli/commands_test.py::test__cli__fix_mul...,[test/cli/commands_test.py::test__cli__command...,d19de0ecd16d298f9e3bfb91da122734c40c01e5


After we used our LLM on the dataset to generate solutions to the problems, our output needs to be in the following format:
```
{
    "instance_id": "<Unique task instance ID>",
    "model_patch": "<.patch file content string>",
    "model_name_or_path": "<Model name here (i.e. SWE-Llama-13b)>",
}
```
With multiple prediction like this `[<prediction 1>, <prediction 2>,... <prediction n>]`.

**Example:**
```
{
    "instance_id": "django__django-15127",
    "model_name_or_path": "test",
    "model_patch": "--- a/django/contrib/messages/storage/base.py\n+++ b/django/contrib/messages/storage/base.py\n@@ -52,6 +52,7 @@\n                 if self._loaded_data is None:\n                     self._loaded_data = self.load()\n                 level, message, extra_tags = self._loaded_data\n+                extra_tags.update(self.get_level_tags())\n                 return {\n                     'message': message,\n                     'level': level,\n"
  },
``` 

In [13]:
# For this we will use a list of dictionaries with the keys named above.
predictions = []

# Generating our Predictions

In [25]:
class LLM_Stub():
    def __init__(self):
        self.name = "stub"

    def predict(self, input: str):
        return ""

In [28]:
llm = LLM_Stub()

# Generating out solution
for index, row in df.iterrows():
    predict = llm.predict(row["problem_statement"])
    predictions.append({
        "instance_id": row["instance_id"],
        "model_patch": predict,
        "model_name_or_path": llm.name
                       }
    )

# Saving our data to disk

In [29]:
# Convert the list of dictionaries to a JSON formatted string
json_data = json.dumps(predictions, indent=4)

# Save the JSON string to a file
with open('predictions.json', 'w') as json_file:
    json_file.write(json_data)
