# Manual HumanEval

## 1. Run HumanEval(Plus) on pre-generated GPT-4 output

In [None]:
# Run eval plus evaluation on pre-generated GPT-4 outputs
# Downloaded from 
# https://github.com/evalplus/evalplus/releases/download/v0.1.0/gpt-4_temp_0.0.zip
# (File described in more detail in https://github.com/evalplus/evalplus/releases/tag/v0.1.0)

! evalplus.evaluate --dataset humaneval --samples gpt-4_temp_0.0

## 2. Select hard problems (unsolved by GPT-4)

Note: testing indicated that `HumanEval/32` may not be easily solvable with the canonical solution, thus less interesting for manual investigation.

In [None]:
# Load the results and find failed tests

import pandas as pd

GPT4_SAMPLES_DIR = "gpt-4_temp_0.0"

# Read the results of the evaluation
df = pd.read_json(f'{GPT4_SAMPLES_DIR}/eval_results.json')

# Get unsuccessful tests
filtered_df = df[df['eval'].apply(lambda x: x["base"][0][0]!="success")]
failed_ids = sorted(filtered_df.index, key=lambda item: int(item.split("/")[1]))
failed_ids

## 3. Generate Task

Generate prompt and solution for a given task

In [None]:
# for a given test case id create cannonical answer

import evalplus.data

TESTCASE_ID = "HumanEval/68"

prompt = evalplus.data.get_human_eval_plus()[TESTCASE_ID]["prompt"]
full_solution = prompt + evalplus.data.get_human_eval_plus()[TESTCASE_ID]["canonical_solution"]

print(f"PROMPT\n\n{prompt}\n\n------\n\nFULL SOLUTION\n\n{full_solution}")

## 4. Test solution

We test the solution using the EvalPlus command line tools. This requires (the hacky solution of) adding our new custom solution into a complete collection of HumanEval generations. There are definitely better ways of doing this but for now this works.


In [None]:
SOLUTION_TESTCASE_ID = TESTCASE_ID
SOLUTION = """<YOURSOLUTION>"""
#SOLUTION = full_solution

In [None]:
# To test solution we must have a full copy of model samples
# To achieve this we just copy the existing GPT-4 samples
# and switch out the new solution with the rest

import shutil
import uuid
import os

src = GPT4_SAMPLES_DIR
random_suffix = str(uuid.uuid4())[:6]
# Append the random UUID to the original directory name
dst = f"tmp_samples/{src}_{random_suffix}"
# Copy the directory
shutil.copytree(src, dst)

# replace result
solution_file = f"{dst}/{SOLUTION_TESTCASE_ID.replace('/','_')}/0.py"
with open(solution_file, 'w') as file:
    #file.write(SOLUTION)
    pass

# remove existing results, otherwise tests don't rerun
results_json_path = dst + "/eval_results.json"
os.remove(results_json_path)

os.environ["TEMP_SAMPLE_DIR"] = dst


In [None]:
# Run eval on custom samples (this basically runs a bunch of unit tests)
# Note this will run all tests just to verify one result
! evalplus.evaluate --dataset humaneval --samples ${TEMP_SAMPLE_DIR}

Results for GPT-4 outputs without without any changes:

```
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.7804878048780488}
```

In [None]:
df_custom = pd.read_json(results_json_path)
eval_result = df_custom[df_custom.index == SOLUTION_TESTCASE_ID]["eval"].iloc[0]
passed_default_tests = dict(eval_result)['base'][0][0] == "success"
passed_plus_tests = dict(eval_result)['plus'][0][0] == "success"
print(f"Passed default HumanEval unit tests: {passed_default_tests}")
print(f"Passed default HumanEvalPlus specific unit tests: {passed_plus_tests}")

# clean up
shutil.rmtree(dst)



Done. You have succesfully tested your EvalPlus testcase.