# Manual HumanEval

## 1. Run HumanEval(Plus) on pre-generated GPT-4 output

In [None]:
# Run eval plus evaluation on pre-generated GPT-4 outputs
# Downloaded from 
# https://github.com/evalplus/evalplus/releases/download/v0.1.0/gpt-4_temp_0.0.zip
# (File described in more detail in https://github.com/evalplus/evalplus/releases/tag/v0.1.0)

! evalplus.evaluate --dataset humaneval --samples gpt-4_temp_0.0

## 2. Select hard problems (unsolved by GPT-4)

Note: testing indicated that `HumanEval/32` may not be easily solvable with the canonical solution, thus less interesting for manual investigation.

In [None]:
# Load the results and find failed tests

import pandas as pd

GPT4_SAMPLES_DIR = "gpt-4_temp_0.0"

# Read the results of the evaluation
df = pd.read_json(f'{GPT4_SAMPLES_DIR}/eval_results.json')

# Get unsuccessful tests
filtered_df = df[df['eval'].apply(lambda x: x["base"][0][0]!="success")]
failed_ids = sorted(filtered_df.index, key=lambda item: int(item.split("/")[1]))
failed_ids

Notes on the hard tasks:
- 32: Potential bug in unit tests (canonical solution does not appear to work either).
- 68: Candidate, but fairly abstract
- 74: Simple but very weirdly formulated
- 75: Candidate, about prime numbers, requires a longer solution
- 83: Bug in unit tests. This task is abstract but relatively easy to solve for humans, but canonical solution (and unit tests) appear to be incorrect. Should be 19 * (10 ** (n-2)) instead of 18 * (10 ** (n-2)). GPT-4 gets the correct solution, but shows as fail with unit tests. Thus I think this is a false negative.
- 84: GPT-4 fails because of ambiguity of prompt: solution is sum of digits, then binary of that sum. GPT-4 does binary, then sum of binary digits.
- 91: GPT-4 fails because it imports a (standard) module that isn't imported yet.
- 93: STRONG CANDIDATE. GPT-4 forgets to also swap case when it's a vowel.


## 3. Generate Task

Generate prompt and solution for a given task

In [None]:
# for a given test case id create cannonical answer

import evalplus.data

TESTCASE_ID = 'HumanEval/93' # ADAPT THIS

prompt = evalplus.data.get_human_eval_plus()[TESTCASE_ID]["prompt"]
full_solution = prompt + evalplus.data.get_human_eval_plus()[TESTCASE_ID]["canonical_solution"]

print(f"PROMPT\n\n{prompt}\n\n------\n\nFULL SOLUTION\n\n{full_solution}")

## 4. Test solution

We test the solution using the EvalPlus command line tools.


In [None]:
SOLUTION_TESTCASE_ID = TESTCASE_ID
SOLUTION = """
def encode(message):
    vowels = 'aeiouAEIOU'
    encoded_message = ''
    
    for char in message:
        if char.isalpha():
            char = char.swapcase()
            if char in vowels:
                if char.islower():
                    char = chr(((ord(char) - ord('a') + 2) % 26) + ord('a'))
                else:
                    char = chr(((ord(char) - ord('A') + 2) % 26) + ord('A'))
        encoded_message += char
    
    return encoded_message
"""
#SOLUTION = full_solution

In [None]:
# To test solution we must have a full copy of model samples
# To achieve this we just copy the existing GPT-4 samples
# and switch out the new solution with the rest

import shutil
import uuid
import os

src = GPT4_SAMPLES_DIR
random_suffix = str(uuid.uuid4())[:6]
# Append the random UUID to the original directory name
dst = f"tmp_samples/{src}_{random_suffix}"
os.makedirs(dst + "/" + SOLUTION_TESTCASE_ID.replace('/','_'))

# replace result
solution_file = f"{dst}/{SOLUTION_TESTCASE_ID.replace('/','_')}/0.py"
with open(solution_file, 'w') as file:
    file.write(SOLUTION)

os.environ["TEMP_SAMPLE_DIR"] = dst


In [None]:
# Run eval on custom samples (this basically runs a bunch of unit tests)
# Note this will run all tests just to verify one result
! evalplus.evaluate --dataset humaneval --samples ${TEMP_SAMPLE_DIR}

Example results for GPT-4 outputs without without any changes:

```
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.7804878048780488}
```

In [None]:
results_json_path = dst + "/eval_results.json"
df_custom = pd.read_json(results_json_path)
eval_result = df_custom[df_custom.index == SOLUTION_TESTCASE_ID]["eval"].iloc[0]
passed_default_tests = dict(eval_result)['base'][0][0] == "success"
passed_plus_tests = dict(eval_result)['plus'][0][0] == "success"
print(f"Passed default HumanEval unit tests: {passed_default_tests}")
print(f"Passed default HumanEvalPlus specific unit tests: {passed_plus_tests}")

# clean up
shutil.rmtree(dst)



Done. You have succesfully tested your EvalPlus testcase.

In [None]:
eval_result

In [None]:
def encode(message):
    """
    Write a function that takes a message, and encodes in such a 
    way that it swaps case of all letters, replaces all vowels in 
    the message with the letter that appears 2 places ahead of that 
    vowel in the english alphabet. 
    Assume only letters. 
    
    Examples:
    >>> encode('test')
    'TGST'
    >>> encode('This is a message')
    'tHKS KS C MGSSCGG'
    """


    def switch_case(ch):
        if ord("A") <= ord(ch) <= ord("Z"):
            return chr(ord(ch) + 32)
        elif ord("a") <= ord(ch) <= ord("z"):
            return chr(ord(ch) - 32)
        else:
            return ch
    
    def vowel_change(ch):
        return ch if ch not in "aeiouAEIOU" else chr(ord(ch) + 2)
    
    m = "".join(map(switch_case, message))
    return "".join(map(vowel_change, m))



In [None]:
def encode(message):
    """
    Write a function that takes a message, and encodes in such a 
    way that it swaps case of all letters, replaces all vowels in 
    the message with the letter that appears 2 places ahead of that 
    vowel in the english alphabet. 
    Assume only letters. 
    
    Examples:
    >>> encode('test')
    'TGST'
    >>> encode('This is a message')
    'tHKS KS C MGSSCGG'
    """
    vowels = 'aeiouAEIOU'
    encoded_message = ''
    
    for char in message:
        if char.isalpha():
            char = char.swapcase()
            if char in vowels:
                if char.islower():
                    char = chr(((ord(char) - ord('a') + 2) % 26) + ord('a'))
                else:
                    char = chr(((ord(char) - ord('A') + 2) % 26) + ord('A'))
        encoded_message += char
    
    return encoded_message



In [None]:
encode('This is a message')