## Benchpress Hackathon

The challenge comes with a Jupyter notebook for your implementation and various utilities.
We provide a development set and a validation set you can use to develop your solution.
The development set is for testing your code and consists of 300 problems with a varying number of test cases.
You are free to use all data provided with a problem, a sample has the following structure:

```python
{
    # Unique identifier for the problem in the APPS dataset.
    "problem_id": 4424,
    # The problem statement
    "question": "Given three integers ...",
    # The expected function name and the input/output examples
    # representing test cases.
    "input_output": {
        "fn_name": "expression_matter",
        "inputs": [ ... ],
        "outputs": [ ... ]
    },
    "url": "https://www.codewars.com/kata/5ae62fcf252e66d44d00008e",
    "difficulty": "introductory",
    # The starter code for the problem.
    "starter_code": "def expression_matter(a, b, c):\n\t"
}
```

The validation set is consists of 200 problems, and includes an additional key `test_cases` which is used to score your solution with the provided scoring function.

```python
{
    ...
    "test_cases": {
        "fn_name": "expression_matter",
        "inputs": [ ... ],
        "outputs": [ ... ]
    },
    ...
}
```

### Loading Problems

Use the `load_sample` function to load a problem from the development or validation set.

```python
from utilities import load_sample

problem = load_sample(index=0, dataset_path="./data/dev")
```

### Generating Code

Use the `aleph_alpha_client` to generate code.
Make sure your `AA_TOKEN` is set.

```python
from aleph_alpha_client import Client, CompletionRequest, Prompt

client = Client(AA_TOKEN)

request = CompletionRequest(
    prompt=Prompt.from_text("Your prompt."),
    maximum_tokens=256,
)

# API reference for the client:
# https://aleph-alpha-client.readthedocs.io/en/latest/
response = client.complete(request, model=MODEL)
```

### Running Tests

Use the `run_test_cases` function to run the generated code against the test cases.
The function returns a dictionary with the test results, including the expected output, the generated output, a boolean indicating whether the test passed and a traceback in case of an error.

```python
from utilities import run_test_cases

test_results = run_test_cases(
    problem=problem, 
    generation=response.completions[0].completion, 
    timeout=10,
)
```

### Scoring

Use the `score` function to score your solution on the validation set.
It expects a function that takes a problem and a client and returns a generation.

```python
from utilities import score

passed_problems, passed_test_cases = score(
    generation_func=generate_code, 
    client=client,
    dataset_path="./data/val", 
    length=50,
)
```

In [3]:
%pip install --upgrade pip
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import os

AA_TOKEN = "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoyNTk4OCwidG9rZW5faWQiOjY0MTl9.nxJFNx04AMwicY8C6NRY8tWb8FIEGkB4hO7hQywuCiM"
# MODEL = "llama-3.1-8b-instruct-long-context"
MODEL = "llama-3.1-70b-instruct-long-context"

if AA_TOKEN is None:
    raise ValueError("Aleph Alpha Playground token is not set.")


In [3]:
from aleph_alpha_client import Client, CompletionRequest, Prompt
from utilities import load_sample
from multiprocessing import Pool, Manager

client = Client(AA_TOKEN)

cache = Manager().dict()

def generate_prompt(problem: dict) -> str:
    prompt = (
        "You are a highly skilled Senior Software Engineer tasked with implementing a Python function based on the following details:\n\n"
        f"PROBLEM DESCRIPTION:\n{problem['question']}\n\n"
        "STARTER CODE (if any):\n"
        f"{problem['starter_code']}\n\n"
        "INSTRUCTIONS:\n"
        "1. Write **only the Python function implementation** that solves the problem.\n"
        "2. Ensure the implementation is:\n"
        "   - Free of syntax errors and ready to execute.\n"
        "   - Fully adherent to Python's PEP8 standards for code style.\n"
        "   - Optimal, concise, and free from redundancy.\n"
        "3. Do not include any:\n"
        "   - Comments or explanations.\n"
        "   - Test cases or print statements.\n"
        "   - Extra text outside the function body.\n"
        "4. Handle edge cases to avoid runtime errors such as key errors, division by zero, or type mismatches.\n\n"
        "EXAMPLE FUNCTION FORMAT:\n"
        "```\n"
        "def function_name(parameters):\n"
        "    # Replace this line with the implementation\n"
        "```\n\n"
        "ADDITIONAL CONSIDERATIONS:\n"
        "1. If there are input constraints, enforce them in the implementation.\n"
        "2. If the function relies on a specific data structure or library, ensure its usage is correct and imported.\n\n"
        "IMPLEMENT THE FUNCTION BELOW:\n"
    )
    return prompt


def clean_code(generated_code: str) -> str:
    cleaned_code = generated_code.replace("```python", "").replace("```", "").strip()

    if cleaned_code.endswith(","):
        cleaned_code = cleaned_code.rsplit(",", 1)[0].strip()

    return cleaned_code


def fetch_or_generate(prompt: str, client: Client) -> str:
    if prompt in cache:
        return cache[prompt]

    request = CompletionRequest(
        prompt=Prompt.from_text(prompt),
        maximum_tokens=128,
        temperature=0.3,
    )
    response = client.complete(request, model=MODEL)
    result = clean_code(response.completions[0].completion)
    cache[prompt] = result
    return result


def generate_single_code(args) -> str:
    prompt, client = args
    return fetch_or_generate(prompt, client)


def generate_multiple_codes(problem: dict, client: Client, num_variations: int = 10) -> list[str]:
    prompts = [generate_prompt(problem) for _ in range(num_variations)]
    
    with Pool() as pool:
        results = pool.map(generate_single_code, [(prompt, client) for prompt in prompts])
    
    return results


def evaluate_code_variations_on_inputs(
    generated_codes: list[str], inputs: list[list], fn_name: str
) -> dict:
    gen_code_outputs = {}
    for input_set in inputs:
        input_key = str(input_set)
        gen_code_outputs[input_key] = []

        for code in generated_codes:
            local_scope = {}
            try:
                exec(code, {}, local_scope)
                if fn_name not in local_scope:
                    gen_code_outputs[input_key].append(f"Error: Function '{fn_name}' not defined")
                    continue

                func = local_scope[fn_name]
                result = func(*input_set)
                gen_code_outputs[input_key].append(result)

            except Exception as e:
                gen_code_outputs[input_key].append(f"Error: {str(e)}")

    return gen_code_outputs


def extract_ground_truth_outputs(problem: dict) -> dict:
    inputs = problem["input_output"]["inputs"]
    outputs = problem["input_output"]["outputs"]

    return {str(inputs[i]): outputs[i][0] for i in range(len(inputs))}


def find_correct_code(generated_codes: list[str], results: dict, ground_truth: dict) -> dict:
    correct_code_mapping = {}

    for input_key, outputs in results.items():
        if input_key in ground_truth:
            correct_indices = [
                idx for idx, output in enumerate(outputs) if output == ground_truth[input_key]
            ]
            correct_code_mapping[input_key] = [
                generated_codes[idx] for idx in correct_indices
            ]

    return correct_code_mapping


def generate_code(problem: dict, client: Client) -> str:
    generated_code_variations = generate_multiple_codes(problem, client, num_variations=10)

    ground_truth_outputs = extract_ground_truth_outputs(problem)

    inputs = problem["input_output"]["inputs"]
    fn_name = problem["input_output"]["fn_name"]
    gen_code_outputs = evaluate_code_variations_on_inputs(
        generated_codes=generated_code_variations,
        inputs=inputs,
        fn_name=fn_name,
    )

    correct_code_mapping = find_correct_code(
        generated_codes=generated_code_variations,
        results=gen_code_outputs,
        ground_truth=ground_truth_outputs,
    )

    for input_key, codes in correct_code_mapping.items():
        if codes:
            return codes[0]

    return "Error: No correct implementation found."


problem = load_sample(index=1, dataset_path="./data/dev")
correct_code = generate_code(problem, client)

print("Correct Code:\n", correct_code)


Correct Code:
 def kooka_counter(laughing):
    return (len(laughing) + 1) // 2


In [5]:
from utilities import score

passed_problems, passed_test_cases = score(
    generation_func=generate_code, 
    client=client,
    dataset_path="./data/val", 
    length=5,
)

print(f"Passed {passed_problems*100}% of problems")
print(f"Passed {passed_test_cases*100}% of test cases")

  0%|          | 0/5 [00:00<?, ?it/s]

type 0 compilation error = invalid syntax (<string>, line 16)


 20%|██        | 1/5 [00:00<00:00,  8.64it/s]

[{'passed': False, 'input': None, 'output': None, 'expected_output': None, 'traceback': 'Traceback (most recent call last):\n  File "/home/gulden/makeathon/benchpress-hackathon/utilities/testing_util.py", line 185, in run_test\n    tmp_sol = RuntimeModule.from_string("tmp_sol", "", sol)\n  File "/home/gulden/makeathon/benchpress-hackathon/.venv/lib64/python3.9/site-packages/pyext.py", line 169, in _newf\n    return self._items[f.__name__][len(args)](*args, **kwargs)\n  File "/home/gulden/makeathon/benchpress-hackathon/.venv/lib64/python3.9/site-packages/pyext.py", line 279, in from_string\n    _exec(s, g)\n  File "/home/gulden/makeathon/benchpress-hackathon/.venv/lib64/python3.9/site-packages/pyext.py", line 97, in _exec\n    def _exec(m,g): exec(m,g)\n  File "<string>", line 16\n    Error: No correct implementation found.\n              ^\nSyntaxError: invalid syntax\n'}]


 40%|████      | 2/5 [00:03<00:06,  2.19s/it]

[{'passed': True, 'input': [[0.5, 0.5, 0.5], 30], 'output': [[0.5, 0.5, 0.5, 1.5, 2.5, 4.5, 8.5, 15.5, 28.5, 52.5, 96.5, 177.5, 326.5, 600.5, 1104.5, 2031.5, 3736.5, 6872.5, 12640.5, 23249.5, 42762.5, 78652.5, 144664.5, 266079.5, 489396.5, 900140.5, 1655616.5, 3045153.5, 5600910.5, 10301680.5]], 'expected_output': [[0.5, 0.5, 0.5, 1.5, 2.5, 4.5, 8.5, 15.5, 28.5, 52.5, 96.5, 177.5, 326.5, 600.5, 1104.5, 2031.5, 3736.5, 6872.5, 12640.5, 23249.5, 42762.5, 78652.5, 144664.5, 266079.5, 489396.5, 900140.5, 1655616.5, 3045153.5, 5600910.5, 10301680.5]], 'traceback': None}]
type 0 compilation error = invalid syntax (<string>, line 16)


 60%|██████    | 3/5 [00:09<00:07,  3.65s/it]

[{'passed': False, 'input': None, 'output': None, 'expected_output': None, 'traceback': 'Traceback (most recent call last):\n  File "/home/gulden/makeathon/benchpress-hackathon/utilities/testing_util.py", line 185, in run_test\n    tmp_sol = RuntimeModule.from_string("tmp_sol", "", sol)\n  File "/home/gulden/makeathon/benchpress-hackathon/.venv/lib64/python3.9/site-packages/pyext.py", line 169, in _newf\n    return self._items[f.__name__][len(args)](*args, **kwargs)\n  File "/home/gulden/makeathon/benchpress-hackathon/.venv/lib64/python3.9/site-packages/pyext.py", line 279, in from_string\n    _exec(s, g)\n  File "/home/gulden/makeathon/benchpress-hackathon/.venv/lib64/python3.9/site-packages/pyext.py", line 97, in _exec\n    def _exec(m,g): exec(m,g)\n  File "<string>", line 16\n    Error: No correct implementation found.\n              ^\nSyntaxError: invalid syntax\n'}]


 80%|████████  | 4/5 [00:13<00:04,  4.05s/it]

[{'passed': True, 'input': [['B', 'C', '', '']], 'output': [''], 'expected_output': [''], 'traceback': None}]


100%|██████████| 5/5 [00:19<00:00,  3.84s/it]

[{'passed': True, 'input': ['PLPPLPLLEELELRPFFMAAGGTPLAMMGG'], 'output': [50.0], 'expected_output': [50], 'traceback': None}]
Passed 60.0% of problems
Passed 60.0% of test cases



