## CodeI/O

Original paper (DeepSeek): https://arxiv.org/pdf/2502.07316

The approach begins by obtaining high quality raw code data and preprocessing it by prompting an LLM. The output of this preprocessing, for each raw code file used, should be:

- cleaned reference code, with a main entrypoint function
- a query, converting the reference code into a question (along the lines of "given [function parameters...] how can we obtain [desired outputs...]")
- a natural language description of all inputs (function parameters) and outputs (function return values)
- an input generator, which can generate a dictionary of valid inputs for the function

This notebook seeks to experiment with prompting an LLM to this end, as a starting point. The raw code data is from this GitHub repository that the DeepSeek paper mentions as one of their raw code sources: https://github.com/TheAlgorithms/Python

NOTE: Be careful with the raw code you input into this, as cells later execute the LLM-generated outputs.

In [1]:
!git clone https://github.com/TheAlgorithms/Python.git

import shutil
from pathlib import Path

repo_dir = Path("Python")
raw_code_dir = Path("raw_files")
raw_code_dir.mkdir(exist_ok=True)

def process_dir(directory: Path):
    # Move all the Python code files to the raw code file directory
    # Handles subdirectories recursively
    dirname = directory.name
    for file in directory.iterdir():
        if file.is_dir():
            process_dir(file)
        elif file.name.endswith(".py") and file.name != "__init__.py":
            file.rename(raw_code_dir / f"{dirname}_{file.name}")

for repo_child in repo_dir.iterdir():
    # For this repo, algorithms are divided into categories by subdirectories
    if not repo_child.is_dir() or repo_child.name.startswith("."):
        continue
    process_dir(repo_child)

shutil.rmtree(repo_dir)

Cloning into 'Python'...
remote: Enumerating objects: 20925, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 20925 (delta 6), reused 2 (delta 2), pack-reused 20912 (from 3)[K
Receiving objects: 100% (20925/20925), 14.86 MiB | 17.27 MiB/s, done.
Resolving deltas: 100% (13469/13469), done.


In [2]:
import random
from pathlib import Path
from dotenv import load_dotenv
load_dotenv()
raw_files = list(Path("raw_files/").iterdir())

Note that the below prompt is built for DeepSeekV3. It may not work with other LLMs.

In [3]:
format_prompt_template = """
You are tasked with preprocessing a raw file of Python code into a standard format. The format is made up of several components. Here is a very simple example of a raw code file:

def kg_to_pounds(weights):
    return [w * 2.20462 for w in weights]

def filter_weekly(original_measurements, days):
    return [m for i, m in enumerate(original_measurements) if i % 7 == 0]

def main(kgs, days):
    lbs = kg_to_pounds(kgs)

    for measurement in filter_weekly(lbs, days):
        print(measurement)

1. Cleaned reference code, with a main entrypoint function that takes all required arguments as parameters and returns all outputs.

The name of the main entrypoint function should be `main`. The parameters should be clearly named but do not require type hints. The function should return a dict mapping output names to values. The function should contain all the necessary code to perform the functionality, without splitting into several functions. The function should not print or otherwise output anything; results should be returned as part of the result dict. Ensure you include any imports necessary, prior to the function definition.

Example function signature: `def main(weights_kg, days):`

2. A query, defined as natural language description of the question the function answers.

Example query: "You are given two lists of integers, `weights_kg` and `days`. The unit of `weights_kg` is kilograms. `days` refers to the number of days passed, starting from zero. Your task is to convert the integers to pounds and filter to only one weight measurement every 7 days. Return the list of integers in pounds."

The query should be as detailed as the code requires to be fully explained. It should be clear what the function does, what the inputs are, and what the outputs are.

3. A natural language description of all inputs (function parameters) and outputs (return values) of the function.

Example description:

Input:
    weights_kg (list of int): List of weight values in kilograms.
    days (list of int): List of integers representing the number of days passed, starting from zero.

Output:
    return (dict): A dictionary with one key:
    - weights_lb (list of int): List of filtered weight values in pounds.

4. Python 3.11 code for an input generator, which randomly generates valid sets of inputs for the functions.

The input generator should return a dict mapping parameter names to values. The values should be randomly generated, but should be valid inputs for the function. You have access to `random` in the input generator. Do not import any other modules.

Example input generator:

def input_generator():
    weights = [random.randint(100) for _ in range(40)]
    days = list(range(40))
    return {{"weights_kg": weights, "days": days}}

Using the guidelines and example above, preprocess the following raw code file into the standard format:

{0}

Output the components (reference code, query, description, input generator) in order. Separate each component with a line of dashes (---). Avoid code blocks and do not output any Markdown formatting. Respond only with the four components, no prefix or additional text.
"""

Edit the below cell or appropriate env variables to utilise different API providers, etc

In [19]:
import asyncio
import os
from openai import AsyncOpenAI
from openai.types.chat import ChatCompletion, ChatCompletionMessageParam
from typing import Any, Iterable

# Cap concurrent requests. I had to set this to 1 for the DeepSeek API to work, YMMV
semaphore = asyncio.Semaphore(1)

async def llm_generate(
    client: AsyncOpenAI,
    messages: Iterable[ChatCompletionMessageParam],
    sampling_params: dict[str, Any],
    retry_empty_response: bool = True,
    max_retries: int = 3,
) -> ChatCompletion:
    for trial in range(max_retries):
        async with semaphore:
            try:
                completion = await client.chat.completions.create(
                    messages=messages, **sampling_params
                )
                if completion.choices[0].message.content or not retry_empty_response:
                    return completion
                await asyncio.sleep(5)
            except Exception as e:
                print(f"Failure response (trial {trial}):", e)
                await asyncio.sleep(3 * (trial + 1))
                if trial == max_retries - 1:
                    raise

client = AsyncOpenAI(
    base_url=os.getenv("API_BASE_URL"),
    api_key=os.getenv("API_KEY"),
    timeout=120.0,
)

sampling_params = {
    "model": "deepseek-chat",  # For DeepSeek API
    #"model": "deepseek/deepseek-chat:free",  # For OpenRouter
    "max_tokens": 8192,
}

Demo cell to illustrate the LLM preprocessing:

In [13]:
raw_file = random.choice(raw_files)

print(raw_file)

raw_code = raw_file.read_text()

prompt = format_prompt_template.format(raw_code)

messages = [
    {"role": "user", "content": prompt},
]

response = await llm_generate(client, messages, sampling_params)
print(response.choices[0].message.content)

raw_files/genetic_algorithm_basic_string.py
def main(target: str, genes: list[str], debug: bool = True) -> dict:
    if N_POPULATION < N_SELECTED:
        raise ValueError(f"{N_POPULATION} must be bigger than {N_SELECTED}")
    
    not_in_genes_list = sorted({c for c in target if c not in genes})
    if not_in_genes_list:
        raise ValueError(f"{not_in_genes_list} is not in genes list, evolution cannot converge")
    
    population = []
    for _ in range(N_POPULATION):
        population.append("".join([random.choice(genes) for _ in range(len(target))]))
    
    generation, total_population = 0, 0
    
    while True:
        generation += 1
        total_population += len(population)
        
        population_score = [evaluate(item, target) for item in population]
        population_score = sorted(population_score, key=lambda x: x[1], reverse=True)
        
        if population_score[0][0] == target:
            return {
                "generation": generation,
           

Run the below cell to preprocess all the raw code files for real. This will send quite a lot of requests to OpenRouter.

In [None]:
import json
from tqdm import tqdm

async def process_file(raw_file):
    raw_code = raw_file.read_text()
    prompt = format_prompt_template.format(raw_code)
    messages = [{"role": "user", "content": prompt}]

    try:
        response = await llm_generate(client, messages, sampling_params)
        content = response.choices[0].message.content
        code, query, parameters, generator = [el.strip() for el in content.split("\n---\n")]
        return code, query, parameters, generator
    except Exception as e:
        print("Error processing file", raw_file, e)

async def process_all_files(raw_code_files: list[Path], out_file: Path):
    process_tasks = []
    for raw_file in raw_code_files:
        process_tasks.append(asyncio.create_task(process_file(raw_file)))
    for future in tqdm(asyncio.as_completed(process_tasks), total=len(process_tasks)):
        code, query, parameters, generator = await future
        out_object = {"query": query, "reference_code": code, "parameters": parameters, "input_generator": generator}
        out_json = json.dumps(out_object)
        with out_file.open("a") as f:
            f.write(out_json + "\n")

out_file = Path("processed_code.jsonl")
await process_all_files(raw_files, out_file)

Failure response (trial 1): Expecting value: line 1 column 1 (char 0)
Error processing file raw_files/graphs_page_rank.py Expecting value: line 1 column 1 (char 0)
Failure response (trial 1): Expecting value: line 1 column 1 (char 0)
Error processing file raw_files/problem_002_sol2.py Expecting value: line 1 column 1 (char 0)


Load one of the processed outputs to test the reference code and input generator.

The below cell executes the loaded LLM-generated code, so exercise caution.

In [None]:
rng = random.Random()

sample_object = json.loads(out_file.read_text().splitlines()[0])

def generate_io_pairs(main_code: str, input_generator_code: str, num_pairs: int = 100):
    local_vars = {"random": rng}
    exec(main_code, {"random": rng}, local_vars)
    exec(input_generator_code, {"random": rng}, local_vars)
    io_pairs = []
    for _ in range(num_pairs):
        inputs = local_vars["input_generator"]()
        outputs = local_vars["main"](**inputs)
        io_pairs.append((inputs, outputs))
    return io_pairs

io_pairs = generate_io_pairs(sample_object["reference_code"], sample_object["input_generator"], num_pairs=2)
io_pairs

{'particles': [{'x': 46.08733176390575, 'y': -79.53711508439847, 'z': 45.779499438274655, 'mass': 9.121897656796}, {'x': -37.62801734935914, 'y': 94.62608762267024, 'z': -88.900444530177, 'mass': 13.267310061939007}, {'x': 57.04088821817467, 'y': 42.54071907694012, 'z': -73.71739928081027, 'mass': 33.13376982254907}, {'x': -25.913090702690695, 'y': 97.27894813174453, 'z': -68.24577317209872, 'mass': 20.409856607552626}, {'x': -7.993371736001535, 'y': 5.784333365689022, 'z': 82.05216927454009, 'mass': 97.18903185914192}, {'x': 8.028265944329263, 'y': -16.980411042271342, 'z': -38.28350230155666, 'mass': 68.56437969046345}, {'x': 72.19027810108415, 'y': 40.80441736137902, 'z': -27.381163108822662, 'mass': 31.705269244558238}]}
{'particles': [{'x': -82.51989169298639, 'y': 79.31892816610184, 'z': 74.79703074246333, 'mass': 8.173913842116992}, {'x': 40.50078366091543, 'y': -81.62144939582438, 'z': -90.67215023121767, 'mass': 69.66013035036612}, {'x': 23.07410631316951, 'y': 52.578733900890

[({'particles': [{'x': 46.08733176390575,
     'y': -79.53711508439847,
     'z': 45.779499438274655,
     'mass': 9.121897656796},
    {'x': -37.62801734935914,
     'y': 94.62608762267024,
     'z': -88.900444530177,
     'mass': 13.267310061939007},
    {'x': 57.04088821817467,
     'y': 42.54071907694012,
     'z': -73.71739928081027,
     'mass': 33.13376982254907},
    {'x': -25.913090702690695,
     'y': 97.27894813174453,
     'z': -68.24577317209872,
     'mass': 20.409856607552626},
    {'x': -7.993371736001535,
     'y': 5.784333365689022,
     'z': 82.05216927454009,
     'mass': 97.18903185914192},
    {'x': 8.028265944329263,
     'y': -16.980411042271342,
     'z': -38.28350230155666,
     'mass': 68.56437969046345},
    {'x': 72.19027810108415,
     'y': 40.80441736137902,
     'z': -27.381163108822662,
     'mass': 31.705269244558238}]},
  {'center_of_mass': {'x': 12.23, 'y': 16.89, 'z': -0.42}}),
 ({'particles': [{'x': -82.51989169298639,
     'y': 79.31892816610184,


Next in the paper they synthesized chains of thought from the LLM for use in building a supervised finetuning dataset. Excerpt:

> Since we aim for the input-output prediction tasks, we construct the prompt using a designed template to combine the function, the query, the reference code, and either a specific input or output. The response should ideally be a natural language CoT to reason about how to derive the correct output or a feasible input.

The below prompts are also from the paper. Synthesized chains of thought are not our main goal, but the cells below provide a demo nonetheless.

In [20]:
synthetic_cot_prompt_prefix = """
You are given a question that requires some input and output variables as follows:

{0}

The input and output requirements are as follows:

{1}
"""

synthetic_cot_prompt_suffix = """
Tip: Here is a reference code snippet for this question. You can refer to this code to guide your reasoning but not copy spans of code directly.

{3}
"""

synthetic_cot_prompt_input_prediction = synthetic_cot_prompt_prefix + """
Given the following output:

{2}

Can you predict a feasible input without writing any code? Please reason and put your final answer in the following json format: "input": <your input>, where <your input> should be a dictionary, even if the there is only one input variable, with keys strictly matching the input variables' names as specified.
""" + synthetic_cot_prompt_suffix

synthetic_cot_prompt_output_prediction = synthetic_cot_prompt_prefix + """
Given the following input:

{2}

Can you predict the output without writing any code? Please reason and put your final answer in the following json format: "output": <your output>, where <your output> should strictly match the the output requirement as specified.
""" + synthetic_cot_prompt_suffix

In [21]:
async def predict_input(query, parameters, output, reference_code):
    messages = [
        {"role": "user", "content": synthetic_cot_prompt_input_prediction.format(query, parameters, output, reference_code)},
    ]
    response = await llm_generate(client, messages, sampling_params)
    return response.choices[0].message.content

await predict_input(sample_object["query"], sample_object["parameters"], io_pairs[0][1], sample_object["reference_code"])

"To predict a feasible input that would result in the given output `{'center_of_mass': {'x': 12.23, 'y': 16.89, 'z': -0.42}}`, we need to consider the formula for calculating the center of mass in 3D space. The center of mass is calculated as the weighted average of the positions of the particles, where the weights are the masses of the particles.\n\nThe formula for the center of mass is:\n\n\\[\n\\text{center\\_of\\_mass\\_x} = \\frac{\\sum (x_i \\cdot m_i)}{\\sum m_i}\n\\]\n\\[\n\\text{center\\_of\\_mass\\_y} = \\frac{\\sum (y_i \\cdot m_i)}{\\sum m_i}\n\\]\n\\[\n\\text{center\\_of\\_mass\\_z} = \\frac{\\sum (z_i \\cdot m_i)}{\\sum m_i}\n\\]\n\nGiven the output, we can work backward to estimate the input. Let's assume we have two particles for simplicity:\n\n1. **Particle 1**:\n   - Position: \\(x_1 = 10.0\\), \\(y_1 = 15.0\\), \\(z_1 = 0.0\\)\n   - Mass: \\(m_1 = 2.0\\)\n\n2. **Particle 2**:\n   - Position: \\(x_2 = 14.0\\), \\(y_2 = 18.0\\), \\(z_2 = -1.0\\)\n   - Mass: \\(m_2 = 3.

In [22]:
async def predict_output(query, parameters, input, reference_code):
    messages = [
        {"role": "user", "content": synthetic_cot_prompt_output_prediction.format(query, parameters, input, reference_code)},
    ]
    response = await llm_generate(client, messages, sampling_params)
    return response.choices[0].message.content

await predict_output(sample_object["query"], sample_object["parameters"], io_pairs[1][0], sample_object["reference_code"])

'To calculate the center of mass for the given list of particles, we need to follow these steps:\n\n1. **Check for Errors**: \n   - Ensure that the list of particles is not empty.\n   - Ensure that all particles have a mass greater than zero.\n\n2. **Calculate Total Mass**: \n   - Sum the masses of all particles.\n\n3. **Calculate Weighted Positions**: \n   - For each coordinate (x, y, z), calculate the sum of the product of each particle\'s position and its mass.\n\n4. **Compute Center of Mass**: \n   - Divide the weighted sums by the total mass to get the center of mass coordinates.\n   - Round the results to two decimal places.\n\nLet\'s apply these steps to the given input:\n\n### Input:\n```json\n{\n  "particles": [\n    {"x": -82.51989169298639, "y": 79.31892816610184, "z": 74.79703074246333, "mass": 8.173913842116992},\n    {"x": 40.50078366091543, "y": -81.62144939582438, "z": -90.67215023121767, "mass": 69.66013035036612},\n    {"x": 23.07410631316951, "y": 52.57873390089097, 