# 1. Come up with high-level code problems

In [None]:
from collections import Counter
from llms import generate_json
import json

model = "t-o3"

prompt_ideation = """You are helping come up with python programming problems that should be similar to the ones in HumanEval.
Your goal here is to come up with 100 unique but realistic problems that can be framed as a single Python function, and can be resolved in 10-30 lines of code.

For each problem, you should assign the following key:
- a problem_id that counts from 1 to 100.
- a category that decides the kind of algorithm involved in the problem. Do not use any category more than 5 times.
- a name that is a short description of the problem.
- a domain that decides the kind of problem. Do not use any domain more than 5 times.
- a description that uniquely describes the problem in detail, it should be 20-40 words long.

For example, here's one example problem statement you could generate:
{"problem_id": 1,"category": "sorting", "name": "standardize_and_sort_names", "domain": "database", "description": "Given a list of names of people in various formats, return the list in a sorted order, having standardized the names to keep only the first and last name, capitalized accordingly. Remove any middle names or initials. It should be sorted by last name then first name."}

Careful:
- [Unique Problems] All the problems should be unique.
- [Testable Through Unit Tests] All the problems should be testable through 5 unit tests for accuracy. For example, we cannot have any problem that requires efficience (implement in O(log n) time or space) as that is not testable through unit tests.
- [No Complexity] Avoid very complex problems, for example that involve advanced data structures or dynamic programming.
- [Problem Twist] A problem should not be a very basic problem, but should involve at least one twist or additional constraint that is not obvious. The full description of the problem should be required to solve the problem (and not just going off the high-level problem description).

Now produce 100 such problems, following the following JSON format
{"problems": [...]}"""

response = generate_json([{"role": "user", "content": prompt_ideation}], model=model)

categories = Counter([response["category"] for response in response["problems"]])
print(categories)

domains = Counter([response["domain"] for response in response["problems"]])
print(domains)

problems = response["problems"]
for problem in problems:
    problem["sample_type"] = "code_synthetic"

with open("data/code_synthetic_problems_0.1.json", "w") as f:
    json.dump(problems, f)

# 2. Implement Solution and Unit Tests

In [1]:
from llms import generate_json
import tqdm, json

MODEL = "t-o3"

prompt_solution = """
You are given the high-level description of a Python programming problem. Your objective is to produce two outputs:
1. `solution` - a block of Python code which only includes (1) imports if necessary, (2) a single function definition with the name `[[NAME]]`. So it should start as :`def [[name]]`
2. `tests` - a list of 4-6 unit tests for the given problem. Each unit test uses the following schema: {"type": "basic|edge_case", "inputs": [1, [1,2,3]], "output": 4} where the `type` is either `basic` or `edge_case`, and the `inputs` are the inputs to the function. The `output` is the expected output of the function.

The `solution` should be a valid Python function that can be executed. The `tests` should be a list of valid Python unit tests that can be executed.

Here's an example of of the completion of the task:
Example Problem:
Sort the given list by string length and if length is the same, then by alphabetical order.

Problem Name: sort_by_length_and_alphabetical

{"solution": "def sort_by_length_and_alphabetical(lst):\n\treturn sorted(lst, key=lambda x: (len(x), x))",
"tests": [{"type": "basic", "inputs": [["apple", "banana", "cherry"]], "output": ["apple", "cherry", "banana"]},
{"type": "edge_case", "inputs": [["abc", "def", "ghi"]], "output": ["abc", "def", "ghi"]},
{"type": "edge_case", "inputs": [[]], "output": []},
{"type": "edge_case", "inputs": [["c", "d", "c"]], "output": ["c", "c", "d"]}
]
}

Now complete the task for the following problem:

Problem Name: [[NAME]]

Problem Description: [[DESCRIPTION]]

Only output JSON in the format shown in the example above."""

with open("data/code_synthetic_problems_0.1.json", "r") as f:
    data = json.load(f)

def evaluate_synthetic_problem(extracted_answer, tests, func_name, printing=False):
    import ast
    # Clean the code
    pred_python_code = extracted_answer.replace("```python", "").replace("```", "").strip()
    
    # Parse the code to extract imports
    try:
        tree = ast.parse(pred_python_code)
        imports = []
        for node in tree.body:
            if isinstance(node, (ast.Import, ast.ImportFrom)):
                imports.append(ast.unparse(node))
        
        # Create global scope with builtins
        global_scope = {"__builtins__": __builtins__}
        
        # Execute imports first to populate global scope
        for import_stmt in imports:
            exec(import_stmt, global_scope)
        
        # Execute the full code in the enriched global scope
        local_scope = {}
        exec(pred_python_code, global_scope, local_scope)
        
        func = local_scope.get(func_name)
        if not callable(func):
            raise ValueError(f"Function {func_name} not found in extracted answer.")

        for test in tests:
            inputs = test["inputs"]
            output = test["output"]
            result = func(*inputs)
            if printing:
                valid = result == output
                color = "\033[92m" if valid else "\033[91m"
                print(f"   {color}{func_name}({inputs}) -> {result} vs. {output}\033[0m")
            if result != output:
                return False
        return True
        
    except Exception as e:
        if printing:
            print(f"\033[91mError executing code: {e}\033[0m")
        return False


for problem in tqdm.tqdm_notebook(data):
    if "verified" in problem:
        continue
    n_tries = 0
    while n_tries < 3:
        n_tries += 1
        print(f"[Problem {problem['name']}] Try {n_tries}/3")
        try:
            response = generate_json([{"role": "user", "content": prompt_solution}], model=MODEL, variables={"NAME": problem["name"], "DESCRIPTION": problem["description"]})
            solution = response["solution"]
            tests = response["tests"]
        except Exception as e:
            continue

        problem["verified"] = evaluate_synthetic_problem(solution, tests, problem["name"], printing=True)
        if problem["verified"]:
            problem["reference_solution"] = solution
            problem["reference_tests"] = tests
            break

    with open("data/code_synthetic_problems_0.1.json", "w") as f:
        json.dump(data, f)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for problem in tqdm.tqdm_notebook(data):


  0%|          | 0/100 [00:00<?, ?it/s]

[Problem format_country_codes] Try 1/3




   [92mformat_country_codes([['us', 'ca', 'mx']]) -> US, CA, MX vs. US, CA, MX[0m
   [91mformat_country_codes([['fr', 'de', 'es', 'it', 'pt']]) -> FR, DE, ES
IT, PT vs. FR, DE, ES\nIT, PT[0m
[Problem format_country_codes] Try 2/3




   [92mformat_country_codes([['us', 'ca', 'mx']]) -> US,CA,MX vs. US,CA,MX[0m
   [92mformat_country_codes([['us', 'ca', 'mx', 'uk', 'fr']]) -> US,CA,MX
UK,FR vs. US,CA,MX
UK,FR[0m
   [92mformat_country_codes([['aa', 'bb', 'cc', 'dd', 'ee', 'ff']]) -> AA,BB,CC
DD,EE,FF vs. AA,BB,CC
DD,EE,FF[0m
   [92mformat_country_codes([['in']]) -> IN vs. IN[0m
   [92mformat_country_codes([[]]) ->  vs. [0m
[Problem check_lat_lon] Try 1/3




   [91mcheck_lat_lon(['23.5,-45.2']) -> (23.5, -45.2) vs. [23.5, -45.2][0m
[Problem check_lat_lon] Try 2/3




   [91mcheck_lat_lon(['12.34,56.78']) -> (12.34, 56.78) vs. [12.34, 56.78][0m
[Problem check_lat_lon] Try 3/3




   [91mcheck_lat_lon(['23,42']) -> (23.0, 42.0) vs. [23.0, 42.0][0m
[Problem population_weighted_centroid] Try 1/3




Cannot parse
|{"solution": "def population_weighted_centroid(triples):\n    total_population = 0.0\n    weighted_lat_sum = 0.0\n    weighted_lon_sum = 0.0\n\n    for lat, lon, population in triples:\n        total_population += population\n        weighted_lat_sum += lat * population\n        weighted_lon_sum += lon * population\n\n    # Handle empty input or zero total population\n    if total_population == 0:\n        return (0.0, 0.0)\n\n    avg_lat = round(weighted_lat_sum / total_population, 5)\n    avg_lon = round(weighted_lon_sum / total_population, 5)\n\n    return (avg_lat, avg_lon)\n",
"tests": [
    {"type": "basic", "inputs": [[(10, 20, 100), (20, 30, 200)]], "output": (16.66667, 26.66667)},
    {"type": "basic", "inputs": [[(40.7128, -74.0060, 8405837)]], "output": (40.7128, -74.006)},
    {"type": "edge_case", "inputs": [[]], "output": (0.0, 0.0)},
    {"type": "edge_case", "inputs": [[(10, 10, 0), (20, 20, 0)]], "output": (0.0, 0.0)},
    {"type": "basic", "inputs": [[(3



Cannot parse
|{"solution": "def population_weighted_centroid(points):\n    \"\"\"Compute population-weighted centroid of (lat, lon, population) triples.\n    Returns (lat, lon) rounded to five decimals. If the input list is empty or\n    total population is zero, returns (0.0, 0.0).\"\"\"\n    if not points:\n        return (0.0, 0.0)\n\n    total_population = sum(p for _, _, p in points)\n    if total_population == 0:\n        return (0.0, 0.0)\n\n    weighted_lat = sum(lat * p for lat, _, p in points) / total_population\n    weighted_lon = sum(lon * p for _, lon, p in points) / total_population\n\n    return (round(weighted_lat, 5), round(weighted_lon, 5))",
"tests": [
  {"type": "basic", "inputs": [[(10.0, 20.0, 100)]], "output": (10.0, 20.0)},
  {"type": "basic", "inputs": [[(0.0, 0.0, 50), (10.0, 10.0, 50)]], "output": (5.0, 5.0)},
  {"type": "basic", "inputs": [[(0.0, 0.0, 10), (10.0, 10.0, 30)]], "output": (7.5, 7.5)},
  {"type": "edge_case", "inputs": [[]], "output": (0.0, 0.0)



Cannot parse
|{"solution": "def population_weighted_centroid(points):\n    \"\"\"Compute population weighted centroid.\n    Args:\n        points (list of tuple): Each tuple is (lat, lon, population).\n    Returns:\n        tuple: (weighted_lat, weighted_lon) rounded to 5 decimals. If the\n                list is empty or total population is zero, returns (0.0, 0.0).\n    \"\"\"\n    if not points:\n        return (0.0, 0.0)\n\n    total_pop = 0.0\n    sum_lat = 0.0\n    sum_lon = 0.0\n\n    for lat, lon, pop in points:\n        total_pop += pop\n        sum_lat += lat * pop\n        sum_lon += lon * pop\n\n    if total_pop == 0:\n        return (0.0, 0.0)\n\n    weighted_lat = round(sum_lat / total_pop, 5)\n    weighted_lon = round(sum_lon / total_pop, 5)\n\n    return (weighted_lat, weighted_lon)", "tests": [{"type": "basic", "inputs": [[(10, 20, 100)]], "output": [10.0, 20.0]}, {"type": "basic", "inputs": [[(0, 0, 1), (10, 10, 1)]], "output": [5.0, 5.0]}, {"type": "basic", "inputs":



   [92mfilter_border_regions([[{'name': 'Canada', 'is_border': True}, {'name': 'Xenon', 'is_border': True}, {'name': 'USA', 'is_border': False}]]) -> ['Canada'] vs. ['Canada'][0m
   [92mfilter_border_regions([[]]) -> [] vs. [][0m
   [92mfilter_border_regions([[{'name': 'Brazil', 'is_border': True}, {'name': 'Argentina', 'is_border': True}]]) -> ['Brazil', 'Argentina'] vs. ['Brazil', 'Argentina'][0m
   [92mfilter_border_regions([[{'name': 'xenia', 'is_border': True}, {'name': 'Xavier', 'is_border': True}]]) -> ['xenia'] vs. ['xenia'][0m
   [92mfilter_border_regions([[{'name': 'XCountry', 'is_border': True}]]) -> [] vs. [][0m
[Problem gradebook_sort] Try 1/3




   [92mgradebook_sort([[['Alice', 88], ['Bob', 95], ['Charlie', 90]]]) -> [['Bob', 95], ['Charlie', 90], ['Alice', 88]] vs. [['Bob', 95], ['Charlie', 90], ['Alice', 88]][0m
   [92mgradebook_sort([[['Alice', 90], ['Bob', 90], ['Charlie', 85]]]) -> [['Alice', 90], ['Bob', 90], ['Charlie', 85]] vs. [['Alice', 90], ['Bob', 90], ['Charlie', 85]][0m
   [92mgradebook_sort([[]]) -> [] vs. [][0m
   [92mgradebook_sort([[['Dave', 72.5], ['Eve', 72.5], ['Frank', 100.0]]]) -> [['Frank', 100.0], ['Dave', 72.5], ['Eve', 72.5]] vs. [['Frank', 100.0], ['Dave', 72.5], ['Eve', 72.5]][0m
   [92mgradebook_sort([[['Anna', -10], ['Bella', -5], ['Cara', -5]]]) -> [['Bella', -5], ['Cara', -5], ['Anna', -10]] vs. [['Bella', -5], ['Cara', -5], ['Anna', -10]][0m
[Problem camel_to_snake_identifiers] Try 1/3




   [92mcamel_to_snake_identifiers([['CamelCase']]) -> ['camel_case'] vs. ['camel_case'][0m
   [92mcamel_to_snake_identifiers([['HTMLParser']]) -> ['html_parser'] vs. ['html_parser'][0m
   [92mcamel_to_snake_identifiers([['getURLResponse']]) -> ['get_url_response'] vs. ['get_url_response'][0m
   [92mcamel_to_snake_identifiers([['already_snake']]) -> ['already_snake'] vs. ['already_snake'][0m
   [92mcamel_to_snake_identifiers([['JSON2XMLConverter', 'MyHTTPServer']]) -> ['json2_xml_converter', 'my_http_server'] vs. ['json2_xml_converter', 'my_http_server'][0m
[Problem average_grade_drop] Try 1/3




   [92maverage_grade_drop([[100, 90, 95, 80]]) -> 15 vs. 15[0m
   [92maverage_grade_drop([[1, 2, 3, 4]]) -> 0 vs. 0[0m
   [92maverage_grade_drop([[80]]) -> 0 vs. 0[0m
   [92maverage_grade_drop([[]]) -> 0 vs. 0[0m
   [92maverage_grade_drop([[90, 70, 70, 60]]) -> 20 vs. 20[0m
   [92maverage_grade_drop([[100, 90, 90, 95, 85, 85, 70]]) -> 15 vs. 15[0m
[Problem classroom_projector_area] Try 1/3




Cannot parse
|{"solution": "import math\n\ndef classroom_projector_area(points):\n    \"\"\"Return the area of a rectangle given its four corner coordinates (unordered).\n\n    points: list/tuple with four elements, each an (x, y) coordinate.\n    The rectangle can be axis-aligned or rotated; points can appear in any order.\n    \"\"\"\n    if len(points) != 4:\n        raise ValueError(\"Exactly four points are required\")\n\n    # Compute all pair-wise squared distances (there are 6 of them)\n    dists_sq = []\n    for i in range(4):\n        x1, y1 = points[i]\n        for j in range(i + 1, 4):\n            x2, y2 = points[j]\n            dx, dy = x1 - x2, y1 - y2\n            dists_sq.append(dx * dx + dy * dy)\n\n    # Filter out zero distances (shouldn't exist for valid rectangle)\n    dists_sq = [d for d in dists_sq if d > 0]\n    if len(dists_sq) != 6:\n        raise ValueError(\"Invalid or repeated points provided\")\n\n    # Sort distances; the two smallest *distinct* values c



   [92mclassroom_projector_area([[[0, 0], [4, 0], [4, 3], [0, 3]]]) -> 12.0 vs. 12.0[0m
   [92mclassroom_projector_area([[[0, 0], [1, 1], [2, 0], [1, -1]]]) -> 2.0 vs. 2.0[0m
   [92mclassroom_projector_area([[[5, 0], [6, 3], [1, 3], [0, 0]]]) -> 15.0 vs. 15.0[0m
   [92mclassroom_projector_area([[[-2, -1], [2, -1], [2, 1], [-2, 1]]]) -> 8.0 vs. 8.0[0m
   [92mclassroom_projector_area([[[1, 2], [4, 6], [1, 10], [-2, 6]]]) -> 24.0 vs. 24.0[0m
[Problem split_into_semesters] Try 1/3




   [92msplit_into_semesters([['Math', 'Physics', 'Chemistry', 'Biology', 'English'], 2]) -> [['Math', 'Physics'], ['Chemistry', 'Biology'], ['English']] vs. [['Math', 'Physics'], ['Chemistry', 'Biology'], ['English']][0m
   [92msplit_into_semesters([['History', 'Art', 'Music', 'PE', 'Drama'], 5]) -> [['History', 'Art', 'Music', 'PE', 'Drama']] vs. [['History', 'Art', 'Music', 'PE', 'Drama']][0m
   [92msplit_into_semesters([['CS', 'Math'], 10]) -> [['CS', 'Math']] vs. [['CS', 'Math']][0m
   [92msplit_into_semesters([['A', 'B', 'C'], 1]) -> [['A'], ['B'], ['C']] vs. [['A'], ['B'], ['C']][0m
[91mError executing code: Semester size 'n' must be at least 1[0m
[Problem split_into_semesters] Try 2/3




   [92msplit_into_semesters([['CS101', 'CS102', 'CS103', 'CS104'], 2]) -> [['CS101', 'CS102'], ['CS103', 'CS104']] vs. [['CS101', 'CS102'], ['CS103', 'CS104']][0m
   [92msplit_into_semesters([[1, 2, 3, 4, 5, 6, 7], 3]) -> [[1, 2, 3], [4, 5, 6], [7]] vs. [[1, 2, 3], [4, 5, 6], [7]][0m
   [92msplit_into_semesters([[], 2]) -> [] vs. [][0m
   [92msplit_into_semesters([['A', 'B'], 1]) -> [['A'], ['B']] vs. [['A'], ['B']][0m
   [92msplit_into_semesters([[10, 20], 5]) -> [[10, 20]] vs. [[10, 20]][0m
[Problem possible_loot_combinations] Try 1/3




   [92mpossible_loot_combinations([5, [1, 2, 5]]) -> 4 vs. 4[0m
   [92mpossible_loot_combinations([3, [2]]) -> 0 vs. 0[0m
   [92mpossible_loot_combinations([0, [1, 2]]) -> 1 vs. 1[0m
   [92mpossible_loot_combinations([10, [2, 3, 5, 6]]) -> 5 vs. 5[0m
   [92mpossible_loot_combinations([3, [1, 1, 2]]) -> 2 vs. 2[0m
[Problem daily_login_streaks] Try 1/3




   [92mdaily_login_streaks([['2023-01-01', '2023-01-02', '2023-01-05', '2023-01-06', '2023-01-07']]) -> 3 vs. 3[0m
   [92mdaily_login_streaks([[]]) -> 0 vs. 0[0m
   [92mdaily_login_streaks([['2023-03-10', '2023-03-11', '2023-03-12', '2023-03-13', '2023-03-14', '2023-03-15']]) -> 6 vs. 6[0m
   [92mdaily_login_streaks([['2023-01-01', '2023-01-03', '2023-01-05']]) -> 1 vs. 1[0m
   [92mdaily_login_streaks([['2020-02-28', '2020-02-29', '2020-03-01']]) -> 3 vs. 3[0m
[Problem decode_save_file] Try 1/3




   [92mdecode_save_file(['health=100;mana=50;name=Link']) -> {'health': 100, 'mana': 50, 'name': 'Link'} vs. {'health': 100, 'mana': 50, 'name': 'Link'}[0m
   [92mdecode_save_file(['lives=-3;level=2']) -> {'lives': -3, 'level': 2} vs. {'lives': -3, 'level': 2}[0m
   [92mdecode_save_file(['code=007bond;secret=42']) -> {'code': '007bond', 'secret': 42} vs. {'code': '007bond', 'secret': 42}[0m
   [92mdecode_save_file(['x=1;y=2;z=three;']) -> {'x': 1, 'y': 2, 'z': 'three'} vs. {'x': 1, 'y': 2, 'z': 'three'}[0m
   [92mdecode_save_file(['']) -> {} vs. {}[0m
[Problem first_unopened_chest] Try 1/3




Cannot parse
|{"solution": "def first_unopened_chest(chests):\n    \"\"\"Return the index of the first False in the list `chests`.\n    If every element is True (or the list is empty), return -1.\n    \"\"\"\n    for idx, opened in enumerate(chests):\n        if not opened:\n            return idx\n    return -1\n", "tests": [{"type": "basic", "inputs": [[true, false, true]], "output": 1}, {"type": "basic", "inputs": [[false, true, false]], "output": 0}, {"type": "edge_case", "inputs": [[true, true, true]], "output": -1}, {"type": "edge_case", "inputs": [[]], "output": -1}, {"type": "edge_case", "inputs": [[[True]*1000 + [False]]], "output": 1000}]}|
[Problem first_unopened_chest] Try 2/3




Cannot parse
|{"solution": "def first_unopened_chest(chests):\n    \"\"\"Return the index of the first unopened chest (False) in the list.\n    If all chests are opened (True) or list is empty, return -1.\n    \"\"\"\n    for idx, opened in enumerate(chests):\n        if not opened:\n            return idx\n    return -1\n", "tests": [{"type": "basic", "inputs": [[True, False, True]], "output": 1}, {"type": "basic", "inputs": [[False, True, True, False]], "output": 0}, {"type": "edge_case", "inputs": [[True, True, True]], "output": -1}, {"type": "edge_case", "inputs": [[]], "output": -1}, {"type": "edge_case", "inputs": [[True, False, False, False]], "output": 1}] }|
[Problem first_unopened_chest] Try 3/3




Cannot parse
|{"solution":"def first_unopened_chest(chests):\n    \"\"\"Return the smallest index of a chest that has not been opened (False).\n    If all chests are opened (True) or the list is empty, return -1.\n    \"\"\"\n    try:\n        return chests.index(False)\n    except ValueError:\n        return -1","tests":[{"type":"basic","inputs":[[True,False,True]],"output":1},{"type":"basic","inputs":[[True,True,True]],"output":-1},{"type":"edge_case","inputs":[[]],"output":-1},{"type":"edge_case","inputs":[[False,True,False]],"output":0},{"type":"edge_case","inputs":[[True,True,True,True,True,True,True,True,True,True,False]],"output":10}]}|
[Problem unique_player_signature] Try 1/3




   [92munique_player_signature(['a', '']) -> 0CC175B9 vs. 0CC175B9[0m
   [92munique_player_signature(['', '']) -> D41D8CD9 vs. D41D8CD9[0m
   [92munique_player_signature(['ab', 'c']) -> 90015098 vs. 90015098[0m
   [92munique_player_signature(['', 'password']) -> 5F4DCC3B vs. 5F4DCC3B[0m
   [92munique_player_signature(['a', 'b']) -> 187EF443 vs. 187EF443[0m
[Problem science_common_genes] Try 1/3




   [92mscience_common_genes([[['BRCA1', 'TP53', 'EGFR'], ['EGFR', 'TP53', 'MTOR'], ['TP53', 'EGFR', 'BRCA1']]]) -> ['EGFR', 'TP53'] vs. ['EGFR', 'TP53'][0m
   [92mscience_common_genes([[['A', 'B', 'A'], ['B', 'B', 'C', 'A'], ['A', 'B']]]) -> ['A', 'B'] vs. ['A', 'B'][0m
   [92mscience_common_genes([[['A', 'B'], ['C', 'D']]]) -> [] vs. [][0m
   [92mscience_common_genes([[]]) -> [] vs. [][0m
   [92mscience_common_genes([[[], ['A', 'B']]]) -> [] vs. [][0m
[Problem toggle_experiment_flags] Try 1/3




   [92mtoggle_experiment_flags([5, [0]]) -> 4 vs. 4[0m
   [92mtoggle_experiment_flags([0, [1, 3]]) -> 10 vs. 10[0m
   [92mtoggle_experiment_flags([6, [1, 2, 1]]) -> 2 vs. 2[0m
   [92mtoggle_experiment_flags([15, []]) -> 15 vs. 15[0m
   [92mtoggle_experiment_flags([1, [10]]) -> 1025 vs. 1025[0m
[Problem population_decay] Try 1/3




   [92mpopulation_decay([1000, 10, 3]) -> [1000, 900, 810, 729] vs. [1000, 900, 810, 729][0m
   [92mpopulation_decay([500, 20, 2]) -> [500, 400, 320] vs. [500, 400, 320][0m
   [92mpopulation_decay([123, 5, 0]) -> [123] vs. [123][0m
   [92mpopulation_decay([1000, 0, 4]) -> [1000, 1000, 1000, 1000, 1000] vs. [1000, 1000, 1000, 1000, 1000][0m
   [92mpopulation_decay([5, 50, 1]) -> [5, 2] vs. [5, 2][0m
[Problem chemical_chain_rewrite] Try 1/3




   [92mchemical_chain_rewrite(['AB']) -> BA vs. BA[0m
   [92mchemical_chain_rewrite(['AAB']) -> BAA vs. BAA[0m
   [92mchemical_chain_rewrite(['ABAB']) -> BBAA vs. BBAA[0m
   [92mchemical_chain_rewrite(['']) ->  vs. [0m
   [92mchemical_chain_rewrite(['ABBA']) -> BBAA vs. BBAA[0m
   [92mchemical_chain_rewrite(['AAA']) -> AAA vs. AAA[0m
[Problem count_prime_measurements] Try 1/3




Cannot parse
|{"solution": "def count_prime_measurements(measurements):\n    \"\"\"Return the number of prime integers in the given iterable of measurements.\n    A prime number is a positive integer greater than 1 that has no positive divisors other than 1 and itself.\n    \"\"\"\n    import math\n\n    def is_prime(n: int) -> bool:\n        if n <= 1:\n            return False\n        if n <= 3:\n            return True  # 2 and 3 are prime\n        if n % 2 == 0 or n % 3 == 0:\n            return False\n        # Check for factors of the form 6k ± 1 up to sqrt(n)\n        limit = int(math.isqrt(n))\n        i = 5\n        while i <= limit:\n            if n % i == 0 or n % (i + 2) == 0:\n                return False\n            i += 6\n        return True\n\n    # Count primes in the provided measurements list\n    return sum(1 for x in measurements if is_prime(x))",
"tests": [
    {"type": "basic", "inputs": [[1, 2, 3, 4, 5]], "output": 3},
    {"type": "basic", "inputs": [[0, -1



   [92mcount_prime_measurements([[2, 3, 4, 5, 6, 7]]) -> 4 vs. 4[0m
   [92mcount_prime_measurements([[10, 11, 12, 13, 14, 15]]) -> 2 vs. 2[0m
   [92mcount_prime_measurements([[-3, -2, 0, 1]]) -> 0 vs. 0[0m
   [92mcount_prime_measurements([[]]) -> 0 vs. 0[0m
   [92mcount_prime_measurements([[99991, 99989, 100000]]) -> 2 vs. 2[0m
[Problem minimum_truck_loads] Try 1/3




   [92mminimum_truck_loads([[4, 8, 1, 4, 2, 1], 10]) -> 2 vs. 2[0m
   [92mminimum_truck_loads([[9, 8, 2, 2, 5], 10]) -> 3 vs. 3[0m
   [92mminimum_truck_loads([[], 15]) -> 0 vs. 0[0m
   [92mminimum_truck_loads([[5, 5, 5], 5]) -> 3 vs. 3[0m
   [92mminimum_truck_loads([[1, 1, 1], 10]) -> 1 vs. 1[0m
[Problem pad_tracking_numbers] Try 1/3




   [92mpad_tracking_numbers([['12345', '987654321']]) -> ['000000012345', '000987654321'] vs. ['000000012345', '000987654321'][0m
   [92mpad_tracking_numbers([[123, 456789012345]]) -> ['000000000123', '456789012345'] vs. ['000000000123', '456789012345'][0m
   [92mpad_tracking_numbers([[]]) -> [] vs. [][0m
   [92mpad_tracking_numbers([['000123456789']]) -> ['000123456789'] vs. ['000123456789'][0m
   [92mpad_tracking_numbers([[0]]) -> ['000000000000'] vs. ['000000000000'][0m
[Problem validate_container_code] Try 1/3




   [92mvalidate_container_code(['MSCU1234567']) -> True vs. True[0m
   [92mvalidate_container_code(['MSIU1234567']) -> False vs. False[0m
   [92mvalidate_container_code(['ABCD12345A7']) -> False vs. False[0m
   [92mvalidate_container_code(['ABcD1234567']) -> False vs. False[0m
   [92mvalidate_container_code(['ABCD123456']) -> False vs. False[0m
   [92mvalidate_container_code(['ABCD12345678']) -> False vs. False[0m
[Problem average_delivery_delay] Try 1/3




Cannot parse
|{"solution": "from datetime import datetime\nfrom typing import List, Tuple\n\ndef average_delivery_delay(pairs: List[Tuple[str, str]]) -> float:\n    \"\"\"Compute the average delivery delay in days.\n\n    Each element in `pairs` is a tuple of two datetime strings:\n      (expected_datetime_str, actual_datetime_str)\n\n    The delay for one shipment is (actual - expected) measured in days\n    (positive means late). The function returns the average delay across\n    all pairs, rounded to two decimal places. If the input list is empty,\n    the function returns 0.0.\n    \"\"\"\n    if not pairs:\n        return 0.0\n\n    total_days = 0.0\n    for expected_str, actual_str in pairs:\n        expected_dt = datetime.fromisoformat(expected_str)\n        actual_dt = datetime.fromisoformat(actual_str)\n        delta_seconds = (actual_dt - expected_dt).total_seconds()\n        total_days += delta_seconds / 86400.0  # seconds in a day\n\n    average = total_days / len(pairs)\n 



Cannot parse
|{"solution": "from datetime import datetime\n\ndef average_delivery_delay(pairs):\n    \"\"\"Compute the average delivery delay in days (actual - expected).\n    Input: list of tuples/lists with expected and actual datetime strings.\n    Returns float rounded to two decimal places; positive means late.\n    If the list is empty, returns 0.0.\n    \"\"\"\n    if not pairs:\n        return 0.0\n\n    total_days = 0.0\n    for expected_str, actual_str in pairs:\n        expected_dt = datetime.fromisoformat(expected_str)\n        actual_dt = datetime.fromisoformat(actual_str)\n        delta_days = (actual_dt - expected_dt).total_seconds() / 86400.0  # seconds in a day\n        total_days += delta_days\n\n    average = total_days / len(pairs)\n    return round(average, 2)\n", "tests": [{"type": "basic", "inputs": [[("2023-01-01", "2023-01-03")]], "output": 2.0}, {"type": "basic", "inputs": [[("2023-01-01", "2023-01-03"), ("2023-01-05", "2023-01-03")]], "output": 0.0}, {"type":



Cannot parse
|{"solution": "import datetime\n\ndef average_delivery_delay(deliveries):\n    \"\"\"Compute the average delivery delay in days rounded to two decimals.\n    Each element in `deliveries` is a tuple (expected_datetime_str, actual_datetime_str).\n    Positive values indicate late delivery, negative values early delivery.\n    \"\"\"\n    if not deliveries:\n        return 0.0\n\n    total_days = 0.0\n    for expected_str, actual_str in deliveries:\n        # Parse using datetime.fromisoformat which supports YYYY-MM-DD and full ISO datetime strings\n        expected_dt = datetime.datetime.fromisoformat(expected_str)\n        actual_dt = datetime.datetime.fromisoformat(actual_str)\n        diff_days = (actual_dt - expected_dt).total_seconds() / 86400  # seconds per day\n        total_days += diff_days\n\n    average = total_days / len(deliveries)\n    return round(average, 2)\n", "tests": [{"type": "basic", "inputs": [[("2021-01-01", "2021-01-03"), ("2021-02-01", "2021-02-02")



Cannot parse
|{"solution": "def filter_heavy_packages(packages, threshold, exclusion_list):\n    \"\"\"Return a list of package IDs whose weight strictly exceeds the given\n    threshold and whose destination is not in the exclusion list.\n\n    Args:\n        packages (list[dict]): Each dict must contain keys 'id', 'weight', and 'destination'.\n        threshold (int | float): Weight threshold. Packages heavier than this value qualify.\n        exclusion_list (list[str]): Destinations to exclude.\n\n    Returns:\n        list: IDs of the qualifying packages, in their original order.\n    \"\"\"\n    if not isinstance(exclusion_list, set):\n        exclusion_set = set(exclusion_list)\n    else:\n        exclusion_set = exclusion_list\n\n    qualified_ids = []\n    for pkg in packages:\n        try:\n            if pkg['weight'] > threshold and pkg['destination'] not in exclusion_set:\n                qualified_ids.append(pkg['id'])\n        except KeyError:\n            # Skip packages



Cannot parse
|{"solution": "def filter_heavy_packages(packages, weight_threshold, excluded_destinations):\n    \"\"\"Return a list of package IDs whose weight exceeds the given threshold and whose\n    destination is not in the exclusion list.\n\n    Parameters\n    ----------\n    packages : list of dict\n        Each dictionary must contain the keys 'id', 'weight', and 'destination'.\n    weight_threshold : int | float\n        The minimum weight (exclusive) a package must exceed to be included.\n    excluded_destinations : set | list | tuple\n        Iterable of destination names to be excluded.\n\n    Returns\n    -------\n    list\n        IDs of packages that satisfy the conditions, in the original order.\n    \"\"\"\n    # Convert exclusion list to a set for O(1) look-ups\n    exclusions = set(excluded_destinations)\n\n    # Filter and collect IDs maintaining original order\n    result_ids = [pkg['id'] for pkg in packages\n                  if pkg.get('weight', 0) > weight_thres



   [92mfilter_heavy_packages([[{'id': 1, 'weight': 15, 'destination': 'NY'}, {'id': 2, 'weight': 5, 'destination': 'CA'}, {'id': 3, 'weight': 20, 'destination': 'TX'}], 10, ['CA']]) -> [1, 3] vs. [1, 3][0m
   [92mfilter_heavy_packages([[{'id': 'A', 'weight': 50, 'destination': 'LA'}, {'id': 'B', 'weight': 40, 'destination': 'NY'}, {'id': 'C', 'weight': 70, 'destination': 'LA'}], 45, ['NY']]) -> ['A', 'C'] vs. ['A', 'C'][0m
   [92mfilter_heavy_packages([[], 30, ['LA']]) -> [] vs. [][0m
   [92mfilter_heavy_packages([[{'id': 10, 'weight': 5, 'destination': 'SF'}, {'id': 11, 'weight': 9, 'destination': 'LA'}], 10, []]) -> [] vs. [][0m
   [92mfilter_heavy_packages([[{'id': 'X', 'weight': 100, 'destination': 'NK'}, {'id': 'Y', 'weight': 100, 'destination': 'NK'}], 50, ['NK']]) -> [] vs. [][0m
[Problem sort_products_by_margin] Try 1/3




   [92msort_products_by_margin([[{'id': 1, 'cost': 50, 'price': 100}, {'id': 2, 'cost': 30, 'price': 60}, {'id': 3, 'cost': 20, 'price': 25}]]) -> [{'id': 1, 'cost': 50, 'price': 100}, {'id': 2, 'cost': 30, 'price': 60}, {'id': 3, 'cost': 20, 'price': 25}] vs. [{'id': 1, 'cost': 50, 'price': 100}, {'id': 2, 'cost': 30, 'price': 60}, {'id': 3, 'cost': 20, 'price': 25}][0m
   [92msort_products_by_margin([[{'id': 10, 'cost': 80, 'price': 120}, {'id': 11, 'cost': 40, 'price': 60}, {'id': 12, 'cost': 50, 'price': 100}]]) -> [{'id': 12, 'cost': 50, 'price': 100}, {'id': 10, 'cost': 80, 'price': 120}, {'id': 11, 'cost': 40, 'price': 60}] vs. [{'id': 12, 'cost': 50, 'price': 100}, {'id': 10, 'cost': 80, 'price': 120}, {'id': 11, 'cost': 40, 'price': 60}][0m
   [92msort_products_by_margin([[{'id': 4, 'cost': 20, 'price': 30}, {'id': 5, 'cost': 40, 'price': 60}, {'id': 6, 'cost': 50, 'price': 60}]]) -> [{'id': 4, 'cost': 20, 'price': 30}, {'id': 5, 'cost': 40, 'price': 60}, {'id': 6, 'cost'



Cannot parse
|{"solution": "def truncate_product_titles(titles):\n    \"\"\"Shorten titles longer than 60 characters without cutting words.\n    If a title is truncated, append '...' to the result. Words are kept whole;\n    if there is no space within the first 60 characters the title is returned\n    unchanged.\n    Args:\n        titles (list[str]): List of product title strings.\n    Returns:\n        list[str]: List with titles truncated where necessary.\n    \"\"\"\n    result = []\n    for title in titles:\n        if len(title) <= 60:\n            # No need to truncate\n            result.append(title)\n            continue\n        # Find the last space within the first 60 characters (0..59)\n        cut_position = title.rfind(' ', 0, 61)  # 61 because the stop index is exclusive\n        if cut_position == -1:\n            # No space found -> cannot truncate without breaking a word\n            result.append(title)\n        else:\n            truncated = title[:cut_position] 



Cannot parse
|{"solution": "def truncate_product_titles(titles):\n    \"\"\"Shorten titles longer than 60 characters without cutting words; append '...' if truncated.\"\"\"\n    result = []\n    for title in titles:\n        # No truncation needed\n        if len(title) <= 60:\n            result.append(title)\n            continue\n\n        # Preliminary cut at 60 characters\n        cutoff = title[:60]\n        last_space = cutoff.rfind(' ')\n\n        if last_space == -1:\n            # No space within the first 60 chars – cut hard\n            truncated = cutoff\n        else:\n            # Cut at the last full word that fits\n            truncated = cutoff[:last_space]\n\n        result.append(truncated.rstrip() + '...')\n\n    return result", "tests": [{"type": "basic", "inputs": [["This is a simple product", "This is a really long product title that definitely exceeds sixty characters in total length"]], "output": ["This is a simple product", "This is a really long product tit



Cannot parse
|{"solution": "def truncate_product_titles(titles):\n    \"\"\"Truncate each product title in the provided list so that if the title's\n    length exceeds 60 characters it is shortened without breaking words.\n    The truncation takes the longest prefix not exceeding 60 characters that\n    ends at a space; if no such space exists in the first 60 characters, the\n    cut occurs exactly at the 60th character. An ellipsis ('...') is appended\n    to every title that was shortened. The original list is not modified; a\n    new list with processed titles is returned.\n\n    Args:\n        titles (list[str]): List of product title strings.\n\n    Returns:\n        list[str]: List with titles processed according to the rules above.\n    \"\"\"\n    processed = []\n    for title in titles:\n        if len(title) <= 60:\n            processed.append(title)\n            continue\n\n        # Find the last space within the first 60 characters\n        cutoff_section = title[:60]\n  



Cannot parse
|{"solution": "def calculate_cart_total(cart, discount):\n    \"\"\"\n    Calculate the total cost for a shopping cart.\n\n    Parameters\n    ----------\n    cart : iterable of (price, quantity)\n        A sequence where each element is a tuple of (unit_price, quantity).\n    discount : float or int\n        Percentage discount to apply to the subtotal. A value of 10 means 10% off.\n\n    Returns\n    -------\n    float\n        The final total, rounded to two decimal places.\n    \"\"\"\n    # Subtotal: sum of (price * quantity) for every item in the cart\n    subtotal = sum(price * quantity for price, quantity in cart)\n\n    # Apply percentage discount\n    total_after_discount = subtotal * (1 - discount / 100.0)\n\n    # Round to two decimal places\n    return round(total_after_discount, 2)\n", "tests": [{"type": "basic", "inputs": [[(10, 2), (5, 1)], 10], "output": 22.5}, {"type": "edge_case", "inputs": [[], 0], "output": 0.0}, {"type": "edge_case", "inputs": [[(20, 



   [92mcalculate_cart_total([[10, 20, 30], [1, 2, 3], 10]) -> 126.0 vs. 126.0[0m
   [92mcalculate_cart_total([[5.5, 3.25], [2, 4], 0]) -> 24.0 vs. 24.0[0m
   [92mcalculate_cart_total([[100], [1], 100]) -> 0.0 vs. 0.0[0m
   [92mcalculate_cart_total([[], [], 5]) -> 0.0 vs. 0.0[0m
   [92mcalculate_cart_total([[10, 20], [1], 50]) -> 5.0 vs. 5.0[0m
[Problem gift_box_diagonal] Try 1/3




   [92mgift_box_diagonal([1, 2, 2]) -> 3.0 vs. 3.0[0m
   [92mgift_box_diagonal([3, 4, 12]) -> 13.0 vs. 13.0[0m
   [92mgift_box_diagonal([1.5, 2.5, 3.5]) -> 4.555 vs. 4.555[0m
   [92mgift_box_diagonal([0, 0, 0]) -> 0.0 vs. 0.0[0m
   [92mgift_box_diagonal([0.333, 0.444, 0.555]) -> 0.785 vs. 0.785[0m
[Problem batch_orders] Try 1/3




Cannot parse
|{"solution": "def batch_orders(orders, k):\n    \"\"\"\n    Group order ids into batches of at most *k* where each batch contains\n    orders from the same country. The relative order of orders (and thereby\n    of the batches themselves) must remain the same as in the original list.\n\n    Parameters\n    ----------\n    orders : list[tuple | list]\n        Each element is a 2-item tuple/list: (order_id, country).\n    k : int\n        Maximum size of a batch.\n\n    Returns\n    -------\n    list[list]\n        A list of batches where every batch is a list of order_ids.\n    \"\"\"\n    if k <= 0:\n        raise ValueError(\"k must be a positive integer\")\n\n    batches = []          # final list of batches\n    current_batch = []    # ids for the batch we're building\n    current_country = None\n\n    for order_id, country in orders:\n        # If we can append to current batch\n        if current_batch and country == current_country and len(current_batch) < k:\n     



Cannot parse
|{"solution": "def batch_orders(orders, k):\n    \"\"\"Group order ids into batches of at most k where all belong to the same country.\n       The relative order of orders must be preserved across batches.\n\n       Parameters:\n       orders : list of tuples (order_id, country)\n       k      : positive int, maximum size of each batch\n\n       Returns: list of lists, where each inner list contains order_ids.\n    \"\"\"\n    if k <= 0:\n        raise ValueError(\"k must be positive\")\n\n    batches = []            # final result\n    current_batch = []      # order_ids being collected\n    current_country = None  # country of current_batch\n\n    for order_id, country in orders:\n        if not current_batch:  # starting a new batch\n            current_country = country\n\n        if country == current_country and len(current_batch) < k:\n            current_batch.append(order_id)\n        else:\n            # finalize existing batch\n            batches.append(current



Cannot parse
|{"solution": "def batch_orders(orders, k):\n    \"\"\"Group order ids into batches of at most k where all belong to the same country.\n    The original order of orders is preserved across the produced batches.\n\n    Parameters\n    ----------\n    orders : list[tuple[int,str]]\n        A list of (order_id, country_code) pairs in their original arrival order.\n    k : int\n        Maximum number of orders allowed in a single batch. Must be positive.\n\n    Returns\n    -------\n    list[list[int]]\n        A list of batches, each batch being a list of order ids.\n    \"\"\"\n    if k <= 0:\n        raise ValueError(\"k must be a positive integer\")\n\n    batches = []            # final list of batches to return\n    current_batch = []      # accumulating ids for the current batch\n    current_country = None  # country of the current batch\n\n    for order_id, country in orders:\n        # Need a new batch if country changes or batch is full\n        if (current_country i



[91mError executing code: name 'MOD' is not defined[0m
[Problem possible_anagrams_count] Try 2/3




[91mError executing code: name 'MOD' is not defined[0m
[Problem possible_anagrams_count] Try 3/3




   [92mpossible_anagrams_count(['abc']) -> 6 vs. 6[0m
   [92mpossible_anagrams_count(['aab']) -> 3 vs. 3[0m
   [92mpossible_anagrams_count(['zzzz']) -> 1 vs. 1[0m
   [92mpossible_anagrams_count(['abab']) -> 6 vs. 6[0m
   [92mpossible_anagrams_count(['']) -> 1 vs. 1[0m
[Problem closest_palindrome_date] Try 1/3




   [92mclosest_palindrome_date(['2021-12-01']) -> 2021-12-02 vs. 2021-12-02[0m
   [92mclosest_palindrome_date(['2015-05-10']) -> 2020-02-02 vs. 2020-02-02[0m
   [92mclosest_palindrome_date(['2021-12-02']) -> 2030-03-02 vs. 2030-03-02[0m
   [92mclosest_palindrome_date(['2099-12-31']) -> 2101-10-12 vs. 2101-10-12[0m
   [92mclosest_palindrome_date(['1999-12-31']) -> 2001-10-02 vs. 2001-10-02[0m
[91mError executing code: date value out of range[0m
[Problem closest_palindrome_date] Try 2/3




   [92mclosest_palindrome_date(['2021-12-01']) -> 2021-12-02 vs. 2021-12-02[0m
   [92mclosest_palindrome_date(['2021-12-02']) -> 2030-03-02 vs. 2030-03-02[0m
   [92mclosest_palindrome_date(['1999-12-31']) -> 2001-10-02 vs. 2001-10-02[0m
   [92mclosest_palindrome_date(['2011-11-11']) -> 2020-02-02 vs. 2020-02-02[0m
   [92mclosest_palindrome_date(['2001-10-01']) -> 2001-10-02 vs. 2001-10-02[0m
[Problem ipa_symbol_extractor] Try 1/3




   [92mipa_symbol_extractor(['The word /kæt/ means cat.']) -> ['k', 'æ', 't'] vs. ['k', 'æ', 't'][0m
   [92mipa_symbol_extractor(['English /ðə/ word for /kat/ is cat.']) -> ['ð', 'ə', 'k', 'a', 't'] vs. ['ð', 'ə', 'k', 'a', 't'][0m
   [92mipa_symbol_extractor(['I said /bæt/ and /bæd/.']) -> ['b', 'æ', 't', 'd'] vs. ['b', 'æ', 't', 'd'][0m
   [92mipa_symbol_extractor(["The phrase /həˈloʊ wɜːld/ is 'hello world'."]) -> ['h', 'ə', 'ˈ', 'l', 'o', 'ʊ', 'w', 'ɜ', 'ː', 'd'] vs. ['h', 'ə', 'ˈ', 'l', 'o', 'ʊ', 'w', 'ɜ', 'ː', 'd'][0m
   [91mipa_symbol_extractor(['/səmˈtaɪmz/ slashes can /əˈpɪə/ more than once.']) -> ['s', 'ə', 'm', 'ˈ', 't', 'a', 'ɪ', 'z', 'p'] vs. ['s', 'ə', 'm', 'ˈ', 't', 'a', 'ɪ', 'z', 'l', 'a', 'ʃ', 'e', 'c', 'n', 'p', 'ɪ', 'r'][0m
[Problem ipa_symbol_extractor] Try 2/3




   [92mipa_symbol_extractor(['English cat is /kæt/ and bat is /bæt/.']) -> ['k', 'æ', 't', 'b'] vs. ['k', 'æ', 't', 'b'][0m
   [92mipa_symbol_extractor(['/pɪn/ to /pɪŋ/ vs /pɪn/']) -> ['p', 'ɪ', 'n', 'ŋ'] vs. ['p', 'ɪ', 'n', 'ŋ'][0m
   [92mipa_symbol_extractor(['Pronounce / t ʃ / or /dʒ /']) -> ['t', 'ʃ', 'd', 'ʒ'] vs. ['t', 'ʃ', 'd', 'ʒ'][0m
   [92mipa_symbol_extractor(['Start /s p']) -> ['s', 'p'] vs. ['s', 'p'][0m
   [92mipa_symbol_extractor(['']) -> [] vs. [][0m
[Problem find_word_variant] Try 1/3




   [92mfind_word_variant([['café', 'jalapeño', 'piñata', 'résumé'], 'resume', {'é': 'e', 'ñ': 'n', 'í': 'i', 'á': 'a', 'ó': 'o'}]) -> 3 vs. 3[0m
   [92mfind_word_variant([['año', 'árbol', 'niño'], 'arbol', {'ñ': 'n', 'á': 'a'}]) -> 1 vs. 1[0m
   [92mfind_word_variant([['café', 'jalapeño', 'piñata', 'résumé'], 'cafeine', {'é': 'e', 'ñ': 'n', 'í': 'i', 'á': 'a', 'ó': 'o'}]) -> -1 vs. -1[0m
   [92mfind_word_variant([[], 'anything', {'é': 'e'}]) -> -1 vs. -1[0m
   [92mfind_word_variant([['él', 'ella', 'ellos'], 'el', {'é': 'e'}]) -> 0 vs. 0[0m
[Problem phonetic_hash] Try 1/3




   [92mphonetic_hash(['Robert']) -> R163 vs. R163[0m
   [92mphonetic_hash(['Rupert']) -> R163 vs. R163[0m
   [92mphonetic_hash(['Rubin']) -> R150 vs. R150[0m
   [92mphonetic_hash(['Ashcraft']) -> A261 vs. A261[0m
   [92mphonetic_hash(['Pfister']) -> P236 vs. P236[0m
   [92mphonetic_hash(['Tymczak']) -> T522 vs. T522[0m
[Problem common_visible_stars] Try 1/3




   [92mcommon_visible_stars([[[1, 2, 3], [2, 3, 4], [2, 5]]]) -> [2] vs. [2][0m
   [92mcommon_visible_stars([[[10, 20, 30], [30, 20, 10]]]) -> [10, 20, 30] vs. [10, 20, 30][0m
   [92mcommon_visible_stars([[]]) -> [] vs. [][0m
   [92mcommon_visible_stars([[[5, 3, 1]]]) -> [1, 3, 5] vs. [1, 3, 5][0m
   [92mcommon_visible_stars([[[1, 2], [3, 4]]]) -> [] vs. [][0m
[Problem encode_constellation_pattern] Try 1/3




Cannot parse
|{"solution": "def encode_constellation_pattern(stars):\n    \"\"\"Encode a list of star indices (0–15) into a 16-bit integer.\n\n    Each bit i (0-indexed from the least-significant bit) is set to 1\n    if star i is present in the input list. Duplicate indices are\n    ignored. The function assumes all provided indices are within the\n    valid range 0–15.\n    \"\"\"\n    pattern = 0\n    for idx in stars:\n        # Set bit idx\n        pattern |= 1 << idx\n    return pattern\n", "tests": [{"type": "basic", "inputs": [[0, 2, 5]], "output": 37}, {"type": "basic", "inputs": [[3, 3, 3]], "output": 8}, {"type": "edge_case", "inputs": [[]], "output": 0}, {"type": "edge_case", "inputs": [[i for i in range(16)]], "output": 65535}, {"type": "basic", "inputs": [[15, 0, 7]], "output": 32897}] }|
[Problem encode_constellation_pattern] Try 2/3




   [92mencode_constellation_pattern([[0, 1, 2]]) -> 7 vs. 7[0m
   [92mencode_constellation_pattern([[3, 3, 3]]) -> 8 vs. 8[0m
   [92mencode_constellation_pattern([[]]) -> 0 vs. 0[0m
   [92mencode_constellation_pattern([[0, 2, 4, 6, 8, 10, 12, 14]]) -> 21845 vs. 21845[0m
   [92mencode_constellation_pattern([[15]]) -> 32768 vs. 32768[0m
[Problem meteor_shower_sim] Try 1/3




   [92mmeteor_shower_sim([[10, 10, 10]]) -> [8, 16, 20] vs. [8, 16, 20][0m
   [92mmeteor_shower_sim([[1, 2, 3]]) -> [1, 3, 4] vs. [1, 3, 4][0m
   [92mmeteor_shower_sim([[]]) -> [] vs. [][0m
   [92mmeteor_shower_sim([[0, 0, 0]]) -> [0, 0, 0] vs. [0, 0, 0][0m
   [92mmeteor_shower_sim([[5.5, 2.2]]) -> [5, 7] vs. [5, 7][0m
[Problem binary_star_orbit_steps] Try 1/3




   [92mbinary_star_orbit_steps([1, 1, 5]) -> [1, 1, 2, 3, 5, 8] vs. [1, 1, 2, 3, 5, 8][0m
   [92mbinary_star_orbit_steps([2, 3, 4]) -> [2, 3, 5, 8, 13] vs. [2, 3, 5, 8, 13][0m
   [92mbinary_star_orbit_steps([7, 10, 0]) -> [7] vs. [7][0m
   [92mbinary_star_orbit_steps([0, 0, 3]) -> [0, 0, 0, 0] vs. [0, 0, 0, 0][0m
   [92mbinary_star_orbit_steps([5, 8, 1]) -> [5, 8] vs. [5, 8][0m
[Problem count_eclipses_in_range] Try 1/3




   [92mcount_eclipses_in_range([['2020-06-21', '2020-12-14', '2021-06-10', '2021-12-04', '2022-04-30'], '2020-01-01', '2020-12-31']) -> 2 vs. 2[0m
   [92mcount_eclipses_in_range([['2020-06-21', '2020-12-14', '2021-06-10', '2021-12-04', '2022-04-30'], '2019-01-01', '2019-12-31']) -> 0 vs. 0[0m
   [92mcount_eclipses_in_range([['2020-06-21', '2020-12-14', '2021-06-10', '2021-12-04', '2022-04-30'], '2020-06-21', '2022-04-30']) -> 5 vs. 5[0m
   [92mcount_eclipses_in_range([[], '2020-01-01', '2022-12-31']) -> 0 vs. 0[0m
   [92mcount_eclipses_in_range([['2020-06-21', '2021-06-10'], '2021-07-01', '2021-01-01']) -> 0 vs. 0[0m
[Problem team_selection_budget] Try 1/3




   [92mteam_selection_budget([50, [10, 20, 30], [5, 10, 15]]) -> 2 vs. 2[0m
   [92mteam_selection_budget([100, [30, 20, 50, 10], [3, 2, 5, 1]]) -> 3 vs. 3[0m
   [92mteam_selection_budget([5, [10, 20], [1, 2]]) -> 0 vs. 0[0m
   [92mteam_selection_budget([0, [5, 5], [1, 1]]) -> 0 vs. 0[0m
   [91mteam_selection_budget([100, [30, 40, 25], [10, 40, 10]]) -> 3 vs. 2[0m
[Problem team_selection_budget] Try 2/3




Cannot parse
|{"solution": "def team_selection_budget(budget, players):\n    \"\"\"Return the maximum number of players that can be bought without\n    exceeding the given budget using a greedy strategy based on the\n    cost/skill ratio (lower ratio first).  If several players share the\n    same ratio, they are ordered by lower cost first, then by higher\n    skill (to spend budget more efficiently while keeping the same\n    ratio).\n\n    Parameters\n    ----------\n    budget : int\n        Available amount of money that can be spent.\n    players : list[tuple[int, int]]\n        Each tuple contains (cost, skill) for a player.\n\n    Returns\n    -------\n    int\n        The number of players selected.\n    \"\"\"\n    # Guard against empty player list or non-positive budget early.\n    if budget <= 0 or not players:\n        return 0\n\n    # Prepare list with their ratios; avoid division by zero (skill == 0)\n    processed = []  # (ratio, cost, -skill, cost)  the last field kee



Cannot parse
|{"solution": "def team_selection_budget(budget, players):\n    \"\"\"Return the maximum number of players that can be bought without\n    exceeding the given budget.  Each player is represented by a tuple\n    (cost, skill). Players are considered for purchase in increasing\n    order of their cost/skill ratio (greedy heuristic).\n    \n    Parameters\n    ----------\n    budget : int\n        The total amount of money that can be spent.\n    players : list[tuple[int, int]]\n        A list where each element is a tuple (cost, skill).\n    \n    Returns\n    -------\n    int\n        The maximum number of players that can be bought using the\n        greedy strategy.\n    \"\"\"\n    # Guard-clauses for trivial situations\n    if budget <= 0 or not players:\n        return 0\n\n    # Sort by ratio cost/skill; if ratios tie, prefer cheaper cost first\n    players_sorted = sorted(players, key=lambda cs: (cs[0] / cs[1], cs[0]))\n\n    count = 0\n    remaining = budget\n\n    



Cannot parse
|{"solution": "def scoreboard_string(matches):\n    \"\"\"Given a list/tuple with two (team, score) tuples, return a string of the\n    form 'TeamA 3-1 TeamB' where the team with the higher score is listed first.\n\n    If the scores are tied, the original order is preserved.\n    \"\"\"\n    # Basic validation\n    if not isinstance(matches, (list, tuple)) or len(matches) != 2:\n        raise ValueError(\"Input must be a list/tuple containing exactly two (team, score) pairs\")\n\n    # Attach original indices so we can preserve original order on ties\n    indexed = list(enumerate(matches))  # [(idx, (team, score)), ...]\n\n    # Sort primarily by score descending; secondarily by original index (to keep order on ties)\n    indexed.sort(key=lambda item: (-item[1][1], item[0]))\n\n    # Unpack the two sorted pairs\n    (team1, score1) = indexed[0][1]\n    (team2, score2) = indexed[1][1]\n\n    return f\"{team1} {score1}-{score2} {team2}\"", "tests": [{"type": "basic", "input



Cannot parse
|{"solution": "def scoreboard_string(match):\n    \"\"\"Convert a pair of (team, score) tuples into a scoreboard string.\n    The team with the higher score is placed first; if scores are tied,\n    the original order is preserved.\n\n    Parameters\n    ----------\n    match : list[tuple] | tuple[tuple]\n        An iterable with exactly two elements, each of which is a tuple\n        of the form (team_name: str, score: int).\n\n    Returns\n    -------\n    str\n        A string formatted as 'TeamA x-y TeamB'.\n    \"\"\"\n    if len(match) != 2:\n        raise ValueError(\"Input must contain exactly two (team, score) pairs\")\n\n    (team1, score1), (team2, score2) = match\n\n    # Decide order: higher score first; if tied, keep original order\n    if score1 > score2:\n        first_team, first_score, second_team, second_score = team1, score1, team2, score2\n    elif score2 > score1:\n        first_team, first_score, second_team, second_score = team2, score2, team1, scor



Cannot parse
|{"solution": "def scoreboard_string(match):\n    \"\"\"Given a list or tuple with exactly two (team, score) pairs, return a string\n    formatted like 'WinningTeam 3-1 LosingTeam'. If the second team has a higher\n    score, it is presented first. If scores are equal, original order is kept.\n    Examples\n    --------\n    >>> scoreboard_string([(\"A\", 3), (\"B\", 1)])\n    'A 3-1 B'\n    >>> scoreboard_string([(\"A\", 1), (\"B\", 4)])\n    'B 4-1 A'\n    >>> scoreboard_string([(\"A\", 2), (\"B\", 2)])\n    'A 2-2 B'\n    \"\"\"\n    if len(match) != 2:\n        raise ValueError(\"Exactly two teams must be provided\")\n\n    # Ensure we have mutable sequence (could be tuple);\n    (team1, score1), (team2, score2) = match\n\n    # Swap order if the second team has the higher score\n    if score2 > score1:\n        team1, team2 = team2, team1\n        score1, score2 = score2, score1\n\n    return f\"{team1} {score1}-{score2} {team2}\"",
 "tests": [
    {"type": "basic", "



   [92mvalidate_badminton_score([21, 18]) -> True vs. True[0m
   [92mvalidate_badminton_score([22, 20]) -> True vs. True[0m
   [92mvalidate_badminton_score([21, 20]) -> False vs. False[0m
   [92mvalidate_badminton_score([30, 29]) -> True vs. True[0m
   [92mvalidate_badminton_score([30, 28]) -> False vs. False[0m
   [92mvalidate_badminton_score([25, 17]) -> False vs. False[0m
[Problem player_moving_average] Try 1/3




Cannot parse
|{"solution": "from collections import deque\n\ndef player_moving_average(records, k):\n    \"\"\"Compute the moving average of the last k game points for each player.\n\n    Parameters\n    ----------\n    records : list[tuple[int, (int | float)]]\n        Chronologically ordered list of (player_id, points) tuples.\n    k : int\n        Window size for the moving average.\n\n    Returns\n    -------\n    dict[int, float]\n        Mapping from player_id to the average of their last ``k`` (or fewer) game points.\n    \"\"\"\n    if k <= 0:\n        raise ValueError(\"k must be a positive integer\")\n\n    # For each player keep a deque of the most recent k scores and running sum\n    windows = {}\n    sums = {}\n\n    for pid, pts in records:\n        if pid not in windows:\n            windows[pid] = deque()\n            sums[pid] = 0.0\n\n        q = windows[pid]\n        sums[pid] += pts\n        q.append(pts)\n\n        # Maintain window size\n        if len(q) > k:\n  



Cannot parse
|{"solution": "from collections import defaultdict, deque\n\ndef player_moving_average(records, k):\n    \"\"\"Compute the average of the last *k* game points for every player.\n\n    Parameters\n    ----------\n    records : list[tuple[int, int]]\n        A chronological list of tuples (player_id, points_scored).\n    k : int\n        The window size for the moving average (must be > 0).\n\n    Returns\n    -------\n    dict[int, float]\n        Mapping from player_id to the average of their last *k* points.  If a\n        player appears fewer than *k* times, the average of all of their\n        available points is returned.\n    \"\"\"\n    if k <= 0:\n        raise ValueError(\"k must be a positive integer\")\n\n    # keep a fixed-size queue per player to hold their last k scores\n    queues = defaultdict(lambda: deque(maxlen=k))\n\n    for pid, pts in records:\n        queues[pid].append(pts)\n\n    # compute the average of the stored points for each player\n    result



Cannot parse
|{"solution": "from collections import defaultdict, deque\n\ndef player_moving_average(records, k):\n    \"\"\"Compute the moving average of the last k game points for each player.\n    Args:\n        records (list[tuple]): Chronological list of (player_id, points) tuples.\n        k (int): Window size for moving average (must be >=1).\n    Returns:\n        dict: Mapping player_id -> average of the last k recorded points.\n    \"\"\"\n    if k <= 0:\n        raise ValueError(\"k must be a positive integer\")\n\n    # For each player maintain a deque of the last k scores and a running sum\n    windows = defaultdict(lambda: deque(maxlen=k))\n    sums = defaultdict(int)\n\n    for pid, pts in records:\n        window = windows[pid]\n        # If the window is full, subtract the point that will be popped out\n        if len(window) == k:\n            sums[pid] -= window[0]\n        window.append(pts)\n        sums[pid] += pts\n\n    # Build the averages\n    averages = {}\n  



   [92mfilter_home_games([[{'home': 'TeamA', 'away': 'TeamB'}, {'home': 'TeamC', 'away': 'TeamA'}, {'home': 'TeamA', 'away': 'TeamD'}], 'TeamA']) -> [{'home': 'TeamA', 'away': 'TeamB'}, {'home': 'TeamA', 'away': 'TeamD'}] vs. [{'home': 'TeamA', 'away': 'TeamB'}, {'home': 'TeamA', 'away': 'TeamD'}][0m
   [92mfilter_home_games([[{'home': 'TeamA', 'away': 'TeamB'}, {'home': 'TeamC', 'away': 'TeamA'}, {'home': 'TeamA', 'away': 'TeamD'}], 'TeamB']) -> [] vs. [][0m
   [92mfilter_home_games([[], 'AnyTeam']) -> [] vs. [][0m
   [92mfilter_home_games([[{'homeTeam': 'Lions', 'awayTeam': 'Tigers'}, {'homeTeam': 'Bears', 'awayTeam': 'Lions'}], 'Lions']) -> [{'homeTeam': 'Lions', 'awayTeam': 'Tigers'}] vs. [{'homeTeam': 'Lions', 'awayTeam': 'Tigers'}][0m
   [92mfilter_home_games([[{'home': 'AAA', 'away': 'BBB'}, {'home': 'AAA', 'away': 'CCC'}, {'home': 'DDD', 'away': 'AAA'}, {'home': 'AAA', 'away': 'EEE'}], 'AAA']) -> [{'home': 'AAA', 'away': 'BBB'}, {'home': 'AAA', 'away': 'CCC'}, {'home':



Cannot parse
|{"solution": "def sort_tracks_by_bpm(tracks):\n    \"\"\"Sort a list of track dictionaries by BPM (ascending). If two tracks have the\n    same BPM, they are ordered by popularity (descending).\n\n    Parameters\n    ----------\n    tracks : list[dict]\n        Each dictionary must contain at least the keys 'bpm' (numeric) and\n        'popularity' (numeric).\n\n    Returns\n    -------\n    list[dict]\n        A new list with tracks sorted as specified.\n    \"\"\"\n    if not isinstance(tracks, list):\n        raise TypeError(\"Input must be a list of dictionaries\")\n\n    # Use tuple (bpm, -popularity) for desired order\n    return sorted(tracks, key=lambda t: (t.get('bpm'), -t.get('popularity')))", "tests": [{"type": "basic", "inputs": [[{\"title\": \"A\", \"bpm\": 120, \"popularity\": 50}, {\"title\": \"B\", \"bpm\": 110, \"popularity\": 70}, {\"title\": \"C\", \"bpm\": 130, \"popularity\": 60}]], "output": [{\"title\": \"B\", \"bpm\": 110, \"popularity\": 70}, {\"t



   [92msort_tracks_by_bpm([[{'title': 'Track A', 'bpm': 120, 'popularity': 80}, {'title': 'Track B', 'bpm': 110, 'popularity': 90}, {'title': 'Track C', 'bpm': 120, 'popularity': 85}]]) -> [{'title': 'Track B', 'bpm': 110, 'popularity': 90}, {'title': 'Track C', 'bpm': 120, 'popularity': 85}, {'title': 'Track A', 'bpm': 120, 'popularity': 80}] vs. [{'title': 'Track B', 'bpm': 110, 'popularity': 90}, {'title': 'Track C', 'bpm': 120, 'popularity': 85}, {'title': 'Track A', 'bpm': 120, 'popularity': 80}][0m
   [92msort_tracks_by_bpm([[{'title': 'T1', 'bpm': 100, 'popularity': 50}, {'title': 'T2', 'bpm': 100, 'popularity': 60}]]) -> [{'title': 'T2', 'bpm': 100, 'popularity': 60}, {'title': 'T1', 'bpm': 100, 'popularity': 50}] vs. [{'title': 'T2', 'bpm': 100, 'popularity': 60}, {'title': 'T1', 'bpm': 100, 'popularity': 50}][0m
   [92msort_tracks_by_bpm([[]]) -> [] vs. [][0m
   [92msort_tracks_by_bpm([[{'title': 'Solo', 'bpm': 128, 'popularity': 70}]]) -> [{'title': 'Solo', 'bpm': 128



   [92mclean_lyrics(['[Intro] La la la\nLa la la\nLa la la']) -> La la la vs. La la la[0m
   [92mclean_lyrics(['Hello (whisper)\nHello\nWorld']) -> Hello
World vs. Hello
World[0m
   [92mclean_lyrics(['Boom\n\nBoom\nBoom']) -> Boom

Boom vs. Boom

Boom[0m
   [92mclean_lyrics(['Love [Chorus: x2]\nLove\n(Love)\nLove']) -> Love

Love vs. Love

Love[0m
   [92mclean_lyrics(['']) ->  vs. [0m
[Problem transpose_chords] Try 1/3




   [92mtranspose_chords([['C', 'G', 'Am', 'F'], 2]) -> ['D', 'A', 'Bm', 'G'] vs. ['D', 'A', 'Bm', 'G'][0m
   [92mtranspose_chords([['F#m', 'B', 'E'], -2]) -> ['Em', 'A', 'D'] vs. ['Em', 'A', 'D'][0m
   [92mtranspose_chords([['C'], 14]) -> ['D'] vs. ['D'][0m
   [92mtranspose_chords([['Bb', 'Ebm'], 3]) -> ['C#', 'F#m'] vs. ['C#', 'F#m'][0m
   [91mtranspose_chords([['G#'], -13]) -> ['G'] vs. ['B'][0m
[Problem transpose_chords] Try 2/3




   [92mtranspose_chords([['C', 'G', 'Am'], 2]) -> ['D', 'A', 'Bm'] vs. ['D', 'A', 'Bm'][0m
   [92mtranspose_chords([['F#', 'C#m', 'G#'], -1]) -> ['F', 'Cm', 'G'] vs. ['F', 'Cm', 'G'][0m
   [92mtranspose_chords([['Bb', 'Eb', 'F'], 3]) -> ['C#', 'F#', 'G#'] vs. ['C#', 'F#', 'G#'][0m
   [92mtranspose_chords([['C', 'Dm', 'E'], 0]) -> ['C', 'Dm', 'E'] vs. ['C', 'Dm', 'E'][0m
   [92mtranspose_chords([['D', 'Em', 'Gb'], 14]) -> ['E', 'F#m', 'G#'] vs. ['E', 'F#m', 'G#'][0m
[Problem speaker_array_layout] Try 1/3




Cannot parse
|{"solution": "import math\n\ndef speaker_array_layout(n, r):\n    \"\"\"Return a list of (x, y) coordinates for n equally spaced speakers\n    on a circle of radius r centred at the origin, starting at angle 0\n    along the positive x-axis and proceeding counter-clockwise.\n    \"\"\"\n    # basic validation\n    if not isinstance(n, int) or n <= 0:\n        raise ValueError(\"n must be a positive integer\")\n    if r < 0:\n        raise ValueError(\"r must be non-negative\")\n\n    coords = []\n    step = 2 * math.pi / n\n    for k in range(n):\n        angle = k * step\n        x = round(r * math.cos(angle), 10)\n        y = round(r * math.sin(angle), 10)\n        # convert -0.0 to 0.0 for neatness\n        if x == -0.0:\n            x = 0.0\n        if y == -0.0:\n            y = 0.0\n        coords.append((x, y))\n    return coords\n", "tests": [{"type": "basic", "inputs": [4, 1], "output": [(1.0, 0.0), (0.0, 1.0), (-1.0, 0.0), (0.0, -1.0)]}, {"type": "basic", "input



Cannot parse
|{"solution": "import math\n\ndef speaker_array_layout(n, r):\n    \"\"\"Return coordinates of n speakers equally spaced on a circle of radius r.\n    Speakers are placed counter-clockwise starting from (r, 0).\n    Args:\n        n (int): number of speakers (>=0)\n        r (float): circle radius (>=0)\n    Returns:\n        list[tuple[float,float]]: list of (x, y) coordinates.\n    \"\"\"\n    if n <= 0:\n        return []\n    # Pre-compute step in radians\n    step = 2 * math.pi / n\n    return [(r * math.cos(step * k), r * math.sin(step * k)) for k in range(n)]",
"tests": [{"type": "basic", "inputs": [4, 1], "output": [(1.0, 0.0), (6.123233995736766e-17, 1.0), (-1.0, 1.2246467991473532e-16), (-1.8369701987210297e-16, -1.0)]},
{"type": "basic", "inputs": [1, 5], "output": [(5.0, 0.0)]},
{"type": "edge_case", "inputs": [0, 3], "output": []},
{"type": "basic", "inputs": [3, 2], "output": [(2.0, 0.0), (-1.0000000000000002, 1.7320508075688772), (-0.9999999999999996, -1.732



Cannot parse
|{"solution": "import math\n\ndef speaker_array_layout(n, r):\n    \"\"\"Return a list of (x, y) coordinates for `n` speakers equally spaced\n    around a circle of radius `r` centered at the origin. The first speaker\n    is placed on the positive x-axis (angle 0), and subsequent speakers are\n    placed counter-clockwise at equal angular intervals.\n    The coordinates are rounded to 10 decimal places to avoid tiny floating\n    point artifacts (e.g., 6.1232339957e-17 instead of 0).\n    \"\"\"\n    if n <= 0:\n        return []\n\n    step = 2 * math.pi / n\n    coords = []\n    for i in range(n):\n        theta = step * i\n        x = round(r * math.cos(theta), 10)\n        y = round(r * math.sin(theta), 10)\n        coords.append((x, y))\n    return coords\n", "tests": [{"type": "basic", "inputs": [4, 1], "output": [(1.0, 0.0), (0.0, 1.0), (-1.0, 0.0), (0.0, -1.0)]}, {"type": "basic", "inputs": [3, 2], "output": [(2.0, 0.0), (-1.0, 1.7320508076), (-1.0, -1.7320508076)



   [92msegment_playlist([[4, 2, 3, 5], 6]) -> [2, 1, 1] vs. [2, 1, 1][0m
   [92msegment_playlist([[1, 2, 3], 3]) -> [2, 1] vs. [2, 1][0m
   [92msegment_playlist([[], 10]) -> [] vs. [][0m
   [92msegment_playlist([[5, 5, 5], 20]) -> [3] vs. [3][0m
   [92msegment_playlist([[7], 10]) -> [1] vs. [1][0m
[Problem weather_outfit_options] Try 1/3




Cannot parse
|{"solution": "def weather_outfit_options(tops, bottoms, accessories):\n    \"\"\"Return the number of possible outfits composed of:\n    • exactly one top,\n    • exactly one bottom,\n    • 0-2 distinct accessories (order does not matter).\n    \n    Parameters\n    ----------\n    tops : list\n        List of available tops.\n    bottoms : list\n        List of available bottoms.\n    accessories : list\n        List of available accessories (items assumed distinct; duplicates are treated as distinct positions).\n    \n    Returns\n    -------\n    int\n        Total number of valid outfit combinations.\n    \"\"\"\n    n_tops = len(tops)\n    n_bottoms = len(bottoms)\n    n_acc = len(accessories)\n\n    # If no tops or bottoms, no outfit can be formed\n    if n_tops == 0 or n_bottoms == 0:\n        return 0\n\n    # combinations of accessories: choose 0, 1 or 2 (without order)\n    # C(n,0) = 1, C(n,1) = n, C(n,2) = n*(n-1)//2 for n>=2 else 0\n    comb_0 = 1  # choosing



   [92mweather_outfit_options([['t1', 't2'], ['b1'], ['a1', 'a2']]) -> 8 vs. 8[0m
   [92mweather_outfit_options([['top'], ['bottom1', 'bottom2', 'bottom3'], []]) -> 3 vs. 3[0m
   [92mweather_outfit_options([['shirt'], ['jeans'], ['watch']]) -> 2 vs. 2[0m
   [92mweather_outfit_options([[], ['b1'], ['a1', 'a2', 'a3']]) -> 0 vs. 0[0m
   [92mweather_outfit_options([['t1'], ['b1'], ['a1', 'a2', 'a3', 'a4']]) -> 11 vs. 11[0m
[Problem sunrise_sequence_breaks] Try 1/3




   [92msunrise_sequence_breaks([['06:30', '06:29', '06:28']]) -> 2 vs. 2[0m
   [92msunrise_sequence_breaks([['07:00', '07:05', '07:10']]) -> 0 vs. 0[0m
   [92msunrise_sequence_breaks([[]]) -> 0 vs. 0[0m
   [92msunrise_sequence_breaks([['05:50']]) -> 0 vs. 0[0m
   [92msunrise_sequence_breaks([['06:00', '05:59', '06:01', '05:58']]) -> 2 vs. 2[0m
   [92msunrise_sequence_breaks([['23:59', '00:00']]) -> 1 vs. 1[0m
[Problem decode_weather_station] Try 1/3




   [92mdecode_weather_station(['123|52.5 13.4|20']) -> {'id': 123, 'lat': 52.5, 'lon': 13.4, 'tempF': 68.0} vs. {'id': 123, 'lat': 52.5, 'lon': 13.4, 'tempF': 68.0}[0m
   [92mdecode_weather_station(['7|-34.5 -58.4|25.6']) -> {'id': 7, 'lat': -34.5, 'lon': -58.4, 'tempF': 78.08} vs. {'id': 7, 'lat': -34.5, 'lon': -58.4, 'tempF': 78.08}[0m
   [92mdecode_weather_station(['42| 0 0 | -40']) -> {'id': 42, 'lat': 0.0, 'lon': 0.0, 'tempF': -40.0} vs. {'id': 42, 'lat': 0.0, 'lon': 0.0, 'tempF': -40.0}[0m
   [92mdecode_weather_station(['0|90 -180|100']) -> {'id': 0, 'lat': 90.0, 'lon': -180.0, 'tempF': 212.0} vs. {'id': 0, 'lat': 90.0, 'lon': -180.0, 'tempF': 212.0}[0m
   [92mdecode_weather_station(['999|  12.345   67.89  |  15.5  ']) -> {'id': 999, 'lat': 12.345, 'lon': 67.89, 'tempF': 59.9} vs. {'id': 999, 'lat': 12.345, 'lon': 67.89, 'tempF': 59.9}[0m
[Problem locate_calm_period] Try 1/3




   [92mlocate_calm_period([[3, 4, 2, 1, 5], 2, 3]) -> 2 vs. 2[0m
   [92mlocate_calm_period([[1, 1, 1], 3, 2]) -> 0 vs. 0[0m
   [92mlocate_calm_period([[4, 5, 6], 1, 4]) -> -1 vs. -1[0m
   [92mlocate_calm_period([[1, 2], 3, 5]) -> -1 vs. -1[0m
   [92mlocate_calm_period([[0, 1, 5, 1, 0], 2, 2]) -> 0 vs. 0[0m
[Problem weather_cache_key] Try 1/3




   [92mweather_cache_key([{}]) -> da39a3ee5e6b4b0d vs. da39a3ee5e6b4b0d[0m
   [92mweather_cache_key([{}]) -> da39a3ee5e6b4b0d vs. da39a3ee5e6b4b0d[0m
   [92mweather_cache_key([{}]) -> da39a3ee5e6b4b0d vs. da39a3ee5e6b4b0d[0m
   [92mweather_cache_key([{}]) -> da39a3ee5e6b4b0d vs. da39a3ee5e6b4b0d[0m
[Problem iot_shared_sensors] Try 1/3




   [92miot_shared_sensors([[[1, 2, 3], [3, 4], [5]]]) -> [3] vs. [3][0m
   [92miot_shared_sensors([[[1, 2], [2, 3], [2, 4], [5]]]) -> [2] vs. [2][0m
   [92miot_shared_sensors([[[1, 1, 1], [1], [2, 2], [2]]]) -> [1, 2] vs. [1, 2][0m
   [92miot_shared_sensors([[[1], [2], [3]]]) -> [] vs. [][0m
   [92miot_shared_sensors([[[1, 2], [1, 2], [1, 2]]]) -> [1, 2] vs. [1, 2][0m
[Problem encode_device_status] Try 1/3




Cannot parse
|{"solution": "def encode_device_status(online, battery_low, fault):\n    \"\"\"Pack three boolean-like flags into a 3-bit integer.\n    Bit 0 -> online        (LSB)\n    Bit 1 -> battery_low\n    Bit 2 -> fault\n    Each input is cast to bool so any truthy / falsy value is accepted.\n    \"\"\"\n    return (bool(online) << 0) | (bool(battery_low) << 1) | (bool(fault) << 2)\n", "tests": [{"type": "basic", "inputs": [True, False, False], "output": 1}, {"type": "basic", "inputs": [True, True, False], "output": 3}, {"type": "basic", "inputs": [False, True, True], "output": 6}, {"type": "edge_case", "inputs": [False, False, False], "output": 0}, {"type": "edge_case", "inputs": [True, True, True], "output": 7}, {"type": "edge_case", "inputs": [1, 0, 1], "output": 5}]}|
[Problem encode_device_status] Try 2/3




Cannot parse
|{"solution": "def encode_device_status(online, battery_low, fault):\n    \"\"\"Return an integer whose bits encode the three device status flags.\n    Bit 0 (LSB): online\n    Bit 1: battery_low\n    Bit 2: fault\n    Accepts booleans or truthy/falsey values; converts them to ints (0 or 1).\n    \"\"\"\n    return (int(bool(fault)) << 2) | (int(bool(battery_low)) << 1) | int(bool(online))",
 "tests": [{"type": "basic", "inputs": [True, False, False], "output": 1}, {"type": "basic", "inputs": [False, True, False], "output": 2}, {"type": "basic", "inputs": [True, True, True], "output": 7}, {"type": "edge_case", "inputs": [False, False, False], "output": 0}, {"type": "edge_case", "inputs": [False, False, True], "output": 4}, {"type": "edge_case", "inputs": [1, 1, 0], "output": 3}]}|
[Problem encode_device_status] Try 3/3




Cannot parse
|{"solution": "def encode_device_status(online, battery_low, fault):\n    \"\"\"Return an integer whose bit 0 represents `online`, bit 1 represents\n    `battery_low`, and bit 2 represents `fault`.\n    Each argument is interpreted truth-value wise.\n    \"\"\"\n    return (bool(fault) << 2) | (bool(battery_low) << 1) | bool(online)\n", "tests": [{"type": "basic", "inputs": [True, False, False], "output": 1}, {"type": "basic", "inputs": [True, True, False], "output": 3}, {"type": "basic", "inputs": [False, False, True], "output": 4}, {"type": "edge_case", "inputs": [True, True, True], "output": 7}, {"type": "edge_case", "inputs": [False, False, False], "output": 0}]}|
[Problem simulate_temperature_control] Try 1/3




   [92msimulate_temperature_control([20, [19, 20, 21]]) -> ['ON', 'ON', 'OFF'] vs. ['ON', 'ON', 'OFF'][0m
   [92msimulate_temperature_control([22.5, [23, 22, 22.3, 23.1]]) -> ['OFF', 'OFF', 'OFF', 'OFF'] vs. ['OFF', 'OFF', 'OFF', 'OFF'][0m
   [92msimulate_temperature_control([18, []]) -> [] vs. [][0m
   [92msimulate_temperature_control([18, [17, 17, 18, 19, 18, 17]]) -> ['ON', 'ON', 'ON', 'OFF', 'OFF', 'ON'] vs. ['ON', 'ON', 'ON', 'OFF', 'OFF', 'ON'][0m
   [91msimulate_temperature_control([15, [14.5, 15, 15.5, 15.6, 14.4]]) -> ['OFF', 'OFF', 'OFF', 'OFF', 'ON'] vs. ['ON', 'ON', 'ON', 'OFF', 'ON'][0m
[Problem simulate_temperature_control] Try 2/3




   [92msimulate_temperature_control([22.0, [21.0, 21.3, 22.6, 22.4, 21.4]]) -> ['ON', 'ON', 'OFF', 'OFF', 'ON'] vs. ['ON', 'ON', 'OFF', 'OFF', 'ON'][0m
   [91msimulate_temperature_control([20, [19.4, 19.6, 20.5, 19.7]]) -> ['ON', 'ON', 'ON', 'ON'] vs. ['ON', 'ON', 'OFF', 'OFF'][0m
[Problem simulate_temperature_control] Try 3/3




   [92msimulate_temperature_control([20, [19, 19.5, 20, 21]]) -> ['ON', 'ON', 'ON', 'OFF'] vs. ['ON', 'ON', 'ON', 'OFF'][0m
   [92msimulate_temperature_control([22, [23, 22.6, 22, 21.4]]) -> ['OFF', 'OFF', 'OFF', 'ON'] vs. ['OFF', 'OFF', 'OFF', 'ON'][0m
   [92msimulate_temperature_control([18.5, []]) -> [] vs. [][0m
   [92msimulate_temperature_control([25, [24, 24.4, 24.49, 24.51, 24.9, 25.6, 25.4, 25]]) -> ['ON', 'ON', 'ON', 'ON', 'ON', 'OFF', 'OFF', 'OFF'] vs. ['ON', 'ON', 'ON', 'ON', 'ON', 'OFF', 'OFF', 'OFF'][0m
[Problem device_tree_paths] Try 1/3




Cannot parse
|{"solution": "def device_tree_paths(child_mapping):\n    \"\"\"Return all root-to-leaf paths in the device tree described by a parent->children dict.\n\n    Each path is represented as a list of node IDs.  The collection of paths is\n    returned as a list of lists.  The order of the returned paths is\n    deterministic (lexicographically sorted).\n    \"\"\"\n    if not child_mapping:\n        return []\n\n    # All nodes that ever appear.\n    all_nodes = set(child_mapping.keys())\n    for children in child_mapping.values():\n        all_nodes.update(children)\n\n    # Nodes that appear as children.\n    non_roots = {child for children in child_mapping.values() for child in children}\n    # Roots are nodes that never appear as a child.\n    roots = sorted(all_nodes - non_roots)\n\n    paths = []\n\n    def dfs(node, cur_path):\n        children = child_mapping.get(node, [])\n        if not children:  # Leaf node\n            paths.append(cur_path + [node])\n            



Cannot parse
|{"solution": "def device_tree_paths(child_map):\n    \"\"\"Return all root-to-leaf paths in the given device tree.\n\n    child_map: dict where each key is a node id and its value is a list of child ids.\n    The function returns a list of lists, each inner list representing a path from a\n    root node to a leaf node (inclusive). A leaf is a node that either doesn't appear\n    as a key in child_map or is mapped to an empty list. Roots are nodes that appear\n    as keys but never as children of any other node.\n    \"\"\"\n    # Empty input yields no paths\n    if not child_map:\n        return []\n\n    # Determine roots (keys that are not listed as children anywhere)\n    all_children = set()\n    for children in child_map.values():\n        all_children.update(children)\n    roots = [node for node in child_map if node not in all_children]\n\n    paths = []\n\n    # Depth-first traversal to collect paths\n    def dfs(node, current_path):\n        # If node is a leaf (n



   [91mdevice_tree_paths([{'1': [2, 3], '2': [4], '3': [], '4': []}]) -> [['1', 2], ['1', 3], ['2', 4], ['3'], ['4']] vs. [[1, 2, 4], [1, 3]][0m
[Problem count_firmware_versions] Try 1/3




   [92mcount_firmware_versions([['1.0', '1.0', '2.0', '3.1', '3.1', '3.1']]) -> {'1.0': 2, '3.1': 3} vs. {'1.0': 2, '3.1': 3}[0m
   [92mcount_firmware_versions([['A', 'B', 'A', 'C', 'B', 'A']]) -> {'A': 3, 'B': 2} vs. {'A': 3, 'B': 2}[0m
   [92mcount_firmware_versions([[]]) -> {} vs. {}[0m
   [92mcount_firmware_versions([['v1']]) -> {} vs. {}[0m
   [92mcount_firmware_versions([['x', 'y', 'z', 'z']]) -> {'z': 2} vs. {'z': 2}[0m
[Problem min_patches_for_security] Try 1/3




   [92mmin_patches_for_security([2, [1, 2, 5, 7]]) -> [7, 5] vs. [7, 5][0m
   [92mmin_patches_for_security([3, [10, 20, 30]]) -> [30, 20, 10] vs. [30, 20, 10][0m
   [92mmin_patches_for_security([5, [4, 8]]) -> [8, 4] vs. [8, 4][0m
   [92mmin_patches_for_security([0, [1, 2, 3]]) -> [] vs. [][0m
   [92mmin_patches_for_security([-1, [1, 5]]) -> [] vs. [][0m
[Problem format_hex_dump] Try 1/3




Cannot parse
|{"solution": "def format_hex_dump(data: bytes) -> str:\n    \"\"\"Return a classic hex-dump representation of *data*.\n\n    Each output line contains:\n      1. The offset of the first byte in hexadecimal, zero-padded.\n      2. Two spaces.\n      3. Sixteen two-digit, space separated, upper-case hex values.\n         For the final line the missing byte positions are filled with\n         three spaces each so that the ASCII preview lines up.\n      4. Two spaces.\n      5. An ASCII preview where non-printable bytes are replaced by '.'.\n    \"\"\"\n    if not data:\n        return \"\"\n\n    # Determine how many hexadecimal digits are required for the largest offset\n    highest_offset = len(data) - 1\n    addr_width = max(4, len(f\"{highest_offset:X}\"))  # at least 4 characters (0000)\n\n    lines = []\n    for offset in range(0, len(data), 16):\n        chunk = data[offset:offset + 16]\n\n        # Address part\n        address = f\"{offset:0{addr_width}X}\"\n\n     



Cannot parse
|{"solution": "def format_hex_dump(data):\n    \"\"\"Return a formatted hex dump for the given bytes-like object.\n\n    Each line contains:\n      - a 4-digit hexadecimal address (offset)\n      - a colon followed by a space\n      - up to sixteen two-digit, uppercase hexadecimal byte values separated by spaces\n      - a space, a vertical bar, another space\n      - an ASCII preview in which non-printable characters are replaced by '.'\n    \"\"\"\n    # Accept bytes, bytearray or any iterable of ints (0-255)\n    if isinstance(data, (bytes, bytearray)):\n        b = data\n    else:\n        b = bytes(data)\n\n    if not b:\n        return \"\"\n\n    lines = []\n    for offset in range(0, len(b), 16):\n        chunk = b[offset:offset + 16]\n        addr = f\"{offset:04X}\"\n        hex_part = ' '.join(f\"{byte:02X}\" for byte in chunk)\n        ascii_part = ''.join(chr(byte) if 32 <= byte <= 126 else '.' for byte in chunk)\n        lines.append(f\"{addr}: {hex_part} | {



Cannot parse
|{"solution": "def format_hex_dump(data):\n    \"\"\"Return a hex dump string for the given bytes/bytearray object.\n    Each line contains up to 16 bytes, starts with the 4-digit offset in\n    hexadecimal (uppercase) followed by a colon and a space, then the\n    hexadecimal byte values (uppercase, separated by spaces), two spaces,\n    and finally an ASCII preview where non-printable characters are shown\n    as a dot ('.').\n    \"\"\"\n    # Accept both bytes and bytearray. Any other iterable of ints is\n    # converted to bytes so long as it supports the bytes constructor.\n    if not data:\n        return \"\"\n\n    if not isinstance(data, (bytes, bytearray)):\n        data = bytes(data)\n\n    lines = []\n    for offset in range(0, len(data), 16):\n        chunk = data[offset:offset + 16]\n        # Hexadecimal representation for the 16-byte chunk\n        hex_bytes = ' '.join(f\"{b:02X}\" for b in chunk)\n        # ASCII preview: printable bytes (32-126) keep the



   [92mvalidate_password_rules(['StrongPass123!', ['password', '123456']]) -> True vs. True[0m
   [92mvalidate_password_rules(['Short1!', []]) -> False vs. False[0m
   [92mvalidate_password_rules(['StrongPass123', []]) -> False vs. False[0m
   [92mvalidate_password_rules(['MyPasswordIsStrong123!', ['password']]) -> False vs. False[0m
   [92mvalidate_password_rules(['NoDigitPass!', []]) -> False vs. False[0m
   [92mvalidate_password_rules(['ValidOne#2023', []]) -> True vs. True[0m
[Problem failed_login_rate] Try 1/3




Cannot parse
|{"solution": "def failed_login_rate(attempts, reference_time):\n    \"\"\"Compute the percentage of failed login attempts that occurred within the\n    last hour (i.e., the 3600-second window ending at `reference_time`).\n\n    Parameters\n    ----------\n    attempts : list of tuples (timestamp: int/float, success: bool)\n        Each tuple represents a login attempt where `timestamp` is the epoch\n        time in seconds and `success` is True when the attempt succeeded and\n        False otherwise.\n    reference_time : int/float\n        The moment (epoch seconds) that marks the end of the one-hour window.\n\n    Returns\n    -------\n    float\n        The failure rate as a percentage (0 to 100). If no attempts were made\n        in the window, 0.0 is returned.\n    \"\"\"\n    window_start = reference_time - 3600\n    # Filter attempts that fall inside the window (exclusive of start, inclusive of end)\n    recent = [success for ts, success in attempts if window_start



Cannot parse
|{"solution": "def failed_login_rate(attempts, reference_time):\n    \"\"\"Calculate the percentage of failed login attempts within the\n    60-minute window that ends at (and includes) *reference_time*.\n\n    Parameters\n    ----------\n    attempts : list of tuple\n        Each tuple is (timestamp:int|float, success:bool). ``success`` is\n        True if the login attempt succeeded, False otherwise.\n    reference_time : int | float\n        The point in time (seconds since epoch, or any monotonic scale)\n        that serves as the right-hand boundary of the 1-hour window.\n\n    Returns\n    -------\n    float\n        Failure rate expressed as a percentage (0–100). If there are\n        no attempts in the window, 0.0 is returned.\n    \"\"\"\n\n    # Define left boundary (exclusive) and right boundary (inclusive)\n    window_start = reference_time - 3600\n\n    total = 0\n    failures = 0\n\n    for ts, success in attempts:\n        if window_start < ts <= reference_t



Cannot parse
|{"solution": "def failed_login_rate(attempts, reference_time):\n    \"\"\"Compute the percentage of failed login attempts that occurred during the\n    last hour (3600 seconds) leading up to and including `reference_time`.\n\n    Parameters\n    ----------\n    attempts : list[tuple[int, bool]]\n        Each tuple contains (timestamp, success_flag).\n        `timestamp` is an integer/float representing seconds (e.g., Unix time).\n        `success_flag` is True for a successful login, False for a failure.\n    reference_time : int | float\n        The point in time (same unit as timestamps) used as the window's upper\n        bound.\n\n    Returns\n    -------\n    float\n        The failure rate as a percentage, rounded to two decimal places.\n        If there are no attempts in the window, returns 0.0.\n    \"\"\"\n    # Define the start of the 1-hour window (3600 seconds)\n    window_start = reference_time - 3600\n\n    # Filter attempts that fall within (window_start, 



   [92mfilter_ip_range([['192.168.1.1', '10.0.0.5', '172.16.0.1'], ['192.168.0.0/16']]) -> ['10.0.0.5', '172.16.0.1'] vs. ['10.0.0.5', '172.16.0.1'][0m
   [92mfilter_ip_range([['8.8.8.8', '1.1.1.1'], []]) -> ['8.8.8.8', '1.1.1.1'] vs. ['8.8.8.8', '1.1.1.1'][0m
   [92mfilter_ip_range([[], ['10.0.0.0/8']]) -> [] vs. [][0m
   [92mfilter_ip_range([['10.0.1.1', '192.0.2.1'], ['10.0.0.0/8', '10.0.0.0/16']]) -> ['192.0.2.1'] vs. ['192.0.2.1'][0m
   [92mfilter_ip_range([['203.0.113.5', '203.0.113.6'], ['203.0.113.5/32']]) -> ['203.0.113.6'] vs. ['203.0.113.6'][0m
[Problem sort_bus_routes] Try 1/3




Cannot parse
|{"solution": "def sort_bus_routes(routes):\n    \"\"\"Sort bus route tuples primarily by ascending distance and secondarily\n    by the numeric part of the line identifier.\n\n    Each route is a tuple in the form (line, start, end, distance).\n    \"\"\"\n    def numeric_part(line):\n        # Extract leading digits; default to 0 if none found\n        num = 0\n        i = 0\n        while i < len(line) and line[i].isdigit():\n            num = num * 10 + int(line[i])\n            i += 1\n        return num\n\n    return sorted(routes, key=lambda r: (r[3], numeric_part(r[0])))",
"tests": [{"type": "basic", "inputs": [[('12A', 'X', 'Y', 10), ('03', 'A', 'B', 5), ('45', 'C', 'D', 20)]], "output": [('03', 'A', 'B', 5), ('12A', 'X', 'Y', 10), ('45', 'C', 'D', 20)]}, {"type": "basic", "inputs": [[('20', 'a', 'b', 15), ('3C', 'c', 'd', 15), ('5', 'e', 'f', 12)]], "output": [('5', 'e', 'f', 12), ('3C', 'c', 'd', 15), ('20', 'a', 'b', 15)]}, {"type": "edge_case", "inputs": [[('1



Cannot parse
|{"solution": "import re\ndef sort_bus_routes(routes):\n    \"\"\"Sort bus route tuples (line, start, end, distance)\n    primarily by ascending distance, secondarily by the numeric part\n    of the line string (as an integer), and finally by the full line\n    string to ensure deterministic ordering.\n    \"\"\"\n    def numeric_part(line):\n        # Extract the first sequence of digits from the line identifier\n        match = re.search(r\"\\d+\", str(line))\n        return int(match.group()) if match else 0\n\n    return sorted(routes, key=lambda r: (r[3], numeric_part(r[0]), r[0]))", "tests": [{"type": "basic", "inputs": [[("B12", "Station A", "Station B", 15), ("X9", "Station C", "Station D", 10), ("A1", "Start", "End", 10), ("Z100", "Q", "R", 20)]], "output": [("A1", "Start", "End", 10), ("X9", "Station C", "Station D", 10), ("B12", "Station A", "Station B", 15), ("Z100", "Q", "R", 20)]}, {"type": "basic", "inputs": [[("Line 20", "S", "E", 5), ("Route 3", "S", "E", 



Cannot parse
|{"solution": "import re\n\ndef sort_bus_routes(routes):\n    \"\"\"Sort a list of bus route tuples by ascending distance and then by the\n    numeric part of the line identifier.\n\n    Each route is a tuple (line, start, end, distance).\n    \"\"\"\n\n    def numeric_part(line):\n        # Find all consecutive digits, join them and convert to int.\n        nums = re.findall(r\"\\d+\", str(line))\n        return int(nums[0]) if nums else 0\n\n    return sorted(routes, key=lambda r: (r[3], numeric_part(r[0])))",
"tests": [
    {"type": "basic", "inputs": [[("Line10","A","B",15), ("Line2","C","D",10), ("Line5","E","F",10)]], "output": [("Line2","C","D",10), ("Line5","E","F",10), ("Line10","A","B",15)]},
    {"type": "basic", "inputs": [[("1","A","B",5), ("2","C","D",3)]], "output": [("2","C","D",3), ("1","A","B",5)]},
    {"type": "edge_case", "inputs": [[("Express","A","B",10), ("Line1","C","D",10)]], "output": [("Express","A","B",10), ("Line1","C","D",10)]},
    {"type": 



   [92mnormalize_license_plate(['abc 123']) -> ABC123 vs. ABC123[0m
   [92mnormalize_license_plate(['ab-c- 123']) -> ABC123 vs. ABC123[0m
   [92mnormalize_license_plate(['ABC123']) -> ABC123 vs. ABC123[0m
   [92mnormalize_license_plate(['  abc-123  ']) -> ABC123 vs. ABC123[0m
   [92mnormalize_license_plate(['ab@12-c']) -> AB@12C vs. AB@12C[0m
[Problem fuel_consumption_estimate] Try 1/3




Cannot parse
|{"solution": "import math\n\ndef fuel_consumption_estimate(trip_segments):\n    \"\"\"Calculate the total fuel (in liters) needed for the given trip segments.\n\n    Parameters\n    ----------\n    trip_segments : list[tuple[float, float]]\n        A list where each tuple contains two numbers:\n        (distance_in_km, efficiency_in_km_per_liter).\n\n    Returns\n    -------\n    float\n        The total liters required, after adding a 5% safety margin and\n        rounding up to the nearest 0.1 liter.\n    \"\"\"\n    if not trip_segments:\n        return 0.0\n\n    total_liters = 0.0\n    for distance, efficiency in trip_segments:\n        # Guard against zero or negative efficiency values\n        if efficiency <= 0:\n            raise ValueError(\"Efficiency must be greater than zero.\")\n        total_liters += distance / efficiency\n\n    # Apply 5% safety margin\n    total_liters *= 1.05\n\n    # Round up to one decimal place\n    total_liters = math.ceil(total_lit



   [92mfuel_consumption_estimate([[100, 150], 10]) -> 26.3 vs. 26.3[0m
   [92mfuel_consumption_estimate([[50], 8]) -> 6.6 vs. 6.6[0m
   [92mfuel_consumption_estimate([[], 12]) -> 0.0 vs. 0.0[0m
   [92mfuel_consumption_estimate([[105], 10]) -> 11.1 vs. 11.1[0m
   [91mfuel_consumption_estimate([[33.3, 66.7], 15]) -> 7.1 vs. 7.0[0m
[Problem fuel_consumption_estimate] Try 3/3




   [92mfuel_consumption_estimate([[100], 10]) -> 10.5 vs. 10.5[0m
   [92mfuel_consumption_estimate([[50, 75], 12]) -> 11.0 vs. 11.0[0m
   [92mfuel_consumption_estimate([[], 15]) -> 0.0 vs. 0.0[0m
   [92mfuel_consumption_estimate([[33.3], 13.3]) -> 2.7 vs. 2.7[0m
   [92mfuel_consumption_estimate([[0, 0], 5]) -> 0.0 vs. 0.0[0m
[Problem intersection_area_two_circles] Try 1/3




Cannot parse
|{"solution": "import math\n\ndef intersection_area_two_circles(c1, c2):\n    \"\"\"Return the area of intersection of two circles rounded to 4 decimals.\n    Each circle is given as (x, y, r).\"\"\"\n    x0, y0, r0 = c1\n    x1, y1, r1 = c2\n\n    # Euclidean distance between centres\n    d = math.hypot(x1 - x0, y1 - y0)\n\n    # One circle is completely inside the other\n    if d <= abs(r0 - r1):\n        area = math.pi * min(r0, r1) ** 2\n    # Circles do not overlap (should not happen per statement, but handled)\n    elif d >= r0 + r1:\n        area = 0.0\n    # Partial overlap\n    else:\n        # angle for circle 0\n        alpha = 2 * math.acos((d**2 + r0**2 - r1**2) / (2 * d * r0))\n        # angle for circle 1\n        beta = 2 * math.acos((d**2 + r1**2 - r0**2) / (2 * d * r1))\n\n        area = 0.5 * (\n            r0**2 * (alpha - math.sin(alpha)) +\n            r1**2 * (beta - math.sin(beta))\n        )\n\n    return round(area, 4)\n", "tests": [{"type": "basi



   [92mintersection_area_two_circles([0, 0, 2, 3, 0, 2]) -> 1.8132 vs. 1.8132[0m
   [92mintersection_area_two_circles([0, 0, 3, 5, 0, 4]) -> 6.6417 vs. 6.6417[0m
   [92mintersection_area_two_circles([0, 0, 1, 1.5, 0, 1]) -> 0.4533 vs. 0.4533[0m
   [92mintersection_area_two_circles([0, 0, 5, 1, 1, 2]) -> 12.5664 vs. 12.5664[0m
   [92mintersection_area_two_circles([0, 0, 1, 0, 0, 1]) -> 3.1416 vs. 3.1416[0m
[Problem merge_route_stops] Try 1/3




   [92mmerge_route_stops([['Alpha', 'Bravo', 'Charlie'], ['Bravo', 'Delta']]) -> ['Alpha', 'Bravo', 'Charlie', 'Delta'] vs. ['Alpha', 'Bravo', 'Charlie', 'Delta'][0m
   [92mmerge_route_stops([['Central', 'Main'], ['central', 'Park', 'main']]) -> ['Central', 'Main', 'Park'] vs. ['Central', 'Main', 'Park'][0m
   [92mmerge_route_stops([[], ['A', 'B']]) -> ['A', 'B'] vs. ['A', 'B'][0m
   [92mmerge_route_stops([[], []]) -> [] vs. [][0m
   [92mmerge_route_stops([['Avenue', 'Street'], ['Boulevard', 'street', 'Terrace']]) -> ['Avenue', 'Boulevard', 'Street', 'Terrace'] vs. ['Avenue', 'Boulevard', 'Street', 'Terrace'][0m
[Problem crop_rotation_patterns] Try 1/3




   [92mcrop_rotation_patterns([1, 3]) -> 3 vs. 3[0m
   [92mcrop_rotation_patterns([2, 2]) -> 2 vs. 2[0m
   [92mcrop_rotation_patterns([4, 3]) -> 24 vs. 24[0m
   [92mcrop_rotation_patterns([0, 5]) -> 1 vs. 1[0m
   [92mcrop_rotation_patterns([3, 1]) -> 0 vs. 0[0m
[Problem next_harvest_moon] Try 1/3




   [92mnext_harvest_moon([2019]) -> 2019-09-14 vs. 2019-09-14[0m
   [92mnext_harvest_moon([2021]) -> 2021-09-20 vs. 2021-09-20[0m
   [92mnext_harvest_moon([2023]) -> 2023-09-29 vs. 2023-09-29[0m
   [92mnext_harvest_moon([2025]) -> 2025-10-06 vs. 2025-10-06[0m
   [92mnext_harvest_moon([2018]) -> 2018-09-24 vs. 2018-09-24[0m
[Problem parse_soil_report] Try 1/3




   [91mparse_soil_report(['pH: 6.8\\nNitrogen: 45 ppm\\nPhosphorus: 20 mg/kg\\nPotassium: 150']) -> {'pH': 6.8, 'nitrogen': None, 'phosphorus': None, 'potassium': None} vs. {'pH': 6.8, 'nitrogen': 45, 'phosphorus': 20, 'potassium': 150}[0m
[Problem parse_soil_report] Try 2/3




Cannot parse
|{"solution": "import re\n\ndef parse_soil_report(report_text):\n    \"\"\"Extract numeric values for pH, nitrogen, phosphorus and potassium from a\n    multiline soil report. If a value is not present, None is returned for that\n    key.\n\n    Parameters\n    ----------\n    report_text : str\n        Raw, multiline string coming from a soil analysis report.\n\n    Returns\n    -------\n    dict\n        Dictionary with keys 'pH', 'nitrogen', 'phosphorus', 'potassium'.\n    \"\"\"\n\n    # Build one regular-expression for each sought metric.  Each RE tries to be\n    # tolerant to common formats such as:\n    #   pH: 6.4\n    #   Nitrogen (N): 42 mg/kg\n    #   P (phosphorus) ..... 15.1\n    #   Potassium- 88ppm\n    patterns = {\n        'pH': r\"\\bpH\\b[^0-9\\-+]*([-+]?\\d+(?:\\.\\d+)?)\",\n        'nitrogen': r\"\\b(?:Nitrogen|N)\\b[^0-9\\-+]*([-+]?\\d+(?:\\.\\d+)?)\",\n        'phosphorus': r\"\\b(?:Phosphorus|P)\\b[^0-9\\-+]*([-+]?\\d+(?:\\.\\d+)?)\",\n        'pot



Cannot parse
|{"solution": "import re\n\ndef parse_soil_report(report: str):\n    \"\"\"Extract pH, nitrogen, phosphorus and potassium numeric values from a soil report.\n\n    Parameters\n    ----------\n    report : str\n        Multiline text containing the soil analysis.\n\n    Returns\n    -------\n    dict\n        Dictionary with keys 'pH', 'nitrogen', 'phosphorus', 'potassium'. If a value\n        is not found it is set to None. The numbers are returned as float if they\n        contain a decimal point, otherwise as int.\n    \"\"\"\n    # Prepare output dictionary with default None\n    result = {\n        'pH': None,\n        'nitrogen': None,\n        'phosphorus': None,\n        'potassium': None\n    }\n\n    # Mapping of nutrient names to regex patterns for extraction\n    patterns = {\n        'pH': r\"pH\\s*[:=]?\\s*([0-9]+(?:\\.[0-9]+)?)\",\n        'nitrogen': r\"nitrogen\\b[^0-9\\n\\r]*([0-9]+(?:\\.[0-9]+)?)\",\n        'phosphorus': r\"phosphorus\\b[^0-9\\n\\r]*([0-



   [92mlocate_weeds_cluster([[[1, 0, 0], [1, 1, 0], [0, 0, 1]]]) -> 3 vs. 3[0m
   [92mlocate_weeds_cluster([[[0, 0], [0, 0]]]) -> 0 vs. 0[0m
   [92mlocate_weeds_cluster([[[1]]]) -> 1 vs. 1[0m
   [92mlocate_weeds_cluster([[]]) -> 0 vs. 0[0m
   [92mlocate_weeds_cluster([[[1, 1], [1, 1]]]) -> 4 vs. 4[0m
[Problem seed_lot_identifier] Try 1/3




   [92mseed_lot_identifier(['', '', '']) -> e3b0c442-98fc-4c14-9afb-f4c8996fb924 vs. e3b0c442-98fc-4c14-9afb-f4c8996fb924[0m
   [92mseed_lot_identifier(['', '', 'a']) -> ca978112-ca1b-4dca-bac2-31b39a23dc4d vs. ca978112-ca1b-4dca-bac2-31b39a23dc4d[0m
   [92mseed_lot_identifier(['hello', '', '']) -> 2cf24dba-5fb0-430e-a6e8-3b2ac5b9e29e vs. 2cf24dba-5fb0-430e-a6e8-3b2ac5b9e29e[0m
   [92mseed_lot_identifier(['a', 'b', 'c']) -> ba7816bf-8f01-4fea-8141-40de5dae2223 vs. ba7816bf-8f01-4fea-8141-40de5dae2223[0m
[Problem shared_power_sources] Try 1/3




Cannot parse
|{"solution": "def shared_power_sources(plants):\n    \"\"\"Return a set of power source names that appear in every plant's supply list.\n\n    Args:\n        plants (list[list[str]]): List containing each plant's list of power source names.\n\n    Returns:\n        set[str]: Set of names present in every plant's list. If the input list is empty,\n                   or there is no common power source, an empty set is returned.\n    \"\"\"\n    # If there are no plants, nothing can be shared\n    if not plants:\n        return set()\n\n    # Start with the unique sources of the first plant\n    common_sources = set(plants[0])\n\n    # Intersect with the unique sources of every subsequent plant\n    for supply_list in plants[1:]:\n        common_sources &= set(supply_list)\n        # Early exit if intersection becomes empty\n        if not common_sources:\n            return set()\n\n    return common_sources\n", "tests": [{"type": "basic", "inputs": [[["Coal", "Gas", "Nucle



Cannot parse
|{"solution":"def shared_power_sources(plants):\n    \"\"\"Return the set of power source names that appear in every plant's supply list.\n\n    Parameters\n    ----------\n    plants : list of iterables (e.g., list of lists/sets/tuples)\n        Each inner iterable contains the power source names (strings) for a single plant.\n\n    Returns\n    -------\n    set\n        A set containing the power source names that are present in every plant's list.\n        If `plants` is empty, or if no single power source is common to all plants,\n        an empty set is returned.\n    \"\"\"\n    # If there are no plants, there can be no shared power sources.\n    if not plants:\n        return set()\n\n    # Start intersection with the first plant's sources converted to a set.\n    shared = set(plants[0])\n\n    # Intersect with the sources of every other plant.\n    for sources in plants[1:]:\n        shared &= set(sources)\n        # Early exit if intersection becomes empty.\n     



   [92mshared_power_sources([[['Coal', 'Solar', 'Wind'], ['Gas', 'Solar', 'Wind'], ['Solar', 'Wind', 'Hydro']]]) -> ['Solar', 'Wind'] vs. ['Solar', 'Wind'][0m
   [92mshared_power_sources([[['Solar', 'Hydro']]]) -> ['Hydro', 'Solar'] vs. ['Hydro', 'Solar'][0m
   [92mshared_power_sources([[]]) -> [] vs. [][0m
   [92mshared_power_sources([[['Solar'], []]]) -> [] vs. [][0m
   [92mshared_power_sources([[['Solar', 'Solar', 'Wind'], ['Wind', 'Solar', 'Wind'], ['Solar', 'Wind']]]) -> ['Solar', 'Wind'] vs. ['Solar', 'Wind'][0m
[Problem energy_meter_flag_set] Try 1/3




   [92menergy_meter_flag_set([0]) -> 32 vs. 32[0m
   [92menergy_meter_flag_set([4]) -> 32 vs. 32[0m
   [92menergy_meter_flag_set([36]) -> 32 vs. 32[0m
   [92menergy_meter_flag_set([255]) -> 251 vs. 251[0m
   [92menergy_meter_flag_set([-1]) -> -5 vs. -5[0m
[Problem battery_drain_sim] Try 1/3




Cannot parse
|{"solution": "def battery_drain_sim(rate, boosts, start=100):\n    \"\"\"\n    Simulate battery percentage decay.\n    Arguments:\n        rate (int | float): percentage points drained each hour (e.g. 20 => -20 per hour).\n        boosts (list[tuple[int, int | float]]): list of (hour, +value) boosts. The boost is\n            applied at the beginning of the specified hour, before the drain of that hour.\n        start (int | float, optional): starting battery percentage. Defaults to 100.\n\n    Returns:\n        list[float]: battery percentage at the end of every hour, starting with the initial\n                      level and ending with 0 (battery empty).\n    \"\"\"\n    # Map boosts by hour for O(1) access. If multiple boosts occur at the same hour, sum them.\n    boost_map = {}\n    for hr, val in boosts:\n        boost_map[hr] = boost_map.get(hr, 0) + val\n\n    battery = float(start)\n    history = [battery]\n    hour = 1  # simulation starts counting hours from 1 



Cannot parse
|{"solution": "def battery_drain_sim(initial_percentage, rate, boosts):\n    \"\"\"\n    Simulate the battery percentage decay.\n\n    Parameters\n    ----------\n    initial_percentage : int or float\n        Starting battery percentage (0-100).\n    rate : int or float\n        Fixed percentage–points drained every hour.\n    boosts : list[(int, int/float)]\n        Each tuple represents (hour, positive_value) that is added to the\n        battery *after* the drain of that hour.  Battery level is capped\n        at 100.\n\n    Returns\n    -------\n    list[float]\n        Battery level after each hour, including the hour in which it\n        reaches 0.\n    \"\"\"\n\n    if initial_percentage <= 0 or rate <= 0 and not boosts:\n        # No depletion possible or already empty – nothing to simulate.\n        return []\n\n    # Aggregate boosts that occur at the same hour.\n    boost_map = {}\n    for hr, val in boosts:\n        boost_map[hr] = boost_map.get(hr, 0) + val\n



Cannot parse
|{"solution": "import math\n\ndef battery_drain_sim(initial_battery: int, rate: float, boosts):\n    \"\"\"\n    Simulate the battery percentage decay.\n\n    Parameters\n    ----------\n    initial_battery : int\n        Initial battery percentage (0-100).\n    rate : float\n        Percentage drained each hour (e.g. 20 means 20%).\n    boosts : list[tuple[int, int]]\n        List of tuples (hour, value) – a boost of `value` percent applied *after* the decay\n        of the given hour.  Hour counting starts at 1.\n\n    Returns\n    -------\n    list[int]\n        Battery level **after** each hour (post-decay, post-boost) until the battery is\n        completely empty (0).  If the battery starts at 0 the returned list is empty.\n    \"\"\"\n\n    if initial_battery <= 0:\n        return []\n\n    boosts_dict = {h: v for h, v in boosts}\n    battery = int(initial_battery)\n    hour = 0\n    history = []\n\n    while battery > 0:\n        hour += 1\n        # drain\n       



[91mError executing code: name 'power_grid_split' is not defined[0m
[Problem power_grid_split] Try 2/3




   [92mpower_grid_split([13]) -> [8, 4, 1] vs. [8, 4, 1][0m
   [92mpower_grid_split([1]) -> [1] vs. [1][0m
   [92mpower_grid_split([0]) -> [] vs. [][0m
   [92mpower_grid_split([32]) -> [32] vs. [32][0m
   [92mpower_grid_split([19]) -> [16, 2, 1] vs. [16, 2, 1][0m
[Problem count_peak_hours] Try 1/3




   [92mcount_peak_hours([[1, 3, 2, 4, 1], 2]) -> 2 vs. 2[0m
   [92mcount_peak_hours([[0, 5, 0, 6, 1, 7, 0], 4]) -> 3 vs. 3[0m
   [92mcount_peak_hours([[5, 4, 3, 2, 1], 2]) -> 0 vs. 0[0m
   [92mcount_peak_hours([[10, 9, 10, 9, 10], 10]) -> 0 vs. 0[0m
   [92mcount_peak_hours([[5, 6], 4]) -> 0 vs. 0[0m
[Problem schedule_show_times] Try 1/3




   [92mschedule_show_times([[30, 20, 10], 40]) -> [10, 20] vs. [10, 20][0m
   [92mschedule_show_times([[5, 10, 3], 8]) -> [3, 5] vs. [3, 5][0m
   [92mschedule_show_times([[15, 15, 15], 30]) -> [15, 15] vs. [15, 15][0m
   [92mschedule_show_times([[20, 30], 10]) -> [] vs. [][0m
   [92mschedule_show_times([[], 100]) -> [] vs. [][0m
[Problem format_cast_list] Try 1/3




   [92mformat_cast_list([['Brad Pitt']]) -> Brad Pitt vs. Brad Pitt[0m
   [92mformat_cast_list([['Brad Pitt', 'Leonardo DiCaprio']]) -> Brad Pitt & Leonardo DiCaprio vs. Brad Pitt & Leonardo DiCaprio[0m
   [92mformat_cast_list([['Brad Pitt', 'Leonardo DiCaprio', 'Margot Robbie']]) -> Brad Pitt, Leonardo DiCaprio, & Margot Robbie vs. Brad Pitt, Leonardo DiCaprio, & Margot Robbie[0m
   [92mformat_cast_list([[]]) ->  vs. [0m
   [92mformat_cast_list([['A', 'B', 'C', 'D']]) -> A, B, C, & D vs. A, B, C, & D[0m
[Problem validate_rating_string] Try 1/3




   [92mvalidate_rating_string(['G']) -> True vs. True[0m
   [92mvalidate_rating_string(['PG-13']) -> True vs. True[0m
   [92mvalidate_rating_string(['pg']) -> False vs. False[0m
   [92mvalidate_rating_string(['PG13']) -> False vs. False[0m
   [92mvalidate_rating_string(['NC-17']) -> True vs. True[0m
   [92mvalidate_rating_string(['NC -17']) -> False vs. False[0m
[Problem box_office_percent_change] Try 1/3




Cannot parse
|{"solution": "def box_office_percent_change(earnings):\n    \"\"\"Return list of week-over-week percentage changes for given gross earnings.\n    The first element is always None since there is no previous week to compare.\n    Changes are rounded to one decimal place. If the previous week's earnings are\n    zero, the change is recorded as None to avoid division by zero.\n    \"\"\"\n    # Validate input is iterable; if not, let Python raise naturally when iterated.\n    n = len(earnings)\n    if n == 0:\n        return []\n\n    changes = [None]  # First week has no previous week to compare.\n\n    for prev, curr in zip(earnings, earnings[1:]):\n        if prev == 0:\n            changes.append(None)  # Undefined percentage change.\n        else:\n            pct_change = (curr - prev) / prev * 100\n            changes.append(round(pct_change, 1))\n\n    return changes", "tests": [{"type": "basic", "inputs": [[100, 110, 121]], "output": [None, 10.0, 10.0]}, {"type": "ba



Cannot parse
|{"solution": "def box_office_percent_change(grosses):\n    \"\"\"Given a list of weekly gross earnings, return a list of the same length\n    where each element (except the first) is the week-over-week percentage\n    change, rounded to one decimal place. The first element is always None.\n    If the previous week's gross is 0, the change is undefined and set to None.\n    \"\"\"\n    # Handle empty input quickly\n    if not grosses:\n        return []\n\n    changes = [None]  # First week always has no prior comparison\n\n    for prev, curr in zip(grosses, grosses[1:]):\n        if prev == 0:\n            changes.append(None)\n        else:\n            percent_change = (curr - prev) / prev * 100.0\n            changes.append(round(percent_change, 1))\n    return changes\n", "tests": [{"type": "basic", "inputs": [[100, 110, 105]], "output": [None, 10.0, -4.5]}, {"type": "basic", "inputs": [[200, 200, 250]], "output": [None, 0.0, 25.0]}, {"type": "edge_case", "inputs": [[



Cannot parse
|{"solution": "def box_office_percent_change(grosses):\n    \"\"\"Return a list of week-over-week percent changes of box-office grosses.\n    The first element is always None because there is no previous week.\n    Each subsequent element is rounded to one decimal place. If the previous\n    week's gross is 0 (cannot divide), the change is set to None.\n    \n    Parameters\n    ----------\n    grosses : list[float|int]\n        Weekly gross earnings.\n\n    Returns\n    -------\n    list[None|float]\n        Percent change list, first element None.\n    \"\"\"\n\n    if not isinstance(grosses, list):\n        raise TypeError(\"Input must be a list of numbers\")\n\n    changes = [None]\n    for prev, curr in zip(grosses[:-1], grosses[1:]):\n        if prev == 0:\n            changes.append(None)\n        else:\n            pct = (curr - prev) / prev * 100\n            changes.append(round(pct, 1))\n    return changes", "tests": [{"type": "basic", "inputs": [[100, 110, 121]

In [4]:
from collections import Counter
import json

with open("data/code_synthetic_problems_0.1.json", "r") as f:
    data = json.load(f)

print(Counter([problem.get("verified", -1) for problem in data]))

Counter({True: 81, -1: 14, False: 5})


## 3. Generate Sharded Instructions

In [None]:
from llms import generate_json
import tqdm, json

MODEL = "t-o3"

prompt_solution = """You are given a problem statement for a Python programming problem, a full solution to the problem, and a list of unit tests for the problem.
Your objective is to produce a equivalent problem statement called the "sharded_instruction" which consists of exactly 4 shards that reveal the same information in the original problem, but split across the shards.

For example:
Example Problem Statement:
Given an array of integers, sort the integers that are between 1 and 9 inclusive, reverse the resulting array, and then replace each digit by its corresponding name from "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine".

Example Output Sharded Instruction:
{"sharded_instruction": [
  {"shard_id": 1, "shard": "Turn digits into names (like 'One') in a list"},
  {"shard_id": 2, "shard": "Sort numbers if they're between 1 and 9"},
  {"shard_id": 3, "shard": "Then flip the list around"},
  {"shard_id": 4, "shard": "For instance, from [1, -1, 55] you'll get ['One']"}
]}

Careful:
- [Underspecified] None of the shards should reveal enough information to solve the entire problem.
- [Informal] The shards should mimic the style of a real user, and do not need to be in perfect english or have perfect grammar.
- [Clearly Defined Inputs / Outputs] The shards should clearly define the inputs and outputs of the problem.
- [Initial Intent] Shard 1 should contain the overall intent for the problem. Follow-up shards can add to the specification.
- [Short Shards] Shards should all be short, like the ones in the example (up to 10-15 words).
- [Avoid using fancy notation] The shards must not contain any of the following characters colons, semi-colons, brackets, parentheses, etc. Just use plain english.

Now generate the sharded instruction for the following instruction, following the JSON schema shown in the example.

Instruction:
[[INSTRUCTION]]

Full Solution:
[[SOLUTION]]

Unit Tests:
[[TESTS]]"""


with open("data/code_synthetic_problems_0.1.json", "r") as f:
    data = json.load(f)

for problem in tqdm.tqdm_notebook(data):
    if not problem.get("verified", False) or "sharded_instruction" in problem:
        continue

    # Generate the sharded instruction
    try:
        problem["shards"] = generate_json([{"role": "user", "content": prompt_solution}], model=MODEL, variables={"INSTRUCTION": problem["description"], "SOLUTION": problem["reference_solution"], "TESTS": json.dumps(problem["reference_tests"])})["sharded_instruction"]
    except Exception as e:
        print(e)
        continue

    with open("data/code_synthetic_problems_0.1.json", "w") as f:
        json.dump(data, f)


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for problem in tqdm.tqdm_notebook(data):


  0%|          | 0/100 [00:00<?, ?it/s]



In [17]:
with open("data/code_synthetic_problems_0.1.json", "r") as f:
    data = json.load(f)

verified_samples = [d for d in data if d.get("verified", False)]
for d in verified_samples:
    if "sharded_instruction" in d:
        d["shards"] = d["sharded_instruction"]
        del d["sharded_instruction"]
    d["task"] = "code"
    d["task_id"] = f"sharded-synthetic-code-{d['problem_id']}"

with open("data/code_synthetic_problems_0.1_verified.json", "w") as f:
    json.dump(verified_samples, f, indent=4)

In [6]:
from collections import Counter
from IPython.display import display
import json, pandas as pd

pd.set_option('display.max_rows', 100)

with open("data/code_synthetic_problems_0.1_verified.json", "r") as f:
    data = json.load(f)

single_keys = ["full-avg", "concat-avg", "shuffle-concat-avg"]
multi_keys = ["sharded-avg"]
rows = []
for d in data:
    row = {"task_id": d["task_id"]}
    row.update({k: d["verifications"][k] for k in single_keys})
    # if they're all above 0.7
    acceptable = all(row[k] > 0.7 for k in single_keys)
    # green checkmark or red x
    row["acceptable"] = "✅" if acceptable else "❌"
    d["acceptable"] = 1 if acceptable else 0

    row["sharded-avg"] = d["verifications"]["sharded-avg"]

    rows.append(row)

df = pd.DataFrame(rows)
print(Counter(df["acceptable"]))
display(df)

final_data = [d for d in data if d["acceptable"] == 1]
print(f"Finally we have {len(final_data)} problems")

with open("data/code_synthetic_problems_0.1_final.json", "w") as f:
    json.dump(final_data, f, indent=4)

Counter({'✅': 43, '❌': 38})


Unnamed: 0,task_id,full-avg,concat-avg,shuffle-concat-avg,acceptable,sharded-avg
0,sharded-synthetic-code-26,0.0,0.5,1.0,❌,0.125
1,sharded-synthetic-code-86,0.125,1.0,1.0,❌,0.5
2,sharded-synthetic-code-82,1.0,0.75,0.75,✅,0.375
3,sharded-synthetic-code-5,1.0,1.0,1.0,✅,0.875
4,sharded-synthetic-code-66,1.0,1.0,0.875,✅,0.375
5,sharded-synthetic-code-92,1.0,1.0,1.0,✅,0.25
6,sharded-synthetic-code-91,0.0,1.0,1.0,❌,1.0
7,sharded-synthetic-code-65,1.0,1.0,1.0,✅,0.25
8,sharded-synthetic-code-6,1.0,1.0,1.0,✅,1.0
9,sharded-synthetic-code-8,0.5,1.0,0.75,❌,1.0


Finally we have 43 problems


In [1]:
import json

with open("data/code_synthetic_problems_0.1_final.json", "r") as f:
    final_data = json.load(f)

def convert_test_case_format(sample):
    org_test_cases = sample['reference_tests']
    new_test_cases = []
    for test_case in org_test_cases:
        new_test_cases.append({
            "type": test_case['type'],  # keep the original stuff from the generator
            "input": "\n".join([json.dumps(i) for i in test_case['inputs']]),
            "output": json.dumps(test_case['output']),
            "testtype": "functional"
        })
    return json.dumps(new_test_cases)

def transform_data(data):
    for sample in data:
        sample['metadata'] = {"func_name": sample['name']}
        sample['public_test_cases'] = convert_test_case_format(sample)
    return data

final_data = transform_data(final_data)

with open("data/code_synthetic_problems_0.1_final_transformed.json", "w") as f:
    json.dump(final_data, f)