# AMC Preprocessing

## Load Numina and AIME Problems

In [2]:
# Extract AMC Data from numina
# Load AIME dataset in raw/train/aime.json and raw/test/aime.json
import json
from datasets import load_dataset

ds = load_dataset("AI-MO/NuminaMath-CoT")
# Filter for amc_aime problems
amc_aime = ds['train'].filter(lambda x: x['source'] == 'amc_aime')

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 859494/859494 [00:04<00:00, 205273.66 examples/s]
Generating test split: 100%|██████████| 100/100 [00:00<00:00, 29967.88 examples/s]
Filter: 100%|██████████| 859494/859494 [00:17<00:00, 49871.99 examples/s]


In [16]:
from rllm.data import load_dataset, TrainDataset, TestDataset

train_dataset = load_dataset(TrainDataset.AIME)
test_dataset = load_dataset(TestDataset.AIME)
print("Training dataset loaded, size:", len(train_dataset))
print("Test dataset loaded, size:", len(test_dataset))
aime_dataset = train_dataset + test_dataset

Training dataset loaded, size: 975
Test dataset loaded, size: 30
1005


## Filter AMC-only Problems from Numina AMC_AIME Category

In [17]:
from rllm.utils import RAG
rag_server = RAG(docs=[d['problem'] for d in aime_dataset])

In [19]:
# Filter for AMC only problems.
amc_dataset = []
for row in amc_aime:
    problem = row['problem']
    result_dict = rag_server.top_k(problem, k=1)[0]
    score = result_dict['score']
    if score > 0.9 or 'AIME' in problem or 'aime' in problem:
        print(score)
        print(problem)
        print("Found similar problem:", result_dict['text'])
    else:
        amc_dataset.append(row)

0.9696745872497559
Patio blocks that are hexagons $1$ unit on a side are used to outline a garden by placing the blocks edge to edge with $n$ on each side. The diagram indicates the path of blocks around the garden when $n=5$.
[AIME 2002 II Problem 4.gif](https://artofproblemsolving.com/wiki/index.php/File:AIME_2002_II_Problem_4.gif)
If $n=202$, then the area of the garden enclosed by the path, not including the path itself, is $m\left(\sqrt3/2\right)$ square units, where $m$ is a positive integer. Find the remainder when $m$ is divided by $1000$.
Found similar problem: Patio blocks that are hexagons $1$ unit on a side are used to outline a garden by placing the blocks edge to edge with $n$ on each side. The diagram indicates the path of blocks around the garden when $n=5$.
 If $n=202$, then the area of the garden enclosed by the path, not including the path itself, is $m\left(\sqrt3/2\right)$ square units, where $m$ is a positive integer. Find the remainder when $m$ is divided by $100

# Refine AMC Problems for No Multiple Choice (Direct Answer)

In [21]:
import ast
import re

from rllm.utils import call_gemini_llm
from rllm.system_prompts import REFINE_AMC_PROMPT, FETCH_MC_PROMPT

def parse_llm_output(llm_output: str) -> dict:
    try:
        # Remove code fences in case they appear
        llm_output = repr(llm_output)
        cleaned = re.sub(r'```(?:json)?', '', llm_output)
        cleaned = re.sub(r'```', '', cleaned)
        cleaned = cleaned.strip()
        # Parse as Python dictionary
        parsed_dict = ast.literal_eval(cleaned)
        parsed_dict = ast.literal_eval(parsed_dict)
        return parsed_dict
    except:
        print("FAIL PARSING")
        print(llm_output)
        return {
            'problem': None,
            'A': None,
            'B': None,
            'C': None,
            'D': None,
            'E': None
        }

def process_entry_no_mc(entry):
    output_dict = {}
    # 1) Get the problem text
    problem_text = entry['problem']
    solution_text = entry['solution']
    # 2) Call Gemini LLM
    output_str = call_gemini_llm(problem_text, system_prompt=REFINE_AMC_PROMPT)
    if not output_str:
        print("Gemini not happy.")
        return {}
    # 3) Parse the LLM output into a Python dict
    python_dict = parse_llm_output(output_str)
    python_dict = dict(python_dict)
    output_dict['problem'] = python_dict['problem']
    output_dict['solution'] = entry['solution']
    if python_dict.get('A', None) is None and python_dict.get('B', None):
        return {}
    answer = call_gemini_llm(f'Problem: {problem_text} \n Solution: {solution_text}', system_prompt=FETCH_MC_PROMPT)
    answer = answer.upper()
    if len(answer) > 1 or answer not in ['A', 'B', 'C', 'D', 'E']:
        answer = call_gemini_llm(f'Problem: {problem_text} \n Solution: {solution_text}', system_prompt=FETCH_MC_PROMPT)
        print("Retrying answer fetching:")
        print(answer)
        answer = answer.upper()
    if answer not in ['A', 'B', 'C', 'D', 'E'] or answer not in python_dict:
        print('Failed extracting answer')
        print(problem_text)
        return {}
    output_dict['answer'] = python_dict[answer]
    return output_dict

In [22]:
import concurrent.futures

# Use a ProcessPoolExecutor with up to 10 workers
with concurrent.futures.ProcessPoolExecutor(max_workers=64) as executor:
    # executor.map(...) applies process_entry to each item in `subset`.
    # It returns results in the same order as `subset`.
    results = list(executor.map(process_entry_no_mc, amc_dataset))

    # Now `results` is a list of parsed dictionaries, one for each entry
    final_list = []
    counter =0
    for entry, parsed_dict in zip(amc_dataset, results):
        if parsed_dict:
            if parsed_dict['problem'] is None:
                continue
            final_list.append(parsed_dict)
            counter +=1
            if counter%100==0:
                # Save final list as json
                with open("amc_processed.json", "w") as f:
                    json.dump(final_list, f, indent=2)
# Save final list as json
with open("amc_processed.json", "w") as f:
    json.dump(final_list, f, indent=2)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

FAIL PARSING
'```json\n{\n"problem": "Which of the following is equivalent to \\"If P is true, then Q is false.\\"?",\n"A": null,\n"B": null,\n"C": null,\n"D": null,\n"E": null\n}\n```'
FAIL PARSING
'```json\n{\n"problem": "Figure 1 is called a \\"stack map.\\" The numbers tell how many cubes are stacked in each position. Fig. 2 shows these cubes, and Fig. 3 shows the view of the stacked cubes as seen from the front. Which of the following is the front view for the stack map in Fig. 4?",\n"A": null,\n"B": null,\n"C": null,\n"D": null,\n"E": null\n}\n```'
FAIL PARSING
'```json\n{\n"problem": "A given convex pentagon $ABCDE$ has the property that the area of each of the five triangles $ABC$, $BCD$, $CDE$, $DEA$, and $EAB$ is unity. Show that all pentagons with the above property have the same area, and calculate that area. Show, furthermore, that there are infinitely many non-congruent pentagons having the above area property.",\n"A": null,\n"B": null,\n"C": null,\n"D": null,\n"E": null\

## Finally, manually review dataset (have undergrad do it ;) )