## MATH Data Preprocess

Make sure you get data from HuggingFace.

In [1]:
!huggingface-cli download EleutherAI/hendrycks_math --repo-type dataset --local-dir ./math_dataset

I:\AI Models\Proj\math_dataset



Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]Downloading 'algebra/train-00000-of-00001.parquet' to 'math_dataset\.cache\huggingface\download\algebra\train-00000-of-00001.parquet.73347f5b55869f21ff094361025dbd12d291e971aa4d72142eacee10438e449f.incomplete'
Downloading 'geometry/test-00000-of-00001.parquet' to 'math_dataset\.cache\huggingface\download\geometry\test-00000-of-00001.parquet.98720f3a8134ad9784fd8951195245264495df03ffc5c9485a9a54ff5ae2431a.incomplete'
Downloading 'counting_and_probability/test-00000-of-00001.parquet' to 'math_dataset\.cache\huggingface\download\counting_and_probability\test-00000-of-00001.parquet.c324818949c75a6227c0328a8499d58023cbbf6757cfc1736f5ab24cc8d87966.incomplete'
Downloading 'counting_and_probability/train-00000-of-00001.parquet' to 'math_dataset\.cache\huggingface\download\counting_and_probability\train-00000-of-00001.parquet.d37ae017930df41678edee50dcde18c17fb06afaff9eb505c1b95f1002bfcb31.incomplete'
Downloading 'geometry/train-00000-of

Now we preprocess the data to make it into the desired format for SFT.

In [13]:
import json
from pathlib import Path

from datasets import load_dataset, get_dataset_config_names, concatenate_datasets
import tiktoken
from tqdm import tqdm

In [14]:
configs = get_dataset_config_names("EleutherAI/hendrycks_math")
def load_all(split: str):
    parts = []
    for cfg in sorted(configs):
        ds = load_dataset("EleutherAI/hendrycks_math", cfg, split=split)
        print(f"{split.upper()} – {cfg}: {len(ds)} examples")
        parts.append(ds)
    full = concatenate_datasets(parts)
    print(f"{split.upper()} total: {len(full)} examples")
    return full

train_ds = load_all("train")
test_ds  = load_all("test")

TRAIN – algebra: 1744 examples


Generating train split: 100%|██████████| 771/771 [00:00<00:00, 75660.57 examples/s]
Generating test split: 100%|██████████| 474/474 [00:00<00:00, 48802.10 examples/s]


TRAIN – counting_and_probability: 771 examples


Generating train split: 100%|██████████| 870/870 [00:00<00:00, 113430.04 examples/s]
Generating test split: 100%|██████████| 479/479 [00:00<00:00, 80536.82 examples/s]


TRAIN – geometry: 870 examples


Generating train split: 100%|██████████| 1295/1295 [00:00<00:00, 121796.21 examples/s]
Generating test split: 100%|██████████| 903/903 [00:00<00:00, 88740.78 examples/s]


TRAIN – intermediate_algebra: 1295 examples


Generating train split: 100%|██████████| 869/869 [00:00<00:00, 86134.09 examples/s]
Generating test split: 100%|██████████| 540/540 [00:00<00:00, 39485.44 examples/s]


TRAIN – number_theory: 869 examples


Generating train split: 100%|██████████| 1205/1205 [00:00<00:00, 138899.51 examples/s]
Generating test split: 100%|██████████| 871/871 [00:00<00:00, 91310.42 examples/s]


TRAIN – prealgebra: 1205 examples


Generating train split: 100%|██████████| 746/746 [00:00<00:00, 94661.79 examples/s]
Generating test split: 100%|██████████| 546/546 [00:00<00:00, 57125.15 examples/s]


TRAIN – precalculus: 746 examples
TRAIN total: 7500 examples
TEST – algebra: 1187 examples
TEST – counting_and_probability: 474 examples
TEST – geometry: 479 examples
TEST – intermediate_algebra: 903 examples
TEST – number_theory: 540 examples
TEST – prealgebra: 871 examples
TEST – precalculus: 546 examples
TEST total: 5000 examples


In [15]:
SYSTEM_PROMPT = (
    "You are a math assistant. Solve the problem step by step, "
    "explain your reasoning, and box the final answer using \\boxed{}."
)
MAX_TOKENS = 4000
ENC = tiktoken.get_encoding("cl100k_base")
OUT_TRAIN = Path("MATH_train_full.jsonl")
OUT_TEST  = Path("MATH_test_full.jsonl")

In [None]:
def write_jsonl(ds, out_path):
    kept = 0
    with out_path.open("w", encoding="utf-8") as f:
        for ex in tqdm(ds, desc=f"Writing {out_path.name}"):
            messages = [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user",   "content": ex["problem"]},
                {"role": "assistant","content": ex["solution"]}
            ]
            # ---- token guard (preserve \boxed{}) ----
            txt = json.dumps({"messages": messages}, ensure_ascii=False)
            if len(ENC.encode(txt)) > MAX_TOKENS:
                sol = ex["solution"]
                if "\\boxed{" in sol:
                    start = sol.rfind("\\boxed{")
                    # 保留答案前 200 字 + 答案
                    before = sol[max(0, start - 200):start]
                    after = sol[start:]  # 包含 \boxed{...}
                    sol = before + "\n...\n" + after
                else:
                    # 极少情况：无 boxed
                    sol = sol[:MAX_TOKENS*4] + "\n..."
                messages[-1]["content"] = sol
                txt = json.dumps({"messages": messages}, ensure_ascii=False)
            f.write(txt + "\n")
            kept += 1
    print(f"{kept} lines → {out_path}")

write_jsonl(train_ds, OUT_TRAIN)
write_jsonl(test_ds,  OUT_TEST)

Writing MATH_train_full.jsonl: 100%|██████████| 7500/7500 [00:05<00:00, 1256.70it/s]


7500 lines → MATH_train_full.jsonl


Writing MATH_test_full.jsonl: 100%|██████████| 5000/5000 [00:03<00:00, 1333.30it/s]

5000 lines → MATH_test_full.jsonl



