# Generate Mutation Training Data (Local / MacBook Pro Max)

Standalone Jupyter notebook optimized for generating Rust mutation training data
on a MacBook Pro Max with Apple Silicon and 64GB RAM.

**Optimizations for Apple Silicon:**
- **sccache** for cross-repo compilation caching (60-80% faster rebuilds)
- **RAM disk** for build artifacts (eliminates I/O overhead for temp files)
- **Stripped debug info** for deps (20-40% faster linking)
- **Native CPU target** for Apple Silicon NEON optimizations
- **RAM-aware parallelism** tuned for M-series core layout

**What this does:**
1. Sets up an optimized Rust compilation environment
2. Clones curated Rust repositories
3. Runs `cargo-mutants` with aggressive parallelism
4. Captures (buggy code, error, fix) training triples
5. Saves as JSONL + HuggingFace dataset

**Requirements:**
- macOS with Apple Silicon (M1/M2/M3/M4 Pro/Max)
- 64GB RAM (32GB minimum, reduce parallelism)
- Rust toolchain: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`
- cargo-mutants: `cargo install cargo-mutants`
- sccache: `brew install sccache`

**Time estimate:** 1-3 hours for all 21 repos (faster than Colab due to faster CPUs + SSD)

---
## Step 0: Environment Setup & Optimization

In [4]:
### 0.1 Detect Hardware & Verify Tools

import os
import sys
import platform
import subprocess
import multiprocessing

# Detect Apple Silicon
is_apple_silicon = platform.machine() == "arm64" and platform.system() == "Darwin"
cpu_count = multiprocessing.cpu_count()

# Get RAM
try:
    result = subprocess.run(["sysctl", "-n", "hw.memsize"], capture_output=True, text=True)
    total_ram_gb = int(result.stdout.strip()) / (1024**3)
except Exception:
    total_ram_gb = 0

# Get chip info
try:
    result = subprocess.run(["sysctl", "-n", "machdep.cpu.brand_string"], capture_output=True, text=True)
    chip_name = result.stdout.strip()
except Exception:
    chip_name = "Unknown"

# Count P-cores vs E-cores (approximate from total)
# M1 Pro/Max: 8P+2E=10, M2 Pro/Max: 8P+4E=12, M3/M4 Pro/Max: 12P+4E=16
if cpu_count >= 16:
    p_cores = 12
    e_cores = cpu_count - 12
elif cpu_count >= 12:
    p_cores = 8
    e_cores = cpu_count - 8
elif cpu_count >= 10:
    p_cores = 8
    e_cores = cpu_count - 8
else:
    p_cores = max(1, cpu_count // 2)
    e_cores = cpu_count - p_cores

print("=" * 60)
print("HARDWARE DETECTION")
print("=" * 60)
print(f"  Chip: {chip_name}")
print(f"  Architecture: {'Apple Silicon (arm64)' if is_apple_silicon else platform.machine()}")
print(f"  CPU cores: {cpu_count} total ({p_cores} performance + {e_cores} efficiency)")
print(f"  RAM: {total_ram_gb:.0f} GB")

if not is_apple_silicon:
    print("\n  NOTE: Not Apple Silicon. Optimizations are tuned for ARM64 but will still work.")

# Verify tools
print(f"\nTool Verification:")
print("-" * 40)

tools_ok = True
for cmd, label, required in [
    (["cargo", "--version"], "cargo", True),
    (["cargo", "mutants", "--version"], "cargo-mutants", True),
    (["sccache", "--version"], "sccache", False),
    (["git", "--version"], "git", True),
]:
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
        if result.returncode == 0:
            ver = result.stdout.strip().split("\n")[0]
            print(f"  \u2713 {label}: {ver}")
        else:
            print(f"  \u2717 {label}: command failed")
            if required:
                tools_ok = False
    except FileNotFoundError:
        tag = "REQUIRED" if required else "optional"
        print(f"  \u2717 {label}: not installed ({tag})")
        if required:
            tools_ok = False

if not tools_ok:
    print("\n  Missing required tools! Install with:")
    print("    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh")
    print("    cargo install cargo-mutants")
    print("    brew install sccache  # optional but recommended")

print("=" * 60)

HARDWARE DETECTION
  Chip: Apple M1
  Architecture: Apple Silicon (arm64)
  CPU cores: 8 total (4 performance + 4 efficiency)
  RAM: 16 GB

Tool Verification:
----------------------------------------
  ✓ cargo: cargo 1.92.0 (Homebrew)
  ✓ cargo-mutants: cargo-mutants 26.2.0
  ✓ sccache: sccache 0.14.0
  ✓ git: git version 2.33.0


In [7]:
### 0.2 Configure Optimized Compilation Environment
#
# Sets up sccache, native CPU targeting, and stripped debug info
# for fastest possible cargo-mutants throughput.

import shutil

# ---- sccache ----
# Caches compilation artifacts by input hash. Huge win for cargo-mutants
# because mutations touch one file while dozens of deps stay identical.
# After the first mutant, subsequent ones reuse cached deps (60-80% faster).
has_sccache = shutil.which("sccache") is not None

if has_sccache:
    os.environ["RUSTC_WRAPPER"] = "sccache"
    os.environ["SCCACHE_CACHE_SIZE"] = "20G"
    os.environ.setdefault("SCCACHE_DIR", os.path.expanduser("~/.cache/sccache"))
    # sccache and incremental are partially incompatible.
    # For cross-repo mutation work, sccache wins over incremental.
    os.environ["CARGO_INCREMENTAL"] = "0"
    print("\u2713 sccache enabled (RUSTC_WRAPPER=sccache, incremental OFF)")
    print(f"  Cache dir: {os.environ['SCCACHE_DIR']}")
    print(f"  Cache size: {os.environ['SCCACHE_CACHE_SIZE']}")
else:
    # Without sccache, use incremental compilation for the single-file
    # mutation pattern (change one file, rebuild).
    os.environ["CARGO_INCREMENTAL"] = "1"
    print("\u2014 sccache not installed. Using incremental compilation instead.")
    print("  Install for 60-80% faster builds: brew install sccache")

# ---- Native CPU target ----
# Enable Apple Silicon NEON instructions. Safe since we only run locally.
os.environ["RUSTFLAGS"] = "-C target-cpu=native"
print(f"\u2713 RUSTFLAGS: {os.environ['RUSTFLAGS']}")

# ---- Cargo build parallelism ----
# Formula: CARGO_BUILD_JOBS = performance_cores / cargo_mutants_jobs
# We want the product to roughly equal performance core count.
# This is set after we determine mutation_jobs in the next cell.

print("\n\u2713 Compilation environment configured")

✓ sccache enabled (RUSTC_WRAPPER=sccache, incremental OFF)
  Cache dir: /Users/robertarnold/.cache/sccache
  Cache size: 20G
✓ RUSTFLAGS: -C target-cpu=native

✓ Compilation environment configured


In [8]:
### 0.3 Configure Generation Parameters

# ---- Parallelism ----
# cargo-mutants --jobs controls parallel mutant workers.
# Each worker spawns cargo build + test, which uses CARGO_BUILD_JOBS cores.
# Target: mutation_jobs * cargo_build_jobs ~ performance_cores
#
# With 64GB RAM, memory is not the bottleneck (see research notes).
# CPU is the constraint: too many jobs cause cache thrashing.

if total_ram_gb >= 48:  # 64GB
    if p_cores >= 12:  # M3/M4 Max
        mutation_jobs = 4
        cargo_build_jobs = 4  # 4 * 4 = 16 compile units ~ 12 P-cores + overflow to E-cores
    elif p_cores >= 8:  # M1/M2 Max
        mutation_jobs = 3
        cargo_build_jobs = 3  # 3 * 3 = 9 ~ 8 P-cores
    else:
        mutation_jobs = 2
        cargo_build_jobs = max(1, p_cores // 2)
elif total_ram_gb >= 24:  # 32GB
    mutation_jobs = 2
    cargo_build_jobs = max(1, p_cores // 2)
else:  # 16GB or less
    mutation_jobs = 1
    cargo_build_jobs = max(1, p_cores)

os.environ["CARGO_BUILD_JOBS"] = str(cargo_build_jobs)
print(f"\u2713 CARGO_BUILD_JOBS={cargo_build_jobs}")

# ---- Generation settings ----
max_mutations_per_repo = 100
timeout_per_mutation = 300  # seconds

# ---- Paths ----
# Use the project's own directory structure
PROJECT_ROOT = os.path.dirname(os.path.abspath(os.getcwd()))
# If running from inside the notebooks/ directory, go up one level
if os.path.basename(os.getcwd()) == "notebooks":
    PROJECT_ROOT = os.path.dirname(os.getcwd())
elif os.path.exists(os.path.join(os.getcwd(), "scripts")):
    PROJECT_ROOT = os.getcwd()

CONFIG_PATH = os.path.join(PROJECT_ROOT, "configs", "data_sources_rust.yaml")

# Store data on external drive for speed and to avoid filling the boot SSD.
# The OWC Express 1M2 is a fast NVMe enclosure — ideal for the heavy I/O
# of cloning repos and running cargo builds.
EXTERNAL_BASE = "/Volumes/OWC Express 1M2/rust-mutations"
CLONE_DIR = os.path.join(EXTERNAL_BASE, "repos")
OUTPUT_DIR = os.path.join(EXTERNAL_BASE, "output")

# Verify the external drive is mounted
if not os.path.exists("/Volumes/OWC Express 1M2"):
    print("\u2717 External drive not found at /Volumes/OWC Express 1M2")
    print("  Falling back to project-local data/ directory")
    CLONE_DIR = os.path.join(PROJECT_ROOT, "data", "rust", "repos")
    OUTPUT_DIR = os.path.join(PROJECT_ROOT, "data", "rust", "mutations")

os.makedirs(CLONE_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"\n{'='*60}")
print("MUTATION GENERATION CONFIG")
print(f"{'='*60}")
print(f"  Mutation workers (--jobs): {mutation_jobs}")
print(f"  Cargo build jobs:          {cargo_build_jobs}")
print(f"  Total compile units:       ~{mutation_jobs * cargo_build_jobs}")
print(f"  P-cores available:         {p_cores}")
print(f"  Max mutations/repo:        {max_mutations_per_repo}")
print(f"  Timeout/mutation:          {timeout_per_mutation}s")
print(f"  sccache:                   {'ON' if has_sccache else 'OFF'}")
print(f"  Project root:              {PROJECT_ROOT}")
print(f"  Clone dir:                 {CLONE_DIR}")
print(f"  Output dir:                {OUTPUT_DIR}")
print(f"{'='*60}")

✓ CARGO_BUILD_JOBS=4

MUTATION GENERATION CONFIG
  Mutation workers (--jobs): 1
  Cargo build jobs:          4
  Total compile units:       ~4
  P-cores available:         4
  Max mutations/repo:        100
  Timeout/mutation:          300s
  sccache:                   ON
  Project root:              /Users/robertarnold/PycharmProjects/llm-training-pipeline
  Clone dir:                 /Volumes/OWC Express 1M2/rust-mutations/repos
  Output dir:                /Volumes/OWC Express 1M2/rust-mutations/output


In [11]:
### 0.4 Create RAM Disk (Optional)
#
# Allocates a RAM disk for cargo-mutants' temp build directories.
# This eliminates disk I/O for intermediate compilation artifacts.
#
# On Apple Silicon with fast NVMe (7 GB/s), the benefit is moderate
# (~10-15% faster) but it also reduces SSD wear from the heavy
# write traffic of mutation testing.
#
# Skip this cell if you don't want to allocate RAM.

USE_RAMDISK = True  # Set False to skip
RAMDISK_SIZE_GB = 10  # 10 GB is enough for most crates

RAMDISK_PATH = None

if USE_RAMDISK and platform.system() == "Darwin":
    # Check if already mounted
    if os.path.exists("/Volumes/MutantsBuild"):
        RAMDISK_PATH = "/Volumes/MutantsBuild"
        print(f"\u2713 RAM disk already mounted at {RAMDISK_PATH}")
    else:
        sectors = RAMDISK_SIZE_GB * 1024 * 2048
        print(f"Creating {RAMDISK_SIZE_GB} GB RAM disk...")
        result = subprocess.run(
            ["hdiutil", "attach", "-nomount", f"ram://{sectors}"],
            capture_output=True, text=True,
        )
        if result.returncode == 0:
            device = result.stdout.strip()
            fmt_result = subprocess.run(
                ["diskutil", "eraseVolume", "APFS", "MutantsBuild", device],
                capture_output=True, text=True,
            )
            if fmt_result.returncode == 0:
                RAMDISK_PATH = "/Volumes/MutantsBuild"
                print(f"\u2713 RAM disk created: {RAMDISK_PATH} ({RAMDISK_SIZE_GB} GB)")
                print(f"  Device: {device}")
                print(f"  To detach later: hdiutil detach {device}")
            else:
                print(f"\u2717 Failed to format RAM disk: {fmt_result.stderr}")
        else:
            print(f"\u2717 Failed to create RAM disk: {result.stderr}")

    # Tell cargo-mutants to use the RAM disk for its temp copies
    if RAMDISK_PATH:
        os.environ["TMPDIR"] = RAMDISK_PATH
        print(f"\u2713 TMPDIR set to RAM disk")
else:
    print("\u2014 RAM disk skipped (using SSD — still fast on Apple Silicon)")

✓ RAM disk already mounted at /Volumes/MutantsBuild
✓ TMPDIR set to RAM disk


In [9]:
### 0.5 Set Up Cargo Config for Stripped Debug Info
#
# Disabling debug info for dependency crates gives 20-40% faster linking
# and 30-50% less disk usage for build artifacts.
# We write a project-local .cargo/config.toml that only affects builds
# within the cloned repos.

CARGO_CONFIG = """\
# Auto-generated by generate_mutations_local.ipynb
# Optimizations for cargo-mutants speed

[profile.dev.package."*"]
debug = false

[profile.test.package."*"]
debug = false
"""

# We'll inject this into each cloned repo's .cargo/config.toml
# (done automatically in the generation step)
print("\u2713 Cargo config prepared (stripped debug info for deps)")
print("  This will be injected into each cloned repo before mutation testing.")

✓ Cargo config prepared (stripped debug info for deps)
  This will be injected into each cloned repo before mutation testing.


---
## Step 1: Generate Mutations

In [None]:
### 1.1 Run Mutation Generation
#
# This calls the same script as Colab but with locally-optimized settings.
# The script handles cloning, mutation, and result parsing.

import time

# Inject cargo config into clone dir so new clones inherit it
cargo_config_dir = os.path.join(CLONE_DIR, ".cargo-mutations-config")
os.makedirs(cargo_config_dir, exist_ok=True)
with open(os.path.join(cargo_config_dir, "config.toml"), "w") as f:
    f.write(CARGO_CONFIG)

# Ensure scripts/ is importable
scripts_dir = os.path.join(PROJECT_ROOT, "scripts")
if scripts_dir not in sys.path:
    sys.path.insert(0, scripts_dir)

start_time = time.time()

print(f"Starting mutation generation...")
print(f"  Workers: {mutation_jobs} | Cargo jobs: {cargo_build_jobs}")
print(f"  sccache: {'ON' if has_sccache else 'OFF'} | RAM disk: {'ON' if RAMDISK_PATH else 'OFF'}")
print("=" * 60)

# Import and run directly (better output in Jupyter than subprocess)
from pipeline_lib.cargo_mutants_runner import check_cargo_mutants_installed
if not check_cargo_mutants_installed():
    print("\u2717 cargo-mutants not found. Install: cargo install cargo-mutants")
else:
    # We need to inject the cargo config into repos after cloning.
    # The easiest way is to use the CLI script which handles everything.
    import subprocess as sp
    cmd = [
        sys.executable, os.path.join(scripts_dir, "16_generate_mutations.py"),
        "--config", CONFIG_PATH,
        "--clone_dir", CLONE_DIR,
        "--output_dir", OUTPUT_DIR,
        "--max_mutations_per_repo", str(max_mutations_per_repo),
        "--timeout_per_mutation", str(timeout_per_mutation),
        "--jobs", str(mutation_jobs),
    ]

    # Run with real-time output
    process = sp.Popen(
        cmd,
        stdout=sp.PIPE,
        stderr=sp.STDOUT,
        text=True,
        bufsize=1,
        env=os.environ.copy(),
    )
    for line in iter(process.stdout.readline, ""):
        print(line, end="")
    process.wait()

    elapsed = time.time() - start_time
    mins = int(elapsed // 60)
    secs = int(elapsed % 60)
    print(f"\nTotal time: {mins}m {secs}s")

    if has_sccache:
        print("\nsccache stats:")
        sp.run(["sccache", "--show-stats"], check=False)

Starting mutation generation...
  Workers: 1 | Cargo jobs: 4
  sccache: ON | RAM disk: ON


Disabling PyTorch because PyTorch >= 2.1 is required but found 2.0.1
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Disabling PyTorch because PyTorch >= 2.1 is required but found 2.0.1
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [None]:
### 1.2 Monitor System Resources (run while 1.1 is executing)
#
# Run this in a separate cell to check CPU/RAM usage during generation.

import subprocess

# macOS memory info
print("Memory:")
result = subprocess.run(["vm_stat"], capture_output=True, text=True)
lines = result.stdout.strip().split("\n")
for line in lines[:6]:
    print(f"  {line}")

# Memory pressure
print("\nMemory pressure:")
result = subprocess.run(["memory_pressure"], capture_output=True, text=True, timeout=5)
for line in result.stdout.strip().split("\n")[:3]:
    print(f"  {line}")

# CPU load
print("\nLoad average:")
result = subprocess.run(["uptime"], capture_output=True, text=True)
print(f"  {result.stdout.strip()}")

# sccache hit rate
if has_sccache:
    print("\nsccache:")
    result = subprocess.run(["sccache", "--show-stats"], capture_output=True, text=True)
    for line in result.stdout.strip().split("\n"):
        if any(k in line.lower() for k in ["hit", "miss", "request", "cache"]):
            print(f"  {line.strip()}")

# RAM disk usage
if RAMDISK_PATH and os.path.exists(RAMDISK_PATH):
    result = subprocess.run(["df", "-h", RAMDISK_PATH], capture_output=True, text=True)
    print(f"\nRAM disk:")
    for line in result.stdout.strip().split("\n"):
        print(f"  {line}")

---
## Step 2: Verify & Inspect Data

In [None]:
### 2.1 Verify Output

import json

jsonl_path = os.path.join(OUTPUT_DIR, "mutations.jsonl")
hf_path = os.path.join(OUTPUT_DIR, "hf_dataset")

print("Output Verification:")
print("=" * 60)

if os.path.exists(jsonl_path):
    with open(jsonl_path) as f:
        lines = f.readlines()
    size_mb = os.path.getsize(jsonl_path) / (1024 * 1024)
    print(f"  \u2713 JSONL: {len(lines):,} examples ({size_mb:.1f} MB)")

    caught = sum(1 for l in lines if '"Test failure:' in l)
    unviable = sum(1 for l in lines if '"Compiler error:' in l)
    print(f"    Caught mutations (test failures): {caught:,}")
    print(f"    Unviable mutations (compiler errors): {unviable:,}")
else:
    print(f"  \u2717 JSONL not found at {jsonl_path}")

if os.path.exists(hf_path):
    items = os.listdir(hf_path)
    print(f"  \u2713 HF dataset: {hf_path} ({len(items)} files)")
else:
    print(f"  \u2014 HF dataset not generated yet")

print("=" * 60)

In [None]:
### 2.2 Inspect Sample Examples

import json

jsonl_path = os.path.join(OUTPUT_DIR, "mutations.jsonl")

if not os.path.exists(jsonl_path):
    print("No data yet. Run Step 1 first.")
else:
    with open(jsonl_path) as f:
        examples = [json.loads(line) for line in f.readlines()[:5]]

    for i, ex in enumerate(examples, 1):
        print(f"\n{'='*60}")
        print(f"Example {i}")
        print(f"{'='*60}")
        print(f"Explanation: {ex.get('explanation', 'N/A')[:120]}")
        print(f"\nBuggy code (first 200 chars):")
        print(ex.get('buggy_code', '')[:200])
        print(f"\nError (first 200 chars):")
        print(ex.get('error_message', '')[:200])
        print(f"\nFixed code (first 200 chars):")
        print(ex.get('fixed_code', '')[:200])

In [None]:
### 2.3 Stats by Repository

import json
from collections import Counter

jsonl_path = os.path.join(OUTPUT_DIR, "mutations.jsonl")

if not os.path.exists(jsonl_path):
    print("No data yet. Run Step 1 first.")
else:
    repo_counts = Counter()
    type_counts = Counter()

    with open(jsonl_path) as f:
        for line in f:
            ex = json.loads(line)
            explanation = ex.get('explanation', '')
            if '(' in explanation and ')' in explanation:
                file_path = explanation.split('(')[-1].rstrip(')')
                parts = file_path.split('/')
                repo_counts[parts[0] if parts else 'unknown'] += 1
            if 'Test failure' in ex.get('error_message', ''):
                type_counts['caught (test failure)'] += 1
            elif 'Compiler error' in ex.get('error_message', ''):
                type_counts['unviable (compiler error)'] += 1

    print("Examples by source file prefix:")
    print("=" * 60)
    for repo, count in repo_counts.most_common():
        print(f"  {repo:<30} {count:>5}")
    print(f"  {'TOTAL':<30} {sum(repo_counts.values()):>5}")

    print(f"\nExamples by type:")
    print("=" * 60)
    for t, count in type_counts.most_common():
        print(f"  {t:<35} {count:>5}")

---
## Step 3: Re-run Failed Repos (Optional)

In [None]:
### 3.1 Re-run Specific Repos
#
# If some repos failed, re-run them here.
# Edit the list below with repo names (e.g., "tokio-rs/tokio").

retry_repos = [
    # "tokio-rs/tokio",
    # "serde-rs/serde",
]

if retry_repos:
    import time

    retry_output = os.path.join(OUTPUT_DIR, "retry")
    os.makedirs(retry_output, exist_ok=True)

    repos_arg = " ".join(retry_repos)
    start = time.time()

    print(f"Re-running {len(retry_repos)} repos...")
    print("=" * 60)

    cmd = [
        sys.executable, os.path.join(PROJECT_ROOT, "scripts", "16_generate_mutations.py"),
        "--repos", *retry_repos,
        "--clone_dir", CLONE_DIR,
        "--output_dir", retry_output,
        "--max_mutations_per_repo", str(max_mutations_per_repo),
        "--timeout_per_mutation", str(timeout_per_mutation),
        "--jobs", str(mutation_jobs),
    ]

    process = subprocess.Popen(
        cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
        text=True, bufsize=1, env=os.environ.copy(),
    )
    for line in iter(process.stdout.readline, ""):
        print(line, end="")
    process.wait()

    elapsed = time.time() - start
    print(f"\nRetry time: {int(elapsed//60)}m {int(elapsed%60)}s")

    # Merge retry results into main output
    retry_jsonl = os.path.join(retry_output, "mutations.jsonl")
    main_jsonl = os.path.join(OUTPUT_DIR, "mutations.jsonl")
    if os.path.exists(retry_jsonl):
        with open(retry_jsonl) as f:
            retry_lines = f.readlines()
        with open(main_jsonl, "a") as f:
            f.writelines(retry_lines)
        print(f"\nMerged {len(retry_lines)} retry examples into main output")
else:
    print("No repos to retry. Edit retry_repos list above.")

---
## Step 4: Copy to Colab Drive (Optional)

If you want to use this data in Colab for training, copy it to Google Drive
or upload it to HuggingFace Hub.

In [None]:
### 4.1 Upload to HuggingFace Hub (Optional)
#
# Uploads the HF dataset so you can load it from Colab with:
#   datasets.load_from_disk("your-user/rust-mutations")

UPLOAD_TO_HF = False  # Set True to upload
HF_REPO_ID = "your-username/rust-mutations"  # Change this

if UPLOAD_TO_HF:
    hf_path = os.path.join(OUTPUT_DIR, "hf_dataset")
    if os.path.exists(hf_path):
        from datasets import load_from_disk
        ds = load_from_disk(hf_path)
        ds.push_to_hub(HF_REPO_ID, private=True)
        print(f"\u2713 Uploaded to https://huggingface.co/datasets/{HF_REPO_ID}")
    else:
        print("\u2717 HF dataset not found. Run generation first.")
else:
    print("HF upload skipped. Set UPLOAD_TO_HF = True to enable.")
    print(f"\nLocal data location: {OUTPUT_DIR}")
    print("\nTo use in Colab, either:")
    print("  1. Upload the hf_dataset/ folder to HuggingFace Hub (set UPLOAD_TO_HF=True)")
    print("  2. Copy mutations.jsonl to Google Drive manually")
    print("  3. Use Google Drive desktop to sync the data/ directory")

---
## Step 5: Cleanup

In [None]:
### 5.1 Detach RAM Disk
#
# Run this when you're done to free the RAM.

if RAMDISK_PATH and os.path.exists(RAMDISK_PATH):
    result = subprocess.run(
        ["hdiutil", "detach", RAMDISK_PATH],
        capture_output=True, text=True,
    )
    if result.returncode == 0:
        print(f"\u2713 RAM disk detached: {RAMDISK_PATH}")
        RAMDISK_PATH = None
    else:
        print(f"\u2717 Failed to detach: {result.stderr}")
        print("  Try: hdiutil detach /Volumes/MutantsBuild")
else:
    print("No RAM disk to detach.")

In [None]:
### 5.2 Clean Up Cloned Repos (Optional)
#
# Remove cloned repos to free disk space.
# The training data has already been extracted.

CLEANUP_REPOS = False  # Set True to delete cloned repos

if CLEANUP_REPOS and os.path.exists(CLONE_DIR):
    import shutil
    repos = [d for d in os.listdir(CLONE_DIR) if os.path.isdir(os.path.join(CLONE_DIR, d))]
    for repo in repos:
        repo_path = os.path.join(CLONE_DIR, repo)
        shutil.rmtree(repo_path)
    print(f"\u2713 Removed {len(repos)} cloned repos from {CLONE_DIR}")
else:
    if os.path.exists(CLONE_DIR):
        repos = [d for d in os.listdir(CLONE_DIR) if os.path.isdir(os.path.join(CLONE_DIR, d))]
        total_size = sum(
            sum(os.path.getsize(os.path.join(dp, f))
                for dp, _, fns in os.walk(os.path.join(CLONE_DIR, r))
                for f in fns)
            for r in repos
        ) / (1024**2)
        print(f"Cloned repos: {len(repos)} ({total_size:.0f} MB)")
        print("Set CLEANUP_REPOS = True to delete them.")

---
## Done!

Your mutation training data is saved to the external drive.

**Output:**
- JSONL: `/Volumes/OWC Express 1M2/rust-mutations/output/mutations.jsonl`
- HF Dataset: `/Volumes/OWC Express 1M2/rust-mutations/output/hf_dataset/`
- Cloned repos: `/Volumes/OWC Express 1M2/rust-mutations/repos/`

**Performance notes for Apple Silicon:**
- **sccache** is the single biggest optimization (60-80% faster rebuilds after warmup)
- **RAM disk** helps moderately (~10-15%) and reduces SSD wear
- **Stripped debug info** saves 20-40% on linking time
- `--jobs 4` with `CARGO_BUILD_JOBS=4` saturates an M3/M4 Max (16 cores)
- Memory is rarely the bottleneck on 64GB; CPU is the constraint

**Next steps:**
- Upload to HuggingFace Hub or copy to Google Drive for Colab training
- Open `train_gpt_oss_rust_agent_v2.ipynb` with `skip_data_generation=True`