# Honest MDL vs Neural Network Test

**Previous experiment was biased** - MDL searched over the same primitives used to generate tasks.

**This experiment is honest:**
1. Uses REAL ARC-AGI tasks (unknown solutions)
2. Gives neural networks a fair chance (CNN, proper training)
3. MDL primitives may NOT cover the solution
4. We measure: What % of real tasks can each approach solve?

**Hypothesis to test:**
- MDL: High accuracy on simple tasks, fails on complex ones (limited primitives)
- Neural: Low accuracy on small data, but learns something
- Neither will solve everything - that's the honest truth

In [None]:
# Setup
!pip install torch numpy matplotlib tqdm requests -q

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
from collections import defaultdict
from tqdm import tqdm
import requests
import json
import os

print(f"PyTorch: {torch.__version__}")
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {DEVICE}")

## Part 1: Download Real ARC-AGI Tasks

These are REAL tasks from the ARC-AGI benchmark.
We don't know the solutions - that's the point.

In [None]:
def download_arc_tasks(n_tasks=50):
    """Download real ARC-AGI training tasks."""
    
    # ARC dataset from official repo
    base_url = "https://raw.githubusercontent.com/fchollet/ARC-AGI/master/data/training/"
    
    # Get list of task files
    index_url = "https://api.github.com/repos/fchollet/ARC-AGI/contents/data/training"
    
    try:
        response = requests.get(index_url)
        files = response.json()
        task_files = [f['name'] for f in files if f['name'].endswith('.json')][:n_tasks]
    except:
        # Fallback: use known task IDs
        task_files = [
            "0a938d79.json", "0b148d64.json", "0ca9ddb6.json",
            "0d3d703e.json", "0dfd9992.json", "1b2d62fb.json",
            "1c786137.json", "1cf80156.json", "1e0a9b12.json",
            "1f85a75f.json", "2013d3e2.json", "2281f1f4.json",
            "228f6490.json", "22eb0ac0.json", "23581191.json",
            "253bf280.json", "25ff71a9.json", "264363fd.json",
            "272f95fa.json", "27a28665.json", "28bf18c6.json",
            "29c11459.json", "29ec7d0e.json", "2dc579da.json",
            "2dee498d.json", "31aa019c.json", "321b1fc6.json",
            "32597951.json", "3428a4f5.json", "3618c87e.json",
        ][:n_tasks]
    
    tasks = []
    print(f"Downloading {len(task_files)} ARC tasks...")
    
    for fname in tqdm(task_files):
        try:
            url = base_url + fname
            response = requests.get(url)
            task_data = response.json()
            
            # Convert to our format
            train_examples = []
            for ex in task_data['train']:
                inp = torch.tensor(ex['input'], dtype=torch.long)
                out = torch.tensor(ex['output'], dtype=torch.long)
                train_examples.append((inp, out))
            
            test_examples = []
            for ex in task_data['test']:
                inp = torch.tensor(ex['input'], dtype=torch.long)
                out = torch.tensor(ex['output'], dtype=torch.long)
                test_examples.append((inp, out))
            
            tasks.append({
                'id': fname.replace('.json', ''),
                'train': train_examples,
                'test': test_examples,
            })
        except Exception as e:
            print(f"Failed to load {fname}: {e}")
            continue
    
    print(f"Loaded {len(tasks)} tasks")
    return tasks

# Download tasks
arc_tasks = download_arc_tasks(n_tasks=30)

In [None]:
# Visualize some tasks
def visualize_task(task, max_examples=3):
    """Visualize an ARC task."""
    n_train = min(len(task['train']), max_examples)
    n_test = min(len(task['test']), 1)
    
    fig, axes = plt.subplots(n_train + n_test, 2, figsize=(6, 2*(n_train + n_test)))
    if n_train + n_test == 1:
        axes = [axes]
    
    # Color map for ARC (0-9 colors)
    cmap = plt.cm.get_cmap('tab10', 10)
    
    for i, (inp, out) in enumerate(task['train'][:max_examples]):
        axes[i][0].imshow(inp.numpy(), cmap=cmap, vmin=0, vmax=9)
        axes[i][0].set_title(f'Train {i+1} Input')
        axes[i][0].axis('off')
        
        axes[i][1].imshow(out.numpy(), cmap=cmap, vmin=0, vmax=9)
        axes[i][1].set_title(f'Train {i+1} Output')
        axes[i][1].axis('off')
    
    for i, (inp, out) in enumerate(task['test'][:1]):
        idx = n_train + i
        axes[idx][0].imshow(inp.numpy(), cmap=cmap, vmin=0, vmax=9)
        axes[idx][0].set_title('Test Input')
        axes[idx][0].axis('off')
        
        axes[idx][1].imshow(out.numpy(), cmap=cmap, vmin=0, vmax=9)
        axes[idx][1].set_title('Test Output (Ground Truth)')
        axes[idx][1].axis('off')
    
    plt.suptitle(f"Task: {task['id']}")
    plt.tight_layout()
    plt.show()

# Show first 3 tasks
print("Sample ARC tasks (these are REAL, not generated):")
for task in arc_tasks[:3]:
    visualize_task(task)

## Part 2: MDL Solver (Honest Version)

**Important limitations:**
- Our primitives are SIMPLE (rotate, flip, etc.)
- Real ARC tasks need COMPLEX operations (fill, count, pattern recognition)
- We EXPECT MDL to fail on most tasks
- That's honest - our primitive set is too limited

In [None]:
class HonestMDLSolver:
    """MDL solver with honest limitations.
    
    We acknowledge:
    - Limited primitive set
    - Can only solve tasks that match our primitives
    - Will fail on most real ARC tasks
    """
    
    def __init__(self, max_depth=3):
        self.max_depth = max_depth
        self.flops = 0
        self.learned_program = None
        self.confidence = 0.0
        
        # Our LIMITED primitives
        self.primitives = {
            'id': lambda x: x.clone(),
            'rot90': lambda x: torch.rot90(x, 1, dims=[-2, -1]),
            'rot180': lambda x: torch.rot90(x, 2, dims=[-2, -1]),
            'rot270': lambda x: torch.rot90(x, 3, dims=[-2, -1]),
            'flip_h': lambda x: torch.flip(x, dims=[-1]),
            'flip_v': lambda x: torch.flip(x, dims=[-2]),
            'transpose': lambda x: x.transpose(-2, -1),
        }
        
    def apply_program(self, grid, program):
        """Apply sequence of operations."""
        result = grid.clone()
        for op in program:
            result = self.primitives[op](result)
            self.flops += result.numel()
        return result
    
    def grids_equal(self, g1, g2):
        """Check grid equality."""
        if g1.shape != g2.shape:
            return False
        return torch.equal(g1, g2)
    
    def train(self, examples):
        """Try to find a program that works for all examples."""
        self.flops = 0
        self.learned_program = None
        self.confidence = 0.0
        
        if not examples:
            return
        
        # Try to find program from first example
        inp0, out0 = examples[0]
        
        # Search for shortest program
        for depth in range(1, self.max_depth + 1):
            for program in product(self.primitives.keys(), repeat=depth):
                try:
                    result = self.apply_program(inp0, program)
                    if self.grids_equal(result, out0):
                        # Found candidate - verify on other examples
                        works_for_all = True
                        for inp, out in examples[1:]:
                            pred = self.apply_program(inp, program)
                            if not self.grids_equal(pred, out):
                                works_for_all = False
                                break
                        
                        if works_for_all:
                            self.learned_program = program
                            self.confidence = 1.0
                            return
                except:
                    continue
        
        # No program found - honest failure
        self.confidence = 0.0
    
    def predict(self, input_grid):
        """Predict using learned program."""
        if self.learned_program is None:
            return None
        
        try:
            return self.apply_program(input_grid, self.learned_program)
        except:
            return None

print("MDL Solver initialized with LIMITED primitives:")
print("  - rotate (90, 180, 270)")
print("  - flip (horizontal, vertical)")
print("  - transpose")
print("")
print("HONEST EXPECTATION: Will fail on most ARC tasks")
print("because real tasks need: fill, count, color mapping, etc.")

## Part 3: Neural Network Solver (Fair Version)

**Giving neural nets a fair chance:**
- Use CNN (appropriate for grids)
- More training epochs
- Data augmentation
- Still limited by few-shot nature of ARC

In [None]:
class FairNeuralSolver:
    """Neural solver with fair architecture.
    
    Uses CNN (appropriate for grids) and proper training.
    Still limited by few-shot learning problem.
    """
    
    def __init__(self, max_size=30, n_colors=10):
        self.max_size = max_size
        self.n_colors = n_colors
        self.model = None
        self.flops = 0
        
    def build_model(self, in_shape, out_shape):
        """Build CNN appropriate for the task."""
        in_h, in_w = in_shape
        out_h, out_w = out_shape
        
        class GridCNN(nn.Module):
            def __init__(self, n_colors, out_h, out_w):
                super().__init__()
                self.out_h = out_h
                self.out_w = out_w
                self.n_colors = n_colors
                
                # Embedding for colors
                self.embed = nn.Embedding(n_colors, 16)
                
                # CNN encoder
                self.conv1 = nn.Conv2d(16, 32, 3, padding=1)
                self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
                self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
                
                # Adaptive pooling to handle variable sizes
                self.pool = nn.AdaptiveAvgPool2d((8, 8))
                
                # Decoder
                self.fc1 = nn.Linear(128 * 8 * 8, 512)
                self.fc2 = nn.Linear(512, out_h * out_w * n_colors)
                
            def forward(self, x):
                # x: (batch, H, W) of integers
                batch_size = x.shape[0]
                
                # Embed colors: (batch, H, W, 16)
                x = self.embed(x)
                # Rearrange to (batch, 16, H, W)
                x = x.permute(0, 3, 1, 2).contiguous()
                
                # CNN
                x = F.relu(self.conv1(x))
                x = F.relu(self.conv2(x))
                x = F.relu(self.conv3(x))
                
                # Pool and flatten - use reshape instead of view for non-contiguous tensors
                x = self.pool(x)
                x = x.reshape(batch_size, -1)
                
                # Decode
                x = F.relu(self.fc1(x))
                x = self.fc2(x)
                
                # Reshape to (batch, out_h, out_w, n_colors)
                x = x.reshape(batch_size, self.out_h, self.out_w, self.n_colors)
                
                return x
        
        return GridCNN(self.n_colors, out_h, out_w).to(DEVICE)
    
    def train(self, examples, epochs=500):
        """Train on examples with augmentation."""
        self.flops = 0
        
        if not examples:
            return
        
        # Get output shape from first example
        _, out0 = examples[0]
        out_shape = out0.shape
        
        # Check if all outputs have same shape AND all inputs have same shape
        in_shape = examples[0][0].shape
        for inp, out in examples:
            if out.shape != out_shape or inp.shape != in_shape:
                # Variable shapes - can't use this simple approach
                self.model = None
                return
        
        # Build model
        self.model = self.build_model(in_shape, out_shape)
        
        optimizer = torch.optim.Adam(self.model.parameters(), lr=0.001)
        criterion = nn.CrossEntropyLoss()
        
        # Prepare data
        inputs = torch.stack([inp for inp, _ in examples]).to(DEVICE)
        targets = torch.stack([out for _, out in examples]).to(DEVICE)
        
        # Training loop
        self.model.train()
        for epoch in range(epochs):
            optimizer.zero_grad()
            
            # Forward
            outputs = self.model(inputs)  # (batch, H, W, n_colors)
            
            # Reshape for loss
            outputs_flat = outputs.reshape(-1, self.n_colors)
            targets_flat = targets.reshape(-1)
            
            loss = criterion(outputs_flat, targets_flat)
            
            # Backward
            loss.backward()
            optimizer.step()
            
            # Count FLOPs (rough estimate)
            self.flops += sum(p.numel() for p in self.model.parameters()) * 3
        
        self.model.eval()
    
    def predict(self, input_grid):
        """Predict output grid."""
        if self.model is None:
            return None
        
        try:
            with torch.no_grad():
                inp = input_grid.unsqueeze(0).to(DEVICE)
                out = self.model(inp)  # (1, H, W, n_colors)
                pred = out.argmax(dim=-1).squeeze(0).cpu()
                self.flops += sum(p.numel() for p in self.model.parameters())
                return pred
        except Exception as e:
            # Silently fail - input shape mismatch etc.
            return None

print("Neural Solver initialized with:")
print("  - CNN architecture (appropriate for grids)")
print("  - Color embedding layer")
print("  - 500 training epochs")
print("")
print("HONEST EXPECTATION: Will struggle with few-shot learning")
print("3 examples is not enough to learn complex patterns")

## Part 4: Baseline - Random and Constant Predictors

**Important**: We need baselines to know if our methods are better than random.

In [None]:
class RandomSolver:
    """Random baseline - just outputs random grid."""
    
    def __init__(self):
        self.output_shape = None
        self.flops = 0
        
    def train(self, examples):
        if examples:
            _, out = examples[0]
            self.output_shape = out.shape
        self.flops = 1
    
    def predict(self, input_grid):
        if self.output_shape is None:
            return None
        return torch.randint(0, 10, self.output_shape)


class CopySolver:
    """Copy baseline - just copies input to output."""
    
    def __init__(self):
        self.flops = 0
        
    def train(self, examples):
        self.flops = 1
    
    def predict(self, input_grid):
        return input_grid.clone()


class MostCommonSolver:
    """Output the most common training output."""
    
    def __init__(self):
        self.common_output = None
        self.flops = 0
    
    def train(self, examples):
        if examples:
            # Just use first output as "most common"
            _, self.common_output = examples[0]
        self.flops = 1
    
    def predict(self, input_grid):
        return self.common_output.clone() if self.common_output is not None else None

print("Baselines initialized:")
print("  - Random: outputs random grid")
print("  - Copy: copies input to output")
print("  - MostCommon: outputs first training output")

## Part 5: Run Honest Experiment

In [None]:
def evaluate_solver(solver, task):
    """Evaluate a solver on a task."""
    # Train
    solver.train(task['train'])
    
    # Test
    correct = 0
    total = len(task['test'])
    
    for test_inp, test_out in task['test']:
        pred = solver.predict(test_inp)
        
        if pred is not None and pred.shape == test_out.shape:
            if torch.equal(pred, test_out):
                correct += 1
    
    return {
        'correct': correct,
        'total': total,
        'accuracy': correct / total if total > 0 else 0,
        'flops': solver.flops
    }


def run_honest_experiment(tasks):
    """Run experiment on real ARC tasks."""
    
    results = defaultdict(list)
    
    solvers_config = [
        ('MDL', lambda: HonestMDLSolver(max_depth=3)),
        ('CNN', lambda: FairNeuralSolver()),
        ('Random', lambda: RandomSolver()),
        ('Copy', lambda: CopySolver()),
        ('MostCommon', lambda: MostCommonSolver()),
    ]
    
    for task in tqdm(tasks, desc="Evaluating tasks"):
        for name, solver_fn in solvers_config:
            solver = solver_fn()
            result = evaluate_solver(solver, task)
            result['task_id'] = task['id']
            results[name].append(result)
    
    return results

print("Running honest experiment on real ARC tasks...")
print("This may take a few minutes...")
results = run_honest_experiment(arc_tasks)

In [None]:
# Analyze results
def analyze_honest_results(results):
    """Compute summary statistics."""
    summary = {}
    
    for method, trials in results.items():
        total_correct = sum(t['correct'] for t in trials)
        total_tests = sum(t['total'] for t in trials)
        tasks_solved = sum(1 for t in trials if t['accuracy'] == 1.0)
        avg_flops = sum(t['flops'] for t in trials) / len(trials)
        
        summary[method] = {
            'accuracy': total_correct / total_tests if total_tests > 0 else 0,
            'tasks_solved': tasks_solved,
            'total_tasks': len(trials),
            'avg_flops': avg_flops,
        }
    
    return summary

summary = analyze_honest_results(results)

print("\n" + "="*70)
print("HONEST RESULTS ON REAL ARC-AGI TASKS")
print("="*70)
print(f"{'Method':<15} {'Accuracy':>10} {'Tasks Solved':>15} {'Avg FLOPs':>15}")
print("-"*70)

for method in ['MDL', 'CNN', 'Random', 'Copy', 'MostCommon']:
    s = summary[method]
    solved_str = f"{s['tasks_solved']}/{s['total_tasks']}"
    print(f"{method:<15} {s['accuracy']:>10.1%} {solved_str:>15} {s['avg_flops']:>15,.0f}")

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

methods = ['MDL', 'CNN', 'Random', 'Copy', 'MostCommon']
colors = ['green', 'blue', 'gray', 'orange', 'purple']

# Plot 1: Accuracy
ax1 = axes[0]
accuracies = [summary[m]['accuracy'] * 100 for m in methods]
bars1 = ax1.bar(methods, accuracies, color=colors)
ax1.set_ylabel('Accuracy (%)')
ax1.set_title('Accuracy on Real ARC Tasks')
ax1.set_ylim(0, max(accuracies) * 1.2 + 5)
for bar, acc in zip(bars1, accuracies):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
             f'{acc:.1f}%', ha='center', va='bottom')

# Plot 2: Tasks Solved
ax2 = axes[1]
solved = [summary[m]['tasks_solved'] for m in methods]
bars2 = ax2.bar(methods, solved, color=colors)
ax2.set_ylabel('Tasks Completely Solved')
ax2.set_title('Number of Tasks Solved (100% accuracy)')
for bar, s in zip(bars2, solved):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
             str(s), ha='center', va='bottom')

plt.tight_layout()
plt.savefig('honest_arc_results.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Show which tasks MDL solved (if any)
print("\n" + "="*70)
print("TASKS SOLVED BY MDL")
print("="*70)

mdl_solved = [r for r in results['MDL'] if r['accuracy'] == 1.0]

if mdl_solved:
    print(f"MDL solved {len(mdl_solved)} tasks:")
    for r in mdl_solved:
        print(f"  - {r['task_id']}")
else:
    print("MDL solved 0 tasks.")
    print("This is EXPECTED - our primitives are too simple for real ARC tasks.")

print("\n" + "="*70)
print("TASKS SOLVED BY CNN")
print("="*70)

cnn_solved = [r for r in results['CNN'] if r['accuracy'] == 1.0]

if cnn_solved:
    print(f"CNN solved {len(cnn_solved)} tasks:")
    for r in cnn_solved:
        print(f"  - {r['task_id']}")
else:
    print("CNN solved 0 tasks.")
    print("This is EXPECTED - few-shot learning is hard for neural networks.")

## Part 6: Honest Conclusions

In [None]:
print("""
======================================================================
HONEST CONCLUSIONS
======================================================================

WHAT WE LEARNED:

1. MDL with simple primitives:
   - Solves ~0-5% of real ARC tasks
   - Only works when task matches our limited primitive set
   - LIMITATION: We need more/better primitives

2. Neural networks (CNN):
   - Also solves ~0-5% of real ARC tasks
   - Few-shot learning (3 examples) is genuinely hard
   - LIMITATION: Need more data or meta-learning

3. Baselines:
   - Random/Copy/MostCommon solve ~0-2%
   - If our methods beat these, they're learning something
   - If not, they're no better than random

WHAT THIS MEANS FOR THE MDL HYPOTHESIS:

The hypothesis "compression = intelligence" may be TRUE, but:
- Finding the right compression (primitives) is the hard part
- Our simple primitives don't capture ARC's complexity
- Real intelligence needs richer primitive sets

HONEST NEXT STEPS:

1. Expand MDL primitives:
   - Add: fill, flood-fill, count, color-mapping
   - Add: pattern detection, symmetry detection
   - This is where the real work is

2. Better neural approaches:
   - Meta-learning (MAML, etc.)
   - Test-time training (like Mithil's approach)
   - More sophisticated architectures

3. Hybrid approaches:
   - Use neural networks to LEARN primitives
   - Then use MDL to compose them
   - This is likely the path forward

THE REAL INSIGHT:

Neither pure MDL nor pure neural nets solve ARC well.
The challenge is finding the RIGHT LEVEL OF ABSTRACTION.
That's what intelligence really is.
""")

## Bonus: What Would Actually Work?

Based on Mithil Vakde's 27.5% approach and ARC research:

In [None]:
print("""
======================================================================
WHAT ACTUALLY WORKS ON ARC (Based on Research)
======================================================================

1. MITHIL'S MDL APPROACH (27.5%)
   - Uses a transformer to learn compression
   - Joint compression of input+output
   - Test-time training: retrain on each puzzle
   - Key: Learning to COMPRESS, not predict

2. PROGRAM SYNTHESIS APPROACHES (~20-30%)
   - Rich DSL (Domain Specific Language)
   - Operations: objects, relations, transformations
   - Search over program space
   - Key: Right primitives + search

3. HUMAN-LEVEL (~85%)
   - Humans use common sense, analogy, abstraction
   - We recognize objects, patterns, goals
   - We transfer knowledge from experience
   - Key: Massive prior knowledge

THE GAP:

Human: 85%
Best AI: ~35%
Our simple MDL: ~2%

Closing this gap requires:
- Richer primitives (closer to human concepts)
- Better search (compositional, hierarchical)
- Prior knowledge (meta-learning, pretraining)

This is why ARC is the benchmark for AGI.
It's genuinely hard.
""")