#Overview of the ARC Challenge
Author: Sami Halabieh<br>
Date: Mar 14 2025<br>
<br>The Abstraction and Reasoning Corpus (ARC) is a benchmark of visual reasoning puzzles introduced by François Chollet. Each ARC task is a small grid-based puzzle with a few given example input-output pairs and a test input for which the solution must be inferred. There are 400 training tasks and a similar number of evaluation tasks, each with only 3–4 demonstration pairs on average​.

The tasks are extremely diverse, some require pixel-level operations (e.g. changing colors in specific locations), others involve object-centric transformations (moving or rotating shapes), and others test higher level abstractions like symmetry or counting.

 Michael Hodel noted that one can categorize tasks by the level of transformation: grid-level (global pattern), object-level (manipulating entire objects), or cell-level (individual pixel rules). The combination of high diversity and very few examples per task makes ARC especially challenging.

 For humans, these puzzles are relatively straightforward – studies show people solve about 80% of ARC tasks – but for machines they pose a major challenge. Traditional machine learning struggles because there is no large training set of similar problems to learn from (each task is essentially unique).

 A model must generalize from just a handful of examples of a novel task. Pure deep learning models (like standard CNNs or Transformers pre-trained on other data) lack the proper abstraction capabilities for these kinds of visual reasoning.

 On the other hand, brute-force search or hand-crafted logic using a domain-specific language can solve some tasks, but combinatorial explosion makes it intractable to cover the full variety of tasks.

 In summary, purely neural approaches tend to overfit or guess randomly due to the scarce data, while purely symbolic approaches get lost in the enormous search space of possible solutions. The current state-of-the-art for ARC hovers around 50% task success, achieved by hybrid approaches that combine learning with symbolic program synthesis. These neuro-symbolic methods leverage machine learning to guide or generate programs, which are then executed to produce the solution. While this is a significant improvement over naive brute force (around 20% success using pure program search), it’s still far from human-level performance, which is why we are drawn to the challenge

#Michael Hodel’s DSL Repository
To better tackle ARC tasks, Michael Hodel developed a domain-specific language (DSL) tailored for ARC. The purpose of this DSL is to provide a symbolic, high-level way to describe transformations on the grid, using a set of primitives that capture common operations in ARC tasks. Hodel’s DSL is both expressive (able to represent solutions for essentially any ARC task) and compact, consisting of a small number of generic primitives that can be composed for different tasks. In fact, the DSL breaks down complex transformations into sequences of basic operations – 165 elementary operations in total – and Hodel demonstrated its power by manually writing programs in this DSL to solve all 400 training tasks​
ARXIV.ORG
. Some complex tasks required sequences of up to 50–60 DSL operations strung together​
ARXIV.ORG
, but they were representable, confirming the DSL’s coverage. How the DSL works: Each primitive in the DSL corresponds to a specific transformation or query on the grid. For example, one primitive is objects(grid, univalued, diagonal, without_bg) which extracts connected components (“objects”) from a grid that meet certain criteria (e.g. all cells have the same color, connectivity can be 4-directional only, etc.). Another primitive colorfilter(objects, value) filters a set of objects by a specified color. There are functional combinators too, such as rbind(function, fixed) which fixes one argument of a binary function (useful for partial application), or compose(outer, inner) to combine two functions. Finally, primitives like fill(grid, value, patch) can paint all cells in a given region (patch) with a certain color. Using such primitives, a DSL program can symbolically specify a solution: for instance, “find all objects of color X that satisfy property Y, then paint those objects with color Z.” The repository provides many primitives (for symmetry, rotation, reflection, counting objects, etc.) to cover varied tasks. One big advantage of using a DSL is readability and verifiability. A solution in code form is interpretable – one can understand the sequence of operations being applied, which is not the case for a raw neural network mapping. Additionally, the DSL enables systematic search or program synthesis. Instead of learning a direct input-output mapping, one can search for a combination of primitives that turns the input into the output. This was the approach of some early ARC solvers: they tried to search over the DSL’s program space for a program that fits the given examples. Hodel himself built a program synthesis solver using this DSL for the ARCathon 2022 competition (a kind of ARC mini-challenge), managing to solve some tasks via breadth-first search. The DSL also makes it easier to generate synthetic data: because the transformations are explicit, one can apply a DSL program to randomly generated inputs to create new input-output pairs that follow the same rule. In fact, Hodel’s RE-ARC project uses the DSL programs from the 400 training tasks to generate infinitely many new instances of those tasks with different grids and colors. This addresses the data scarcity by producing more training examples for learning-based methods. Compared to purely neural approaches, a DSL-centric approach has the advantage of injecting a strong prior: essentially the DSL encodes knowledge of what operations are useful, reducing the hypothesis space the solver needs to consider. Instead of learning from scratch that “objects can move” or “colors can change,” the DSL already provides those concepts. This makes the search (or learning) more sample-efficient – fewer examples are needed to identify the correct program, since the space of possible programs is constrained to meaningful operations. The DSL approach is also exact – if you find the right program, it will produce the correct output on all inputs by construction, unlike a neural net that might output slight mistakes. The trade-off, however, is that searching in the space of programs can be extremely slow when tasks get complex (the number of possible programs grows combinatorially). That’s why combining DSL-based reasoning with learning (to guide the search or predict the program) is seen as a promising direction.

# Architecture
To leverage both neural pattern recognition and symbolic reasoning, we propose a dual-headed Convolutional Neural Network (CNN) architecture. In this design, a single CNN backbone processes the input grid and then bifurcates into two output “heads” that produce two complementary predictions:
(A) Grid Prediction Head: This head outputs a transformed grid (of the same size as the input) by predicting a class label for each cell. Essentially, it performs per-pixel classification to determine the color of each output pixel. If the ARC output grid is for example a 10×10 grid with colors 0–9, this head would produce a 10×10×(10 colors) output of logits, which can be turned into a 10×10 grid of predicted colors by taking the argmax per cell.
(B) DSL Sequence Head: This head outputs a sequence of DSL tokens (a program) that, when executed, would carry out the transformation from the input to output. The DSL vocabulary is predefined (the set of primitives and maybe some structural tokens), and the head produces a sequence of logits over this vocabulary, for each step in the program. For instance, the vocabulary might include tokens representing primitives like objects, filter, rotate, etc., as well as some markers or end-of-sequence token. The network might be configured to output a fixed-length sequence (with padding if necessary) or use an autoregressive decoder to generate tokens one by one. In our implementation, for simplicity, we use a fixed maximum length and have the model output that many tokens at once (with a special no-op or padding token when the program is shorter).<br>
# CNN Backbone:
The backbone is a series of convolutional layers with ReLU activations (and possibly batch normalization) that transforms the input grid into a set of high-level feature maps. Because ARC grids are relatively small (often 30x30 or less), we don’t need a very deep network. For example, we could use two or three convolutional layers. We also represent the input grid in a format suitable for CNNs: a common approach is one-hot encoding the grid’s colors into separate channels. For instance, if there are 10 possible colors, the input can be a 10-channel image where each channel is a binary mask for one color (1’s where that color appears, 0 elsewhere). This way, the convolutional filters can easily pick up on color-specific patterns. The CNN’s job is to extract features that are relevant to the transformation – for example, detecting particular shapes or arrangements in the input. After the conv layers, we have a tensor of feature maps (say of size H×W with some number of channels). We then branch into two heads:<br>
# Grid Head:
We use a 1×1 convolution (and possibly upsampling if the feature map is smaller than the input size, though if we keep stride 1 and padding, the feature map can remain the same size as input) to produce an output with depth equal to the number of color classes. This yields a H×W×C tensor of logits, where C is the number of colors. This is essentially treating the task as an image-to-image mapping, like a segmentation network where each pixel is classified into a color category. A softmax can be applied across the color channels for each pixel during training to compute a cross-entropy loss against the true output grid.<br>
# DSL Head:
 For the DSL output, we need the network to produce a sequence of tokens. One simple approach is to flatten the spatial feature map (i.e., take all the features from the final conv layer and flatten into a 1D vector) so that we have a fixed-length representation of the input. Then we pass this through one or more fully connected layers to produce the sequence output. In our design, we use a single linear layer that maps the flattened features to a vector of length equal to (max_program_length × vocab_size). We then reshape that vector into a matrix of shape (max_program_length, vocab_size), which can be interpreted as the logits for a sequence of tokens of length max_program_length. For example, if we allow at most 10 tokens in the program and our DSL vocabulary has 50 tokens, the linear layer outputs a 10×50 logits matrix. The first row corresponds to the logits of the first token in the program (over 50 possible tokens), the second row corresponds to the second token, and so on. During inference, we could take argmax on each row to get a predicted sequence, and ideally we would include an end-of-sequence token so that the model can predict when the program ends (any tokens after that could be considered padding and ignored). <br>
This dual-headed design effectively lets the model learn the task in two ways: directly in the space of outputs (grid) and indirectly via a symbolic program description. The hope is that the two outputs will regularize each other during training. The grid head forces the model to produce precise pixel outputs, while the DSL head forces the model to capture the structure of the transformation. Even if the DSL sequence is not 100% correct, it might guide the CNN to learn features that correspond to meaningful operations (for example, focusing on specific objects in the grid because the DSL supervision signals the need to detect those objects). Likewise, predicting the DSL sequence is a harder, high-level task, so having the pixel-wise loss helps ensure the CNN extracts low-level features correctly. The architecture is neuro-symbolic in that it blends continuous neural prediction with a discrete symbolic output. At inference time, we can use either or both heads: the grid head can output an answer directly, and the DSL head can output a program that we then execute using Hodel’s DSL interpreter to produce an answer. Ideally, both should coincide – i.e., the program when executed yields the same grid the grid-head produced.


#Training Pipeline
Training this dual-headed model requires careful preparation of data and appropriate loss functions for the two types of outputs. Data Preprocessing: We need to convert the ARC tasks into a large set of training examples for supervised learning. Each ARC task provides a few example pairs (input grid, output grid). We treat each example pair as a training sample. (In a more advanced setup, one could do meta-learning where the model sees multiple examples from the same task to internalize the concept, but here we assume a simpler supervised approach where each input->output mapping is learned as an independent mapping, hoping the model learns generalizable patterns across tasks.) Using Hodel’s DSL repository, we can obtain the ground-truth DSL program for each training pair – recall that Hodel wrote solver programs for all training tasks, so for any (input, output) pair from a training task, we have an underlying DSL program that produces that output from that input.

This gives us target sequences for the DSL head. We also one-hot encode input grids as described earlier, so that they can be fed to the CNN. Similarly, it’s convenient to represent the output grid as a 2D array of class labels (color indices) for computing the loss. Because 400 tasks with ~3 examples each only yields about 1200 examples, we need to augment and expand the training data to avoid overfitting the neural network.

One strategy is to use synthetic data generation via the DSL. Using the known DSL programs, we can generate new random inputs and compute outputs by executing the DSL program on those inputs. This is essentially what Hodel’s RE-ARC does, creating potentially infinite variations of each task. For example, if a task’s program says “find the red squares and turn them blue,” we can generate dozens of random grids that contain some red squares in different configurations, then apply the program to get the corresponding outputs. These become additional training pairs for the model, all labeled with the same DSL sequence (since the underlying transformation is the same).

We can also apply data augmentation in a more generic sense: operations like rotating the entire grid, flipping it horizontally/vertically, or permuting the color palette. ARC’s tasks are usually invariant to such transformations (for instance, rotating the entire puzzle shouldn’t change the core logic of the task, just its manifestation). By augmenting each example in these ways, we increase the diversity of inputs the model sees for the same underlying rule, which should improve generalization. (One must be careful that the augmentation doesn’t violate the task’s logic – e.g., rotating might not make sense if the task involves a specific orientation-sensitive pattern – but color permutations are almost always safe because colors are abstract labels in ARC.)

#Loss Functions: We train the model to minimize a combination of losses from the two heads:
- For the grid output head, we use a pixel-wise cross-entropy loss. This compares the predicted class distribution for each pixel to the true class (color) at that pixel. We sum (or average) this over all pixels in the output grid. Essentially this treats each output cell as an independent classification problem (which in implementation is done by flattening the grid and applying cross-entropy).

- If the output grid is size N×M and there are K color classes, and we have batch size B, the loss is <br><br>
 $\frac{1}{B}\sum_{b=1}^B \frac{1}{NM}\sum_{i,j} \ell_{\text{CE}}\big(p_{b,i,j},, y_{b,i,j}\big)$ <br><br>
 where $p_{b,i,j}$
 is the predicted class probabilities for pixel (i,j) and
 $y_{b,i,j}$
 is the true color.
For the DSL sequence head, we use a sequence cross-entropy loss. A common approach is to use teacher forcing: we have the ground-truth token sequence for the program, and we compute cross-entropy at each position of the sequence. For example, if the true DSL program is [objects, colorfilter, fill, END, PAD, PAD] (assuming a fixed length of 6 with END and PAD tokens for termination and padding), and the model predicted logits for 6 time-steps over the vocabulary, we compute the loss for each time-step where a real token or END is expected. We often mask out the loss for padding positions (so the model isn’t punished for what it predicts after the end of the program). If we denote the sequence of length L (including END) as $t_1, t_2, ..., t_L$, and the model’s predicted probability for token $t_k$ at position k as $q_{k}(t_k)$, then the sequence loss can be
<br> <br>
$\frac{1}{B}\sum_{b=1}^B \frac{1}{L_b}\sum_{k=1}^{L_b} -\log q_{b,k}(t_k)$
 <br><br> (where $L_b$ is the length of sequence b, and we don’t count padded steps). In practice, we implement this by shaping the prediction as (BatchSeqLength, VocabSize) and the target as (BatchSeqLength,) and using cross-entropy with an ignore index for padding token.


#More on Training
The total loss is a weighted sum of the grid loss and sequence loss. We can simply weight them equally (just add them) or adjust weights if one learning signal is more important. In our experiments, treating them with equal weight works, since both are cross-entropy losses of comparable scales. By minimizing this combined loss, the model learns to produce both correct output grids and correct DSL programs for the training examples. Training Procedure: We iterate over the training examples (augmented as described) and perform standard gradient-based optimization (e.g., using Adam or SGD optimizer). Each iteration, we:
Forward pass: feed the input grid through the CNN to get both outputs.
Compute the grid loss and DSL sequence loss against the known targets.
Backpropagate to compute gradients for all parameters (the backbone and both heads).
Update the weights using the optimizer. Because the dataset (after augmentation) can still be relatively small, we must be cautious about overfitting. Techniques like early stopping or using a validation split (perhaps taking a subset of tasks as “dev” tasks not to train on) are useful. Also, we should shuffle tasks and mix data so the model doesn’t see one task’s examples all in a row (to avoid it simply memorizing per-task idiosyncrasies).
Evaluation Criteria: During training (and especially for final evaluation), we consider a task “solved” by the model if it can produce the correct output for the test input of that task. There are two ways our model can produce an output:
Direct Grid Prediction: We take the grid head’s output for the test input and compare it to the expected output grid cell by cell. If every pixel matches, then the solution is correct. (ARC tasks typically require an exact match of the entire output grid.)
DSL Execution: We take the DSL sequence predicted by the model’s second head, and run it through Michael Hodel’s DSL interpreter to generate an output grid. We then compare that output grid to the expected output. If they match, the task is solved (via the program).
In principle, if the model has learned well, both the direct grid and the executed DSL should match the expected output. In practice, we might get cases where the grid head gets it right but the DSL sequence is slightly off (or vice versa). Because the DSL is a discrete output, even a single token error can cause the executed result to be wrong or nonsensical. However, one nice property is that the DSL execution provides a guaranteed check: even if the program uses a different sequence of operations than the ground-truth, as long as the final result matches the target, we consider it a valid solution (ARC is agnostic to how you get the answer, only that the answer is correct). We therefore ultimately evaluate the model by checking if either of its outputs yields the correct grid. For quantitative evaluation across many tasks, we can report:
Grid accuracy: the percentage of test input grids (from various tasks) for which the grid head’s output is exactly correct.
Program accuracy: the percentage of cases where the DSL head’s output, when executed, yields the correct output.
Task success rate: the percentage of tasks for which the model got the correct output on the test input (some tasks might have multiple test inputs; depending on ARC evaluation, usually you need all test outputs correct to count the task as solved).
Additionally, because the model produces interpretable programs, we can do a symbolic verification: using Hodel’s DSL engine, we can simulate the predicted program on all example inputs to ensure it indeed transforms those correctly (this is another check one could incorporate – e.g., if the model proposes a program, one could verify it against the known examples and perhaps choose between multiple hypotheses, akin to how a human would test their reasoning on the examples). In summary, the training pipeline leverages the DSL to generate lots of training data and supervision signals, and uses multi-task learning (pixel-wise and program-wise) to train a CNN that generalizes the concept of ARC transformations. With enough synthetic data, we hope to overcome the small sample size issue and let the neural network learn the “language of transformations” that the DSL provides, enabling it to solve unseen tasks.

In [None]:
!git clone https://github.com/michaelhodel/re-arc.git
# make sure to rename re-arc to re_arc so its pythonic

In [None]:
import sys
sys.path.append("re_arc")
import dsl


In [None]:
%run re_arc/dsl.py


In [None]:
import zipfile
import os

zip_path = "re_arc/arc_original.zip"
extract_to = "arc_data"

# Create the target directory if it doesn't exist.
os.makedirs(extract_to, exist_ok=True)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)

print(f"Extracted files to: {extract_to}")


Extracted files to: arc_data


In [None]:
import json
folder_path = "/content/arc_data/arc_original/training"
def load_official_arc_data(folder_path):
    """
    folder_path: directory with the 400 .json files
    Returns: a list of (task_id, input_grid, output_grid)
    """
    samples = []
    for fn in os.listdir(folder_path):
        if fn.endswith(".json"):
            task_id = fn.replace(".json", "")
            with open(os.path.join(folder_path, fn), 'r') as f:
                data = json.load(f)
            # data["train"] is a list of {input, output} pairs
            for pair in data["train"]:
                inp = pair["input"]
                out = pair["output"]
                samples.append((task_id, inp, out))
    return samples

# Example usage
arc_folder = "arc_data/training"
arc_samples = load_official_arc_data(arc_folder)
print(f"Loaded {len(arc_samples)} training examples.")


In [None]:
arc_folder = "arc_data/arc_original/training"
arc_samples = load_official_arc_data(arc_folder)
print(f"Loaded {len(arc_samples)} training examples.")


Loaded 1301 training examples.


In [None]:
import os
import zipfile

zip_path = "re_arc/re_arc.zip"

extract_path = "/content/rearcdata"

if not os.path.exists(extract_path):
    os.makedirs(extract_path)

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print(f"Files extracted to {extract_path}")


Files extracted to /content/rearcdata


In [None]:
import numpy as np
import random
from math import ceil

# Define colors and background
ALL_COLORS = list(range(10))  # 0-9 possible colors
BG_COLOR = 0  # background color (black in ARC)

# Task transformation functions
def horizontal_mirror(grid):
    # Flip vertically (top-to-bottom)
    return np.flipud(grid)

def vertical_mirror(grid):
    # Flip horizontally (left-to-right)
    return np.fliplr(grid)

def rotate180(grid):
    return np.rot90(grid, k=2)  # rotate 180 = 2 * 90

def replace_color(grid, src=6, dst=2):
    out = grid.copy()
    out[out == src] = dst
    return out

def upscale(grid, factor=2):
    # Repeat each cell in a factor x factor block
    return np.repeat(np.repeat(grid, factor, axis=0), factor, axis=1)

# Dictionary of task specifications
tasks = {
    "hmirror": {
        "transform": horizontal_mirror,
        "dsl_tokens": ["hmirror"]
    },
    "vmirror": {
        "transform": vertical_mirror,
        "dsl_tokens": ["vmirror"]
    },
    "rot180": {
        "transform": rotate180,
        "dsl_tokens": ["rot180"]
    },
    "replace_6_2": {
        "transform": lambda g: replace_color(g, 6, 2),
        "dsl_tokens": ["replace", "6", "2"]
    },
    "upscale2": {
        "transform": lambda g: upscale(g, 2),
        "dsl_tokens": ["upscale", "2"]
    },
    "upscale3": {
        "transform": lambda g: upscale(g, 3),
        "dsl_tokens": ["upscale", "3"]
    }
}

# Generate random grids and apply transformations
def generate_examples(task_key, n_examples=50):
    """Generate (input, output, program_tokens) examples for the given task."""
    transform_fn = tasks[task_key]["transform"]
    dsl_tokens = tasks[task_key]["dsl_tokens"]
    examples = []
    for _ in range(n_examples):
        # Random grid size (at least 3x3, up to say 10x10 for diversity, and smaller for upscale tasks)
        if task_key.startswith("upscale"):
            max_size = 10 if "3" in task_key else 15  # upscale3 uses smaller inputs
        else:
            max_size = 15
        h = random.randint(3, max_size)
        w = random.randint(3, max_size)
        # Random grid content
        grid = np.random.choice(ALL_COLORS, size=(h, w))
        # Ensure special conditions (e.g., include color 6 for replace task)
        if task_key == "replace_6_2":
            # ensure at least one 6 in grid
            if 6 not in grid:
                # place a 6 at a random position
                rh, rw = random.randrange(h), random.randrange(w)
                grid[rh, rw] = 6
        # Apply transformation
        output_grid = transform_fn(grid)
        # Pad input and output to 30x30
        H, W = grid.shape
        outH, outW = output_grid.shape
        pad_input = np.full((30, 30), BG_COLOR, dtype=int)
        pad_output = np.full((30, 30), BG_COLOR, dtype=int)
        pad_input[:H, :W] = grid
        pad_output[:outH, :outW] = output_grid
        examples.append((pad_input, pad_output, dsl_tokens))
    return examples

# Create training and validation splits
train_data = []
val_data = []
for task_key in tasks.keys():
    examples = generate_examples(task_key, n_examples=100)  # generate 100 examples per task
    # 80% train, 20% val
    split = int(0.8 * len(examples))
    train_data += examples[:split]
    val_data += examples[split:]

print(f"Generated {len(train_data)} training examples and {len(val_data)} validation examples.")
# Example: inspect one sample (input and output shapes, DSL)
sample_in, sample_out, sample_dsl = train_data[0]
print("Input shape:", sample_in.shape, "Output shape:", sample_out.shape, "DSL:", sample_dsl)


Generated 480 training examples and 120 validation examples.
Input shape: (30, 30) Output shape: (30, 30) DSL: ['hmirror']


In [None]:
import torch
import torch.nn as nn

# Define the DSL vocabulary and utility for encoding DSL sequences
DSL_VOCAB = ["<PAD>", "hmirror", "vmirror", "rot180", "replace", "upscale", "2", "3", "6"]
vocab_size = len(DSL_VOCAB)  # number of token types
token_to_idx = {tok: i for i, tok in enumerate(DSL_VOCAB)}
pad_idx = token_to_idx["<PAD>"]

# Utility: encode DSL token list to tensor of indices (length 3 with padding)
def encode_dsl(tokens):
    idxs = [token_to_idx[t] for t in tokens]
    # pad to length 3
    if len(idxs) < 3:
        idxs = idxs + [pad_idx] * (3 - len(idxs))
    return torch.tensor(idxs[:3], dtype=torch.long)

# Create training and validation tensors
# (Convert to torch tensors: one-hot input, target grid, target DSL sequence)
def one_hot_encode_grid(grid):
    # grid is 30x30 numpy array of ints
    tensor = torch.tensor(grid, dtype=torch.long)
    # One-hot along the color channel (will be 30x30x10, then permute to 10x30x30)
    one_hot = nn.functional.one_hot(tensor, num_classes=10).permute(2,0,1).float()
    return one_hot

train_inputs = []
train_grid_targets = []
train_dsl_targets = []
for inp, out, dsl in train_data:
    train_inputs.append(one_hot_encode_grid(inp))
    train_grid_targets.append(torch.tensor(out, dtype=torch.long))
    train_dsl_targets.append(encode_dsl(dsl))
# Stack into tensors
train_inputs = torch.stack(train_inputs)        # shape [N, 10, 30, 30]
train_grid_targets = torch.stack(train_grid_targets)  # shape [N, 30, 30]
train_dsl_targets = torch.stack(train_dsl_targets)    # shape [N, 3]

val_inputs = []
val_grid_targets = []
val_dsl_targets = []
for inp, out, dsl in val_data:
    val_inputs.append(one_hot_encode_grid(inp))
    val_grid_targets.append(torch.tensor(out, dtype=torch.long))
    val_dsl_targets.append(encode_dsl(dsl))
val_inputs = torch.stack(val_inputs)
val_grid_targets = torch.stack(val_grid_targets)
val_dsl_targets = torch.stack(val_dsl_targets)

print("Training tensor shapes:", train_inputs.shape, train_grid_targets.shape, train_dsl_targets.shape)

# Define the Dual-Head CNN model
class DualHeadCNN(nn.Module):
    def __init__(self):
        super(DualHeadCNN, self).__init__()
        # Convolutional backbone
        self.conv1 = nn.Conv2d(in_channels=10, out_channels=32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2)  # reduce 30x30 -> 15x15
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, padding=1)
        # Fully connected to embedding
        self.flatten = nn.Flatten()
        self.fc_emb = nn.Linear(64 * 15 * 15, 128)  # compress to embedding vector

        # Grid head (CNN decoder)
        self.upconv = nn.ConvTranspose2d(in_channels=64, out_channels=32, kernel_size=4, stride=2, padding=1)
        self.conv_out = nn.Conv2d(in_channels=32, out_channels=10, kernel_size=1)

        # DSL head (token classifier for each of 3 positions)
        self.fc_dsl = nn.Linear(128, vocab_size * 3)  # outputs logits for 3 tokens

    def forward(self, x):
        # x: [batch, 10, 30, 30] one-hot input grid
        # Backbone
        x = nn.functional.relu(self.conv1(x))
        x = nn.functional.relu(self.conv2(x))
        x = self.pool(x)
        x = nn.functional.relu(self.conv3(x))
        # Save feature map for grid head decoding
        feat_map = x  # [batch, 64, 15, 15]
        # Flatten for DSL head
        emb = self.flatten(x)         # [batch, 64*15*15]
        emb = nn.functional.relu(self.fc_emb(emb))  # [batch, 128]

        # Grid head decoding
        x_up = nn.functional.relu(self.upconv(feat_map))  # [batch, 32, 30, 30]
        grid_logits = self.conv_out(x_up)                 # [batch, 10, 30, 30]

        # DSL head output
        dsl_logits = self.fc_dsl(emb)                     # [batch, 3 * vocab_size]
        # Reshape DSL logits to [batch, 3, vocab_size]
        dsl_logits = dsl_logits.view(-1, 3, vocab_size)
        return grid_logits, dsl_logits

# Instantiate model
model = DualHeadCNN()


Training tensor shapes: torch.Size([480, 10, 30, 30]) torch.Size([480, 30, 30]) torch.Size([480, 3])


In [None]:
# Training setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Loss functions
grid_loss_fn = nn.CrossEntropyLoss()  # for grid output
dsl_loss_fn = nn.CrossEntropyLoss(ignore_index=pad_idx)  # for DSL output (ignore pad token)

# Convert data to device
train_inputs, train_grid_targets, train_dsl_targets = train_inputs.to(device), train_grid_targets.to(device), train_dsl_targets.to(device)
val_inputs, val_grid_targets, val_dsl_targets = val_inputs.to(device), val_grid_targets.to(device), val_dsl_targets.to(device)

# Training loop
epochs = 20
batch_size = 16

def get_batches(X, y_grid, y_dsl, batch_size):
    # Generator to yield mini-batches
    for i in range(0, len(X), batch_size):
        yield X[i:i+batch_size], y_grid[i:i+batch_size], y_dsl[i:i+batch_size]

for epoch in range(1, epochs+1):
    model.train()
    total_loss = 0.0
    # Shuffle training data indices for each epoch
    perm = torch.randperm(train_inputs.size(0))
    X_shuffled = train_inputs[perm]
    grid_shuffled = train_grid_targets[perm]
    dsl_shuffled = train_dsl_targets[perm]
    # Mini-batch training
    for X_batch, grid_batch, dsl_batch in get_batches(X_shuffled, grid_shuffled, dsl_shuffled, batch_size):
        optimizer.zero_grad()
        grid_logits, dsl_logits = model(X_batch)
        # Calculate losses
        loss_grid = grid_loss_fn(grid_logits, grid_batch)            # grid logits: [B,10,H,W], grid_batch: [B,H,W]
        loss_dsl = dsl_loss_fn(dsl_logits.view(-1, vocab_size), dsl_batch.view(-1))
        loss = loss_grid + loss_dsl
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * X_batch.size(0)
    avg_loss = total_loss / train_inputs.size(0)

    # Validation
    model.eval()
    with torch.no_grad():
        grid_logits_val, dsl_logits_val = model(val_inputs)
        val_grid_loss = grid_loss_fn(grid_logits_val, val_grid_targets).item()
        val_dsl_loss = dsl_loss_fn(dsl_logits_val.view(-1, vocab_size), val_dsl_targets.view(-1)).item()
    if epoch % 5 == 0 or epoch == epochs:
        print(f"Epoch {epoch}/{epochs}: Train Loss = {avg_loss:.4f}, Val Grid Loss = {val_grid_loss:.4f}, Val DSL Loss = {val_dsl_loss:.4f}")


Epoch 5/20: Train Loss = 2.0016, Val Grid Loss = 0.7452, Val DSL Loss = 1.2831
Epoch 10/20: Train Loss = 1.8796, Val Grid Loss = 0.7653, Val DSL Loss = 1.3195
Epoch 15/20: Train Loss = 1.7165, Val Grid Loss = 0.7334, Val DSL Loss = 1.4100
Epoch 20/20: Train Loss = 1.3619, Val Grid Loss = 0.7338, Val DSL Loss = 1.8157


In [None]:
# Evaluation on validation set
model.eval()
with torch.no_grad():
    # Get predictions
    grid_logits, dsl_logits = model(val_inputs)
    # Predicted grids: choose argmax color at each cell
    pred_grids = grid_logits.argmax(dim=1)  # [N, 30, 30]
    # Predicted DSL tokens: argmax for each of the 3 positions
    pred_dsl_tokens = dsl_logits.argmax(dim=2)  # [N, 3]

# Compute grid accuracy
exact_matches = (pred_grids == val_grid_targets).all(dim=(1,2))  # tensor of shape [N] with True for perfect match
grid_exact_acc = exact_matches.float().mean().item()

# Compute DSL program accuracy
# An example's program is correct if all tokens (ignoring pads) match
dsl_correct = []
token_correct = 0
total_tokens = 0
for i in range(val_dsl_targets.size(0)):
    true_seq = val_dsl_targets[i].cpu().numpy().tolist()
    pred_seq = pred_dsl_tokens[i].cpu().numpy().tolist()
    # Remove padding for comparison
    # (find first pad in true_seq; in our data, true_seq has pad only after actual tokens)
    if pad_idx in true_seq:
        true_len = true_seq.index(pad_idx)
    else:
        true_len = len(true_seq)
    # Compare sequences up to true_len
    if pred_seq[:true_len] == true_seq[:true_len]:
        dsl_correct.append(1)
    else:
        dsl_correct.append(0)
    # Token-wise accuracy
    for t_true, t_pred in zip(true_seq[:true_len], pred_seq[:true_len]):
        if t_true == t_pred:
            token_correct += 1
        total_tokens += 1

prog_acc = np.mean(dsl_correct)
token_acc = token_correct / total_tokens
print(f"Grid Exact Match Accuracy: {grid_exact_acc*100:.1f}%" + " actually its 100% we are geniuses")
print(f"Program Exact Match Accuracy: {prog_acc*100:.1f}%")
print(f"Program Token Accuracy: {token_acc*100:.1f}%")


Grid Exact Match Accuracy: 38.3%
Program Exact Match Accuracy: 79.8%
Program Token Accuracy: 84.1%
