# Create test time fine-tuning datasets

## Goal

On this notebook I will create different versions of the evaluation dataset that could be used for test time fine-tuning.

The only requirement is that the test outputs are not used (they are not available for submission).

## Imports

In [19]:
import os
import json
import random
import matplotlib.pyplot as plt
from matplotlib import colors
import matplotlib as mpl
import numpy as np

plt.plot()
plt.close('all')
plt.rcParams["figure.figsize"] = (20, 5)
mpl.rcParams['lines.linewidth'] = 3
mpl.rcParams['font.size'] = 16

## Code

In [None]:
def plot_task(task, task_id):
    all_samples = task['train'] + task['test']
    for plot_idx, sample in enumerate(all_samples):
        plt.subplot(1, len(all_samples), plot_idx+1)
        plot_grid(sample['input'])
        if plot_idx < len(task['train']):
            plt.title(f'train {plot_idx}')
        else:
            plt.title(f'test {plot_idx-len(task["train"])}')
    plt.suptitle(f'Inputs for task {task_id}')
    plt.show()
    for plot_idx, sample in enumerate(all_samples):
        plt.subplot(1, len(all_samples), plot_idx+1)
        plot_grid(sample['output'])
        if plot_idx < len(task['train']):
            plt.title(f'train {plot_idx}')
        else:
            plt.title(f'test {plot_idx-len(task["train"])}')
    plt.suptitle(f'Outputs for task {task_id}')
    plt.show()


def plot_grid(grid):
    grid = np.array(grid)
    cmap = colors.ListedColormap(
        ['#000000', '#0074D9','#FF4136','#2ECC40','#FFDC00',
         '#AAAAAA', '#F012BE', '#FF851B', '#7FDBFF', '#870C25'])
    norm = colors.Normalize(vmin=0, vmax=9)
    plt.imshow(grid, cmap=cmap, norm=norm)
    plt.grid(True,which='both',color='lightgrey', linewidth=0.5)
    plt.xticks(np.arange(-0.5, grid.shape[1]), [])
    plt.yticks(np.arange(-0.5, grid.shape[0]), [])
    plt.xlim(-0.5, grid.shape[1]-0.5)

    for i in range(grid.shape[0]):
        for j in range(grid.shape[1]):
            plt.text(j, i, grid[i, j], ha='center', va='center')

## Load data

In [None]:
with open('/mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json') as f:
    data = json.load(f)
len(data)

In [20]:
output_dir = '/mnt/hdd0/Kaggle/arc24/data/test_time_fine-tuning'
os.makedirs(output_dir, exist_ok=True)

## Train with n-1 samples

The idea is pretty simple: let's just use the train samples and take one of the train samples as if it were the test sample.

This will result on tasks with a small number of examples. It is possible that some tasks won't be completely defined, but I hope the model will be biased to the test transformation so it will be helpful.

In [None]:
n_train_samples = [len(task['train']) for task in data.values()]
plt.hist(n_train_samples, bins=np.arange(0.5, 9));
plt.xlabel('Number of training samples')
plt.ylabel('Number of tasks')

In [None]:
new_data = dict()
for task_id, task in data.items():
    for i, test_sample in enumerate(task['train']):
        new_data[f'{task_id}_{i}'] = dict(
            train=[sample for j, sample in enumerate(task['train']) if i != j],
            test=[test_sample],
        )
len(new_data)

In [None]:
for task_id, task in new_data.items():
    if random.random() < 0.005:
        plot_task(task, task_id)

Looks good to me!  
If this works we could try doing even a more extreme implementation where we train with even less samples.

In [21]:
with open(os.path.join(output_dir, 'evaluation_n-1.json'), 'w') as f:
    json.dump(new_data, f, indent=2)