# Geometric Invariant Hash Functions

Geometric Invariant Hash Functions are relevant to grid worlds experiencing wraparound (closed manifolds) in [Conway's Reverse Game of Life](https://www.kaggle.com/c/conway-s-reverse-game-of-life) and [Halite](https://www.kaggle.com/c/halite) and are also relevant to the [Abstraction and Reasoning Corpus](https://www.kaggle.com/c/abstraction-and-reasoning-challenge)

They have the property of being invariant to np.roll() and optionally np.flip() and np.rot90()

To achieve this, we also need access to a set of Summable Prime Numbers. According to the Unique Factorization Theorem, the product of any combination of primes results in a unique number. This is not guaranteed to be true for summation, but it is possible through search to find a subset of prime numbers for which this property holds true. This property is important for preventing hash collisions.
- https://www.kaggle.com/jamesmcguigan/summable-primes/


I would like to propose two geometric invariant hash functions suitable for binary pixel images with wraparound.


The underlying concept is to take 1-dimentional views of the board from the perspective of each pixel in each up/down/left/right direction, and compare if the sum of pixelwise views is the same, regardless of the individual xy coordinates for each pixel. The hash value will remain invarient through geometric transforms such as `np.roll()`, `np.flip()` and `np.rot90()`.

If we encoded each view-direction-array as a binary number and summed the view from each direction, then there would be multiple combinations of views that could result in the same hash code, ie: 
`b100 == b100 + b000 == b011 + b001 == b001 + b011 == b010 + b010`

If we encode the pixel distances using a set of summable prime numbers, then we can create a unique hash for each pixelwise view of the board which is invarient to geometric transforms. If we apply the lesser constraint that no 2-pair of prime numbers in the set has the same sum (rather than any combination), then in practice, the hash algorithm still works, but we are not gaurenteed to be free of hash collisions.


## Hash Functions Limitations

When `hash_geometric_linear()` and `hash_translations()` functions where applied to [Conway's Reverse Game of Life](https://www.kaggle.com/c/conway-s-reverse-game-of-life), it was discoverd that hash collisions occured in about 29% of the dataset. This was mostly the result of boards containing multiple self-contained objects seperated by whitespace such that the pixels could not see each other horizontally or vertically. In such cases, the hash function matches when the shape and number individual objects are the same, but it cannot detect differences in whitespace gaps between the objects.
- https://www.kaggle.com/jamesmcguigan/game-of-life-hashmap-solver

`hash_geometric_concentric()` has now been implemented that replaces horizontal and vertical lines, with concentric circles emanating from each pixel.
This allows the hash function to "see" in all directions and detect self-contained objects seperated by whitespace, but at a 2x runtime performance cost.

In [None]:
import os
import sys
import math
import time
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from typing import Tuple
from numba import njit
from fastcache import clru_cache

# Primes

In [None]:
def generate_primes(count):
    primes = [2]
    for n in range(3, sys.maxsize, 2):
        if len(primes) >= count: break
        if all( n % i != 0 for i in range(3, int(math.sqrt(n))+1, 2) ):
            primes.append(n)
    return primes

primes_np = np.array( generate_primes(625), dtype=np.int64 )  # first 625 primes 


# 50 Summable Primes
# with the lesser constraint that: 
#     no 2-pair of prime numbers in the set has the same sum (rather than any combination)
# This works in practice, but is not gaurenteed prevent hash collisions
hashable_primes = np.array([
        2,     7,    23,    47,    61,     83,    131,    163,    173,    251,
      457,   491,   683,   877,   971,   2069,   2239,   2927,   3209,   3529, 
     4451,  4703,  6379,  8501,  9293,  10891,  11587,  13457,  13487,  17117,
    18869, 23531, 23899, 25673, 31387,  31469,  36251,  42853,  51797,  72797,
    76667, 83059, 87671, 95911, 99767, 100801, 100931, 100937, 100987, 100999,
], dtype=np.int64)

# Geometric Invarient Hash 

Given that the board experiences wraparound, this is equlivant to saying that the image is a [Closed Manifold](https://en.wikipedia.org/wiki/Closed_manifold).

A view from each pixel (in each direction) is encoded using the set of Summable Primes. The directions are combined by taking the product. 
The hash for the board is the sum of all pixel views.

The location of each pixel has not been encoded into the hash, only the relative distances of each pixel to each other pixel. As such this hash in invarient to geometric transforms such as `identity`, `np.roll`, `np.rot90` and `np.flip`.

In [None]:
@njit()
def hash_geometric_linear(board: np.ndarray) -> int:
    """
    Takes the 1D pixelwise view from each pixel (up, down, left, right) with wraparound
    the distance to each pixel is encoded as a prime number, the sum of these is the hash for each view direction
    the hash for each cell is the product of view directions and the hash of the board is the sum of these products
    this produces a geometric invariant hash that will be identical for roll / flip / rotate operations
    """
    assert board.shape[0] == board.shape[1]  # assumes square board
    size     = board.shape[0]
    l_primes = hashable_primes[:size//2+1]   # each distance index is represented by a different prime
    r_primes = l_primes[::-1]                # symmetric distance values in reversed direction from center

    hashed = 0
    for x in range(size):
        for y in range(size):
            # current pixel is moved to center [13] index
            horizontal = np.roll( board[:,y], size//2 - x)
            vertical   = np.roll( board[x,:], size//2 - y)
            left       = np.sum( horizontal[size//2:]   * l_primes )
            right      = np.sum( horizontal[:size//2+1] * r_primes )
            down       = np.sum( vertical[size//2:]     * l_primes )
            up         = np.sum( vertical[:size//2+1]   * r_primes )
            hashed    += left * right * down * up
    return hashed

In [None]:
@njit()
def get_concentric_prime_mask(shape: Tuple[int,int]=(25,25), pattern='diamond') -> np.ndarray:
    assert shape[0] == shape[1]
    assert pattern in [ 'diamond', 'oval' ]

    # Center coordinates
    x      = (shape[0])//2
    y      = (shape[1])//2
    max_r  = max(shape) + 1 if max(shape) % 2 == 0 else max(shape)   
    
    # Create diagonal lines of primes (r_mask) in the bottom right quadrant
    mask = np.zeros(shape, dtype=np.int64)
    for r in range(max_r):
        primes = hashable_primes[:r+1]
        for dr in range(r+1): 
            if   pattern == 'diamond':  prime = primes[r]                 # creates symmetric diamond
            elif pattern == 'oval':     prime = primes[r] + primes[dr]    # creates rotation senstive oval
            
            coords = {
                (x+(r-dr),y+(dr)), # bottom right
                (x-(r-dr),y+(dr)), # bottom left
                (x+(r-dr),y-(dr)), # top    right
                (x-(r-dr),y-(dr)), # top    left
            }
            for coord in coords:
                if min(coord) >= 0 and max(coord) < min(shape): 
                    mask[coord] = prime 
    return mask
        
@njit()
def hash_geometric_concentric(board: np.ndarray) -> int:
    """
    Takes the concentric diamond/circle pixelwise view from each pixel with wraparound
    the distance to each pixel is encoded as a prime number, the sum of these is the hash for each view direction
    the hash for each cell is the product of view directions and the hash of the board is the sum of these products
    this produces a geometric invariant hash that will be identical for roll / flip / rotate operations
    
    The concentric version of this function allows the hash function to "see" in all directions 
    and detect self-contained objects seperated by whitespace, but at a 2x runtime performance cost.
    """
    assert board.shape[0] == board.shape[1]  # assumes square board
    mask = get_concentric_prime_mask(shape=board.shape)

    hashed = 0
    for x in range(board.shape[0]):
        for y in range(board.shape[1]):
            for dx in range(mask.shape[0]):
                for dy in range(mask.shape[1]):
                    coords  = ( (x+dx)%board.shape[0], (y+dy)%board.shape[1] )
                    hashed += board[coords] * mask[dx,dy]
    return hashed

## Visualization

This is a visualization of how the hash function applies itself to each pixel

NOTE: When `hash_geometric_linear()` and `hash_translations()` functions where applied to [Conway's Reverse Game of Life](https://www.kaggle.com/c/conway-s-reverse-game-of-life), it was discoverd that hash collisions occured in about 29% of the dataset. This was mostly the result of boards containing multiple self-contained objects seperated by whitespace such that the pixels could not see each other horizontally or vertically. In such cases, the hash function matches when the shape and number individual objects are the same, but it cannot detect differences in whitespace gaps between the objects.

In [None]:
# Hashable Primes: 2, 7, 23, 47, 61
geometric_hash_pattern = np.array([
    [0,0,0,0,61,0,0,0,0],
    [0,0,0,0,47,0,0,0,0],
    [0,0,0,0,23,0,0,0,0],
    [0,0,0,0, 7,0,0,0,0],
 [61,47,23,7, 2,7,23,47,61],
    [0,0,0,0, 7,0,0,0,0],
    [0,0,0,0,23,0,0,0,0],
    [0,0,0,0,47,0,0,0,0],
    [0,0,0,0,61,0,0,0,0],
])
transforms = {
    'identity': geometric_hash_pattern,
    'flip':     np.flip(geometric_hash_pattern),
    'rot90':    np.rot90(geometric_hash_pattern),
    'roll':     np.roll(np.roll(geometric_hash_pattern, -2, axis=0), -1, axis=1),    
}

plt.figure(figsize=(len(transforms)*5, 5))
for i, (name, grid) in enumerate(transforms.items()):
    plt.subplot(1, len(transforms), i+1)
    plt.title(name)
    plt.imshow(grid, cmap='nipy_spectral')

`hash_geometric_concentric()` replaces horizontal and vertical lines, with concentric circles emanating from each pixel.
This allows the hash function to "see" in all directions and detect self-contained objects seperated by whitespace, but at the cost of a 2x runtime performance cost.

In [None]:
for pattern in [ 'diamond', 'oval' ]:
    geometric_hash_pattern = get_concentric_prime_mask((9,9), pattern=pattern)
    transforms = {
        'identity': geometric_hash_pattern,
        'flip':     np.flip(geometric_hash_pattern),
        'rot90':    np.rot90(geometric_hash_pattern),
        'roll':     np.roll(np.roll(geometric_hash_pattern, 3, axis=0), 3, axis=1),    
    }
    get_concentric_prime_mask((9,9))
    plt.figure(figsize=(len(transforms)*5, 5))
    for i, (name, grid) in enumerate(transforms.items()):
        plt.subplot(1, len(transforms), i+1)
        plt.title(f'{pattern} - {name}')
        #plt.imshow(grid, cmap='nipy_spectral')
        plt.imshow(grid)

## Tests
We can test this using the Conway's Reverse Game of Life Dataset 

In [None]:
train_file = f'../input/conways-reverse-game-of-life-2020/train.csv'
test_file  = f'../input/conways-reverse-game-of-life-2020/test.csv'

train_df   = pd.read_csv(train_file, index_col='id')
test_df    = pd.read_csv(test_file,  index_col='id')

def csv_to_numpy_list(df, key='start') -> np.ndarray:
    return df[[ f'{key}_{n}' for n in range(25**2) ]].values.reshape(-1,25,25)

In [None]:
def test_hash_geometric(hash_fn):
    count  = 1000 if os.environ.get('KAGGLE_KERNEL_RUN_TYPE') == 'Interactive' else sys.maxsize
    boards = csv_to_numpy_list(train_df, key='start')[:count]
    for board in boards:
        transforms = {
            "identity": board,
            "roll_0":   np.roll(board, 1, axis=0),
            "roll_1":   np.roll(board, 1, axis=1),
            "flip_0":   np.flip(board, axis=0),
            "flip_1":   np.flip(board, axis=1),
            "rot90":    np.rot90(board, 1),
            "rot180":   np.rot90(board, 2),
            "rot270":   np.rot90(board, 3),
        }
        hashes = { f'{key:8s}': hash_geometric_linear(value) for key, value in transforms.items()}

        # all geometric transforms should produce the same hash
        assert len(set(hashes.values())) == 1
    return len(boards)
        
count = test_hash_geometric(hash_fn=hash_geometric_linear)
print(f'hash_geometric_linear()     - {count} tests passed')

count = test_hash_geometric(hash_fn=hash_geometric_concentric)
print(f'hash_geometric_concentric() - {count} tests passed')

# Translation Invarient Hash 

However sometimes we may wish to distinguish between a `np.roll` translations and `np.flip` or `np.rot90` geometric transforms.

We can modify the above hash to be orientation specific, by only using views in the left and down directions  

NOTE: Here, simply taking the sum of pixel hashes from `hash_translations_board()`, causes 

ca we get hash collisions between `identity` and `rot180`. 
The pixelwise views still produce different numbers, but their numbers somehow sum to the same. 
This can be be fixed by sorting the pixel hashes by value and multiplying by an array of prime numbers. 

In [None]:
@njit()
def hash_translations(board: np.ndarray) -> int:
    """
    Takes the 1D pixelwise view from each pixel (left, down) with wraparound
    by only using two directions, this hash is only invariant for roll operations, but not flip or rotate
    this allows determining which operations are required to solve a transform

    NOTE: np.rot180() produces the same sum as board, but with different numbers which is fixed via: sorted * primes
    """
    assert board.shape[0] == board.shape[1]
    hashes = hash_translations_board(board)
    sorted = np.sort(hashes.flatten())
    hashed = np.sum(sorted[::-1] * primes_np[:len(sorted)])  # multiply big with small numbers | hashable_primes is too small
    return int(hashed)


@njit()
def hash_translations_board(board: np.ndarray) -> np.ndarray:
    """ Returns a board with hash values for individual cells """
    assert board.shape[0] == board.shape[1]  # assumes square board
    size = board.shape[0]

    # NOTE: using the same list of primes for each direction, results in the following identity splits:
    # NOTE: np.rot180() produces the same np.sum() hash, but using different numbers which is fixed via: sorted * primes
    #   with v_primes == h_primes and NOT sorted * primes:
    #       identity == np.roll(axis=0) == np.roll(axis=1) == np.rot180()
    #       np.flip(axis=0) == np.flip(axis=1) == np.rot90() == np.rot270() != np.rot180()
    #   with v_primes == h_primes and sorted * primes:
    #       identity == np.roll(axis=0) == np.roll(axis=1)
    #       np.flip(axis=0) == np.rot270()
    #       np.flip(axis=1) == np.rot90()
    h_primes = hashable_primes[ 0*size : 1*size ]
    v_primes = hashable_primes[ 1*size : 2*size ]
    output   = np.zeros(board.shape, dtype=np.int64)
    for x in range(size):
        for y in range(size):
            # current pixel is moved to left [0] index
            horizontal  = np.roll( board[:,y], -x )
            vertical    = np.roll( board[x,:], -y )
            left        = np.sum( horizontal * h_primes )
            down        = np.sum( vertical   * v_primes )
            output[x,y] = left * down
    return output

## Visualization

This ia visualiztion of what is happening

In [None]:
# Hashable Primes: 2, 7, 23, 47, 61, 83, 131, 163, 173, 251
translation_hash_pattern = np.array([
       [0,0,0,0, 83,0,0,0,0],
       [0,0,0,0,131,0,0,0,0],
       [0,0,0,0,163,0,0,0,0],
       [0,0,0,0,173,0,0,0,0],
[83,131,163,173,  2,7,23,47,61],
       [0,0,0,0,  7,0,0,0,0],
       [0,0,0,0, 23,0,0,0,0],
       [0,0,0,0, 47,0,0,0,0],
       [0,0,0,0, 61,0,0,0,0],
])
transforms = {
    'identity': translation_hash_pattern,
    'flip':     np.flip(translation_hash_pattern),
    'rot90':    np.rot90(translation_hash_pattern),
    'roll':     np.roll(np.roll(translation_hash_pattern, -2, axis=0), -1, axis=1),    
}

plt.figure(figsize=(len(transforms)*5, 5))
for i, (name, grid) in enumerate(transforms.items()):
    plt.subplot(1, len(transforms), i+1)
    plt.title(name)
    plt.imshow(grid, cmap='nipy_spectral')

In [None]:
board = csv_to_numpy_list(test_df.loc[85291], key='stop')[0]
for transforms in [
    {
        "identity": board,
        "np.roll(0)":   np.roll(board, 10, axis=0),
        "np.roll(1)":   np.roll(board, 10, axis=1),
        "np.rot180()":  np.rot90(board, 2),
    },
    {
        "np.flip(0)":   np.flip(board, axis=0),
        "np.flip(1)":   np.flip(board, axis=1),
        "np.rot90()":   np.rot90(board, 1),
        "np.rot180()":  np.rot90(board, 2),
        "np.rot270()":  np.rot90(board, 3),
    }    
]:
    figure = plt.figure(figsize=(len(transforms)*5, 5*2))
    figure.tight_layout(pad=10.0)
    for i, (name, grid) in enumerate(transforms.items()):
        plt.subplot(2, len(transforms), i+1)
        plt.title(name)
        plt.imshow(grid, cmap = 'binary')

    for i, (name, grid) in enumerate(transforms.items()):
        hashmap          = hash_translations_board(grid)
        sum_hash         = np.sum(hashmap)
        sum_x_prime_hash = hash_translations(grid)

        plt.subplot(2, len(transforms), len(transforms) + i+1)
        plt.title(f'sum = {sum_hash}\nsum * primes = {sum_x_prime_hash}')
        plt.imshow(hashmap, cmap = 'nipy_spectral')

## Tests

In [None]:
def test_hash_translations():
    count  = 1000 if os.environ.get('KAGGLE_KERNEL_RUN_TYPE') == 'Interactive' else sys.maxsize
    boards = csv_to_numpy_list(train_df, key='start')[:count]
    for board in boards:
        if np.count_nonzero(board) < 50: continue  # skip small symmetric boards
        transforms = {
            "identity": board,
            "roll_0":   np.roll(board, 13, axis=0),
            "roll_1":   np.roll(board, 13, axis=1),
            "flip_0":   np.flip(board, axis=0),
            "flip_1":   np.flip(board, axis=1),
            "rot90":    np.rot90(board, 1),
            "rot180":   np.rot90(board, 2),
            "rot270":   np.rot90(board, 3),
        }
        hashes  = { key: hash_translations(value) for key, value in transforms.items()  }

        # rolling the board should not change the hash, but other transforms should
        assert hashes['identity'] == hashes['roll_0']
        assert hashes['identity'] == hashes['roll_1']

        # all other flip / rotate transformations should produce different hashes
        assert hashes['identity'] != hashes['flip_0']
        assert hashes['identity'] != hashes['flip_1']
        assert hashes['identity'] != hashes['rot90']
        assert hashes['identity'] != hashes['rot180']
        assert hashes['identity'] != hashes['rot270']
        assert hashes['flip_0'] != hashes['flip_1'] != hashes['rot90']  != hashes['rot180'] != hashes['rot270']
    return len(boards)

count = test_hash_translations()
print(f'{count} tests passed')

# Profiler Performance

In [None]:
count  = 10000 if os.environ.get('KAGGLE_KERNEL_RUN_TYPE') == 'Interactive' else sys.maxsize
boards = csv_to_numpy_list(train_df, key='start')[:count]

time_start = time.perf_counter()
hashes     = [ hash_geometric_linear(board) for board in boards ]
time_taken = time.perf_counter() - time_start
print(f'{1000 * time_taken/len(boards):0.3f}ms = hash_geometric_linear()')

time_start = time.perf_counter()
hashes     = [ hash_geometric_concentric(board) for board in boards ]
time_taken = time.perf_counter() - time_start
print(f'{1000 * time_taken/len(boards):0.3f}ms = hash_geometric_concentric()')

time_start = time.perf_counter()
hashes     = [ hash_translations(board) for board in boards ]
time_taken = time.perf_counter() - time_start

print(f'{1000 * time_taken/len(boards):0.3f}ms = hash_translations()')