# Pytorch Game of Life

Conway's Game of Life is an example of 2D cellular automata.

Before going for the difficult "reverse" problem, lets see if we can solve the easier "forward" problem first and implement the `life_step()` function as a neural network.
We already have a code implemention for this function, so the real challenge is getting a neural network to train to 100% accuracy, which we will indeed see is technically possible. 

For this we will use PyTorch and a pure CNN architure. I have provided a detailed explination of my choices for network sizing, data flows, and state space analysis.

This network was able to implement the `life_step()` function with 100% accuracy tested over 1 million generated boards. 

In batch mode, the profiler reports that the neural network implemention can be faster than `numpy.roll()`, numba `@njit`, and even `scipy.signal.convolve2d()`.


## Update

This model, with 128 CNN layers, is greatly oversized compared to the theoretical minimum, 
but it reliably trains from random weight initialization acted as proof of concept that a 
neural network can indeed be trained to 100% accuracy.

It turns out the Game of Life Forward function can actually be implemented using a minimist network architecture with only 1 or 2 CNN 3x3 channels and only 3 three neurons arranged in 2 layers.

I have written a tutorial on hardcoding neural network weights, and how to implement counting and boolean logic gates using linear algebra.
- https://www.kaggle.com/jamesmcguigan/pytorch-game-of-life-hardcoding-network-weights/

# Classical Implementions of life_step()

Using the classic ruleset on a 25x25 board with wraparound, the game evolves at each timestep according to the following rules

- Overpopulation: if a living cell is surrounded by more than three living cells, it dies.
- Stasis: if a living cell is surrounded by two or three living cells, it survives.
- Underpopulation: if a living cell is surrounded by fewer than two living cells, it dies.
- Reproduction: if a dead cell is surrounded by exactly three cells, it becomes a live cell.

Here are three different implementions of the `life_step()` function, using `numpy.roll()`, `scipy.signal.convolve2d()` and an numba @njit optimized version in pure python.

In [None]:
# Functions for implementing Game of Life Forward Play
from typing import List

import numpy as np
import scipy.sparse
from joblib import delayed
from joblib import Parallel
from numba import njit


# Source: https://www.kaggle.com/ianmoone0617/reversing-conways-game-of-life-tutorial
def life_step_numpy(X: np.ndarray):
    """Game of life step using generator expressions"""
    nbrs_count = sum(np.roll(np.roll(X, i, 0), j, 1)
                     for i in (-1, 0, 1) for j in (-1, 0, 1)
                     if (i != 0 or j != 0))
    return (nbrs_count == 3) | (X & (nbrs_count == 2))


# Source: https://www.kaggle.com/ianmoone0617/reversing-conways-game-of-life-tutorial
def life_step_scipy(X: np.ndarray):
    """Game of life step using scipy tools"""
    from scipy.signal import convolve2d
    nbrs_count = convolve2d(X, np.ones((3, 3)), mode='same', boundary='wrap') - X
    return (nbrs_count == 3) | (X & (nbrs_count == 2))


@njit
def life_step_njit(board: np.ndarray) -> np.ndarray:
    """Game of life step using generator expressions"""
    size_x = board.shape[0]
    size_y = board.shape[1]
    output = np.zeros(board.shape, dtype=np.int8)
    for x in range(size_x):
        for y in range(size_y):
            cell       = board[x,y]
            neighbours = life_neighbours_xy(board, x, y, max_value=3)
            if ( (cell == 0 and      neighbours == 3 )
              or (cell == 1 and 2 <= neighbours <= 3 )
            ):
                output[x, y] = 1
    return output

# NOTE: @njit doesn't like np.roll(axis=) so reimplement explictly
@njit
def life_neighbours_xy(board: np.ndarray, x, y, max_value=3):
    size_x = board.shape[0]
    size_y = board.shape[1]
    neighbours = 0
    for i in (-1, 0, 1):
        for j in (-1, 0, 1):
            if i == j == 0: continue    # ignore self
            xi = (x + i) % size_x
            yj = (y + j) % size_y
            neighbours += board[xi, yj]
            if neighbours > max_value:  # shortcircuit return 4 if overpopulated
                return neighbours
    return neighbours

@njit
def life_neighbours(board: np.ndarray, max_value=3):
    size_x = board.shape[0]
    size_y = board.shape[1]
    output = np.zeros(board.shape, dtype=np.int8)
    for x in range(size_x):
        for y in range(size_y):
            output[x,y] = life_neighbours_xy(board, x, y, max_value)
    return output




# For generating large quantities of data, we can use joblib Parallel() to take advantage of all 4 CPUs available in a Kaggle Notebook
life_step = life_step_njit
def life_steps(boards: List[np.ndarray]) -> List[np.ndarray]:
    """ Parallel version of life_step() but for an array of boards """
    return Parallel(-1)( delayed(life_step)(board) for board in boards )

Rather than relying on the public train dataset, we can just reimplement the kaggle dataset generation function

In [None]:
# RULES: https://www.kaggle.com/c/conway-s-reverse-game-of-life/data
def generate_random_board() -> np.ndarray:
    # An initial board was chosen by filling the board with a random density between 1% full (mostly zeros) and 99% full (mostly ones).
    # DOCS: https://cmdlinetips.com/2019/02/how-to-create-random-sparse-matrix-of-specific-density/
    density = np.random.random() * 0.98 + 0.01
    board   = scipy.sparse.random(25, 25, density=density, data_rvs=np.ones).toarray()

    # The starting board's state was recorded after the 5 "warmup steps". These are the values in the start variables.
    for t in range(5):
        board = life_step(board)
        if np.count_nonzero(board) == 0: return generate_random_board()  # exclude empty boards and try again
    return board

def generate_random_boards(count) -> List[np.ndarray]:
    generated_boards = Parallel(-1)( delayed(generate_random_board)() for _ in range(count) )
    return generated_boards

# Pytorch Base Class

There is a little bit of core infrasture code that needs to be written, to handle common functionality such as: 
- model save/autoload
- casting between data formats
- freezing and unfreezing
- training loop functions

By putting this all in a base class, we can seperate out infrasture code from application logic, and makes code reuse easier with the ability to subclass these functions for different usecases. 

In [None]:
# First check if we have a GPU available 
# __file__ is implictly defined when running on localhost, but needs to be manually set when running inside a Kaggle Notebook   
import torch
device   = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
__file__ = './notebook.ipynb'

In [None]:
from __future__ import annotations

import os
from abc import ABCMeta
from typing import List
from typing import TypeVar
from typing import Union

import humanize
import numpy as np
import torch
import torch.nn as nn


# noinspection PyTypeChecker
T = TypeVar('T', bound='GameOfLifeBase')
class GameOfLifeBase(nn.Module, metaclass=ABCMeta):
    """
    Base class for GameOfLife based NNs
    Handles: save/autoload, freeze/unfreeze, casting between data formats, and training loop functions
    """

    def __init__(self):
        super().__init__()
        self.loaded    = False  # can't call sell.load() in constructor, as weights/layers have not been defined yet
        self.device    = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
        self.criterion = nn.MSELoss()


    @staticmethod
    def weights_init(layer):
        if isinstance(layer, (nn.Conv2d, nn.ConvTranspose2d)):
            nn.init.kaiming_normal_(layer.weight)
            nn.init.constant_(layer.bias, 0.1)

        
    ### Prediction

    def __call__(self, *args, **kwargs) -> torch.Tensor:
        if not self.loaded: self.load()  # autoload on first function call
        return super().__call__(*args, **kwargs)

    def predict(self, inputs: Union[List[np.ndarray], np.ndarray, torch.Tensor]) -> np.ndarray:
        """ Wrapper function around __call__() that returns a numpy int8 array for external usage """
        outputs = self(inputs)
        outputs = self.cast_int(outputs).squeeze().cpu().numpy()
        return outputs



    ### Training

    def loss(self, outputs, expected, input):
        return self.criterion(outputs, expected)

    def accuracy(self, outputs, expected, inputs) -> float:
        # noinspection PyTypeChecker
        return torch.sum(self.cast_int(outputs) == self.cast_int(expected)).cpu().numpy() / np.prod(outputs.shape)



    ### Freee / Unfreeze

    def freeze(self: T) -> T:
        if not self.loaded: self.load()
        for name, parameter in self.named_parameters():
            parameter.requires_grad = False
        return self

    def unfreeze(self: T) -> T:
        if not self.loaded: self.load()
        for name, parameter in self.named_parameters():
            parameter.requires_grad = True
        return self



    ### Load / Save Functionality

    @property
    def filename(self) -> str:
        return os.path.join( os.path.dirname(__file__), 'models', f'{self.__class__.__name__}.pth')


    # DOCS: https://pytorch.org/tutorials/beginner/saving_loading_models.html
    def save(self: T) -> T:
        os.makedirs(os.path.dirname(self.filename), exist_ok=True)
        torch.save(self.state_dict(), self.filename)
        print(f'{self.__class__.__name__}.savefile(): {self.filename} = {humanize.naturalsize(os.path.getsize(self.filename))}')
        return self


    def load(self: T) -> T:
        if os.path.exists(self.filename):
            try:
                self.load_state_dict(torch.load(self.filename))
                print(f'{self.__class__.__name__}.load(): {self.filename} = {humanize.naturalsize(os.path.getsize(self.filename))}')
            except Exception as exception:
                # Ignore errors caused by model size mismatch
                print(f'{self.__class__.__name__}.load(): model has changed dimensions, discarding saved weights\n')
                pass

        self.loaded = True    # prevent any infinite if self.loaded loops
        self.to(self.device)  # ensure all weights, either loaded or untrained are moved to GPU
        self.eval()           # default to production mode - disable dropout
        self.freeze()         # default to production mode - disable training
        return self



    ### Casting

    def cast_bool(self, x: torch.Tensor) -> torch.Tensor:
        # noinspection PyTypeChecker
        return (x > 0.5)

    def cast_int(self, x: torch.Tensor) -> torch.Tensor:
        return self.cast_bool(x).to(torch.int8)

    def cast_int_float(self, x: torch.Tensor) -> torch.Tensor:
        return self.cast_bool(x).to(torch.float32).requires_grad_(True)


    def cast_to_tensor(self, x: Union[np.ndarray, torch.Tensor]) -> torch.Tensor:
        if torch.is_tensor(x):
            return x.to(torch.float32).to(device)
        if isinstance(x, list):
            x = np.array(x)
        if isinstance(x, np.ndarray):
            x = torch.from_numpy(x).to(torch.float32)
            x = x.to(device)
            return x  # x.shape = (42,3)
        raise TypeError(f'{self.__class__.__name__}.cast_to_tensor() invalid type(x) = {type(x)}')


    # DOCS: https://towardsdatascience.com/understanding-input-and-output-shapes-in-convolution-network-keras-f143923d56ca
    # pytorch requires:    contiguous_format = (batch_size, channels, height, width)
    # tensorflow requires: channels_last     = (batch_size, height, width, channels)
    def cast_inputs(self, x: Union[List[np.ndarray], np.ndarray, torch.Tensor]) -> torch.Tensor:
        x = self.cast_to_tensor(x)
        if x.dim() == 1:             # single row from dataframe
            x = x.view(1, 1, torch.sqrt(x.shape[0]), torch.sqrt(x.shape[0]))
        elif x.dim() == 2:
            if x.shape[0] == x.shape[1]:  # single 2d board
                x = x.view(1, 1, x.shape[0], x.shape[1])
            else: # rows of flattened boards
                x = x.view(-1, 1, torch.sqrt(x.shape[1]), torch.sqrt(x.shape[1]))
        elif x.dim() == 3:                                 # numpy  == (batch_size, height, width)
            x = x.view(x.shape[0], 1, x.shape[1], x.shape[2])   # x.shape = (batch_size, channels, height, width)
        elif x.dim() == 4:
            pass  # already in (batch_size, channels, height, width) format, so do nothing
        return x


# Pytorch CNN

Now comes the application logic, in the form of a CNN neural network.

We have one 3x3 convolution, followed by three layers of 1x3 convolutions, we then also remind the final output layer of the original input before making a final prediction.

Notice that there is no nn.Linear layer here that gets the see the entire board state. The effect of this is that we have actually used CNN semantics to created a small 4 layer (128:16:8:1) dense network, with 9 inputs and 1 output, that is reused to predict each pixel individually dependent only on its immediate neighbours. This saves us having to write a looping function over the entire board, and we get `padding_mode='circular'` for free.

If we think about the core logic implemented in the Game Of Life `life_step()` function, all it is really doing is counting up the number of neighbourhood cells and comparing it to the value of the center cell (hence why we pass in the original board state back into the final layer). T
- 1 Alive + 2-3 neighbours = 1 Alive
- 0 Dead  +   3 neighbours = 1 Alive
- Otheriwse                = 0 Dead 

To correctly size the layers in the neural network, we need to think in terms of state complexity. 

The function only needs to know about its distance 1 nearest neighbours, which requires a single 3x3 convolution. There are 2^9 = 512 possible states given 3x3 binary pixels, however this function only needs to be able to count to either 2 or 3. If we do the maths [(9 choose 2) == 34](https://www.wolframalpha.com/input/?i=9+choose+2) + [(9 choose 3) == 84](https://www.wolframalpha.com/input/?i=9+choose+3) = [120](https://www.wolframalpha.com/input/?i=sum%289+choose+n%29+for+n%3D2..3) and we can round that up to the nearest power of 2 to give the network a little extra capacity. 

In theory it might be possible for the neural network to implement `(n choose 3)`  as `(n choose 2) && (n choose 1) && (n choose 1) not in (n choose 2)` by using a clever choose of state representations, which might require less states, but lets stick to 128 input layers for now.


Neural Networks are capable of implementing multi-imput boolean logic gates. NAND, NOR, NOT can all be implemented in a single layer, XNOR requires 2 layers and XOR requires 3.
- https://medium.com/@stanleydukor/neural-representation-of-and-or-not-xor-and-xnor-logic-gates-perceptron-algorithm-b0275375fea1

Thus for a neural network trying to implement a boolean logic function, we need 3 layers in a pyramid design and let the output layer act as a final AND gate to choose between the following rule states, which will have been computed by the previous layers:
- 1 Alive + <2 neighbous  = 1 Dead
- 1 Alive + >3 neighbous  = 1 Dead
- 1 Alive + ==2           = 1 Alive
- 1 Alive + ==2           = 1 Alive
- 0 Dead  + <3 neighbous  = 1 Dead
- 0 Dead  + >3 neighbous  = 1 Dead
- 0 Dead  + ==3 neighbous = 1 Alive

ReLU `max(0,n)` has a very easy to compute derivative function `n if n > 0 else 0`. It essentially implements a greater-than `>` or less-than `<`, with distance given in only one direction (returning 0 for anything on the other side of the line). LeakyReLU `0.01*n if n < 0 else n` however can compute distances in both directions by using a negative slope with a different order of magnitude.

We also add in dropout, which adds a layer of double-check redundancy to the network, as it needs to achieve 100% accuracy with any arbitrary 10% of the network being removed, thus every logic node must essentially have a backup.

The `__init__()` function simply defines the layers, and the `forward()` function calls them in sequence, adding `F.leaky_relu()` and `self.dropout()` between them. `torch.cat()` is used to append the original X value to the inputs of the final output layer, which is passed though a `sigmoid()` function to produce an output in the 0-1 domain, which is what we want for predicting `1 = Alive` or `0 = Dead` output cells.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class GameOfLifeForward(GameOfLifeBase):
    """
    This implements the life_step() function as a Neural Network function
    Training/tested to 100% accuracy over 10,000 random boards
    """
    def __init__(self):
        super().__init__()
        self.dropout = nn.Dropout(p=0.0)  # disabling dropout as it causes excessive (hour+) runtimes to reach 100% accuracy 

        # epoch: 240961 | board_count: 6024000 | loss: 0.0000000000 | accuracy = 1.0000000000 | time: 0.611ms/board | with dropout=0.1
        # Finished Training: GameOfLifeForward - 240995 epochs in 3569.1s
        self.conv1   = nn.Conv2d(in_channels=1,     out_channels=128, kernel_size=(3,3), padding=1, padding_mode='circular')
        self.conv2   = nn.Conv2d(in_channels=128,   out_channels=16,  kernel_size=(1,1))
        self.conv3   = nn.Conv2d(in_channels=16,    out_channels=8,   kernel_size=(1,1))
        self.conv4   = nn.Conv2d(in_channels=1+8,   out_channels=1,   kernel_size=(1,1))
        self.apply(self.weights_init)


    def forward(self, x):
        x = input = self.cast_inputs(x)

        x = self.conv1(x)
        x = F.leaky_relu(x)

        x = self.conv2(x)
        x = F.leaky_relu(x)
        x = self.dropout(x)

        x = self.conv3(x)
        x = F.leaky_relu(x)
        x = self.dropout(x)

        x = torch.cat([ x, input ], dim=1)  # remind CNN of the center cell state before making final prediction
        x = self.conv4(x)
        x = torch.sigmoid(x)

        return x

# Training Loop

To be honest, this could have been written in a more object oriented and resuable fashion, however this has the advantage of showing the full training loop logic in one place.

First we load the model, and call the `.train()` and `.unfreeze()` functions to set the model into training mode. Pytorch by default creates models in this state, but my base class has decided to default to production mode with calls to `.eval()` and `.freeze()` being called as part of the `.load()` function. Note that `.freeze()`, `.unfreeze()`, `.load()` and `.save()` are not native to pytorch and have been custom defined in the `GameOfLifeBase` class. 

Optimizer chosen is RMSProp with momentum, using a CyclicLR scheduler with that decays over time. The criterion loss function is MSELoss which is defined in the GameOfLifeBase class.

I have obverseved that when using CyclicLR, it is possible to set a much higher maximum learning rates, because it will start off slow at `min_lr=1e-3`, make sure the gradient is pointing in the right direction before reaching `max_lr=1` after 250 epochs, which helps speed up the training. Combined with dropout, this helps prevent the neural network from getting stuck on solutions such as everything=1 or everything=0.

Most of the code in the loop is just data generation, keeping track of statistics, as well as exit conditions such as timeout or 100% accuracy over the last 10,000 boards.

The code also contains unused implementions for computing L1 and L2 regularization losses by simply looping over the model parameters and summing up the weights. 


The meat of the training loop is simply:
```
optimizer.zero_grad()
outputs = model(inputs)
loss    = model.loss(outputs, expected, inputs)
loss.backward()
optimizer.step()
```

The `.loss()` and `.accuracy()` functions defined in the `GameOfLifeBase` class.  
```
class GameOfLifeBase:
    def loss(self, outputs, expected, input):
        self.criterion = nn.MSELoss()
        return self.criterion(outputs, expected)
        
    def accuracy(self, outputs, expected, inputs) -> float:
        return torch.sum(self.cast_int(outputs) == self.cast_int(expected)).cpu().numpy() / np.prod(outputs.shape)
```

Note that the `accuracy()` metric is checking for an exact match (after int rounding) between the output and expected state, whereas the `loss()` function is providing a Mean Squared Error heuristic for number of incorect pixels and distance from neural network output floats to expected state.

In [None]:
import atexit
import sys
import time

import numpy as np
import torch
import torch.optim as optim
from torch import tensor


def train(model, batch_size=25, l1=0, l2=0, timeout=0, reverse_input_output=False):
    print(f'Training: {model.__class__.__name__}')
    time_start = time.perf_counter()

    atexit.register(model.save)      # save on exit - BrokenPipeError doesn't trigger finally:
    model.load().train().unfreeze()  # enable training and dropout

    # NOTE: criterion loss function now defined via model.loss()
    optimizer = optim.RMSprop(model.parameters(), lr=0.01, momentum=0.9)

    # epoch: 14481 | board_count: 362000 | loss: 0.0000385726 | accuracy = 0.9990336000 | time: 0.965ms/board
    scheduler = None

    # epoch: 240961 | board_count: 6024000 | loss: 0.0000000000 | accuracy = 1.0000000000 | time: 0.611ms/board
    # Finished Training: GameOfLifeForward - 240995 epochs in 3569.1s
    scheduler_config = {
        'optimizer': optimizer,
        'max_lr':       1e-2,
        'base_lr':      1e-4,
        'step_size_up': 250,
        'mode':         'exp_range',
        'gamma':        0.8
    }
    scheduler = torch.optim.lr_scheduler.CyclicLR(**scheduler_config)

    success_count = 10_000
    
    epoch        = 0
    board_count  = 0
    last_loss    = np.inf
    loop_loss    = 0
    loop_acc     = 0
    loop_count   = 0
    epoch_losses     = [last_loss]
    epoch_accuracies = [ 0 ]
    num_params = torch.sum(torch.tensor([
        torch.prod(torch.tensor(param.shape))
        for param in model.parameters()
    ]))
    try:
        for epoch in range(1, sys.maxsize):
            if np.min(epoch_accuracies[-success_count//batch_size:]) == 1.0:   break  # multiple epochs of 100% accuracy to pass
            if timeout and timeout < time.perf_counter() - time_start:  break
            epoch_start = time.perf_counter()

            inputs_np   = [ generate_random_board() for _     in range(batch_size) ]
            expected_np = [ life_step(board)        for board in inputs_np         ]
            inputs      = model.cast_inputs(inputs_np).to(device)
            expected    = model.cast_inputs(expected_np).to(device)

            # This is for GameOfLifeReverseOneStep() function, where we are trying to learn the reverse function
            if reverse_input_output:
                inputs_np, expected_np = expected_np, inputs_np
                inputs,    expected    = expected,    inputs
                assert np.all( life_step(expected_np[0]) == inputs_np[0] )


            optimizer.zero_grad()
            outputs = model(inputs)
            loss    = model.loss(outputs, expected, inputs)
            if l1 or l2:
                l1_loss = torch.sum(tensor([ torch.sum(torch.abs(param)) for param in model.parameters() ])) / num_params
                l2_loss = torch.sum(tensor([ torch.sum(param**2)         for param in model.parameters() ])) / num_params
                loss   += ( l1_loss * l1 ) + ( l2_loss * l2 )

            loss.backward()
            optimizer.step()
            if scheduler is not None:
                # scheduler.step(loss)  # only required for
                scheduler.step()

            # noinspection PyTypeChecker
            last_accuracy = model.accuracy(outputs, expected, inputs)  # torch.sum( outputs.to(torch.bool) == expected.to(torch.bool) ).cpu().numpy() / np.prod(outputs.shape)
            last_loss     = loss.item() / batch_size

            epoch_losses.append(last_loss)
            epoch_accuracies.append( last_accuracy )

            loop_loss   += last_loss
            loop_acc    += last_accuracy
            loop_count  += 1
            board_count += batch_size
            epoch_time   = time.perf_counter() - epoch_start
            time_taken   = time.perf_counter() - time_start

            # Print statistics after each epoch
            if( (epoch <= 10)
             or (np.log10(board_count) % 1 == 0)
             or (board_count <  10_000 and board_count %  1_000 == 0)  
             or (board_count < 100_000 and board_count % 10_000 == 0) 
             or (board_count % 100_000 == 0)
            ):
                print(f'epoch: {epoch:6d} | board_count: {board_count:7d} | loss: {loop_loss/loop_count:.10f} | accuracy = {loop_acc/loop_count:.10f} | time: {1000*epoch_time/batch_size:.3f}ms/board | {time_taken//60:2.0f}m {time_taken%60:02.0f}s')
                loop_loss  = 0
                loop_acc   = 0
                loop_count = 0
        print(f'Successfully trained to 100% accuracy over the last {success_count} boards')
        print(f'epoch: {epoch:6d} | board_count: {board_count:7d} | loss: {np.mean(epoch_losses[-success_count//batch_size:]):.10f} | accuracy = {np.min(epoch_accuracies[-success_count//batch_size:]):.10f} | time: {1000*epoch_time/batch_size:.3f}ms/board')
                
    except (BrokenPipeError, KeyboardInterrupt):
        pass
    except Exception as exception:
        print(exception)
        raise exception
    finally:
        time_taken = time.perf_counter() - time_start
        print(f'Finished Training: {model.__class__.__name__} - {epoch} epochs in {time_taken:.1f}s')
        model.save()
        atexit.unregister(model.save)   # model now saved, so cancel atexit handler
        # model.eval()                  # disable dropout

# Training

Here we run the training loop. 

Learnings:
- There is a long wait time between 99.999% accuracy and 100%, which is due to the network figuring out how to implement robust redundancy through the dropout layer
- `max_lr=1`, even using CycleLR, is far too high, and was causing weights to get stuck in local minima such as all zeros or all ones. Lowering the learning rate produced more stable training cycles.
- Conv2D weights should be explictly initialized using `nn.init.kaiming_normal_()`. Before implementing this, I had observed that even with a notebook restart, there would sometimes be very high accuracy levels after epoch 1, leading me to suspect that CUDA memory was being reused.
- One intresting observation from my localhost experiements, is that restarting training during the era of 99.999% accuracy will instantly result in a 100% accuracy score. I have not be able to explain this effect yet, but please comment below if you understand this effect.


In [None]:
# !rm ./models/GameOfLifeForward.pth
model = GameOfLifeForward()
train(model)

# Testing

Did you know that you can write unit tests for Neural Networks?

We are asserting 100% accuracy for our neural network, so lets see if it can correctly predict a million boards in a row.

Note that the `GameOfLifeBase` class defaults to `.eval()` production mode, which disables the dropout layers (which are only applied during training).

In [None]:
def test_GameOfLifeForward_single():
    """
    Test GameOfLifeForward().predict() works with single board datastructure semantics
    """
    model = GameOfLifeForward()

    input    = generate_random_board()
    expected = life_step(input)
    output   = model.predict(input)
    assert np.all( output == expected )  # assert 100% accuracy



def test_GameOfLifeForward_batch(count=1000):
    """
    Test GameOfLifeForward().predict() also works batch-mode datastructure semantics
    Batch mode is limited to only 1000 boards at a time, which prevents CUDA out of memory exceptions
    """
    model = GameOfLifeForward()

    # As well as in batch mode
    for _ in range(max(1,count//1000)):
        inputs   = generate_random_boards(1000)
        expected = life_steps(inputs)
        outputs  = model.predict(inputs)
        assert np.all( outputs == expected )  # assert 100% accuracy

In [None]:
# GameOfLifeForward can successfully predict a million boards in a row correctly
time_start = time.perf_counter()
test_GameOfLifeForward_single()
test_GameOfLifeForward_batch(1_000_000)
time_taken = time.perf_counter() - time_start
print(f'All tests passed! ({time_taken:.0f}s)')

# Profiling

So how fast are GPU neural networks really?

We can write a profiling loop and speed test them against our range of classical implementions.

It is also worth noting that neural networks can be used to either predict a single output, or in batch mode to predict an array of input data. Due to the python overhead of function calling and getting the data into and out of CUDA, batch mode is where Neural Network performance really shines, and in this case it can even outperform `numpy.roll()` numba `@njit` and `scipy.signal.convolve2d()`

In [None]:
# Discovery:
# neural networks in batch mode can be faster than C compiled classical code, but slower when called in a loop
# also scipy uses numba @njit under the hood
#
# number: 1 | batch_size = 1000
# life_step() - numpy         = 181.7µs
# life_step() - scipy         = 232.1µs
# life_step() - njit          = 433.5µs   # includes @njit compile times
# gameOfLifeForward() - loop  = 567.7µs
# gameOfLifeForward() - batch = 1807.5µs  # includes .pth file loadtimes
#
# number: 100 | batch_size = 1000
# gameOfLifeForward() - batch =  29.8µs  # faster than even @njit or scipy
# life_step() - scipy         =  35.6µs
# life_step() - njit          =  42.8µs
# life_step() - numpy         = 180.8µs
# gameOfLifeForward() - loop  = 618.3µs  # much slower, but still fast compared to an expensive function

def profile_GameOfLifeForward():
    import timeit
    import operator

    model  = GameOfLifeForward().load().to(device)
    for batch_size in [1, 10_000]:
        boards = generate_random_boards(batch_size)
        number = 1
        timings = {
            'gameOfLifeForward() - batch': timeit.timeit(lambda:   model(boards),                                number=number),
            'gameOfLifeForward() - loop':  timeit.timeit(lambda: [ model(board)           for board in boards ], number=number),
            'life_step() - njit':          timeit.timeit(lambda: [ life_step_njit(board)  for board in boards ], number=number),
            'life_step() - numpy':         timeit.timeit(lambda: [ life_step_numpy(board) for board in boards ], number=number),
            'life_step() - scipy':         timeit.timeit(lambda: [ life_step_scipy(board) for board in boards ], number=number),
        }
        print(f'{device} | batch_size = {len(boards)}')
        for key, value in sorted(timings.items(), key=operator.itemgetter(1)):
            print(f'- {key:27s} = {value/number/len(boards) * 1_000:7.3f}ms')
        print()

In [None]:
device = torch.device("cpu")
profile_GameOfLifeForward()

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    profile_GameOfLifeForward()

# Further Reading

I have written an interactive playable demo of the forward version of this game in React Javascript:
- https://life.jamesmcguigan.com/


This notebook is part of series exploring the Neural Network implementations of the Game of Life Forward Problem
- [Pytorch Game of Life - First Attempt](https://www.kaggle.com/jamesmcguigan/pytorch-game-of-life-first-attempt)
- [Pytorch Game of Life - Hardcoding Network Weights](https://www.kaggle.com/jamesmcguigan/pytorch-game-of-life-hardcoding-network-weights)
- [Its Easy for Neural Networks To Learn Game of Life](https://www.kaggle.com/jamesmcguigan/its-easy-for-neural-networks-to-learn-game-of-life)

This is preliminary research towards the harder Reverse Game of Life problem, for which I have already designed a novel Ouroboros loss function:
- [OuroborosLife - Function Reversal GAN](https://www.kaggle.com/jamesmcguigan/ouroboroslife-function-reversal-gan)


I also have an extended series of Notebooks exploring different approaches to the Reverse Game of Life problem

My first attempt was to use the Z3 Constraint Satisfaction SAT solver. This gets 100% accuracy on most boards, but there are a few which it cannot solve. This approach can be slow for boards with large cell counts and large deltas. I managed to figure out how to get cluster compute working inside Kaggle Notebooks, but this solution is estimated to require 10,000+ hours of CPU time to complete.    
- [Game of Life - Z3 Constraint Satisfaction](https://www.kaggle.com/jamesmcguigan/game-of-life-z3-constraint-satisfaction)

Second approach was to create a Geometrically Invarient Hash function using Summable Primes, then use forward play and a dictionary lookup table to create a database of known states. For known input/output states at a given delta, the problem is reduced to simply solving the geometric transform between inputs and applying the same function to the outputs. The Hashmap Solver was able to solve about 10% of the test dataset.
- [Summable Primes](https://www.kaggle.com/jamesmcguigan/summable-primes)
- [Geometric Invariant Hash Functions](https://www.kaggle.com/jamesmcguigan/geometric-invariant-hash-functions)
- [Game of Life - Repeating Patterns](https://www.kaggle.com/jamesmcguigan/game-of-life-repeating-patterns)
- [Game of Life - Hashmap Solver](https://www.kaggle.com/jamesmcguigan/game-of-life-hashmap-solver)
- [Game of Life - Image Segmentation Solver](https://www.kaggle.com/jamesmcguigan/game-of-life-image-segmentation-solver)