# ConnectX - Implementing Functions in Pytorch

According the the [Universal Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) a sufficently large neural network, with a sufficently large sample of input/outputs should be capable of approximating any mathematical function.

Many of the usecases presented as Kaggle Competitions involve using neural networks to model functions that would be impossible to code classically by hand.

This notebook will cover the opposite usecase, that of trying to create a neural network implemention of a python function that we already have the sourcecode to, using reinforcement learning to (hopefully) achieve 100% accuracy, then to compare the performance of these two implementions.

# Bitshifting Implemention

The example function we will be attempting to reverse-engineer is `is_gameover()` which was descibed in detail in my 
[Vectorized Bitshifting Tutorial](https://www.kaggle.com/jamesmcguigan/connectx-vectorized-bitshifting-tutorial/).

The function is a combination of two edgecases for detecting the endgame state of a Connect4 game
1. Win/Loss  - One of the players has achieved 4-in-a-row
2. Stalemate - All the squares on the board have been played, and there are no more valid actions left 

This is the bitshifting code required to implement this function[](http://)

In [None]:
# Source: https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/core/ConnectXBBNN.py

import itertools
import random
import time
import numpy as np
from numba import njit, int8, int64
from collections import defaultdict, namedtuple
from typing import Union, Tuple, List



Configuration = namedtuple('configuration', ['rows', 'columns', 'inarow', 'steps', 'timeout'])
configuration = Configuration(**{'columns': 7, 'rows': 6, 'inarow': 4, 'steps': 1000, 'timeout': 8})


def is_gameover(bitboard: np.ndarray) -> bool:
    if has_no_more_moves(bitboard):  return True
    if get_winner(bitboard) != 0:    return True
    return False


def has_no_more_moves(bitboard: np.ndarray) -> bool:
    """If all the squares on the top row have been played, then there are no more moves"""
    return bitboard[0] & mask_legal_moves == mask_legal_moves 


def get_winner(bitboard: np.ndarray) -> int:
    """ Endgame get_winner: 0 for no get_winner, 1 = player 1, 2 = player 2"""
    p2_wins = (bitboard[0] &  bitboard[1]) & gameovers[:] == gameovers[:]
    if np.any(p2_wins): return 2
    p1_wins = (bitboard[0] & ~bitboard[1]) & gameovers[:] == gameovers[:]
    if np.any(p1_wins): return 1
    return 0


def get_gameovers() -> np.ndarray:
    """Creates a list of all winning board positions, over 4 directions: horizontal, vertical and 2 diagonals"""
    rows    = configuration.rows
    columns = configuration.columns
    inarow  = configuration.inarow

    gameovers = []

    mask_horizontal  = 0
    mask_vertical    = 0
    mask_diagonal_dl = 0
    mask_diagonal_ul = 0
    for n in range(inarow):  # use prange() with numba(parallel=True)
        mask_horizontal  |= 1 << n
        mask_vertical    |= 1 << n * columns
        mask_diagonal_dl |= 1 << n * columns + n
        mask_diagonal_ul |= 1 << n * columns + (inarow - 1 - n)

    row_inner = rows    - inarow
    col_inner = columns - inarow
    for row in range(rows):         # use prange() with numba(parallel=True)
        for col in range(columns):  # use prange() with numba(parallel=True)
            offset = col + row * columns
            if col <= col_inner:
                gameovers.append( mask_horizontal << offset )
            if row <= row_inner:
                gameovers.append( mask_vertical << offset )
            if col <= col_inner and row <= row_inner:
                gameovers.append( mask_diagonal_dl << offset )
                gameovers.append( mask_diagonal_ul << offset )
    return gameovers
gameovers = get_gameovers()



### Utility Functions

def empty_bitboard() -> np.ndarray:
    bitboard = np.array([0, 0], dtype=np.int64)
    return bitboard


def is_bitboard(bitboard) -> bool:
    return isinstance(bitboard, np.ndarray) and bitboard.dtype == np.int64 and bitboard.shape == (2,)


@njit
def list_to_bitboard(listboard: Union[np.ndarray,List[int]]) -> np.ndarray:
    # bitboard[0] = played, is a square filled             | 0 = empty, 1 = filled
    # bitboard[1] = player, who's token is this, if filled | 0 = empty, 1 = filled
    bitboard_played = 0  # 42 bit number for if board square has been played
    bitboard_player = 0  # 42 bit number for player 0=p1 1=p2
    if isinstance(listboard, np.ndarray): listboard = listboard.flatten()
    for n in range(len(listboard)):  # prange
        if listboard[n] != 0:
            bitboard_played |= (1 << n)        # is a square filled (0 = empty | 1 = filled)
            if listboard[n] == 2:
                bitboard_player |= (1 << n)    # mark as player 2 square, else assume p1=0 as default
    bitboard = np.array([bitboard_played, bitboard_player], dtype=np.int64)
    return bitboard


@njit(int8[:,:](int64[:]))
def bitboard_to_numpy2d(bitboard: np.ndarray) -> np.ndarray:
    global configuration
    rows    = configuration.rows
    columns = configuration.columns
    size    = rows * columns
    output  = np.zeros((size,), dtype=np.int8)
    for i in range(size):  # prange
        is_played = (bitboard[0] >> i) & 1
        if is_played:
            player = (bitboard[1] >> i) & 1
            output[i] = 1 if player == 0 else 2
    return output.reshape((rows, columns))



# Use string reverse to create mirror bit lookup table: mirror_bits[ 0100000 ] == 0000010
mirror_bits = np.array([
    int( "".join(reversed(f'{n:07b}')), 2 )
    for n in range(2**configuration.columns)
], dtype=np.int64)

@njit
def mirror_bitstring( bitstring: int ) -> int:
    """ Return the mirror view of the board for hashing:  0100000 -> 0000010 """
    global configuration

    if bitstring == 0:
        return 0  # short-circuit for empty board

    bitsize     = configuration.columns * configuration.rows        # total number of bits to process
    unit_size   = configuration.columns                             # size of each row in bits
    unit_mask   = (1 << unit_size) - 1                              # == 0b1111111 | 0x7f
    offsets     = np.arange(0, bitsize, unit_size, dtype=np.int64)  # == [ 0, 7, 14, 21, 28, 35 ]

    # This can technically be done as a one liner:
    output = np.sum( mirror_bits[ (bitstring & (unit_mask << offsets)) >> offsets ] << offsets )

    return int(output)


@njit
def mirror_bitboard( bitboard: np.ndarray ) -> np.ndarray:
    return np.array([
        mirror_bitstring(bitboard[0]),
        mirror_bitstring(bitboard[1]),
    ], dtype=bitboard.dtype)


@njit
def get_bitcount_mask(size: int = configuration.columns * configuration.rows) -> np.ndarray:
    # return np.array([1 << index for index in range(0, size)], dtype=np.int64)
    return 1 << np.arange(0, size, dtype=np.int64)
bitcount_mask = get_bitcount_mask()


@njit
def get_move_number(bitboard: np.ndarray) -> int:
    global configuration
    if bitboard[0] == 0: return 0
    size          = configuration.columns * configuration.rows
    mask_bitcount = get_bitcount_mask(size)
    move_number   = np.count_nonzero(bitboard[0] & mask_bitcount)
    return move_number


@njit
def current_player_id( bitboard: np.ndarray ) -> int:
    """ Returns next player to move: 1 = p1, 2 = p2 """
    move_number = get_move_number(bitboard)
    next_player = 1 if move_number % 2 == 0 else 2  # player 1 has the first move on an empty board
    return next_player


@njit
def next_player_id(player_id: int) -> int:
    assert player_id in [1,2]
    return 1 if player_id == 2 else 2


@njit
def get_next_index(bitboard: np.ndarray, action: int) -> int:
    global configuration
    assert is_legal_move(bitboard, action)

    # Start at the ground, and return first row that contains a 0
    for row in range(configuration.rows-1, -1, -1):
        index = action + (row * configuration.columns)
        value = (bitboard[0] >> index) & 1
        if value == 0:
            return index
    return action  # this should never happen - implies not is_legal_move(action)


@njit
def result_action(bitboard: np.ndarray, action: int, player_id: int) -> np.ndarray:
    assert is_legal_move(bitboard, action)
    index    = get_next_index(bitboard, action)
    mark     = 0 if player_id == 1 else 1
    output = np.array([
        bitboard[0] | 1    << index,
        bitboard[1] | mark << index
    ], dtype=bitboard.dtype)
    return output


mask_board       = (1 << configuration.columns * configuration.rows) - 1
mask_legal_moves = (1 << configuration.columns) - 1        
_is_legal_move_mask  = ((1 << configuration.columns) - 1)
_is_legal_move_cache = np.array([ 
    [ int( (bits >> action) & 1 == 0 ) for action in range(configuration.columns) ]
    for bits in range(2**configuration.columns)
], dtype=np.int8)
@njit
def is_legal_move(bitboard: np.ndarray, action: int) -> int:
    bits = bitboard[0] & _is_legal_move_mask   # faster than: int( (bitboard[0] >> action) & 1 == 0 )
    return _is_legal_move_cache[bits, action]  # NOTE: [bits,action] is faster than [bits][action]


@njit
def get_random_move(bitboard: np.ndarray) -> int:
    """ This is slightly quicker than random.choice(get_all_moves())"""
    while True:
        action = np.random.randint(0, configuration.columns)
        if is_legal_move(bitboard, action):
            return action

# Neural Network Implemention

Given that we have a source code implemention of the function, we can generate supply of training examples. Lets see if we can train a neural network to learn this function.

First we define a base class to handle the mechanics of loading and saving model files, as well as an interface for casting between bitboard and pytorch data formats.

This improves code cleanliness, as the neural network model configuration will be defined in a subclass, uncluttered by infrastructure logic.

In [None]:
# Copy save files from previous notebook
!cp -v ../input/*/*.pth ./

In [None]:
import torch
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
device

In [None]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
__file__ = './undefined.py'


class BitboardNN(nn.Module):
    """Base class for bitboard based NNs, handles casting inputs and save/load functionality"""

    def __init__(self):
        super().__init__()
        self.device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

    def cast(self, x):
        if is_bitboard(x):
            x = bitboard_to_numpy2d(x)
        x = torch.from_numpy(x).to(torch.int64)  # int64 required for functional.one_hot()
        x = F.one_hot(x, num_classes=self.one_hot_size)
        x = x.to(torch.float32)                  # float32 required for self.fc1(x)
        x = x.to(device)  
        return x  # x.shape = (42,3)

    def cast_to_labels(self, expected):
        labels = torch.tensor([ expected ], dtype=torch.float).to(device)  # nn.MSELoss() == float | nn.CrossEntropyLoss() == long
        return labels

    def cast_from_outputs(self, outputs: torch.Tensor) -> bool:
        """ convert (1,1) tensor back to bool """
        actual = bool( round( outputs.data.cpu().numpy().flatten()[0] ) )
        return actual

    @property
    def filename(self):
        return os.path.join( os.path.dirname(__file__), f'{self.__class__.__name__}.pth')

    # DOCS: https://pytorch.org/tutorials/beginner/saving_loading_models.html
    def save(self):
        torch.save(self.state_dict(), self.filename)

    def load(self):
        if os.path.exists(self.filename):
            # Ignore errors caused by model size mismatch
            try:
                self.load_state_dict(torch.load(self.filename))
                self.eval()
            except: pass

# Dense Layer Model

The first approach to try is a dense only model, with 3 hidden layers of 128 nodes, which outputs a single scalar.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


# self.model_size = 128 | game:  100000 | move:  2130207 | loss: 0.137 | accuracy: 0.810 / 0.953 | time: 519s
# self.model_size = 128 | game:  200000 | move:  2132342 | loss: 0.134 | accuracy: 0.834 / 0.953
# self.model_size = 128 | game: 1000000 | move: 17053998 | loss: 0.099 | accuracy: 0.890 / 0.953 | time: 4274.1s
class IsGameoverSquareNN(BitboardNN):
    def __init__(self):
        super().__init__()
        self.device       = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
        self.one_hot_size = 3
        self.input_size   = configuration.rows * configuration.columns
        self.output_size  = 1
        self.model_size   = 128

        self.fc1    = nn.Linear(self.input_size * self.one_hot_size, self.model_size)
        self.fc2    = nn.Linear(self.model_size, self.model_size)
        self.fc3    = nn.Linear(self.model_size, self.model_size)
        self.output = nn.Linear(self.model_size, self.output_size)

    def cast(self, x):
        x = super().cast(x)
        x = x.view(-1, self.input_size * self.one_hot_size)
        return x

    def forward(self, x):
        x = self.cast(x)                   # x.shape = (1,126)
        x = F.leaky_relu(self.fc1(x))      # x.shape = (1,256)
        x = F.leaky_relu(self.fc2(x))      # x.shape = (1,256)
        x = F.leaky_relu(self.fc3(x))      # x.shape = (1,256)
        x = torch.sigmoid(self.output(x))  # x.shape = (1,1)
        x = x.view(1)                      # x.shape = (1,)  | return 1d array of outputs, to match isGameoverCNN
        return x


isGameoverSquareNN = IsGameoverSquareNN()
isGameoverSquareNN.to(device)

# CNN Model

A second approach is to try a CNN architecture. Connect4 functions lend themselves naturally to a 4x4 sized CNN kernel

In [None]:
# DOCS: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
import torch
import torch.nn as nn
import torch.nn.functional as F


class IsGameoverCNN(BitboardNN):
    def __init__(self):
        super().__init__()
        self.one_hot_size     = 3
        self.input_size       = configuration.rows * configuration.columns * self.one_hot_size
        self.output_size      = 1
        self.cnn_channels     = (10 + 16)  # 4 vertical, 4 horizontal, 2 diagonal lines = 10 + 16 squares
        self.cnn_kernel_size  = configuration.inarow
        self.cnn_output_size  = (configuration.rows-self.cnn_kernel_size+1) * (configuration.columns-self.cnn_kernel_size+1) * self.cnn_channels
        self.dense_layer_size = self.cnn_output_size // 2

        self.conv1  = nn.Conv2d(self.one_hot_size, self.cnn_channels, self.cnn_kernel_size)
        self.fc1    = nn.Linear(self.cnn_output_size,    self.cnn_output_size//2)
        self.fc2    = nn.Linear(self.cnn_output_size//2, self.cnn_output_size//4)
        self.fc3    = nn.Linear(self.cnn_output_size//4, self.cnn_output_size//8)
        self.output = nn.Linear(self.cnn_output_size//8, self.output_size)


    # DOCS: https://towardsdatascience.com/understanding-input-and-output-shapes-in-convolution-network-keras-f143923d56ca
    # pytorch requires:    contiguous_format = (batch_size, channels, height, width)
    # tensorflow requires: channels_last     = (batch_size, height, width, channels)
    def cast(self, x):
        x = super(IsGameoverCNN, self).cast(x)                                        # x.shape = (height, width, channels)
        x = x.view(-1, configuration.rows, configuration.columns, self.one_hot_size)  # x.shape = (batch_size, height, width, channels)
        x = x.permute(0, 3, 1, 2)                                                     # x.shape = (batch_size, channels, height, width)
        return x


    def forward(self, x):
        x = self.cast(x)                    # x.shape = (1,3,6,7)
        x = F.leaky_relu(self.conv1(x))     # x.shape = (1,26,3,4)
        x = x.permute(0,2,3,1).reshape(-1)  # x.shape = (312,) + convert to columns_last (batch_size, height, width, channels)
        x = F.leaky_relu(self.fc1(x))       # x.shape = (156,)
        x = F.leaky_relu(self.fc2(x))       # x.shape = (78,)
        x = F.leaky_relu(self.fc3(x))       # x.shape = (39,)
        x = torch.sigmoid(self.output(x))   # x.shape = (1,)
        return x


isGameoverCNN = IsGameoverCNN()
isGameoverCNN.to(device)

# Training Loop

Input data (bitboard positions) can be generated by continually simulating a Connect4 games between two random agents. 

However one of the issues faced by reinforcement learning is the imbalanced nature of generated datasets. 
We spend most of the time playing moves, and it is only at the very end do we get a single is_gameover() event. 
Then there is the issue of `has_no_more_moves()` == Draw edgecase, which happens only rarely when two random agents fight.

The raw statistical distribution of the generated dataset is:
- not_gameover():      0.95305 (95% of the generated dataset)
- is_gameover():       0.04680 (average game lasts 21.3 moves)
- has_no_more_moves(): 0.00016 (1 in 6250 moves or 1 in 292.5 games)

My initial implemention generated a new datapoint for each iteration of the training loop. 
This approach ran into the imbalanced dataset problem. Most of the inputs predicted a 0, and the network was unable to achieve greater than 95% accuracy, 
which would be the baseline accuracy of a model that always predicted `0 = not_gameover`.

I borrowed a couple of ideas from the [RL Course by David Silver](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ) lectures.
- Instead of generating the examples inside the training loop, it is possible to generate an batched dataset, then treat it as a supervised learning problem
- RL training examples need not be use-once-and-throw-away, but can be reused to train the model for several iterations. This also reduces the 
computational burden of generating new datasets.

By creating a batched dataset, it was possible to normalize the distribution of generated data via undersampling. 
Dataset size is 1000 bitboards, with 500 examples of not_gameover, and 250 each of the `is_gameover()` and `has_no_more_moves()` edgecases.
Games of random_agent vs random_agent would be repeatedly run until there where enough examples of each of the edgecases to create a dataset.

1.5 million bitboards must be generated to observe 250 `has_no_more_moves()` events. On my RazerBlade laptop with multithreading, it requires 80s (50μs/sample) of compute on my localhost laptop. Inside a Kaggle notebook, the same dataset generation is 9.5x slower (473μs/sample) which can be improved slightly with @njit to only 7.8x slower (391μs/sample). This would still data generation time of 10 minutes per epoch.


For the purposes of simplity, the neural network is trained and backpropergated on single values, rather than mini-batches. 
I am unsure how this effects training, accuracy and computational performance.

In [None]:
#!/usr/bin/env python3
import itertools
import random
import time
import itertools
from collections import defaultdict
from typing import Tuple

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from joblib import delayed
from joblib import Parallel

DatasetItem = namedtuple('DatasetItem', ['expected', 'bitboard'])
def generate_dataset(dataset_size: int, bitboard_fn, verbose=False) -> Tuple[int, np.ndarray]:
    """ Creates a statistically balanced dataset of all the edgecases """
    time_start   = time.perf_counter() 
    dataset_size = int(dataset_size/20)  # 20 = (5 + 5 + 10)
    data = {
        "not_gameover":      [],
        "is_gameover":       [],
        "has_no_more_moves": []
    }
    while min(map(len, data.values())) < dataset_size:
        def generate(games=100):
            data      = defaultdict(list)
            bitboard  = empty_bitboard()
            player_id = current_player_id(bitboard)
            for n in range(games):
                while not is_gameover(bitboard):
                    action    = get_random_move(bitboard)
                    bitboard  = result_action(bitboard, action, player_id)
                    mirror    = mirror_bitboard(bitboard)  # mirror here for random sampling or mirrors and bitboards
                    player_id = next_player_id(player_id)
                    if   has_no_more_moves(bitboard): [ data['has_no_more_moves'].append(x) for x in [ bitboard, mirror ] ]
                    elif is_gameover(bitboard):       [ data['is_gameover'      ].append(x) for x in [ bitboard, mirror ] ] 
                    else:                             [ data['not_gameover'     ].append(x) for x in [ bitboard, mirror ] ] 
            return data
        
        batched_datas = [ generate() ]  # for debugging
        batched_datas = Parallel(n_jobs=os.cpu_count())([ delayed(generate)(100) for round in range(100) ])
        for (key, value), batched_data in itertools.product(data.items(), batched_datas):
            data[key] += batched_data[key]

    # has_no_more_moves are very rare, so duplicate datapoints, mirror and then resample from the rest
    bitboards = [
        *random.sample(data['not_gameover']      * 10,  dataset_size * 10),
        *random.sample(data['is_gameover']       *  5,  dataset_size *  5),
        *random.sample(data['has_no_more_moves'] *  5,  dataset_size *  5),  
    ]
    np.random.shuffle(bitboards)
    output = [ 
        DatasetItem(bitboard_fn(bitboard), bitboard) 
        for bitboard in bitboards
    ]    
    if verbose:
        time_taken = time.perf_counter() - time_start
        data_count = sum(map(len, data.values()))
        statistics = {
            "time_taken": f'{time_taken:.0f}s',
            "time_count": f'{time_taken/data_count*1000000:.0f}μs',
            "count": data_count,
            **{ key: round( len(value)/data_count, 5) for key, value in data.items() }
        }
        print('dataset statistics: ', statistics)
        # {'time_taken':  '84s', 'time_count':  '50μs', 'count': 1693800, 'not_gameover': 0.95312, 'is_gameover': 0.04673, 'has_no_more_moves': 0.00015}  # RazerBlade laptop
        # {'time_taken':  '77s', 'time_count': '391μs', 'count':  197574, 'not_gameover': 0.95293, 'is_gameover': 0.04694, 'has_no_more_moves': 0.00013}  # Kaggle Notebook with    @njit - 10% dataset
        # {'time_taken':  '67s', 'time_count': '473μs', 'count':  142134, 'not_gameover': 0.95286, 'is_gameover': 0.04696, 'has_no_more_moves': 0.00018}  # Kaggle Notebook without @njit - 10% dataset
    return output


def train(model, criterion, optimizer, bitboard_fn=is_gameover, dataset_size=1000, timeout=4*60*60):
    print(f'Training: {model.__class__.__name__}')
    time_start = time.perf_counter()
    epoch = 0
    try:
        model.load()
        hist_accuracy = [0]

        # dataset generation is expensive, so loop over each dataset until fully learnt
        while np.min(hist_accuracy[-10:]) < 1.0:  # need multiple epochs of 100% accuracy to pass
            if time.perf_counter() - time_start > timeout: break
            epoch         += 1                    
            epoch_start    = time.perf_counter()
            bitboard_count = 0
            
            dataset_epoch = 0
            dataset = generate_dataset(dataset_size, bitboard_fn, verbose=True)        
            while hist_accuracy[-1] < 1.0:      # loop until 100% accuracy on dataset
                if time.perf_counter() - time_start > timeout: break
                if dataset_epoch > 100: break  # if we plataeu after many iterations, then generate a new dataset
                    
                dataset_epoch   += 1
                running_accuracy = 0
                running_loss     = 0.0
                running_count    = 0

                random.shuffle(dataset)
                for (expected, bitboard) in dataset:
                    assert isinstance(expected, bool)
                    assert isinstance(bitboard, np.ndarray)
                    
                    bitboard_count += 1
                    labels = model.cast_to_labels(expected)

                    # zero the parameter gradients
                    optimizer.zero_grad()

                    outputs = model(bitboard)
                    loss    = criterion(outputs, labels)
                    loss.backward()
                    optimizer.step()

                    # Update running losses and accuracy
                    actual            = model.cast_from_outputs(outputs)  # convert (1,1) tensor back to bool
                    running_accuracy += int( actual == expected )
                    running_loss     += loss.item()
                    running_count    += 1

                epoch_time       = time.perf_counter() - epoch_start                    
                last_loss        = running_loss     / running_count
                last_accuracy    = running_accuracy / running_count
                hist_accuracy.append(last_accuracy)

                # Print statistics after each epoch
                if bitboard_count % 10000 == 0:
                    print(f'epoch: {epoch:4d} | bitboards: {bitboard_count:5d} | loss: {last_loss:.5f} | accuracy: {last_accuracy:.5f} | time: {epoch_time :.0f}s')
            print(f'epoch: {epoch:4d} | bitboards: {bitboard_count:5d} | loss: {last_loss:.5f} | accuracy: {last_accuracy:.5f} | time: {epoch_time :.0f}s')
            model.save()

                    
    except KeyboardInterrupt:
        pass
    except Exception as exception:
        print(exception)
        raise exception
    finally:
        time_taken = time.perf_counter() - time_start
        print(f'Finished Training: {model.__class__.__name__} - {epoch} epochs in {time_taken:.1f}s')
        model.save()

# Training the Models

[Adadelta](https://ruder.io/optimizing-gradient-descent/index.html#adadelta) is chosen at the optimizer because it automatically adjusts the learning rate without 
the additional complexity of an externally defined scheduler.

On Kaggle after 4 hours of training, regenerating each dataset after 100 epochs I was able to get:
> Dense: | epoch:   13 | bitboards: 100000 | loss: 0.01836 | accuracy: 0.98032 | time: 1099s
> CNN.   | epoch:   12 | bitboards: 100000 | loss: 0.01455 | accuracy: 0.98483 | time: 1148s

Second run created mirror bitboards, duplicated has_no_more_moves datapoints, and modified the training loop to keep training on each epoch until 100% accuracy on a given dataset. Accuracy plataeued at: 
- 0.98960 for isGameoverSquareNN (Dense Model)
- 0.99470 for isGameoverCNN.

Third run regenerates the dataset after 100 iterations to prevent a plateau, and also tests the effect of weight_decay. This article seems to suggest that weight_decay can be added later in the process to improve results, with 0.01 or 0.001 being optimal values (for their usecase). 
- https://machinelearningmastery.com/how-to-reduce-overfitting-in-deep-learning-with-weight-regularization/

However for this usecase, weight_decay=0.001 reduces accuracy to:
- 0.81190 for isGameoverSquareNN (Dense Model)
- 0.83840 for isGameoverCNN

Fourth experiment is to switch from relu to leaky_relu. The shape of Relu is that the lines it draws can only see distance in one direction (as the other side is 0). Leaky Relu can see distance in both directions away from the line, so lets see if this improves the accuracy of the model. Also removing weight_decay here.

In [None]:
model     = isGameoverSquareNN
criterion = nn.MSELoss()  # NOTE: nn.CrossEntropyLoss() is for multi-output classification
optimizer = optim.Adadelta(model.parameters())
train(model, criterion, optimizer, bitboard_fn=is_gameover)

In [None]:
model     = isGameoverCNN
criterion = nn.MSELoss()  # NOTE: nn.CrossEntropyLoss() is for multi-output classification
optimizer = optim.Adadelta(model.parameters())
train(model, criterion, optimizer, bitboard_fn=is_gameover)

# Further Reading

This notebook is part of series exploring the Neural Network implementions, which I have extended to the Game of Life Foward Problem
- [Pytorch Game of Life - First Attempt](https://www.kaggle.com/jamesmcguigan/pytorch-game-of-life-first-attempt)
- [Pytorch Game of Life - Hardcoding Network Weights](https://www.kaggle.com/jamesmcguigan/pytorch-game-of-life-hardcoding-network-weights)
- [Its Easy for Neural Networks To Learn Game of Life](https://www.kaggle.com/jamesmcguigan/its-easy-for-neural-networks-to-learn-game-of-life)
- [OuroborosLife - Function Reversal GAN](https://www.kaggle.com/jamesmcguigan/ouroboroslife-function-reversal-gan)