# ConnectX - Vectorized Bitshifting Tutorial

This notebook tutorial describes how to implement the core functions of the ConnectX game using vectorized bitshifting for performance optimized code.

The code was used to implement an performance optimized version Monty Carlo Tree Search combined with a Bitsquares Heuristic, resulting top 5% score on the leaderboard.
- https://www.kaggle.com/jamesmcguigan/connectx-mcts-bitboard-bitsquares-heuristic/

This code will outperform most python agents using integer list datastructures, with greater search depth leading to higher winrates. However it is worth noting that python as a language is very slow, and the secret of the top entires on the leaderboard is that their implementations are actually written in a compiled language such as C (or Rust, Go, NIM) and imported into python using ctypes or cffi.

In [None]:
%%html
<style> table { display: inline-block } </style>

In [None]:
import numpy as np
from numba import njit
import timeit
from collections import namedtuple
from typing import Tuple, List, Union

# Kaggle Datastructures

These are the datastructures passed into the `agent(observation: Struct, configuration: Struct)` function required for game submission.

In [None]:
observation   = {'mark': 1, 'board': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
configuration = {'columns': 7, 'rows': 6, 'inarow': 4, 'steps': 1000, 'timeout': 8}

Kaggle uses a `Struct` datatype internally, not a dict as shown. However numba doesn't like the `Struct` datatype, but we can cast to `namedtuple()` instead which preserves the same .dot notation as is compatable with numba.

In [None]:
Observation   = namedtuple('observation',   ['mark', 'board'])
observation   = Observation(**observation)

Configuration = namedtuple('configuration', ['rows', 'columns', 'inarow', 'steps', 'timeout'])
configuration = Configuration(**configuration)

## configuration
- `configuration.columns` + `configuration.rows` defines the shape of the game board
- `configuration.inarow` defines the number of pieces in a row needed for a win
- `configuration.steps` I assume is the maximum number of turns the game can go on for, which is actually 42 (needs verifying)
- `configuration.timeout` defines the maximum amount of time an agent has to think before kaggle_environments triggers a timeout 

The configuration is currently set to be the same as those used in the physical boardgame. The competition, however, is called ConnectX rather than Connect4, which suggests that these configuration values might change in future iterations of the game. Code should be written to read values from the configuration rather than hardcoding numbers and assuming they always be constant.


## observation
- `observation.mark` is the player who's turn it is to move next
- `observation.board` is a 1 dimentional representation of the board. 

observation.board has `len()` == 42 representing a 7x6 board, starting from the sky in the top-left corner and decending row by row, with the last value representing the bottom-right hand corner. Values in the array will be in the set `{0,1,2}`, with 0 representing an empty square, and 1 or 2 representing a player token.

The first 7 values represent the top row of the board, which can also be used to check if a move is valid. A 0 in the top row of a given column means the sky is empty and column/action is valid to play in. 

If using numpy, then this can be resized using: 
```
board = np.array(observation.board, dtype=np.int8).reshape(configuration.rows, configuration.columns)
```

# List Implemention

The [kaggle_environments](https://github.com/Kaggle/kaggle-environments/blob/master/kaggle_environments/envs/connectx/connectx.py) source code 
provides a classical implemention of these functions using lists.

## is_draw_kaggle()

The `is_draw()` function is fairly simple. 

The game is drawn if all squares have been played, this it loops over the board array and checks for the presence of any remaining zeros representing empty squares.

An eqlivant definition of a draw/stalemate is when the next player has no more valid actions/columns to play. This is true because tokens always fall to the lowest row in a column, and the top row can only be played after all the other squares in the column have been player.

This function could be optimized by simply checking the first `configuration.columns = 7` values, rather than the entire array.

In [None]:
# Source: https://github.com/Kaggle/kaggle-environments/blob/master/kaggle_environments/envs/connectx/connectx.py

EMPTY = 0

# NOTE: kaggle_environments implements this as inline code rather than a standalone function
def is_draw_kaggle(board):
    # Check for a tie.
    return all(mark != EMPTY for mark in board)

## is_win_kaggle()

`is_win()` optimises itself by only checking for connect4 lines in the last column played, rather than validating the entire board.

It starts by figuring out the row height for the target column. Rows are counted downwards with `row=0` being the sky, thus `min()` will return the highest row number, and `max()` will return the lowest row number. If `has_played=True` the board is the old view of the board before the player moved, thus the row number is the position of the lowest EMPTY square rather than highest played square.

The inline `count()` function checks played token in the column. It given a direction vector `(+-1, +-1)` and iterates upto `configuration.inarow=4` squares until it either hits the edge of the board or encounters an EMPTY or opponent token. If the sum of any pair of directions is greater than `configuration.inarow`, then the function returns `True` as the win condition has been met.

In [None]:
# Source: https://github.com/Kaggle/kaggle-environments/blob/master/kaggle_environments/envs/connectx/connectx.py

def is_win_kaggle(board, column, mark, config, has_played=True):
    columns = config.columns
    rows    = config.rows
    inarow  = config.inarow - 1
    row = (
             min([ r for r in range(rows) if board[column + (r * columns)] == mark  ])
        if has_played
        else max([ r for r in range(rows) if board[column + (r * columns)] == EMPTY ])
    )

    def count(offset_row, offset_column):
        for i in range(1, inarow + 1):
            r = row    + offset_row    * i
            c = column + offset_column * i
            if (
                   r <  0
                or r >= rows
                or c <  0
                or c >= columns
                or board[c + (r * columns)] != mark
            ):
                return i - 1
        return inarow

    return (
            count( 1,  0)                 >= inarow  # vertical.
        or (count( 0,  1) + count(0, -1)) >= inarow  # horizontal.
        or (count(-1, -1) + count(1,  1)) >= inarow  # top left diagonal.
        or (count(-1,  1) + count(1, -1)) >= inarow  # top right diagonal.
    )

# Bitshifting with Bitboards

One of the key challenges of implementing a competative ConnectX bot is writing performance optimized code. If you have written a Minimax, MCTS agent, then the depth of your search depends on how fast you can iterate over gamestates. 

Bitboards are based on the idea that a integer is a really just a 64bit sequence of 1s and 0s, and that rather than treating it as a number, we can effectively treat it as an array of 64 boolean flags. The advantage is that this allows us to operate on this datastructure using bitshifting operations which implemented in the CPU as one-cycle instructions (which is as fast as adding two numbers together). 

The catch is that we need to learn the language of bitshifting and think differently about the types of operations we can efficently do on this datatype.

NOTE: bitshifting in python still comes with the overhead of the python runtime, so is still much slower than a compiled C implemention, however it much faster than asking python to loop over an array of 42 integers. Python speed can be increased using vectorized numpy bitshifting, and even possibly futher still using numba `@njit`, however I have technical issues with getting numba to reliably run inside the kaggle submissions environment.

# empty_bitboard()

The integer list version of the board contains 42 integers, however, there are 3 possible values for each integer `{0,1,2}`. 3 in binary is `11` so we actually need 2 bits to represent each square. Thus we need 42 bits to represent which squares have been played (by either player) and another 42 bits to represent which player owns each square (assuming it has been played).

There are several possible ways of representing this as a bitboard:

Native python supports arbitray sized integers, so we could use a single 84 bit integer, and use shifting and masking to extract out the required bits as and when we need them.

However numpy and numba only support upto 64 bit integers (which get compiled down to C types), thus we would need an array of two 64 bit integers to store the full state information for the game. The first number represents the played tokens (0 = unplayed | 1 = played) and the second represents the player number (0 = p1 | 1 = p2)

At the start of the game, no tokens have been played so the value of the first bitboard is 0. For simplicity the value of the second bitboard also defaults to 0.

In [None]:
# Source: https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/core/ConnectXBBNN.py

#@njit
def empty_bitboard() -> np.ndarray:
    bitboard = np.array([0, 0], dtype=np.int64)
    return bitboard

# mask_board + mask_legal_moves

The first bitshifting operation to learn is the left shift `<<` operator.

The binary number 1 is represented as series of 0s and a single bit in the right-most bit. 

> This is called little-endian format, which is used in x86 processors, however some historical computers chose big-endian format which reverses the order the bits.

To set an bit by index, we can start with a binary `1` and left shift it into position `1 << index`.

If we wish to create a bitmask with X number of 1s, then we can we can left shift one bit too far and then subtract 1: `(1 << X+1) - 1`

In [None]:
mask_board       = (1 << configuration.columns * configuration.rows) - 1
mask_legal_moves = (1 << configuration.columns) - 1

print(f'1                =  {1:042b}')
print(f'1 << 42          = {1 << 42 :042b}')
print(f'mask_board       =  {mask_board:042b}')
print(f'mask_legal_moves =  {mask_legal_moves:042b}')

# list_to_bitboard()

The first practical task is write a casting function to convert the kaggle listboard format into a bitboard array.

Traditionally we are used to reading lists in left-to-right (big-endian) format, with the `list[0]` being the leftmost item, however the bitboard is little-endian and `list[0]` will be represented by the rightmost bit. 

`list_to_bitboard()` starts with 0 initalized bitboards, then loop over the flattened listbord array. If the list value is 0, then the bits have already been set and we don't need to do anything. If the list value indicates player 1, then we only need to set the corresponding bit in `bitboard_played` as `0 = p1` is the default in `bitboard_player`. If the list value indicates player 2, then we will need to set the same bit in both bitboards.

To create a bitmask for a single bit, we use leftshift `1 << n`, then perform a logical OR `|` with the bitboard `bitboard_played |= (1 << n)` and `bitboard_player |= (1 << n)`.

We can use logical OR `bitboard = bitboard | mask` to set specific bits to `1` without unsetting any of the bits previously added.

---

# Logical OR |

| A | B | A|B | 
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |



In [None]:
#@njit
def list_to_bitboard(listboard: Union[np.ndarray,List[int]]) -> np.ndarray:
    # bitboard[0] = played, is a square filled             | 0 = empty, 1 = filled
    # bitboard[1] = player, who's token is this, if filled | 0 = empty, 1 = filled
    bitboard_played = 0  # 42 bit number for if board square has been played
    bitboard_player = 0  # 42 bit number for player 0=p1 1=p2
    if isinstance(listboard, np.ndarray): listboard = listboard.flatten()
    for n in range(len(listboard)):  # prange
        if listboard[n] != 0:
            bitboard_played |= (1 << n)        # is a square filled (0 = empty | 1 = filled)
            if listboard[n] == 2:
                bitboard_player |= (1 << n)    # mark as player 2 square, else assume p1=0 as default
    bitboard = np.array([bitboard_played, bitboard_player], dtype=np.int64)
    return bitboard

To get the true positions of player tokens, we must logical AND `&` on both sides of the bitboard. For player 2, the played bits are already set to `1` so we are good to go, however for player 1 represented by a `0` bit, so we need to flip the bits using logical NOT `~`.


---

# Logical NOT ~

| A | ~A | 
|---|---|
| 0 | 1 |
| 1 | 0 |


Logical NOT `~` is a unary operator and will flip all the 1s and 0s in the bitboard, but be careful as this will also flip all 64 bits, and not just the 42 bits we using for data. 

---


# Logical AND &

| A | B | A&B | 
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |


Logical AND `&` will only keep bits that set in both bitstrings, thus can be used with a bitmask to extract out specific bit ranges. 

---


# Logical XOR ^

| A | B | A^B | 
|---|---|---|
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |


For completeness, there also the bitsize XOR `^` operator (exclusive or), which returns 1 if both inputs are the same, or 0 if both inputs are different.

It can be used to do clever things such as swapping the values of two integers without using a third variable, however I have not needed to use it my codebase.
```
a ^= b; b ^= a; a ^= b
```

---

# Example Bitboard

This is an example of player 1 setting up for a double attack.

Notice how the little-endian bitboard renders the board from bottom-right to top-left. 

`bitboard[0]` represents the played tokens. Reading left-to-right, there are three 0s before the two 1s in the bottom row, then a single bit in the second row.

`bitboard[1]` identifies player ownership, but only in squares that have already been played. Note that `0` represents both the default unset value and the indicator for player 1.

`p1_tokens = bitboard[0] & ~bitboard[1]` uses logical AND `&` and logical NOT `~` to only indicate (with 1s) which squares player 1 has actually played.

`p2_tokens = bitboard[0] &  bitboard[1]` is simpler because `bitboard[1]` already uses 1s to indicate the player 2. 

In theory we could shortcircuit `p2_tokens == bitboard[1]`, but this is based on the assumption that `bitboard[1]` defaults to 0 and 1 bits are only set after a token has been played. Its slightly safer not to make this assumption if we are using logical NOT `~` elsewhere in the codebase (as this might set higher order bits to 1). 

In [None]:
listboard = [
    0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,
    0,0,2,0,0,0,0,
    0,0,1,1,0,0,0,
]
bitboard  = list_to_bitboard(listboard)
p1_tokens = bitboard[0] & ~bitboard[1]
p2_tokens = bitboard[0] &  bitboard[1]

print( f'bitboard    = {bitboard}' )
print( f'bitboard[0] = {bitboard[0]:042b}' )
print( f'bitboard[1] = {bitboard[1]:042b}' )
print( f'p1_tokens   = {p1_tokens:042b}' )
print( f'p2_tokens   = {p2_tokens:042b}' )

# bitboard_to_numpy2d()

We can also do the reverse cast, which turns a bitboard array back into a 2d numpy array. This is useful for visualization and debugging.

The process works in reverse to `list_to_bitboard()`. We iterate over the size of the bitboard and use right shift `>>` to move the indexed bit into the 1 position, then use logical AND `&` with 1 to check if that specific bit is set. 

`(bitboard[0] >> i) & 1` will evaluate to `1 == True` if the bit at index `i` is set, else it will evaluate to `0 == False`.

An alterative method would be to use left shift. This would still evaluate `True` and `False` correctly, but the return value for `True` would be a large positive number instead of the much simpler 1. 
```
bitboard[0] & (1 << i) 
```

In [None]:
# Source: https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/core/ConnectXBBNN.py

#@njit(int8[:,:](int64[:]))
def bitboard_to_numpy2d(bitboard: np.ndarray) -> np.ndarray:
    global configuration
    rows    = configuration.rows
    columns = configuration.columns
    size    = rows * columns
    output  = np.zeros((size,), dtype=np.int8)
    for i in range(size):  # prange
        is_played = (bitboard[0] >> i) & 1
        if is_played:
            player = (bitboard[1] >> i) & 1
            output[i] = 1 if player == 0 else 2
    return output.reshape((rows, columns))

In [None]:
bitboard_to_numpy2d(bitboard)

# has_no_more_moves()

Here we use a similar logic to the kaggle implemention `is_draw(board)`, but with the simplifying assumption that we only need to check the values in the top row.

Here we use a combination of AND `&` and EQUALS `==`. The condition will only return True if ALL the bits in `mask_legal_moves` are also set in `bitboard[0]`.

In [None]:
# Source: https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/core/ConnectXBBNN.py
#@njit
def has_no_more_moves(bitboard: np.ndarray) -> bool:
    """If all the squares on the top row have been played, then there are no more moves"""
    return bitboard[0] & mask_legal_moves == mask_legal_moves 

def has_no_empty_spaces(bitboard: np.ndarray) -> bool:
    """If all the squares on the top row have been played, then there are no more moves"""
    return bitboard[0] == mask_board 

In [None]:
print(f'has_no_more_moves(bitboard) = {has_no_more_moves(bitboard)}')
print(f'mask_legal_moves = {mask_legal_moves:042b}')
print(f'bitboard[0]      = {bitboard[0]:042b}')

# CPU Performance

In theory, from a CPU performance perspective, we have reduced the time complexity of this function from a loop with 42 integer comparisons (and 42 indices to be set) down to 2 one-cycle bit operations. ****

Also notice, that in the bitboard implemention we could equally check for all the bits using `mask_board` or only the top row using `mask_legal_moves` 

Unfortunatly, the bitboard implemention of this simple 1 microsecond function actually turns out to be 1.5x slower than the list comprehention example. 
> 1 second = 1,000 milliseconds = 1,000,000 microseconds 

There are several possible explinations for this, 
- Bitshifting in python is much slower than in C
- Python list comprehensions are optimized inside the python kernel  
- There is static function overhead and argument checking involved numpy array lookup
- Numpy has additional overhead in casting back from C `np.int64` to Python `int()`

We can gain some minor performance improvements:

- Using Tuple[int,int] rather than np.ndarray to store out bitboard results in 0.3µs performance increase
- `has_no_empty_spaces()` avoids the bitwise AND operation, which saves us another 0.6µs (Python bitshifting is much slower than C)


# Numba

Numba can also be used to improve performance:
- Numba @njit compiles the function to C (technically the LLVM)
- We get a 2x performance increase for this tiny function
- The is a one-time performance hit of @njit compiling the function
  - The compile time is in the order of 50 milliseconds for this small function
  - This is equlivant to 40,000 function calls on first execution
  - Kaggle does permit an additional 8 seconds on runtime on your first move to account for data loading and compile times
- There is a static overhead of casting datatypes between C and Python
    - Numpy arrays are already casted into C, so there is little overhead in passing a numpy array to a numba function  
    - Numba works better for larger functions with nested loops
    - The overhead of passing Tuple[int,int] into these numba functions almost completely offsets the performance advantage of Numba   
- Bitshifting in C is much faster, thus the additional cost of logical AND has been reduced from 0.6µs to 0.005µs (120x performance increase)
- I am still trying to figure out the technical constraints of using numba practically within the kaggle submission environment

In [None]:
njit_has_no_more_moves   = njit(has_no_more_moves)
njit_has_no_empty_spaces = njit(has_no_empty_spaces)

tuple_bitboard = tuple(bitboard)
print( f'has_no_more_moves(bitboard):         {timeit.timeit(lambda: has_no_more_moves(bitboard),         number=1000000):.4f} µs' )
print( f'has_no_more_moves(tuple_bitboard):   {timeit.timeit(lambda: has_no_more_moves(tuple_bitboard),   number=1000000):.4f} µs' )
print( f'is_draw_kaggle(listboard):           {timeit.timeit(lambda: is_draw_kaggle(listboard),           number=1000000):.4f} µs' )
print( f'has_no_empty_spaces(bitboard):       {timeit.timeit(lambda: has_no_empty_spaces(bitboard),       number=1000000):.4f} µs' )
print( f'has_no_empty_spaces(tuple_bitboard): {timeit.timeit(lambda: has_no_empty_spaces(tuple_bitboard), number=1000000):.4f} µs' )
print( f'-----')
print( f'njit_has_no_more_moves(tuple_bitboard):    {timeit.timeit(lambda: njit_has_no_more_moves(tuple_bitboard),   number=1)*1000000:.0f} µs -> {timeit.timeit(lambda: njit_has_no_more_moves(tuple_bitboard),   number=1000000):.4f} µs' )
print( f'njit_has_no_empty_spaces(tuple_bitboard):  {timeit.timeit(lambda: njit_has_no_empty_spaces(tuple_bitboard), number=1)*1000000:.0f} µs -> {timeit.timeit(lambda: njit_has_no_empty_spaces(tuple_bitboard), number=1000000):.4f} µs' )
print( f'njit_has_no_more_moves(bitboard):          {timeit.timeit(lambda: njit_has_no_more_moves(bitboard),         number=1)*1000000:.0f} µs -> {timeit.timeit(lambda: njit_has_no_more_moves(bitboard),   number=1000000):.4f} µs' )
print( f'njit_has_no_empty_spaces(bitboard):        {timeit.timeit(lambda: njit_has_no_empty_spaces(bitboard),       number=1)*1000000:.0f} µs -> {timeit.timeit(lambda: njit_has_no_empty_spaces(bitboard), number=1000000):.4f} µs' )


# get_gameovers()

This is a more complicated function which will hopefully show the power of bitboards.

In the kaggle `is_win()` function, a nested loop is used to iterate, coordinate by coordinate, in all directions from each token in the last played column.

In the bitshifting version, we precompute `gameovers = get_gameovers()` to avoid repeating this calculation for every bitboard. It returns an np.array() of all possible bitmask locations that could result in a gameover Connect4 line.

First we generate 4x4 bitmasks for each direction (horizontal + vertical + 2x diagonal), this is done by leftshifting `<< ` each bit into position and using bitwise OR `|` sum the results. Note that it is also possible to use `np.sum()` to quickly bitwise OR an array of bitmasks, but only if you can gaurentee there will be no overlapping bits.

Then for each of these directional bitmaks, we iterate over the coordinates of the board (avoiding corners and edges) then leftshift `<<` the mask by the offset required to move the direction bitmask into position: 
```
mask_diagonal_ul |= 1 << n * columns + (inarow - 1 - n) for n in range(inarow)
offset = col + row * columns
mask_diagonal_ul << offset
```

In [None]:
#@njit(int64[:]())
def get_gameovers() -> np.ndarray:
    """Creates a list of all winning board positions, over 4 directions: horizontal, vertical and 2 diagonals"""
    global configuration

    rows    = configuration.rows
    columns = configuration.columns
    inarow  = configuration.inarow

    gameovers = []

    mask_horizontal  = 0
    mask_vertical    = 0
    mask_diagonal_dl = 0
    mask_diagonal_ul = 0
    for n in range(inarow):  # use prange() with numba(parallel=True)
        mask_horizontal  |= 1 << n
        mask_vertical    |= 1 << n * columns
        mask_diagonal_dl |= 1 << n * columns + n
        mask_diagonal_ul |= 1 << n * columns + (inarow - 1 - n)

    print(f'mask_horizontal  = {mask_horizontal:042b}')
    print(f'mask_vertical    = {mask_vertical:042b}')
    print(f'mask_diagonal_dl = {mask_diagonal_dl:042b}')
    print(f'mask_diagonal_ul = {mask_diagonal_ul:042b}')

    row_inner = rows    - inarow
    col_inner = columns - inarow
    for row in range(rows):         # use prange() with numba(parallel=True)
        for col in range(columns):  # use prange() with numba(parallel=True)
            offset = col + row * columns
            if col <= col_inner:
                gameovers.append( mask_horizontal << offset )
            if row <= row_inner:
                gameovers.append( mask_vertical << offset )
            if col <= col_inner and row <= row_inner:
                gameovers.append( mask_diagonal_dl << offset )
                gameovers.append( mask_diagonal_ul << offset )

    print(f'gameovers = [ \n' + '\n'.join([ f'{gameover:042b}' for gameover in gameovers ]) + '\n]')             
    return gameovers


gameovers = get_gameovers()

# get_winner()

Once we have generated this carefully computed array of bitmasks, we can use numpy vectorization to perform bitwise operations over the entire array.

Gameover is declared if any of the `gameover` bitmasks matches all 4 bits when bitwise AND `&` compared to the player tokens.

- `(bitboard[0] & ~bitboard[1])` computes the scalar value of `p1_tokens`. 
- `& gameovers[:]` computes the bitwise AND `&` against each value in the numpy array, returning a numpy array
- `== gameovers[:]` checks for every result if any all 4 bits for any of the locations matching a gameover bitmask, returning an array of True or False
- `np.any(p1_wins)` checks if any element of the numpy array are True - as we only need one Connect4 line to win

In [None]:
#@njit
def get_winner(bitboard: np.ndarray) -> int:
    """ Endgame get_winner: 0 for no get_winner, 1 = player 1, 2 = player 2"""
    p2_wins = (bitboard[0] &  bitboard[1]) & gameovers[:] == gameovers[:]
    if np.any(p2_wins): return 2
    p1_wins = (bitboard[0] & ~bitboard[1]) & gameovers[:] == gameovers[:]
    if np.any(p1_wins): return 1
    return 0

    # NOTE: above implementation is 2x faster than this original attempt
    # gameovers_played = gameovers[ gameovers & bitboard[0] == gameovers ]  # exclude any unplayed squares
    # if np.any(gameovers_played):                                          # have 4 tokens been played in a row yet
    #     p1_wins = gameovers_played & ~bitboard[1] == gameovers_played
    #     if np.any(p1_wins): return 1
    #     p2_wins = gameovers_played &  bitboard[1] == gameovers_played
    #     if np.any(p2_wins): return 2
    # return 0

# CPU Performance

When we looked at the performance of `has_no_more_moves()`, the static overhead of additional function calls outweighed the performance advantages of bitshifting.

Notice here that the situation is reversed for this more expensive function.
- `get_winner(bitboard)` has an almost identical runtime to `has_no_more_moves(bitboard)` showing that most of the cost is in function overhead
- `get_winner(bitboard)` is 8x faster than `is_win(listboard, column, mark)` despite only needing to check a single column
- `get_win_kaggle(listboard)` implements the same specs as `get_winner(bitboard)` but force it to check all columns - runtime is 100x slower than the bitboard version
- `njit_get_winner(bitboard)` shows a 2x speedup over `get_winner(bitboard)` but @njit compile time costs the same as 30,000 executions

In [None]:
njit_get_winner = njit(has_no_more_moves)

def get_win_kaggle(listboard):
    if   any(is_win_kaggle(listboard, column=column, mark=1, config=configuration, has_played=False) for column in range(configuration.columns)): return 1
    elif any(is_win_kaggle(listboard, column=column, mark=2, config=configuration, has_played=False) for column in range(configuration.columns)): return 2
    else: return 0
    
    
print( f'get_win_kaggle(listboard)):             {timeit.timeit(lambda: get_win_kaggle(listboard),   number=10000)*100:.4f} µs' )
print( f'is_win_kaggle(listboard, column, mark): {timeit.timeit(lambda: is_win_kaggle(listboard, column=2, mark=2, config=configuration), number=1000000):.4f} µs' )
print( f'get_winner(bitboard):                   {timeit.timeit(lambda: has_no_more_moves(bitboard), number=1000000):.4f} µs' )
print( f'njit_get_winner(bitboard):              {timeit.timeit(lambda: njit_get_winner(bitboard),   number=1)*1000000:.0f} µs -> {timeit.timeit(lambda: njit_get_winner(bitboard),   number=1000000):.4f} µs' )
print( bitboard_to_numpy2d(bitboard))

# is_gameover()

This is a simple wrapper function, that combines the two edgecases for game termination.

It is important to note for the pytorch model later on that these two edgecases have different probability distributions. 

Based on a sample of 70,000 games played by random agent on both sides:

- Average moves per game: 21.3
- Average number of draws: 1 in 293 games

In [None]:
#@njit
def is_gameover(bitboard: np.ndarray) -> bool:
    if has_no_more_moves(bitboard):  return True
    if get_winner(bitboard) != 0:    return True
    return False

# get_move_number()

Another useful bitshifting technique is bitcounting. This is required to compute the move number and player turn directly from the bitboard. 

Using an integer listboard, this could be done using:
```
np.count_nonzero(listboard)
```

To do this the bitshifting way, we compute an array of single-bit masks for each bit position. `np.arange(x)` is equlivant to `np.array(range(x))`. 

We can place this numpy array on the right side of the `<<` of the left-shift operator to vectorized shift the 1 into all the required locations, 
then use bitwise AND `&` to test which bits in the bitboard are set

In [None]:
#@njit(int64[:](int8))
def get_bitcount_mask(size: int = configuration.columns * configuration.rows) -> np.ndarray:
    # return np.array([1 << index for index in range(0, size)], dtype=np.int64)
    return 1 << np.arange(0, size, dtype=np.int64)

bitcount_mask = get_bitcount_mask()


#@njit(int8(int64[:]))
def get_move_number(bitboard: np.ndarray) -> int:
    global configuration
    if bitboard[0] == 0: return 0
    size          = configuration.columns * configuration.rows
    mask_bitcount = get_bitcount_mask(size)
    move_number   = np.count_nonzero(bitboard[0] & mask_bitcount[:])
    return move_number

print('bitboard = ')
print(bitboard_to_numpy2d(bitboard))
print(f'get_move_number(bitboard) = {get_move_number(bitboard)}' )
print(f'bitcount_mask = [\n' + '\n'.join(f'{bitmask:042b}' for bitmask in bitcount_mask) + '\n]' )

# result_action()

One final useful function to show is `result_action(bitboard, action, player_id)`. is the bitboard equlivant to the kaggle `play()` function.

`assert` can be used outside of unit tests to enforce pre- and post- conditions in the [Design By Contract](https://en.wikipedia.org/wiki/Design_by_contract) style of programming.
assertions can be disabled for production using the `python3 -O` or `python3 -OO` [cli flags](https://docs.python.org/3/using/cmdline.html). 
My [kaggle_compile.py](https://www.kaggle.com/jamesmcguigan/kaggle-compile-py-python-ide-to-kaggle-compiler) comments them out via a regular expression.

In this case, we need to assert that a move is legal before we can make it. Rather than recompute the bitmasks each time, these can be statically computed once upon load.
First we take and AND bitmask with the top row, then peform a numpy table lookup on `_is_legal_move_cache[bits, action]` which has been prepopulated with 
all possible `2**configuration.columns = 128` combinations of bits that could result from `bitboard[0] & _is_legal_move_mask`. This requires a negligable amount of memory.

`get_next_index()` is called less often than `is_legal_move()` so simply uses a loop to test for indivdual bits. 
Using a vectorized implemention here would require testing all bits in the column, but a for loop can short circuit once the first 0 bit is found. 
We count upwards from the ground because a game is more likely to be half-empty than half-full (we start with an empty board which rarely gets filled completely).

`result_action()` allows us to quickly run Connect4 agent simulations without needing to use the kaggle list implementions.

In [None]:
# Source: https://github.com/Kaggle/kaggle-environments/blob/master/kaggle_environments/envs/connectx/connectx.py
def play(board, column, mark, config):
    columns = config.columns
    rows = config.rows
    row = max([r for r in range(rows) if board[column + (r * columns)] == EMPTY])
    board[column + (row * columns)] = mark

In [None]:
_is_legal_move_mask  = ((1 << configuration.columns) - 1)
_is_legal_move_cache = np.array([
    [
        int( (bits >> action) & 1 == 0 )
        for action in range(configuration.columns)
    ]
    for bits in range(2**configuration.columns)
], dtype=np.int8)

#@njit
def is_legal_move(bitboard: np.ndarray, action: int) -> int:
    bits = bitboard[0] & _is_legal_move_mask   # faster than: int( (bitboard[0] >> action) & 1 == 0 )
    return _is_legal_move_cache[bits, action]  # NOTE: [bits,action] is faster than [bits][action]

#@njit
def get_next_index(bitboard: np.ndarray, action: int) -> int:
    assert is_legal_move(bitboard, action)

    # Start at the ground, and return first row that contains a 0
    for row in range(configuration.rows-1, -1, -1):
        index = action + (row * configuration.columns)
        value = (bitboard[0] >> index) & 1
        if value == 0:
            return index
    return action  # this should never happen - implies not is_legal_move(action)

#@njit
def result_action(bitboard: np.ndarray, action: int, player_id: int) -> np.ndarray:
    assert is_legal_move(bitboard, action)
    index    = get_next_index(bitboard, action)
    mark     = 0 if player_id == 1 else 1
    output = np.array([
        bitboard[0] | 1    << index,
        bitboard[1] | mark << index
    ], dtype=bitboard.dtype)
    return output


In [None]:
print(f'bitboard = \n', bitboard_to_numpy2d(bitboard))
print(f'is_legal_move(bitboard, 1) = {is_legal_move(bitboard, 1)}')
print(f'result_action(bitboard, 1, 1) = \n{bitboard_to_numpy2d(result_action(bitboard, action=1, player_id=2))}')
print(f'_is_legal_move_mask = {_is_legal_move_mask:042b}')
print(f'_is_legal_move_cache.shape = {_is_legal_move_cache.shape}')
print(f'_is_legal_move_cache = \n {_is_legal_move_cache}')

# Further Bitshifting Code Examples

I have a full bitshifting implemention of Connect4 on my github, including how examples of how to design heuristics with bitshifting:
- https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/core/ConnectXBBNN.py
- https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/core/ConnectXBBN_test.py
- https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/heuristics/BitsquaresHeuristic.py
- https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/heuristics/BitboardGameoversHeuristic.py
- https://github.com/JamesMcGuigan/ai-games/blob/master/games/connectx/heuristics/OddEvenHeuristic.py

# Implementing Bitshiftng Functions in Pytorch

According the the Universal Approximation Theorem a sufficently large neural network, with a sufficently large sample of input/outputs should be capable of approximating any mathematical function.

Many of the usecases presented as Kaggle Competitions involve using neural networks to model functions that would be impossible to code classically by hand.

My new notebook will cover the opposite usecase, that of trying to create a neural network implemention of the above bitshifting function `is_gameover()` that we already have the sourcecode to. The aim is to use reinforcement learning to (hopefully) achieve 100% accuracy, then to compare the performance of these two implementions.

- https://www.kaggle.com/jamesmcguigan/connectx-implementing-functions-in-pytorch