In [None]:
from kaggle_environments import make, evaluate
import numpy as np
from math import sqrt, log
from operator import itemgetter
import copy

This agent is an implementation of [Pure Monte Carlo Tree Search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search#Pure_Monte_Carlo_game_search) (PMCTS), the simplest of the Monte Carlo Tree Search (MCTS) variants. For an exaustive review of the topic, check [this](http://www.diego-perez.net/papers/MCTSSurvey.pdf) publication. The main idea behind PMCTS is to explore each possible state $\textbf{s'}=(s'_1, ... ,s'_m)$, one for each possible actions $\textbf{a}=(a_1, ... ,a_m)$, from an initial state $s_0$. Then, for each new state $s'_i$, we simulate the game $n$ times, until a terminal state (loss, win, draw) is reached. The *score* of the state $s'_i$ is a function of the number of terminal states reached after the $n$ simulations (e.g., the average number of winnings). The *score* should approximate the real *value* of the state $s'_i$, that is, the likelihood that the position can lead to a win. After simulating each new state, we pick the action $a$, corrisponding to the state with the highest score.

For this implementation, the set of the possible action is $\textbf{a}=(0, ... ,6)$, corrisponding to each of the columns we can drop the token from.
Both the *score* and the *value* are $1$ for a victory, $-1$ for a loss and $0$ for a draw.

The main parameter of the $\verb|agent|$ function is the number of simulations to run, while the main variable is the $\verb|graph|$. The $\verb|graph|$ is the dictionary where we store each state, with all their $\verb|attributes|$. Each node of the graph is, in fact, a state. The $\verb|attributes|$ give us informations about the state:
* **bitboard**: contains the bitboard representation of the board
* **parent**: contains the ID of the parent node
* **children**: contains the IDs of the children nodes
* **move**: is the move that brought us *to* to state.
* **turn**: is the player who has to move *from* the state.
* **score**: is the score of the state

The remaining variables are the *board* itself, the number ($1$ or $2$) corrisponding to the agent itself, the *opponent* and the two bitstring representing  the players positions. We will see about it in the next paragraph.

In [None]:
def sim_agent(obs, conf):
    SIMULS = 500
    
    graph = {0: {'bitboard': {agent: agent_mask, opponent: opponent_mask}, 'parent': None, 
                 'children': set(), 'move': None, 'turn': agent, 'score': 0}}
    
    board = np.array(obs.board, np.int8).reshape(conf.rows, conf.columns)
    agent = obs.mark
    opponent = switch_player(agent)
    agent_mask, opponent_mask = bitboard(board, agent)

The main function *agent* is composed by a number of decorators, each one preforming one particolar task.
The *bitboard* decorator transforms the $6\times 7$ input board representation into a 2-bitestrings bitboard, one for each player. This is crucial for the simulations: if you want to increase them, you need to optimize the process, given that the time budget is limited to 5 seconds. For the sake of brevity, I omit the details about it. If you want to know more, check [this](https://towardsdatascience.com/creating-the-perfect-connect-four-ai-bot-c165115557b0) article. I used his implementation, with some slight modifications. 

In [None]:
def bitboard(board, player):
    agent_mask, opponent_mask = '', ''
    for j in range(6, -1, -1):
        opponent_mask += '0'
        agent_mask += '0'
        for i in range(6):
            if board[i,j] == 0:
                agent_mask += '0'
                opponent_mask += '0'
            elif board[i,j] == player:
                agent_mask += '1'
                opponent_mask += '0'
            else:
                opponent_mask += '1'
                agent_mask += '0'
    return int(agent_mask, 2), int(opponent_mask, 2)


What just said holds also for the next decorator. *Outcome* returns the value of the terminal state, if the case. Otherwise it returns None. The last if-statement differs from the previously mentioned article: when checking for a draw, the two bitstrings are merged (using $XOR$) in order to have a bitstring that contains all the occupied positions. If all of them are occupied, the bitstring will be a specific number (i.e. $0111111011111101111110111111011111101111110111111$). If the case, it would mean that all the positions are occupied, so none of the players could make another move, hence a draw. 

In [None]:
 def outcome(bb):
        par = bb[agent] & (bb[agent] >> 7)
        if par & (par >> 14):
            return 1
        par = bb[agent] & (bb[agent] >> 6)
        if par & (par >> 12):
            return 1
        par = bb[agent] & (bb[agent] >> 8)
        if par & (par >> 16):
            return 1
        par = bb[agent] & (bb[agent] >> 1)
        if par & (par >> 2):
            return 1

        par = bb[opponent] & (bb[opponent] >> 7)
        if par & (par >> 14):
            return -1
        par = bb[opponent] & (bb[opponent] >> 6)
        if par & (par >> 12):
            return -1
        par = bb[opponent] & (bb[opponent] >> 8)
        if par & (par >> 16):
            return -1
        par = bb[opponent] & (bb[opponent] >> 1)
        if par & (par >> 2):
            return -1

        if bb[agent] | bb[agent] == int(('0' + '1'*6)*7, 2):
            return 0
        else:
            return None

Since some column could be completely filled, it is necessary to check which one is free, in order to make a move. $\verb|Possible_moves|$ merges the two bitstring, in order to obtain the bitstring representing all the filled spots. Then it performs bitwise $AND$ with a bitstring having $1$s only on the first row. The resulting string has $1$s only on the first row and only if there is a token in one of those spots. Eventually, it iterates over all the columns indices $i \in {1,...,6}$, generating a bitstring with a $1$ on the $i$-th spot of the first row. If the bitwise $AND$ operation with the bitstring having $1$s only in the filled spot of the first row returns a bitstring with a $1$ in the same spot, it means it is occupied.

$\verb|New_board|$ returns the bitboard after a $\verb|move|$ is performed by the player defined by $\verb|turn|$. It works like this: suppose that the token is dropped from column $1$ and the column, in the previous state, looks like $001111$, where the zeros are the free spots of the column. If I drop the token, than the column will be $011111$. In order to obtain this we can consider the column with a single token in the bottom row $000001$. If we sum it with the initial state we get $001111 + 000001 = 010000$. But this is the exact spot where the dropped coin should be. If we perform bitwise $OR$ between this last result and the initial column state, we get $001111 + 010000 = 011111$, which is exactly the desired result.

In [None]:
def possible_moves(bb):
    complete_mask = bb[1] ^ bb[2]
    first_row = complete_mask & COMPLETELY_FILLED_FIRST_ROW
    moves = []
    for i in range(7):
        if not first_row & 2**(7*(i+1)-2):
            moves.append(i)
    return moves

def new_board(bb, move, turn):
    complete_mask = bb[1] ^ bb[2]
    new_complete_mask = complete_mask | (complete_mask + 2**(48-7*move))
    return bb[switch_player(turn)] ^ new_complete_mask


This simple function changes the player. It is useful when simulating a game.

In [None]:
def switch_player(player):
        return list([2,1])[player - 1]
    

The $\verb |best_move|$ decorator iterates over all the children of the inital state and returns the move corrisponding to the child node with the highest score. It returns the final output of the agent.

In [None]:
def best_move():
        candidates = [(graph[child_id]['move'], graph[child_id]['score']) 
                      for child_id in graph[0]['children']]
        best_move = max(candidates, key=itemgetter(1))[0]
        return int(best_move)

The $\verb|simulate|$ decorator performs the simulations on a specific node. It initiates a $\verb|performance|$ variable, that sums the results of each simulated game. Since each result is between $-1$ and $1$, the final *score*, calculated as the mean $\verb|performance|$ over the number of simulations (last line), is still between $-1$ and $1$. For each iteration, the bitboard gets copied and a random move is chosen and performed amongst the ones available. The new bitboard is calculated and substitued to the old one, only for the player who just moved. The $\verb|result|$ is then checked: if it is a terminal state, the $\verb|result|$ gets added to $\verb|performance|$ and an other simulation starts. Otherwise, it changes the player and keeps going. At the end, the score is updated.

In [None]:
def simulate(_id):
        performance = 0
        for iter in range(SIMULS):
            sim_player = copy.deepcopy(graph[_id]['turn'])
            sim_board = copy.deepcopy(graph[_id]['bitboard'])
            while True:
                random_move = np.random.choice(possible_moves(sim_board), 1)[0]
                sim_board[sim_player] = new_board(sim_board, random_move, sim_player)
                result = outcome(sim_board)
                if result is not None:
                    performance += result
                    break
                sim_player = switch_player(sim_player)
        graph[_id]['score'] = performance/SIMULS

Given the initial state $s_0$, we need to populate the graph with all the possible reachable states $s'_i$. For each possible move, a new node ID and a copy of the bitboard are generated. Then, the bitboard is updated with the next state, reached by the player that moved in the turn. The result of the new board is added to the $\verb|score|$ and the ID of the new node is added to the set of the $\verb|children|$ of the initial state. 

In [None]:
def explore_node(_id):
    for move in possible_moves(graph[_id]['bitboard']):
        new_node_turn = switch_player(graph[_id]['turn'])
        new_node_id = len(graph)
        new_node_bitboard = copy.deepcopy(graph[_id]['bitboard'])
        new_node_bitboard[graph[_id]['turn']] = new_board(graph[_id]['bitboard'],
                                                         move, graph[_id]['turn'])
        new_node_score = outcome(new_node_bitboard)
        graph[new_node_id] = {'bitboard': new_node_bitboard, 'parent': _id, 
                              'children': set(), 'move': move, 'turn': new_node_turn, 
                              'score': new_node_score}
        graph[_id]['children'].add(new_node_id)

The last lines of the $\verb|agent|$ run the algorithm itself: the initial node is expanded and, for each child node, if it is not a winning state, a sequence of simulations are run. Eventually, the move corrisponding to the node with teh highest score is returned.

In [None]:
explore_node(0)
for child_id in graph[0]['children']:
    if graph[child_id]['score'] == 1:
        return int(graph[child_id]['move'])
    else:
        simulate(child_id)
return best_move()