# Connect 4

---

Author: S. Menary [sbmenary@gmail.com]

Date  : 2023-01-03, last edit 2023-01-15

Brief : Develop a simple Connect 4 game environment and implement a bot using Monte Carlo Tree Search (MCTS)

---

### Summary

- Connect 4 is a two-player, fully-observable, zero-sum game. 
- The game states may be represented as a tree sturcture, We can therefore implement a bot using tree-search algorithms. We choose Connect 4 because it is simple, and therefore provides a launch-pad for more complex games such as checkers or chess.
- Initially we implement vanilla MCTS with no machine learning. We expect this to be limited by (i) the stochastic rollout of the tree and (ii) the simplicity of the simulation policy.
- To introduce ML, we would perform alternate steps of MCTS evaluation and simulation policy improvement. In this way, the simulated games will _hopefully_ begin to approach "good play", and the final MCTS values will reflect the behaviour of good players.
- MCTS configuration:
    + Tree-traversal policy is:
        1. From the current node, uniformly-randomly select a non-expanded child if one is available
        2. Otherwise select child with highest UCB-1 score, traverse to this node and repeat
    + Resulting node is expanded by adding all possible children and selecting one by performing a uniformly-random action
    + Simulation policy is to select a uniformly-random action
- The UCB-1 score is designed to optimally balance exploration/exploitation for static multi-arm bandits. Strictly speaking, we are applying this in a non-stationary environment because the reward-distribution for each action changes according to the evolution of the down-stream tree. This makes UCB-1 theoretically sub-optimal. However, it is often used nonetheless.
- When playing an actual move (i.e. inference time), greedily select the action with the max average score from its MCTS visits (do not use UCB-1 since we are no longer exploring).

Observations:
- Strength of decision-making depends on how many iterations of MCTS we perform:
    1. When tree is shallow, we effectively assume that future play is random, which means we will choose options with the greatest number of permutations of winning. We therefore may neglect to defend against an imminent loss, favouring a different move with many win permutations (bad behaviour).
    2. When tree is deep and UCB1 score converges towards true means, at least for the best moves, then we effectively assume that future play is optimal. As play-count goes to infinity, our scores become unbiased.
    3. For finite but sufficient run-time, we assume optimal play, but using mean scores that are biased by the fact that our early simulations used random play instead of optimal play.
- This explains why even random simulation MCTS is pretty good - we end up doing most of our simulations with pretty effective play, at least for the next few moves where our tree is sufficiently grown.


## Imports

In [1]:
###
###  Required imports
###  - all imports should be placed here
###


##  Python core libs
import sys, time
from enum import IntEnum
from abc  import ABC, abstractmethod, abstractstaticmethod
from __future__ import annotations

##  PyPI libs
import numpy as np

##  Local packages
from connect4.utils import DebugLevel
from connect4.game  import GameBoard
from connect4.MCTS  import Node_VanillaMCTS, PolicyStrategy
from connect4.bot   import Bot_VanillaMCTS


In [2]:
###
###  Print version for reproducibility
###

print(f"Python version is {sys.version}")
print(f"Numpy  version is {np.__version__}")

Python version is 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
Numpy  version is 1.23.2


##  GameBoard

The `GameBoard` object is used to interact with a game of Connect 4.

In [3]:
###
###  Setup a small game
###  - 4x4 grid
###  - line of 3 needed to win
###

##  Create game board
game_board = GameBoard(4, 4, 3)

##  Show initial game board
print(f"Initial game board:\n{game_board}")

##  Play a few moves
game_board.apply_action(1)
game_board.apply_action(2)
game_board.apply_action(1)

##  Show updated game state
print(f"\nAfter a few moves:\n{game_board}")


Initial game board:
+---+---+---+---+
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
+---+---+---+---+
| 0 | 1 | 2 | 3 |
+---+---+---+---+
Game result is: NONE

After a few moves:
+---+---+---+---+
| . | . | . | . |
| . | . | . | . |
| . | [31mX[0m | . | . |
| . | [31mX[0m | [34mO[0m | . |
+---+---+---+---+
| 0 | 1 | 2 | 3 |
+---+---+---+---+
Game result is: NONE


##  MCTS

The `Node_VanillaMCTS` object is used to perform vanilla MCTS searches.

In [4]:
###
###  Perform a few MCTS steps
###  - transitions into a ciritical state where O player needs to be careful not to 
###    blunder a win for X
###

##  Create a root node at the current game state
root_node = Node_VanillaMCTS(game_board, label="ROOT")

##  Print the initial value tree (should be a ROOT node with no children)
print("Initial tree:")
print(root_node.tree_summary())
print()

##  Perform several MCTS steps with a HIGH debug level
root_node.multi_step_MCTS(num_steps=10, max_sim_steps=-1, debug_lvl=DebugLevel.MEDIUM)

##  Print the updated value tree 
print("Updated tree:")
print(root_node.tree_summary())
print()


Initial tree:
> [0: ROOT] N=0, T=0.000, E=inf, Q=-inf
     > None
     > None
     > None
     > None

Running MCTS step 0
Select unvisited action O:1
Simulation ended with result X with compound_discount=1.000
Simulated trajectory was: X:0 O:0 X:3 O:0 X:2 O:1 X:2
Node O:1 with parent=O, N=0, T=0.00 receiving score -1.00
Node ROOT with parent=NONE, N=0, T=0.00 receiving score 0.00

Running MCTS step 1
Select unvisited action O:2
Simulation ended with result X with compound_discount=1.000
Simulated trajectory was: X:2 O:0 X:3 O:2 X:0 O:3 X:3 O:0 X:1
Node O:2 with parent=O, N=0, T=0.00 receiving score -1.00
Node ROOT with parent=NONE, N=1, T=0.00 receiving score 0.00

Running MCTS step 2
Select unvisited action O:3
Simulation ended with result X with compound_discount=1.000
Simulated trajectory was: X:2 O:0 X:3
Node O:3 with parent=O, N=0, T=0.00 receiving score -1.00
Node ROOT with parent=NONE, N=2, T=0.00 receiving score 0.00

Running MCTS step 3
Select unvisited action O:0
Simulation 

##  MCTS

The `Bot_VanillaMCTS` object is used to apply bot actions using vanilla MCTS.

In [9]:
###
###  Use MCTS to play a move
###

##  Use MCTS to search for an optimal action
bot    = Bot_VanillaMCTS()
action = bot.choose_action(game_board, duration=1, debug_lvl=DebugLevel.LOW)

##  Play bot move
game_board.apply_action(action)

##  Show updated game state
print(game_board)


Selecting greedy action from posterior values
Action values are:  0.191   -0.174  0.791   -0.130
Visit counts are:   47      23      804     23    
Selecting action 2
+---+---+---+---+
| . | . | . | . |
| . | [34mO[0m | . | . |
| . | [31mX[0m | [31mX[0m | . |
| . | [31mX[0m | [34mO[0m | . |
+---+---+---+---+
| 0 | 1 | 2 | 3 |
+---+---+---+---+
Game result is: NONE


## Play a game

Play a game of connect 4 against our bot!

Just add new calls to `game_board.apply_action(column_index)` to play a move in column `column_index`, and `bot.take_move(game_board, duration)` to play a bot move in response. Turning up the `duration` parameter will improve the bot by allowing it to search for longer.

In [None]:
##  Create a new game

game_board = GameBoard()
bot        = Bot_VanillaMCTS(greedy=True)
print(game_board)


In [None]:
##  Play a move in column index 3

game_board.apply_action(3)
print(game_board)

if not game_board.get_result() :
    bot.take_move(game_board, duration=10, debug_lvl=DebugLevel.LOW)
    print(game_board)



... and so on until the game is complete!


## Bot-only game

Let's watch the bot play itself!

In [None]:
#  Play a bot game!

game_board = GameBoard()
bot        = Bot_VanillaMCTS()
print(game_board)

result = game_board.get_result()
while not result :
    bot.take_move(game_board, duration=10, debug_lvl=DebugLevel.LOW)
    result = game_board.get_result()
    print(game_board)
