#  Debug neural MCTS

---

Author: S. Menary [sbmenary@gmail.com]

Date  : 2023-01-15, last edit 2023-01-19

Brief : Debug behaviour of bot using a neural network bot with Monte Carlo Tree Search (MCTS)

---

## Imports

---

In [1]:
##=====================================##
##  All imports should be placed here  ##
##=====================================##

##  Python core libs
import pickle, sys, time

##  PyPI libs
import numpy as np
from matplotlib import pyplot as plt

##  Local packages
from connect4.utils    import DebugLevel
from connect4.game     import BinaryPlayer, GameBoard, GameResult
from connect4.MCTS     import Node_NeuralMCTS, PolicyStrategy
from connect4.bot      import Bot_NeuralMCTS, Bot_VanillaMCTS
from connect4.parallel import generate_from_processes
from connect4.neural   import load_model
from connect4.methods  import get_training_data_from_bot_game


In [2]:
##=====================================##
##  Print version for reproducibility  ##
##=====================================##

print(f"{'Python'    .rjust(12)} version is {sys.version}")
print(f"{'Numpy'     .rjust(12)} version is {np.__version__}")


      Python version is 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]
       Numpy version is 1.23.2


In [3]:
##============================##
##  Set global config values  ##
##============================##

model_idx = 4
model_name = f"../models/.neural_model_v{model_idx}.h5"

print(f"Using model: {model_name}")


Using model: ../models/.neural_model_v4.h5


##  Test neural model MCTS

- Test that we can propagate values and make decisions correctly with neural MCTS
- Find a good value for the duration parameter, (smallest value that allows us to make stable posteriors)
- Cannot run these cells when doing regular run, since tf cannot be used in main process before spawning children


In [5]:
##============================##
##  Perform a few MCTS steps  ##
##============================##

##  Create game board
game_board = GameBoard()
print(f"\nInitial game board:\n{game_board}")

##  Create a root node at the current game state
model      = load_model(model_name)
root_node  = Node_NeuralMCTS(game_board, params=[model, 1.], label="ROOT")

##  Print the initial value tree (should be a ROOT node with no children)
print("Initial tree:")
print(root_node.tree_summary())
print()

##  Perform several MCTS steps with a HIGH debug level
root_node.multi_step_MCTS(num_steps=20, max_sim_steps=-1, discount=0.99, debug_lvl=DebugLevel.MEDIUM)

##  Print the updated value tree 
print("Updated tree:")
print(root_node.tree_summary())
print()



Initial game board:
+---+---+---+---+---+---+---+
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Initial tree:
> [0: ROOT] N=0, T=0.000, E=nan, Q=-inf
     > None
     > None
     > None
     > None
     > None
     > None
     > None

Running MCTS step 0
Select unvisited action X:0
Simulation using prior value 0.0227
Node X:0 with parent=X, N=0, T=0.00 receiving score 0.02
Node ROOT with parent=NONE, N=0, T=0.00 receiving score 0.00

Running MCTS step 1
Select unvisited action X:6
Simulation using prior value 0.0227
Node X:6 with parent=X, N=0, T=0.00 receiving score 0.02
Node ROOT with parent=NONE, N=1, T=0.00 receiving score 0.00

Running MCTS step 2
Select unvisited action X:2
Simulation using prior value 0.0008
Node X:2 with parent=X, N=0, 

In [6]:
##==========================================##
##  Play a game and generate training data  ##
##==========================================##

model_inputs, posteriors, values = get_training_data_from_bot_game(model, duration=1, discount=0.99,
                                                                  debug_lvl = DebugLevel.LOW)


Using bot <connect4.bot.Bot_NeuralMCTS object at 0x12708e7d0>
+---+---+---+---+---+---+---+
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Selecting uniformly random action
Action values are:  0.023   0.076   0.001   0.046   0.020   -0.025  0.023 
Visit counts are:   1       10      1       226     3       1       1     
Selecting action 0
+---+---+---+---+---+---+---+
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| [31mX[0m | . | . | . | . | . | . |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Sampling action from posterior policy 0.00 0.00 0.00 0.98 0.00 0.00 0.00
Action valu

Sampling action from posterior policy 0.01 0.01 0.15 0.37 0.06 0.39 0.02
Action values are:  0.193   0.489   0.619   0.827   0.673   0.796   0.090 
Visit counts are:   2       2       38      91      16      96      4     
Selecting action 3
+---+---+---+---+---+---+---+
| . | . | . | . | . | . | . |
| . | . | . | [34mO[0m | . | . | . |
| . | . | [31mX[0m | [31mX[0m | . | . | . |
| . | . | [34mO[0m | [34mO[0m | . | . | . |
| . | . | [34mO[0m | [31mX[0m | . | . | . |
| [31mX[0m | [31mX[0m | [31mX[0m | [34mO[0m | [34mO[0m | [34mO[0m | [31mX[0m |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Sampling action from posterior policy 0.00 0.03 0.08 0.02 0.13 0.65 0.09
Action values are:  -0.943  -0.855  -0.882  -0.989  -0.821  -0.733  -0.792
Visit counts are:   1       7       20      6       33      170     23    
Selecting action 0
+---+---+---+---+---+---+---+
| . | . | . | . | . | . | . |
| . | . |

Sampling action from posterior policy 0.35 0.00 0.01 0.00 0.59 0.04 0.01
Action values are:  0.159   0.070   -0.194  0.124   0.064   -0.002
Visit counts are:   90      2       1       150     11      2     
Selecting action 0
+---+---+---+---+---+---+---+
| . | [34mO[0m | . | . | . | . | . |
| [34mO[0m | [31mX[0m | . | [34mO[0m | . | . | . |
| [31mX[0m | [34mO[0m | [31mX[0m | [31mX[0m | . | . | . |
| [34mO[0m | [31mX[0m | [34mO[0m | [34mO[0m | [31mX[0m | . | . |
| [31mX[0m | [34mO[0m | [34mO[0m | [31mX[0m | [31mX[0m | [34mO[0m | . |
| [31mX[0m | [31mX[0m | [31mX[0m | [34mO[0m | [34mO[0m | [34mO[0m | [31mX[0m |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Sampling action from posterior policy 0.89 0.00 0.01 0.02 0.03 0.05 0.01
Action values are:  -0.236  -0.838  -0.720  -0.650  -0.412  -0.927
Visit counts are:   225     2       5       7       12      2     
Selecting ac

Sampling action from posterior policy 0.00 0.00 0.03 0.00 0.09 0.80 0.08
Action values are:  -0.961  -0.966  -0.979  -0.976
Visit counts are:   65      222     1914    191   
Selecting action 5
+---+---+---+---+---+---+---+
| [31mX[0m | [34mO[0m | . | [34mO[0m | . | . | . |
| [34mO[0m | [31mX[0m | [31mX[0m | [34mO[0m | . | [31mX[0m | [34mO[0m |
| [31mX[0m | [34mO[0m | [31mX[0m | [31mX[0m | [34mO[0m | [34mO[0m | [31mX[0m |
| [34mO[0m | [31mX[0m | [34mO[0m | [34mO[0m | [31mX[0m | [31mX[0m | [34mO[0m |
| [31mX[0m | [34mO[0m | [34mO[0m | [31mX[0m | [31mX[0m | [34mO[0m | [31mX[0m |
| [31mX[0m | [31mX[0m | [31mX[0m | [34mO[0m | [34mO[0m | [34mO[0m | [31mX[0m |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Sampling action from posterior policy 0.00 0.00 0.00 0.00 0.52 0.37 0.11
Action values are:  0.479   1.000   0.984   0.975 
Visit counts are:   17      

In [7]:
##====================================================##
##  Check the data generated by the game is sensible  ##
##====================================================##

for inp, pos, val in zip(model_inputs, posteriors, values) :
    print(inp[:,:,0], ",  posterior="+"  ".join([f"{x:.2f}" for x in pos]), f",  value = {val[0]:.3f}")

[[0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]] ,  posterior=0.00  0.04  0.00  0.93  0.01  0.00  0.00 ,  value = -0.662
[[-1  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]] ,  posterior=0.00  0.00  0.00  0.98  0.00  0.00  0.00 ,  value = 0.669
[[ 1  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [-1  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]] ,  posterior=0.10  0.05  0.18  0.54  0.09  0.04  0.00 ,  value = -0.676
[[-1  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [-1  0  0  0  0  0]
 [ 1  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]] ,  posterior=0.00  0.00  0.82  0.16  0.00  0.00  0.00 ,  value = 0.683
[[ 1  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 1 -1  0  0  0  0]
 [-1  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]
 [ 0  0  0  0  0  0]] ,  posterior=0.0

In [8]:
##===================================================================================================##
##  Use MCTS to search for an optimal action, and compare the prior policy/value with the posterior  ##
##===================================================================================================##

game_board = GameBoard()
bot = Bot_NeuralMCTS(model, policy_strategy=PolicyStrategy.GREEDY_POSTERIOR_POLICY)

while not game_board.get_result() :
    player = game_board.to_play
    action = bot.choose_action(game_board, duration=5, discount=0.99, debug_lvl=DebugLevel.LOW)
    print("Prior policy was :  " + "  ".join([f"{c:.2f}" for c in bot.root_node.child_priors]))
    print("Prior values were:  " + "  ".join([f"{player.value*c.prior_value:.2f}" for c in bot.root_node.children]))
    game_board.apply_action(action)
    print(game_board)


Selecting greedy action from posterior policy 0.00 0.03 0.01 0.94 0.01 0.00 0.00
Action values are:  0.014   0.005   -0.007  0.017   -0.025  -0.025  -0.087
Visit counts are:   3       24      13      825     9       1       2     
Selecting action 3
Prior policy was :  0.00  0.00  0.01  0.98  0.01  0.00  0.00
Prior values were:  0.02  0.71  0.00  -0.08  0.41  -0.03  0.02
+---+---+---+---+---+---+---+
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | . | . | . | . |
| . | . | . | [31mX[0m | . | . | . |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Selecting greedy action from posterior policy 0.00 0.03 0.00 0.96 0.00 0.00 0.00
Action values are:  -0.494  -0.045  -0.414  0.007   -0.451  -0.945  -0.866
Visit counts are:   1       25      2       852     1       1       1     
Selecting action 3
Prior policy was :  0.00  0.00  0.01  0.97  0.

Selecting greedy action from posterior policy 0.10 0.30 0.15 0.16 0.08 0.14 0.08
Action values are:  -0.964  -0.977  -0.969  -0.966  -0.964  -0.964  -0.962
Visit counts are:   886     2687    1321    1403    752     1279    676   
Selecting action 1
Prior policy was :  0.05  0.52  0.15  0.11  0.05  0.10  0.03
Prior values were:  -0.47  -0.92  0.53  0.28  0.46  0.53  0.11
+---+---+---+---+---+---+---+
| . | . | . | . | . | . | . |
| . | . | . | [34mO[0m | . | . | . |
| . | . | . | [34mO[0m | [31mX[0m | . | . |
| . | . | . | [31mX[0m | [34mO[0m | . | . |
| . | [34mO[0m | . | [34mO[0m | [31mX[0m | . | . |
| [31mX[0m | [31mX[0m | . | [31mX[0m | [34mO[0m | . | . |
+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
Game result is: NONE
Selecting greedy action from posterior policy 0.00 0.00 1.00 0.00 0.00 0.00 0.00
Action values are:  -0.291  -0.064  1.000   -0.845  0.046   0.188   -0.331
Visit counts are:   16      25      17802