## Monte Carlo Tree Search
Monte Carlo Tree Search is the industry standard for complex games like Go and Chess. Combined with ConvNets, MCTS was used by Google in Alpha Go and Alpha Zero. In MCTS, the computer plays a number of games against itself. It generates its own expected value of each game state through trial and error, which eliminates the need to explore every state. The cost of this is non-optimality. As the computer simulates more games, it will learn which game states are favorable or unfavorable (more likely to result in a victory). In this section, you will implement Monte Carlo Tree Search for tic tac toe.

For this activity, you will implement an adversarial version of MCTS to play the game of tic-tac-toe. For additional implementation details, refer to Cameron Browne's lecture on Monte Carlo Tree Search: ccg.doc.gold.ac.uk/ccg_old/teaching/ludic_computing/ludic16.pdf

### Introduction

We will be reusing an adapted version of Morgan Swanson's tic-tac-toe environment

**Do not change any of these methods**


In [None]:
# Lab by Morgan Swanson, adapted by Rodrigo Canaan
import numpy as np
from tqdm.notebook import tqdm
from enum import Enum

class Token():
   def __init__(self, node, time) -> None:
      self.node = node
      self.time = time


class Quarter(Enum):
   Q1 = 1
   Q2 = 2
   Q3 = 3
   Q4 = 4

class Board():
   def __init__(self, quarter, tokens  = []) -> None:
      self.tokens = tokens
      self.quad:Quarter = quarter


    


example_board = np.array([[' ', ' ', ' '],
                          [' ', ' ', ' '],
                          [' ', ' ', ' ']])
song_lengths = {Quarter.Q1: 222, Quarter.Q2: 1111, Quarter.Q3: 11, Quarter.Q4: 33}

def classifier(board:Board):
   #use svm to predict based on set of tokens
   return 0.2

# All possibles tokens of that quarter
def get_possible_moves(board:Board):
    moves = []
    #how t
    return moves

# Hash function that returns a unique string for each board
def get_hash(board:Board):
   return hash(board)
  
# Returns the value of a board: 
# Prob when length is optimal else None
#  1 for a win for max (optionally adjusted by depth)
#  - for a win for min
#  0 for draws
#  None for non-terminal states
def get_score(board:Board):
    if (len(board.tokens) == song_lengths[board.quarter]):
       return classifier(board)
    else:
        # Optimal length not reached yet
        return None

# Returns the next player to act
def next_player(player):
  if player=='max':
    return 'min'
  elif player== 'min':
    return 'max'
  else:
    raise ValueError('player must be max or min')

### Storing Simluation Results

In order to remember how good each state is, we will keep track of the results of a state in a class called Node. This class has some basic functionality to add children to a node, calculate UCB and print information about a node or the subtree starting at the node.


**Do not change any of these methods**



In [11]:
import math
import queue
import random



class Node:
  
    def __init__(self, board, parent, player='max'):
        self.score = 0.0
        self.min_wins = 0
        self.max_wins = 0
        self.board:Board = board
        self.parent = parent
        self.count = 0
        self.children = {} #
        self.player = player #
    
    def __str__(self):
      if self.parent is None:
        p = "None"
      else:
        p = get_hash(self.parent.board)
      try:
        expected_value = self.get_expected_value()
      except ValueError:
        expected_value = 0
      s = "Node {}\n with parent {}\n Count = {}\n Max Wins = {}\n Min Wins = {}\n Expected Value = {}\n UCB = {}".format(get_hash(self.board),p,self.count,self.max_wins,self.min_wins,expected_value,self.get_ucb())
      return s
      # TODO: look into self.getucb not passing c

    def add_child(self, child_board):
      key = get_hash(child_board)
      if key in self.children.keys():
        raise ValueError('child already exists')
      else:
        new_tokens = self.board.tokens[:].append(child_board)
        new_board = Board(self.board.quad, new_tokens)
        self.children[key] = Node(new_board,self)
        return self.children[key]
       
    # def get_p_win(self, player):
    #     try:
    #         if player == 'min':
    #             return self.min_wins / self.count
    #         elif player == 'max':
    #             return self.max_wins / self.count
    #         else:
    #             raise ValueError('player {} must be min or max'.format(player))
    #     except ZeroDivisionError:
    #         raise ValueError('must be updated at least once \
    #                           to get win probability')

    def get_expected_value(self):
        return self.score
        # return classifier(self.board)


    def get_explore_term(self, parent, c=1):
        if self.parent is not None:
            return c * (2* math.log(parent.count) / self.count) ** (1 / 2)
        else:
            return 0 
        
        
    def get_ucb(self, c=1, default=6):
        if self.count:
            p_win = self.get_expected_value()
            explore_term = self.get_explore_term(self.parent,c)
            #TODO  can multiply with probability of adding that node aswell (MCTS paper did this)
            return p_win + explore_term
        else:
            return default

    # def print_subtree(self, max_nodes = None):
    #   if max_nodes is None:
    #     max_nodes = len(self.children)+1
    #   print("\n\nPrinting the subtree starting from node {} up to a maximum of {} nodes\n\n".format(get_hash(self.board),max_nodes))
    #   q = queue.Queue()
    #   q.put(self)
    #   node_count = 0
    #   while not q.empty() and node_count<max_nodes:
    #     node_count+=1
    #     n = q.get()
    #     print(n)
    #     for key in n.children.keys():
    #       q.put(n.children[key])

# MCTS Method stubs 
The methods below implement the expand, tree_policy, best_child, default_policy and backup methods.

With the exception of expand (which is already implemented correctly), the four remaining methods are stubs with some default (incorrect) functionality. It will be your job in tasks 1, 2 and 3 to implement the correct version of these methods.

In [12]:
# If the node has unexpanded successors, expands a new successor, and adds it to the node's chidren. Otherwise, returns None
# This method already works correctly, DO NOT MODIFY IT!
def expand(node, player):
  for successor in get_possible_moves(node.board):
    child = None
    try:
      child = node.add_child(successor)
    except ValueError: 
      # Guards against expanding the same child multiple times
      continue
    return child

# This method is supposed to implement tree policy, which returns the next node to expand, starting from the root node
# Right now, the method simply expand the node if it has no children, and returns a random child otherwise
# TODO: Task 1 will have you implement the correct behavior of tree_policy:
#    A) If the node is terminal (that is, if get_value does not return None for its board), return the node itself
#    B) If the node has some unexpanded children, expand it (you can check this by comparing the number of children it has to the number of successor states in get_possible moves, or if expand returns None )
#    C) Otherwise, apply tree policy recursively to the node's best child and next player
# My implementation has 8 lines
def tree_policy(node):
  if get_score(node.board) != None: #A
    return node
  else:
    if len(node.children) != len(get_possible_moves(node.board)): 
      return expand(node) #B
    else:
      return tree_policy(best_child(node)) #C


# This method is supposed return the best already-expanded successor of the current node according to the UCB formula
# Right now, it simply retunrs a random successor
# TODO: Task 2 will have you modify it such that it returns the best value according to the UCT formula, which you can access for each child via child.get_ucb(c)
# My fairly naive implementation has 8 lines, which can be significantly shortened with list comprehension
def best_child(node,c=1):
  return max(node.children.values(), key=lambda p: p.get_ucb())
  

# This method is supposed to increment the visit count by 1 for the current node. 
# It should also implement either max_wins or min_wins if the score is positive or negative respectively. Note that the score can also be zero, in which case no win counter needs to be updated.
# Finally, this method should recursively call itself and perform the exact same updates, unless the node is the root (which you can check by node.parent being None)
# TODO: Task 2 will have you implement this method correctly.
def backup(node, score):
    while node != None:
      node.count += 1
      node.score += score
      node = node.parent



# This method is supposed to return the estimated value of the current node by performing rollouts until it reaches a terminal state.
#  Depending on the print_rollout_result parameter, it should also print the final board of the rollout, to help with debuging
#  Right now, it simply prints the current board and retuns either 1 or -1 randomly
# TODO: Task 3 will have you implement the correct behavior for this method.
# If the node is terminal (in which case score will NOT be none), you should return the score.
# Otherwise you should list all successors with get_possible_move and pick one randomly, and keep doing this until you hit a terminal state.
# Remember to update player to next_player(player) between calls to get_possible_moves
# Printing the final board and score is optional, but might help you debug your program.
def default_policy(node, print_rollout_result = False):
  score = get_score(node.board)
  board = node.board
  if score is None:
    while get_score(board) == None:
      board = random.choice(get_possible_moves(board))
      node.player = next_player(node.player)
    score = get_score(board)
  final_board = board
  if print_rollout_result:
    print(final_board)
    print(score)
  return score




# Running example
The code below defines the main mcts method, and an example of how to run it.
**Do NOT modify the mcts_search method** (but feel free to modify the call to it on line 16 to check the effect of its parameters).
Note that, with the default implementation of tree policy, best child, default policy and backup, this method does not do much: it simply adds a single successor to the tree and never updates its meta-information.
You can see that it tries to add the same state repeatedly thanks to the flag print_added_nodes, and that the final tree has a single node through the flag print_final_tree

In [None]:
def mcts_search(board, player, num_iterations = 15, print_added_nodes=False, print_rollout_result = False, print_final_tree = False, nodes_to_print = None):
  start_node = Node(board,None,'max')
  for iteration in tqdm(range(num_iterations)):
    v = tree_policy(start_node, 'max')
    if print_added_nodes:
      print(v)
      print("Adding new node {} with parent {}".format(get_hash(v.board),get_hash(v.parent.board)))
    value = default_policy(v,print_rollout_result)
    backup(v,value)
  action = best_child(start_node,0).board
  if print_final_tree:
    start_node.print_subtree(nodes_to_print)
  print("Action is :\n {}".format(action))
  return action

mcts_search(example_board,'max',10, print_added_nodes = False, print_final_tree= False, nodes_to_print = float("inf"))

### Playing against your agent
Use this to play against your agent. You play as the second player ("min") by default, but this can be changed by changing line 10 from "max" to "min". With 1000 iterations, the agent should select moves in about 1 second, and should play very well after Task 3, and somewhat reasonably after Task 2.

In [17]:
# Starts a game against the AI Program
def run_demo():
    board = np.array([[' ', ' ', ' '],
                      [' ', ' ', ' '],
                      [' ', ' ', ' ']])
    history = {}
    score = get_score(board)
    player = "max"
    while score is None:
        if player == "max":
            board = mcts_search(board, player, num_iterations= 1000)
        else:
            move_entered = False
            while not move_entered:
                try:
                    move = int(input('Choose a move...')) - 1
                    if not 0 <= move <= 8:
                        print("Enter an integer between 1 and 9.\n")
                        continue
                    elif not board[move//3][move%3] == ' ':
                        print("That spot is already taken.\n")
                        continue
                    else:
                        board[move//3][move%3]= 'O'
                        move_entered = True
                except ValueError:
                    print("Enter an integer.\n")
        score = get_score(board)
        player = "min" if player == "max" else "max"
        print(board)
    if (score == 0):
        print("Draw")
    elif (score > 0):
        print("You Lose")
    else:
        print("You Win")
      
      
run_demo()

  0%|          | 0/1000 [00:00<?, ?it/s]

Action is :
 [[' ' ' ' ' ']
 [' ' ' ' ' ']
 ['X' ' ' ' ']]
[[' ' ' ' ' ']
 [' ' ' ' ' ']
 ['X' ' ' ' ']]
Enter an integer.

[['O' ' ' ' ']
 [' ' ' ' ' ']
 ['X' ' ' ' ']]


  0%|          | 0/1000 [00:00<?, ?it/s]

Action is :
 [['O' ' ' ' ']
 [' ' 'X' ' ']
 ['X' ' ' ' ']]
[['O' ' ' ' ']
 [' ' 'X' ' ']
 ['X' ' ' ' ']]
[['O' ' ' 'O']
 [' ' 'X' ' ']
 ['X' ' ' ' ']]


  0%|          | 0/1000 [00:00<?, ?it/s]

Action is :
 [['O' 'X' 'O']
 [' ' 'X' ' ']
 ['X' ' ' ' ']]
[['O' 'X' 'O']
 [' ' 'X' ' ']
 ['X' ' ' ' ']]
[['O' 'X' 'O']
 [' ' 'X' ' ']
 ['X' 'O' ' ']]


  0%|          | 0/1000 [00:00<?, ?it/s]

Action is :
 [['O' 'X' 'O']
 [' ' 'X' 'X']
 ['X' 'O' ' ']]
[['O' 'X' 'O']
 [' ' 'X' 'X']
 ['X' 'O' ' ']]
Enter an integer.

[['O' 'X' 'O']
 ['O' 'X' 'X']
 ['X' 'O' ' ']]


  0%|          | 0/1000 [00:00<?, ?it/s]

Action is :
 [['O' 'X' 'O']
 ['O' 'X' 'X']
 ['X' 'O' 'X']]
[['O' 'X' 'O']
 ['O' 'X' 'X']
 ['X' 'O' 'X']]
Draw


In [18]:
5


5