## Tic-Tac-Toe with Deep-Q Learning
#### ELEC-ENG 473: Final Project
Yemi Kelani

---

##### **Background and Motivation**
In Reinforcement Learning, agents learn to make decisions by interacting with their environments. Typically, an environment will produce an observation (also referred to as a state), the agent will take an action based on the observation, and the environment will supply the agent with a reward (either positive, neutral, or negative). Over time, the agent will learn to take actions the help it to maximize its potential rewards. This greedy approach is called exploitation. Alternatively, the agent will sometimes opt to make random choices. This is known as exploration. 

In Tic-Tac-Toe, the game board contains 9 separate cells, with each cell taking the form of an "X", an "O", or an empty space. This means there are 19,683 (3 ^ 9) possible states (Note: States and game sequences are different; There are well over 200,000 ways a game can be played.). Here, the term state refers to any possible configuration of the aforementioned values on the game board. Given that not all of these states are valid configurations of the game board, we can assume there are strictly less than 19,683 valid states in the search space. A state is considered valid if there exists at most one more "X" than "O" on the board at any given time; we assume that "X" always takes the first turn.

##### **Project Goals and Objectives**
Reinforcement Learning is a powerful tool for training agents to optimize actions within a given environment. In order for agents to maximize the value of their actions, they need some sort of mechanism for determining the values of states and/or state-action pairs. Deep-Q Learning is a non-tabular method of approximating these values. With this project, I aim to familiarize myself with the core principles of Deep-Q Learning by developing a Tic-Tac-Toe learning system from scratch.

---

In [27]:
import torch
import torch.nn as nn

# torch config
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

import random
import numpy as np
from tqdm import tqdm
from pathlib import Path
from collections import deque
from datetime import datetime

In [28]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##### **Method**
I created two classes for this project: A `TicTacToeGame` class which served as an enviornment for game board, and a `DeepQAgent` class which served as the agent. Implementations for both classes can be viewed below. 

I trained four agents for a varying number of episodes each against different combinations of opponent levels. Disregaring failed training sessions, I trained them for a total of 14,000 episodes. I used a hidden layer size of 100. The rest of the parameters used can be viewed below.

Note that the `opponent_move` method in the `TicTacToeGame` class has two levels: *easy* and *hard*. The easy mode randomly selects a valid next move for the opponent while the hard mode selects the optimal valid move using the `minmax` algorithm. The issue with hard mode is that is inefficient and slow at the start of the game. As moves are made, it gets faster (because the search space reduces). To combat this issue, if hard mode is selected, it will play naively for the first one or two moves and then, when the search space is computationally acceptable, it makes the optimal moves.

---

In [57]:
class TicTacToeGame():
  def __init__(self, device, opponent_level="easy"):
    self.device = device
    self.positions = {-1: "X", 1: "O", 0: " "}
    self.opponent_level = opponent_level
    self.board = torch.zeros((3, 3))
    self.player1 = -1   # the agent
    self.player2 = 1    # the opponent
    self.winner = None
    self.minmaxWinner = None

  def get_state(self):
    return torch.clone(self.board).to(self.device)
  
  def take_action(self, action):
    i = action % 3
    j = action // 3
    if self.is_valid_move(i, j):
      self.board[j][i] = self.player1
      reward, done = self.is_game_over()

      if not done:
        self.opponent_move()
        reward, done = self.is_game_over()
    else:
      reward, done = -10, True

    next_state = torch.clone(self.board).to(self.device)
    return next_state, reward, done

  def is_valid_move(self, i, j):
    # check if the cell is empty
    if self.board[j][i] != 0:
      print(f"action {(3*j)+i} is not valid.")
      self.print_board()
      return False

    # check to see if the Xs and Os are balanced
    board_sum = self.board.sum() + self.player1
    balanced = board_sum >= -1 and board_sum <= 1
    if not balanced:
      print(f"board is unbalanced.")
      self.print_board()

    return balanced

  def get_valid_moves(self, board=None):
    board = self.board if board == None else board
    valid_moves, indicies = [], []
    mask = torch.zeros((9,)).to(self.device)
    for i in range(3):
      for j in range(3):
        if board[j][i] == 0:
          valid_moves.append((j, i))
          index = (3 * j) + i
          indicies.append(index)
          mask[index] = 1

    return valid_moves, mask, indicies

  def opponent_move(self):
    # with torch.no_grad():
    valid_moves, _, _ = self.get_valid_moves()

    force_naive_move = True
    num_moves = len(valid_moves)
    if num_moves == 0:
      return
    elif num_moves <= 7:
      force_naive_move = False

    if self.opponent_level == "easy" or force_naive_move:
      # random choice
      move = valid_moves[np.random.choice(num_moves)]
    elif self.opponent_level == "hard":
      # optimal choice
      scores = []
      board = torch.clone(self.board).detach()
      moves, _, _ = self.get_valid_moves(board=board)
      for move in moves:
        board[move[0]][move[1]] = self.player2 
        scores.append(self.minmax(self.player2, self.player1, board))
        board[move[0]][move[1]] = 0

      move = moves[np.argmax(scores)]
    
    self.board[move[0]][move[1]] = self.player2
  
  def minmax(self, role, player, board):
    _, done = self.is_game_over(board)
    if done:
      # NOTE: minmax rewards != Q value rewards
      if self.minmaxWinner != None:
        # return 1 if role wins else return -1
        # -1 *  1 = -1 [role "X" loses to "O"]
        # -1 * -1 =  1 [role "X" beats op "O"]
        #  1 *  1 =  1 [role "O" beats op "X"]
        #  1 * -1 = -1 [role "O" loses to "X"]
        return role * self.minmaxWinner

      # return 0 in case of a tie
      return 0

    scores = []
    moves, _, _ = self.get_valid_moves(board=board)
    next_player = self.player1 if player == self.player2 else self.player2
    for move in moves:
      board[move[0]][move[1]] = player                     # make move
      scores.append(self.minmax(role, next_player, board)) # score move
      board[move[0]][move[1]] = 0                          # revert board

    return max(scores) if player == role else min(scores)

  def is_game_over(self, board=None):
    revertWinner = False if board == None else True
    board = self.board if board == None else board

    # check rows and columns
    for i in range(self.board.size()[0]):
      if board[i][0] == board[i][1] == board[i][2] != 0:
        self.winner = board[i][0]
      elif board[0][i] == board[1][i] == board[2][i] != 0:
        self.winner = board[0][i]
    
    # check diagonals
    if board[0][0] == board[1][1] == board[2][2] != 0:
      self.winner = board[0][0]
    elif board[0][2] == board[1][1] == board[2][0] != 0:
      self.winner = board[0][2]

    # check if there's a tie
    tie = torch.where(board != 0, 1.0, 0.0).sum() == 9

    reward = -1 if tie else 0
    if self.winner != None:
      if self.winner == self.player1:
        reward = 1  # reward for winning
      else:
        reward = -2 # reward for losing

    done = self.winner != None or tie

    # if this function is being used for minmax, 
    # revert the stored winner to None
    if revertWinner:
      self.minmaxWinner = self.winner
      self.winner = None

    return reward, done

  def reset(self):
    self.board = torch.zeros((3, 3))
    self.player1 = -1   # the agent
    self.player2 = 1    # the opponent
    self.winner = None
    self.minmaxWinner = None

  def print_board(self, board=None):
    board = self.board if board == None else board
    print("_______")
    for a in range(board.size()[0]):
      row = r""
      for b in range(board.size()[1]):
        row += "|" + self.positions[board[a][b].item()]
        row += "|" if b == board.size()[1] - 1 else ""
      print(row)
    print("‾‾‾‾‾‾‾")

In [58]:
class DeepQAgent(nn.Module):
  def __init__(
      self,
      device,
      epsilon:float, 
      gamma:float,
      state_space:int, 
      action_space:int, 
      hidden_size:int = 100,
      dropout:float = 0.15,
      train_start:int = 50,
      batch_size:int = 64,
  ):
    
    super(DeepQAgent, self).__init__()

    self.epsilon = epsilon
    self.epsilon_min = 0.001 * epsilon
    self.epsilon_decay_rate = 0.999
    self.gamma = gamma
    self.state_space = state_space
    self.action_space = action_space
    self.train_start = train_start
    self.batch_size = batch_size
    self.memory = deque(maxlen=2000)
    self.device = device
    
    self.model = nn.Sequential(
        nn.Linear(state_space, hidden_size),
        nn.Dropout(dropout),
        nn.Tanh(),
        nn.Linear(hidden_size, hidden_size),
        nn.Dropout(dropout),
        nn.Tanh(),
        nn.Linear(hidden_size, action_space)
    )

  def forward(self, state):
    return self.model(torch.flatten(state))

  def select_action(self, state, mask, indicies):
    # select action greedily with respect to state
    # or select randomly from uniform distribution
    if np.random.rand() <= self.epsilon:
      return np.random.choice(indicies)
    else:
      q_values = self.forward(state)
      q_masked = torch.where(mask != 0, q_values, -1000)
      return torch.argmax(q_masked)

  def decay_epsilon(self):
    if len(self.memory) > self.train_start:
      if self.epsilon > self.epsilon_min:
        self.epsilon *= self.epsilon_decay_rate

  def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))
  
  def replay(self, optimizer, criterion):
    if len(self.memory) > self.train_start:

      batch = random.sample(self.memory, min(len(self.memory), self.batch_size))
      for state, action, reward, next_state, done in batch:

        q_values = self.forward(state)
        q_values_next = self.forward(next_state)

        if done:
          q_values[action] = reward
        else:
          q_values[action] = reward + (self.gamma * torch.max(q_values_next))

        # optimize the model
        loss = criterion(self.forward(state), q_values)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
      
      self.decay_epsilon()
  
  def save_model(self, path):
    torch.save(self.state_dict(), path)
    print(f"Model saved to {path}")

In [61]:
# DeepQ parameters
BATCH_SIZE    = 64
NUM_EPISODES  = 5000 if torch.cuda.is_available() else 100
STATE_SPACE   = 9
ACTION_SPACE  = 9
EPSILON       = 1.0
GAMMA         = 0.95
HIDDEN_SIZE   = 100
LEARNING_RATE = 0.001
DROPOUT       = 0.15
TRAIN_START   = 1000

# save path
MODEL_PATH = "/content/drive/MyDrive/EE 473/TicTacToeAgent.pt"

# helper functions
def supply_model(overwrite:bool=False):
  model = DeepQAgent(
      device        = DEVICE,
      epsilon       = EPSILON, 
      gamma         = GAMMA,
      state_space   = STATE_SPACE, 
      action_space  = ACTION_SPACE, 
      hidden_size   = HIDDEN_SIZE,
      dropout       = DROPOUT,
      train_start   = TRAIN_START,
      batch_size    = BATCH_SIZE
  )

  if Path(MODEL_PATH).exists() and not overwrite:
    # load saved parameters
    print("Loading Model Parameters...")
    model.load_state_dict(torch.load(MODEL_PATH))
  else:
    # initialize parameters
    print("Initializing Model Parameters...")
    for p in model.parameters():
      if p.dim() > 1:
          nn.init.xavier_uniform_(p)
  
  model = model.to(DEVICE)

  # create optimizer and criterion
  optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
  criterion = nn.SmoothL1Loss() # Huber Loss

  return model, optimizer, criterion

def train_agent(
    agent,
    num_episodes: int,
    epsilon,
    optimizer,
    criterion,
    device,
    save_path:str = "DeepQAgent.pt",
    opponent_level:str = "easy"
):
  reward_history = []
  environment = TicTacToeGame(device, opponent_level)
  for episode in range(num_episodes):
    state = environment.get_state()
    reward_total, done = 0, False
    steps = 0

    while not done:
      _, mask, indicies = environment.get_valid_moves()
      action = agent.select_action(state, mask, indicies)
      next_state, reward, done = environment.take_action(action)
      reward_total += reward
      agent.remember(state, action, reward, next_state, done)
      state = next_state.to(device) if next_state is not None else None
      agent.replay(optimizer, criterion)
      steps += 1

    timestampStr = datetime.now().strftime("%H:%M:%S")
    print(
        "episode: {}/{}, steps: {}, reward_total: {}, e: {:.2}, time: {}"
        .format(episode+1, num_episodes, steps, reward_total, agent.epsilon, timestampStr)
    )

    reward_history.append(reward_total)
    environment.print_board()
    environment.reset()
  
  agent.save_model(save_path)

def test(model, num_episodes, device, opponent_level="easy"):
  model.eval()
  with torch.no_grad():
    agent = model
    environment = TicTacToeGame(device, opponent_level)
    games_won = 0
    games_drawn = 0
    games_lost = 0
    for episode in tqdm(range(num_episodes)):
      state = environment.get_state()
      done = False
      steps = 0

      while not done:
        _, mask, _ = environment.get_valid_moves()
        q_values = agent.forward(state)
        q_masked = torch.where(mask != 0, q_values, -1000)
        action = torch.argmax(q_masked)
        next_state, reward, done = environment.take_action(action)

        steps += 1
        if done:
          # print(
          #   "episode: {}/{}, steps: {}, reward: {}"
          #   .format(episode+1, num_episodes, steps, reward)
          # )

          if reward == 1:
            games_won += 1
          elif reward == -1:
            games_drawn += 1
          elif reward == -2:
            games_lost += 1
          
          break
      
      # environment.print_board()
      environment.reset()
    
    print()
    print(f"Win rate: {round((games_won/num_episodes) * 100, 4)}%")
    print(f"Draw rate: {round((games_drawn/num_episodes) * 100, 4)}%")
    print(f"Loss rate: {round((games_lost/num_episodes) * 100, 4)}%")

#### **Results**
**Baseline: Untrained Model** \
Versus EASY opponent: Win rate: 50.42%, Draw rate: 22.52%, Loss rate: 27.06% \
Versus HARD opponent: Win rate: 0.0%, Draw rate: 22.5%, Loss rate: 77.5% 

**Model 1: Naive Agent** - 6,000 episodes vs EASY \
Versus EASY opponent: Win rate: 84.72%, Draw rate: 7.03%, Loss rate: 8.25% \
Versus HARD opponent: Win rate: 24.6%, Draw rate: 0.0%, Loss rate: 75.4% 

**Model 2: Mixed Agent 1** - 11,000 episodes (6,000 vs EASY, 5,000 vs HARD) \
Versus EASY opponent: Win rate: 76.86%, Draw rate: 8.01%, Loss rate: 15.13% \
Versus HARD opponent: Win rate: 48.6%, Draw rate: 26.6%, Loss rate: 24.8% 

**Model 3: HARD Agent** - 2,000 episodes vs HARD \
Versus EASY opponent: Win rate: 75.55%, Draw rate: 12.89%, Loss rate: 11.56% \
Versus HARD opponent: Win rate: 25.2%, Draw rate: 25.8%, Loss rate: 49.0% 

**Model 4: Mixed Agent 2** - 3,000 episodes (2,000 vs HARD, 1,000 vs EASY) \
Versus EASY opponent: Win rate: 53.3%, Draw rate: 6.95%, Loss rate: 39.75% \
Versus HARD opponent: Win rate: 0.0%, Draw rate: 13.6%, Loss rate: 86.4% 

#### **Obstacles**
Does a naive opponent make for a dull model? The answer is still unclear. I trained my agent against a naive opponent that made its moves randomly by selecting from a list of valid next moves. While the agent did learn to beat the opponent most of the time, it seemed to really love diagonals and had a hard time recognizing better moves. This led me to create an optimal opponent (hard mode).

As I mentioned, the state space contains, in total, 19,683 states. This is a large space to search, but relatively feasible compared to that of something like Chess or Go. I assumed that my agent would be able to learn not to take actions that led to invalid states by punishing them with a large negative reward (i.e. -10). I was sorely mistaken. I trained the model for about 15,000 games with little to no improvement in output. Eventually, I opted to constrain the agents actions to only valid moves using a mask. Only then did I begin to see massive improvements in performance. I suspect the first method requires significantly more games to be played as well as a robust opponent to play against in order to make progress.

#### **Conclusion**
Surprisingly, the Naive Agent performed best against the easy opponent with a win rate of 84.72%. The Mixed Agent 1 had the second-best performance with a win rate of 76.86%. It also led in performance against the hard opponent with a win rate of 48.6%. For comparison, and untrained model had a win rate of 0.0% against the hard opponent while the Naive agent had half the performance with a win rate of 24.6%. It seems that a combination training against the easy and hard opponent can produce strong results when trained in that order. Conversely the Mixed Agent 2, which trained in the opposite order produced worse results, with a win rate of 0.0% against the hard opponent.

As for next steps, I would test out more training configurations and even compare this to a tabular method.

##### **Appendix**
You can see the results for training against an easy opponent and a hard opponent here.

---

In [32]:
model, optimizer, criterion = supply_model(overwrite=False)

Loading Model Parameters...


In [33]:
train_agent(
    model,
    num_episodes    = NUM_EPISODES,
    epsilon         = EPSILON,
    optimizer       = optimizer,
    criterion       = criterion,
    device          = DEVICE,
    save_path       = MODEL_PATH,
    opponent_level  = "hard"
)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
‾‾‾‾‾‾‾
episode: 4168/5000, steps: 5, reward_total: -1, e: 0.001, time: 00:42:40
_______
|O|X|O|
|O|X|X|
|X|O|X|
‾‾‾‾‾‾‾
episode: 4169/5000, steps: 5, reward_total: -1, e: 0.001, time: 00:42:42
_______
|O|X|X|
|X|O|O|
|X|O|X|
‾‾‾‾‾‾‾
episode: 4170/5000, steps: 3, reward_total: 1, e: 0.001, time: 00:42:43
_______
|O|O|X|
|O|X| |
|X| | |
‾‾‾‾‾‾‾
episode: 4171/5000, steps: 3, reward_total: 1, e: 0.001, time: 00:42:43
_______
|O|O|X|
|O|X| |
|X| | |
‾‾‾‾‾‾‾
episode: 4172/5000, steps: 4, reward_total: -2, e: 0.001, time: 00:42:45
_______
|O|O|X|
|X|O|X|
|X|O| |
‾‾‾‾‾‾‾
episode: 4173/5000, steps: 4, reward_total: -2, e: 0.001, time: 00:42:46
_______
|X|O|X|
|O|O|X|
|X|O| |
‾‾‾‾‾‾‾
episode: 4174/5000, steps: 3, reward_total: 1, e: 0.001, time: 00:42:48
_______
|O| |X|
|O|X|O|
|X| | |
‾‾‾‾‾‾‾
episode: 4175/5000, steps: 3, reward_total: 1, e: 0.001, time: 00:42:49
_______
|O|O|X|
|O|X| |
|X| | |
‾‾‾‾‾‾‾
episode: 4176/5000, steps: 

In [34]:
test(model, 10000, DEVICE, opponent_level="easy")

100%|██████████| 10000/10000 [00:54<00:00, 182.15it/s]


Win rate: 76.86%
Draw rate: 8.01%
Loss rate: 15.13%





In [36]:
test(model, 1000, DEVICE, opponent_level="hard")

100%|██████████| 1000/1000 [06:45<00:00,  2.47it/s]


Win rate: 48.6%
Draw rate: 26.6%
Loss rate: 24.8%





In [40]:
print("Untrained Model Results:")

Untrained Model Results:


In [38]:
# For comparison, this is how an untrained model
# performs against a naive opponent
test_model, _, _ = supply_model(overwrite=True)
test(test_model, 10000, DEVICE, opponent_level="easy")

Initializing Model Parameters...


100%|██████████| 10000/10000 [01:09<00:00, 143.77it/s]


Win rate: 50.42%
Draw rate: 22.52%
Loss rate: 27.06%





In [39]:
# untrained model versus HARD opponent
test_model, _, _ = supply_model(overwrite=True)
test(test_model, 1000, DEVICE, opponent_level="hard")

Initializing Model Parameters...


100%|██████████| 1000/1000 [09:14<00:00,  1.80it/s]


Win rate: 0.0%
Draw rate: 22.5%
Loss rate: 77.5%





In [42]:
def test_saved_model(opponent_level="easy"):
  test_model, _, _ = supply_model(overwrite=False)
  num_episodes = 10000 if opponent_level == "easy" else 1000
  test(test_model, num_episodes, DEVICE, opponent_level)

MODEL_PATH="/content/drive/MyDrive/EE 473/TicTacToeAgent-6000-episodes-naive.pt"
print("Naive Opponent Model (6000 episodes) Results:")
test_saved_model(opponent_level="easy")
test_saved_model(opponent_level="hard")

Naive Opponent Model (6000 episodes) Results:
Loading Model Parameters...


100%|██████████| 10000/10000 [00:48<00:00, 205.77it/s]



Win rate: 84.72%
Draw rate: 7.03%
Loss rate: 8.25%
Loading Model Parameters...


100%|██████████| 1000/1000 [06:51<00:00,  2.43it/s]


Win rate: 24.6%
Draw rate: 0.0%
Loss rate: 75.4%





In [43]:
model, optimizer, criterion = supply_model(overwrite=False)

Initializing Model Parameters...


In [48]:
train_agent(
    model,
    num_episodes    = 1000,
    epsilon         = EPSILON,
    optimizer       = optimizer,
    criterion       = criterion,
    device          = DEVICE,
    save_path       = MODEL_PATH,
    opponent_level  = "hard"
)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
‾‾‾‾‾‾‾
episode: 168/1000, steps: 3, reward_total: 1, e: 0.034, time: 01:54:21
_______
|O|O|X|
|O|X| |
|X| | |
‾‾‾‾‾‾‾
episode: 169/1000, steps: 3, reward_total: -2, e: 0.034, time: 01:54:22
_______
|O|O|O|
|X|X| |
| | |X|
‾‾‾‾‾‾‾
episode: 170/1000, steps: 3, reward_total: -2, e: 0.034, time: 01:54:23
_______
| | |O|
|X|X|O|
|X| |O|
‾‾‾‾‾‾‾
episode: 171/1000, steps: 5, reward_total: -1, e: 0.034, time: 01:54:24
_______
|X|O|X|
|O|X|X|
|O|X|O|
‾‾‾‾‾‾‾
episode: 172/1000, steps: 3, reward_total: 1, e: 0.034, time: 01:54:25
_______
|O| |X|
|O|X| |
|X|O| |
‾‾‾‾‾‾‾
episode: 173/1000, steps: 3, reward_total: -2, e: 0.034, time: 01:54:26
_______
|X| |X|
| |X| |
|O|O|O|
‾‾‾‾‾‾‾
episode: 174/1000, steps: 3, reward_total: 1, e: 0.034, time: 01:54:27
_______
|O|O|X|
|O|X| |
|X| | |
‾‾‾‾‾‾‾
episode: 175/1000, steps: 3, reward_total: 1, e: 0.034, time: 01:54:28
_______
|O|O|X|
|O|X| |
|X| | |
‾‾‾‾‾‾‾
episode: 176/1000, steps: 3, reward

In [49]:
test(model, 10000, DEVICE, opponent_level="easy")

100%|██████████| 10000/10000 [00:56<00:00, 175.82it/s]


Win rate: 75.55%
Draw rate: 12.89%
Loss rate: 11.56%





In [50]:
test(model, 1000, DEVICE, opponent_level="hard")

100%|██████████| 1000/1000 [06:35<00:00,  2.53it/s]


Win rate: 25.2%
Draw rate: 25.8%
Loss rate: 49.0%





In [51]:
model.epsilon = 1.0
train_agent(
    model,
    num_episodes    = 1000,
    epsilon         = EPSILON,
    optimizer       = optimizer,
    criterion       = criterion,
    device          = DEVICE,
    save_path       = MODEL_PATH,
    opponent_level  = "easy"
)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
‾‾‾‾‾‾‾
episode: 168/1000, steps: 3, reward_total: -2, e: 0.49, time: 02:18:52
_______
| |O| |
|X|O| |
|X|O|X|
‾‾‾‾‾‾‾
episode: 169/1000, steps: 5, reward_total: 1, e: 0.49, time: 02:18:53
_______
|X|O|O|
|O|X|X|
|O|X|X|
‾‾‾‾‾‾‾
episode: 170/1000, steps: 4, reward_total: 1, e: 0.49, time: 02:18:53
_______
|X|O|X|
| |X|O|
|X|O|O|
‾‾‾‾‾‾‾
episode: 171/1000, steps: 5, reward_total: 1, e: 0.49, time: 02:18:54
_______
|X|O|O|
|X|X|X|
|O|O|X|
‾‾‾‾‾‾‾
episode: 172/1000, steps: 4, reward_total: -2, e: 0.48, time: 02:18:55
_______
|X|O|X|
|O|O| |
|X|O|X|
‾‾‾‾‾‾‾
episode: 173/1000, steps: 5, reward_total: 1, e: 0.48, time: 02:18:55
_______
|X|O|O|
|X|X|X|
|O|X|O|
‾‾‾‾‾‾‾
episode: 174/1000, steps: 4, reward_total: 1, e: 0.48, time: 02:18:56
_______
|O|X|O|
|O|X|O|
|X|X| |
‾‾‾‾‾‾‾
episode: 175/1000, steps: 4, reward_total: 1, e: 0.48, time: 02:18:57
_______
|X|O|O|
|X|X| |
|O|O|X|
‾‾‾‾‾‾‾
episode: 176/1000, steps: 4, reward_total: 1,

In [63]:
MODEL_PATH="/content/drive/MyDrive/EE 473/TicTacToeAgent-6000-episodes-naive.pt"
print("Mixed Opponent Model (3000 episodes) Results:")
test_saved_model(opponent_level="easy")
test_saved_model(opponent_level="hard")

Mixed Opponent Model (3000 episodes) Results:
Loading Model Parameters...


100%|██████████| 10000/10000 [01:04<00:00, 154.15it/s]



Win rate: 53.3%
Draw rate: 6.95%
Loss rate: 39.75%
Loading Model Parameters...


100%|██████████| 1000/1000 [09:04<00:00,  1.84it/s]


Win rate: 0.0%
Draw rate: 13.6%
Loss rate: 86.4%



