<a name="1"></a>
## 1 - Import Packages

We'll make use of the following packages:
- `numpy` is a package for scientific computing in python.
- `deque` will be our data structure for our memory buffer.
- `namedtuple` will be used to store the experience tuples.
- The `gym` toolkit is a collection of environments that can be used to test reinforcement learning algorithms. We should note that in this notebook we are using `gym` version `0.24.0`.
- `PIL.Image` and `pyvirtualdisplay` are needed to render the Lunar Lander environment.
- We will use several modules from the `tensorflow.keras` framework for building deep learning models.
- `utils` is a module that contains helper functions for this assignment. You do not need to modify the code in this file.

Run the cell below to import all the necessary packages.

In [None]:
import time
from collections import deque, namedtuple
import numpy as np
import tensorflow as tf
import copy
import random
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.losses import MSE
from tensorflow.keras.optimizers import Adam

<a name="2"></a>
## 2 - Hyperparameters

Run the cell below to set the hyperparameters.

In [None]:
MEMORY_SIZE = 100_000     # size of memory buffer
GAMMA = 0.9              # discount factor
ALPHA = 1e-3              # learning rate  
NUM_STEPS_FOR_UPDATE = 5  # perform a learning update every C time steps
X = "X"
O = "O"
EMPTY = None
board_size = 5
state_size = 25
action_size = 25
# Store experiences as named tuples
experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])


MINIBATCH_SIZE = 64   # mini-batch size
TAU = 1e-3            # soft update parameter
E_DECAY = 0.999       # ε decay rate for ε-greedy policy
E_MIN = 0.05          # minimum ε value for ε-greedy policy
BOARD_SIZE = 5
ALL_ACTIONS = [(r, c) for r in range(BOARD_SIZE) for c in range(BOARD_SIZE)]
SELF_RACIO = 0.4
# Set the random seed for TensorFlow
SEED = 0              # seed for pseudo-random number generator
random.seed(SEED)
tf.random.set_seed(SEED)

<a name="3"></a>
## 3 - The Five_tigers game

<a name="3.1"></a>
### 3.1 Action Space

(i, j) left on the board (using mask to avoid invalid move)

<a name="3.2"></a>
### 3.2 State

(i, j) steps already on the board
* 1 for self move
* 0 for empty
* -1 for rival move
* express as a 25 length vector
<a name="3.3"></a>
### 3.3 Rewards

* win the game +10 points.
* lose the game -15 points. 
* tie +0 points. 
* reward for each step = weight1 * points_self_get + weight2 * points_block_rival_get.

<a name="3.4"></a>
### 3.4 Game Termination

* All 25 steps are taken

<a name="4"></a>
## 4 - Make the Game


In [None]:
class five_tigers():

    def __init__(self, initial=[[EMPTY for _ in range(board_size)] for _ in range(board_size)]):
        """
        Initialize game board.
        Each game board has
            - `board`: a list of the playing board
            - `player`: 0 or 1 to indicate which player's turn
            - `winner`: None, 0, or 1 to indicate who the winner is
        """
        self.board = copy.deepcopy(initial)
        self.player = 0
        self.winner = None
        self.scores = [0, 0]
        self.left = 25
        self.translation = {X:1, O:-1, EMPTY:0}

    @classmethod
    def available_actions(cls, board):
        """
        five_tigers.available_actions(board) takes a `board` list as input
        and returns all of the available actions `(i, j)` in that state.

        Action `(i, j)` represents the action of make a move in raw_i,column_j
        """
        actions = set()
        for i in range(5) :
            for j in range(5) :
                if board[i][j] == EMPTY :
                    actions.add((i,j))
        return actions

    @classmethod
    def other_player(cls, player):
        """
        five_tigers.other_player(player) returns the player that is not
        `player`. Assumes `player` is either 0 or 1.
        """
        return 0 if player == 1 else 1

    def switch_player(self):
        """
        Switch the current player to the other player.
        """
        self.player = five_tigers.other_player(self.player)

    def now_player_sign(self, player):
        if player == 0:
            return X
        else:
            return O
        
    def update_score(self, action):
        related = {
            (0, 0): [1, 6, 11, 21, 30, 31, ],
            (0, 1): [1, 7, 13, 22, 31, 32, ],
            (0, 2): [1, 8, 17, 18, 21, 23, 32, 33, ],
            (0, 3): [1, 9, 14, 22, 33, 34, ],
            (0, 4): [1, 10, 12, 23, 30, 34, ],
            (1, 0): [2, 6, 15, 24, 31, 35, ],
            (1, 1): [2, 7, 11, 17, 21, 25, 31, 32, 35, 36, ],
            (1, 2): [2, 8, 13, 14, 22, 24, 26, 32, 33, 36, 37, ],
            (1, 3): [2, 9, 12, 18, 23, 25, 33, 34, 37, 38, ],
            (1, 4): [2, 10, 16, 26, 34, 38, ],
            (2, 0): [3, 6, 17, 20, 21, 27, 35, 39, ],
            (2, 1): [3, 7, 14, 15, 22, 24, 28, 35, 36, 39, 40, ],
            (2, 2): [3, 8, 11, 12, 21, 23, 25, 27, 29, 30, 36, 37, 40, 41, ],
            (2, 3): [3, 9, 13, 16, 22, 26, 28, 37, 38, 41, 42, ],
            (2, 4): [3, 10, 18, 19, 23, 29, 38, 42, ],
            (3, 0): [4, 6, 14, 24, 39, 43, ],
            (3, 1): [4, 7, 12, 20, 25, 27, 39, 40, 43, 44, ],
            (3, 2): [4, 8, 15, 16, 24, 26, 28, 40, 41, 44, 45, ],
            (3, 3): [4, 9, 11, 19, 25, 29, 41, 42, 45, 46, ],
            (3, 4): [4, 10, 13, 26, 42, 46, ],
            (4, 0): [5, 6, 12, 27, 30, 43, ],
            (4, 1): [5, 7, 16, 28, 43, 44, ],
            (4, 2): [5, 8, 19, 20, 27, 29, 44, 45, ],
            (4, 3): [5, 9, 15, 28, 45, 46, ],
            (4, 4): [5, 10, 11, 29, 30, 46, ]
        }
        tmp_score = [0, 0]
        def add_score(base_sign, score, k):
            if base_sign == X:
                tmp_score[0] += score
                if k:
                    self.scores[0] += score
            else:
                tmp_score[1] += score
                if k:
                    self.scores[1] += score
        def check_raw(base_sign, i, k):
            if self.board[i][0] == base_sign and self.board[i][1] == base_sign and self.board[i][2] == base_sign and self.board[i][3] == base_sign and self.board[i][4] == base_sign:
                add_score(base_sign,5,k)
        def check_column(base_sign, i, k):
            if self.board[0][i] == base_sign and self.board[1][i] == base_sign and self.board[2][i] == base_sign and self.board[3][i] == base_sign and self.board[4][i] == base_sign:
                add_score(base_sign,5,k)
        def check_5x(base_sign, i, k):
            if i == 1 and self.board[2][2] == base_sign and self.board[0][0] == base_sign and self.board[1][1] == base_sign and self.board[3][3] == base_sign and self.board[4][4] == base_sign:
                add_score(base_sign,5,k)
            if i == 2 and self.board[2][2] == base_sign and self.board[0][4] == base_sign and self.board[1][3] == base_sign and self.board[3][1] == base_sign and self.board[4][0] == base_sign:
                add_score(base_sign,5,k)
        def check_4x(base_sign, i, k):
            if i == 2 and self.board[0][3] == base_sign and self.board[1][2] == base_sign and self.board[2][1] == base_sign and self.board[3][0] == base_sign:
                add_score(base_sign,4,k)
            if i == 1 and self.board[0][1] == base_sign and self.board[1][2] == base_sign and self.board[2][3] == base_sign and self.board[3][4] == base_sign:
                add_score(base_sign,4,k)
            if i == 3 and self.board[4][3] == base_sign and self.board[3][2] == base_sign and self.board[2][1] == base_sign and self.board[1][0] == base_sign:
                add_score(base_sign,4,k)
            if i == 4 and self.board[4][1] == base_sign and self.board[3][2] == base_sign and self.board[2][3] == base_sign and self.board[1][4] == base_sign:
                add_score(base_sign,4,k)
        def check_3x(base_sign, i, k):
            if i == 1 and self.board[0][2] == base_sign and self.board[1][1] == base_sign and self.board[2][0] == base_sign:
                add_score(base_sign,3,k)
            if i == 2 and self.board[0][2] == base_sign and self.board[1][3] == base_sign and self.board[2][4] == base_sign:
                add_score(base_sign,3,k)
            if i == 4 and self.board[4][2] == base_sign and self.board[3][1] == base_sign and self.board[2][0] == base_sign:
                add_score(base_sign,3,k)
            if i == 3 and self.board[4][2] == base_sign and self.board[3][3] == base_sign and self.board[2][4] == base_sign:
                add_score(base_sign,3,k)
        def check_big5(base_sign, k):
            if self.board[2][2] == base_sign and self.board[0][0] == base_sign and self.board[4][0] == base_sign and self.board[0][4] == base_sign and self.board[4][4] == base_sign:
                add_score(base_sign,10,k)
        def check_small5(base_sign, index, k):
            i, j = index // 3 + 1, index % 3 + 1
            if self.board[i][j] == base_sign and self.board[i-1][j-1] == base_sign and self.board[i-1][j+1] == base_sign and self.board[i+1][j-1] == base_sign and self.board[i+1][j+1] == base_sign:
                add_score(base_sign,5,k)
        def check_well(base_sign, index, k):
            i, j = index // 4, index % 4
            if self.board[i][j] == base_sign and self.board[i][j+1] == base_sign and self.board[i+1][j] == base_sign and self.board[i+1][j+1] == base_sign:
                add_score(base_sign,1,k)
        
        i, j = action  # tuple
        base_sign = self.board[i][j]
        for index in related[action]:
            if 1 <= index <= 5:
                check_raw(base_sign, index-1, 1)
            elif 6 <= index <= 10:
                check_column(base_sign, index-6, 1)
            elif 11 <= index <= 12:
                check_5x(base_sign, index-10, 1)
            elif 13 <= index <= 16:
                check_4x(base_sign, index-12, 1)
            elif 17 <= index <= 20:
                check_3x(base_sign, index-16, 1)
            elif 21 <= index <= 29:
                check_small5(base_sign, index-21, 1)
            elif index == 30:
                check_big5(base_sign, 1)
            elif 31 <= index <= 46:
                check_well(base_sign, index-31, 1)

        self.board[i][j] = self.now_player_sign(1-self.player)
        base_sign = self.board[i][j]
        for index in related[action]:
            if 1 <= index <= 5:
                check_raw(base_sign, index-1, 0)
            elif 6 <= index <= 10:
                check_column(base_sign, index-6, 0)
            elif 11 <= index <= 12:
                check_5x(base_sign, index-10, 0)
            elif 13 <= index <= 16:
                check_4x(base_sign, index-12, 0)
            elif 17 <= index <= 20:
                check_3x(base_sign, index-16, 0)
            elif 21 <= index <= 29:
                check_small5(base_sign, index-21, 0)
            elif index == 30:
                check_big5(base_sign, 0)
            elif 31 <= index <= 46:
                check_well(base_sign, index-31, 0)

        self.board[i][j] = self.now_player_sign(self.player)
        index = int((self.translation[self.board[i][j]]+1)/2)
        return 1.5*((1 - SELF_RACIO) * tmp_score[index] + SELF_RACIO * tmp_score[1- index]) - 0.1*(25-self.left)

    def check_winner(self):
        if self.scores[0] > self.scores[1]:
            return 0
        elif self.scores[0] < self.scores[1]:
            return 1
        else:
            return 2
        
    def board_to_state(self):
        state = tuple([tuple([self.translation[j] for j in row])for row in self.board])
        return state  # 2 dimension tuple
        
    def move(self, action):
        """
        Make the move `action` for the current player.
        `action` must be a tuple `(i, j)`.
        """
        raw, column = action  # tuple must be valid move

        # Check for errors
        if self.winner is not None:
            raise Exception("Game already won")
        elif raw < 0 or raw >= 5 or column < 0 or column >= 5:
            raise Exception("Invalid move")
        
        if self.board[raw][column] is not None:  # with our mask, this is not gonna happen 
            print("error:ai made an invalid move")
            return self.board_to_state(), 0, 1

        # Update board
        self.board[raw][column] = self.now_player_sign(self.player)
        reward = self.update_score(action)
        self.switch_player()
        self.left -= 1

        if self.left == 0:
            self.winner = self.check_winner()
            return self.board_to_state(), reward, 1
        elif self.left == 1:
            return self.board_to_state(), reward, 1
        else:
            return self.board_to_state(), reward, 0
        
    def render(self):
        print()
        print("board:")
        print("   0 1 2 3 4")
        for i in range(5):
            print(i,end="  ")
            for j in self.board[i]:
                print(j if j is not None else '-',end=" ")
            print()
        print()
        

In order to build our neural network later on we need to know the size of the state vector.

<a name="5"></a>
## 5 - Interacting with the Game

* play a step
* use state and Q_net work to decide an action
* last_state + 2 actions = new_state
* reward is explained above
* store (last_state, action, new_state, reward, done) pair in the memory

<a name="6"></a>
## 6 - Deep Q-Learning

In cases where both the state and action space are discrete we can estimate the action-value function iteratively by using the Bellman equation:

$$
Q_{i+1}(s,a) = R + \gamma \max_{a'}Q_i(s',a')
$$

This iterative method converges to the optimal action-value function $Q^*(s,a)$ as $i\to\infty$. This means that the agent just needs to gradually explore the state-action space and keep updating the estimate of $Q(s,a)$ until it converges to the optimal action-value function $Q^*(s,a)$. However, in cases where the state space is continuous it becomes practically impossible to explore the entire state-action space. Consequently, this also makes it practically impossible to gradually estimate $Q(s,a)$ until it converges to $Q^*(s,a)$.

In the Deep $Q$-Learning, we solve this problem by using a neural network to estimate the action-value function $Q(s,a)\approx Q^*(s,a)$. We call this neural network a $Q$-Network and it can be trained by adjusting its weights at each iteration to minimize the mean-squared error in the Bellman equation.

Unfortunately, **using neural networks in reinforcement learning to estimate action-value functions has proven to be highly unstable**. Luckily, there's a couple of techniques that can be employed to avoid instabilities. These techniques consist of using a ***Target Network*** and ***Experience Replay***. We will explore these two techniques in the following sections.

<a name="6.1"></a>
### 6.1 Target Network

We can train the $Q$-Network by adjusting it's weights at each iteration to minimize the mean-squared error in the Bellman equation, where the target values are given by:

$$
y = R + \gamma \max_{a'}Q(s',a';w)
$$

where $w$ are the weights of the $Q$-Network. This means that we are adjusting the weights $w$ at each iteration to minimize the following error:

$$
\overbrace{\underbrace{R + \gamma \max_{a'}Q(s',a'; w)}_{\rm {y~target}} - Q(s,a;w)}^{\rm {Error}}
$$

Notice that this forms a problem because the $y$ target is changing on every iteration. **Having a constantly moving target can lead to oscillations and instabilities**. To avoid this, we can create
a separate neural network for generating the $y$ targets. We call this separate neural network the **target $\hat Q$-Network** and it will have the same architecture as the original $Q$-Network. By using the target $\hat Q$-Network, the above error becomes:

$$
\overbrace{\underbrace{R + \gamma \max_{a'}\hat{Q}(s',a'; w^-)}_{\rm {y~target}} - Q(s,a;w)}^{\rm {Error}}
$$

where $w^-$ and $w$ are the weights the target $\hat Q$-Network and $Q$-Network, respectively.

In practice, we will use the following algorithm: every $C$ time steps we will use the $\hat Q$-Network to generate the $y$ targets and update the weights of the target $\hat Q$-Network using the weights of the $Q$-Network. We will update the weights $w^-$ of the the target $\hat Q$-Network using a **soft update**. This means that we will update the weights $w^-$ using the following rule:
 
$$
w^-\leftarrow \tau w + (1 - \tau) w^-
$$

where $\tau\ll 1$. By using the soft update, we are ensuring that the target values, $y$, change slowly, which greatly improves the stability of our learning algorithm.

In [None]:
# Create the Q-Network
q_network = Sequential([
    ### START CODE HERE ### 
    Input(state_size),
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dense(action_size, activation='linear')
    ### END CODE HERE ### 
    ])

# Create the target Q^-Network
target_q_network = Sequential([
    ### START CODE HERE ### 
    Input(state_size),
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dense(128, activation='relu'),
    Dense(action_size, activation='linear')
    ### END CODE HERE ###
    ])

### START CODE HERE ### 
optimizer = Adam(ALPHA)
### END CODE HERE ###

<a name="6.2"></a>
### 6.2 Experience Replay

When an agent interacts with the environment, the states, actions, and rewards the agent experiences are sequential by nature. If the agent tries to learn from these consecutive experiences it can run into problems due to the strong correlations between them. To avoid this, we employ a technique known as **Experience Replay** to generate uncorrelated experiences for training our agent. Experience replay consists of storing the agent's experiences (i.e the states, actions, and rewards the agent receives) in a memory buffer and then sampling a random mini-batch of experiences from the buffer to do the learning. The experience tuples $(S_t, A_t, R_t, S_{t+1})$ will be added to the memory buffer at each time step as the agent interacts with the environment.

For convenience, we will store the experiences as named tuples.

By using experience replay we avoid problematic correlations, oscillations and instabilities. In addition, experience replay also allows the agent to potentially use the same experience in multiple weight updates, which increases data efficiency.

<a name="7"></a>
## 7 - Deep Q-Learning Algorithm with Experience Replay

Now that we know all the techniques that we are going to use, we can put them togther to arrive at the Deep Q-Learning Algorithm With Experience Replay.
<br>
<br>
<figure>
  <img src = "images/deep_q_algorithm.png" width = 90% style = "border: thin silver solid; padding: 0px">
      <figcaption style = "text-align: center; font-style: italic">Fig 3. Deep Q-Learning with Experience Replay.</figcaption>
</figure>

In [None]:
def update_target_network(q_network, target_q_network):
    for target_weights, q_net_weights in zip(target_q_network.weights, q_network.weights):
        target_weights.assign(TAU * q_net_weights + (1.0 - TAU) * target_weights)
        
def get_mask(states):  #(64,25)
    mask = tf.where(tf.equal(states, 1.0), -100.0, states)
    mask = tf.where(tf.equal(mask, -1.0), -100.0, mask)
    return mask


In [None]:
def compute_loss(experiences, gamma, q_network, target_q_network):
    """ 
    Calculates the loss.
    
    Args:
      experiences: (tuple) tuple of ["state", "action", "reward", "next_state", "done"] namedtuples
      gamma: (float) The discount factor.
      q_network: (tf.keras.Sequential) Keras model for predicting the q_values
      target_q_network: (tf.keras.Sequential) Karas model for predicting the targets
          
    Returns:
      loss: (TensorFlow Tensor(shape=(0,), dtype=int32)) the Mean-Squared Error between
            the y targets and the Q(s,a) values.
    """
    
    # Unpack the mini-batch of experience tuples. all of them are tf.tensor
    states, actions, rewards, next_states, done_vals = experiences

    # Compute max Q^(s,a)
    mask = get_mask(next_states)
    masked_target_q = target_q_network(next_states) + mask
    max_qsa = tf.reduce_max(masked_target_q, axis=1)  # (64,) tensor

    # Set y = R if episode terminates, otherwise set y = R + γ max Q^(s,a).
    ### START CODE HERE ### 
    y_targets = done_vals * rewards + (1 - done_vals) * (rewards + gamma * max_qsa)
    ### END CODE HERE ###
    
    # Get the q_values
    q_values = q_network(states)
    q_values = tf.gather(q_values, actions, axis=1, batch_dims=1)  # (64,)tensor
        
    # Compute the loss
    ### START CODE HERE ### 
    loss = MSE(q_values, y_targets) 
    ### END CODE HERE ### 
    
    return loss

<a name="8"></a>
## 8 - Update the Network Weights

We will use the `agent_learn` function below to implement lines ***12 -14*** of the algorithm outlined in [Fig 3](#7). The `agent_learn` function will update the weights of the $Q$ and target $\hat Q$ networks using a custom training loop. Because we are using a custom training loop we need to retrieve the gradients via a `tf.GradientTape` instance, and then call `optimizer.apply_gradients()` to update the weights of our $Q$-Network. Note that we are also using the `@tf.function` decorator to increase performance. Without this decorator our training will take twice as long. If you would like to know more about how to increase performance with `@tf.function` take a look at the [TensorFlow documentation](https://www.tensorflow.org/guide/function).

The last line of this function updates the weights of the target $\hat Q$-Network using a [soft update](#6.1). If you want to know how this is implemented in code we encourage you to take a look at the `utils.update_target_network` function in the `utils` module.

In [None]:
#q_network.save('q_model_128', save_format='tf')
#target_q_network.save('target_q_model_128', save_format='tf')
# ... 稍后加载模型 ...
#q_network = tf.keras.models.load_model('q_model')
#target_q_network = tf.keras.models.load_model('target_q_model')


#optimizer = Adam(ALPHA)

In [None]:
@tf.function
def agent_learn(experiences, gamma):
    """
    Updates the weights of the Q networks.
    
    Args:
      experiences: (tuple) tuple of ["state", "action", "reward", "next_state", "done"] namedtuples
      gamma: (float) The discount factor.
    
    """
    
    # Calculate the loss
    with tf.GradientTape() as tape:
        loss = compute_loss(experiences, gamma, q_network, target_q_network)

    # Get the gradients of the loss with respect to the weights.
    gradients = tape.gradient(loss, q_network.trainable_variables)
    
    # Update the weights of the q_network.
    optimizer.apply_gradients(zip(gradients, q_network.trainable_variables))

    # update the weights of target q_network
    update_target_network(q_network, target_q_network)

    return loss  # to report the loss history

<a name="9"></a>
## 9 - Train the Agent

We are now ready to train our agent to solve the Lunar Lander environment. In the cell below we will implement the algorithm in [Fig 3](#7) line by line (please note that we have included the same algorithm below for easy reference. This will prevent you from scrolling up and down the notebook):

* **Line 1**: We initialize the `memory_buffer` with a capacity of $N =$ `MEMORY_SIZE`. Notice that we are using a `deque` as the data structure for our `memory_buffer`.


* **Line 2**: We skip this line since we already initialized the `q_network` in [Exercise 1](#ex01).


* **Line 3**: We initialize the `target_q_network` by setting its weights to be equal to those of the `q_network`.


* **Line 4**: We start the outer loop. Notice that we have set $M =$ `num_episodes = 2000`. This number is reasonable because the agent should be able to solve the Lunar Lander environment in less than `2000` episodes using this notebook's default parameters.


* **Line 5**: We use the `.reset()` method to reset the environment to the initial state and get the initial state.


* **Line 6**: We start the inner loop. Notice that we have set $T =$ `max_num_timesteps = 1000`. This means that the episode will automatically terminate if the episode hasn't terminated after `1000` time steps.


* **Line 7**: The agent observes the current `state` and chooses an `action` using an $\epsilon$-greedy policy. Our agent starts out using a value of $\epsilon =$ `epsilon = 1` which yields an $\epsilon$-greedy policy that is equivalent to the equiprobable random policy. This means that at the beginning of our training, the agent is just going to take random actions regardless of the observed `state`. As training progresses we will decrease the value of $\epsilon$ slowly towards a minimum value using a given $\epsilon$-decay rate. We want this minimum value to be close to zero because a value of $\epsilon = 0$ will yield an $\epsilon$-greedy policy that is equivalent to the greedy policy. This means that towards the end of training, the agent will lean towards selecting the `action` that it believes (based on its past experiences) will maximize $Q(s,a)$. We will set the minimum $\epsilon$ value to be `0.01` and not exactly 0 because we always want to keep a little bit of exploration during training. If you want to know how this is implemented in code we encourage you to take a look at the `utils.get_action` function in the `utils` module.


* **Line 8**: We use the `.step()` method to take the given `action` in the environment and get the `reward` and the `next_state`. 


* **Line 9**: We store the `experience(state, action, reward, next_state, done)` tuple in our `memory_buffer`. Notice that we also store the `done` variable so that we can keep track of when an episode terminates. This allowed us to set the $y$ targets in [Exercise 2](#ex02).


* **Line 10**: We check if the conditions are met to perform a learning update. We do this by using our custom `utils.check_update_conditions` function. This function checks if $C =$ `NUM_STEPS_FOR_UPDATE = 4` time steps have occured and if our `memory_buffer` has enough experience tuples to fill a mini-batch. For example, if the mini-batch size is `64`, then our `memory_buffer` should have at least `64` experience tuples in order to pass the latter condition. If the conditions are met, then the `utils.check_update_conditions` function will return a value of `True`, otherwise it will return a value of `False`.


* **Lines 11 - 14**: If the `update` variable is `True` then we perform a learning update. The learning update consists of sampling a random mini-batch of experience tuples from our `memory_buffer`, setting the $y$ targets, performing gradient descent, and updating the weights of the networks. We will use the `agent_learn` function we defined in [Section 8](#8) to perform the latter 3.


* **Line 15**: At the end of each iteration of the inner loop we set `next_state` as our new `state` so that the loop can start again from this new state. In addition, we check if the episode has reached a terminal state (i.e we check if `done = True`). If a terminal state has been reached, then we break out of the inner loop.


* **Line 16**: At the end of each iteration of the outer loop we update the value of $\epsilon$, and check if the environment has been solved. We consider that the environment has been solved if the agent receives an average of `200` points in the last `100` episodes. If the environment has not been solved we continue the outer loop and start a new episode.

Finally, we wanted to note that we have included some extra variables to keep track of the total number of points the agent received in each episode. This will help us determine if the agent has solved the environment and it will also allow us to see how our agent performed during training. We also use the `time` module to measure how long the training takes. 

<br>
<br>
<figure>
  <img src = "images/deep_q_algorithm.png" width = 90% style = "border: thin silver solid; padding: 0px">
      <figcaption style = "text-align: center; font-style: italic">Fig 4. Deep Q-Learning with Experience Replay.</figcaption>
</figure>
<br>

In [None]:
def get_experiences(memory_buffer):
    experiences = random.sample(memory_buffer, k=MINIBATCH_SIZE)  # list
    states = tf.convert_to_tensor(np.array([np.array(e.state).flatten() for e in experiences if e is not None]),dtype=tf.float32) # 64*25 tensor
    actions = tf.convert_to_tensor(np.array([ALL_ACTIONS.index(e.action) for e in experiences if e is not None]), dtype=tf.int32) # (64,) tensor
    rewards = tf.convert_to_tensor(np.array([e.reward for e in experiences if e is not None]), dtype=tf.float32) # (64,) tensor
    next_states = tf.convert_to_tensor(np.array([np.array(e.next_state).flatten() for e in experiences if e is not None]),dtype=tf.float32) # 64*25 tensor
    done_vals = tf.convert_to_tensor(np.array([e.done for e in experiences if e is not None]).astype(np.uint8),           # (64,) tensor
                                     dtype=tf.float32)
    return (states, actions, rewards, next_states, done_vals)


def check_update_conditions(t, num_steps_upd, memory_buffer):
    if t % num_steps_upd == 4 and len(memory_buffer) > MINIBATCH_SIZE:
        return True
    else:
        return False
    
    
def get_new_eps(epsilon):
    return max(E_MIN, E_DECAY*epsilon)

def get_action(q_values, state, epsilon=0):
    state = tf.squeeze(state).numpy()  # (25,)
    valid_actions = [i for i in range(25) if state[i] == 0]  # 0 is available
    if random.random() > epsilon:
        action_mask = tf.convert_to_tensor(np.array([0 if i in valid_actions else -100 for i in range(25)]),dtype=tf.float32)
        q_values = q_values[0] + action_mask
        return ALL_ACTIONS[tf.argmax(q_values).numpy()]
    else:
        return ALL_ACTIONS[random.choice(valid_actions)]

def plot_history(reward_history, rolling_window=20, lower_limit=None,
                 upper_limit=None, plot_rw=True, plot_rm=True):
    
    if lower_limit is None or upper_limit is None:
        rh = reward_history
        xs = [x for x in range(len(reward_history))]
    else:
        rh = reward_history[lower_limit:upper_limit]
        xs = [x for x in range(lower_limit,upper_limit)]
    
    df = pd.DataFrame(rh)
    rollingMean = df.rolling(rolling_window).mean()

    plt.figure(figsize=(10,7), facecolor='white')
    
    if plot_rw:
        plt.plot(xs, rh, linewidth=1, color='cyan')
    if plot_rm:
        plt.plot(xs, rollingMean, linewidth=2, color='magenta')

    text_color = 'black'
        
    ax = plt.gca()
    ax.set_facecolor('black')
    plt.grid()
#     plt.title("Total Point History", color=text_color, fontsize=40)
    plt.xlabel('Episode', color=text_color, fontsize=30)
    plt.ylabel('Total Points', color=text_color, fontsize=30)
    yNumFmt = mticker.StrMethodFormatter('{x:,}')
    ax.yaxis.set_major_formatter(yNumFmt)
    ax.tick_params(axis='x', colors=text_color)
    ax.tick_params(axis='y', colors=text_color)
    plt.show()

def print_last_game(buffer):
    for e in list(buffer)[-25:]:
        for i in range(5):
            print(f'{e.state[i]}   ->   {e.next_state[i]}')
        print(f'action: {e.action}, reward:{e.reward}, done: {e.done}')

    
    

In [None]:
def state_to_input(state):
    state_flat = np.array(state).flatten()
    return tf.convert_to_tensor(state_flat, dtype=tf.float32)[tf.newaxis, :]

In [None]:
start = time.time()

num_rounds = 50000

total_loss_history = []

epsilon = 1    # initial ε value for ε-greedy policy

# Create a memory buffer D with capacity N
memory_buffer = deque(maxlen=MEMORY_SIZE)

# Set the target network weights equal to the Q-Network weights
target_q_network.set_weights(q_network.get_weights())

for i in range(num_rounds):
    
    # Reset the environment to the initial state and get the initial state
    game = five_tigers()
    
    for t in range(25):
        
        state = game.board_to_state() # 2 dimension tuple
        # From the current state S choose an action A using an ε-greedy policy
        # state needs to be the right shape for the q_network
        input_qn = state_to_input(state) # (1,25) tf tensor
        q_values = q_network(input_qn) # (1,25) tf tensor
        action = get_action(q_values, input_qn, epsilon) # tuple(i, j)

        # Make move
        next_state, reward, done = game.move(action)

        if t != 24:
        # using target_q_network to get next state where rival'Q is largest
            next_state_list = [list(row) for row in next_state]  # (5,5) list
            input_qn = state_to_input(next_state)
            mask = tf.where(tf.equal(input_qn, 1.0), -100.0, input_qn)
            mask = tf.where(tf.equal(mask, -1.0), -100.0, mask)
            masked_target_q = target_q_network(input_qn) + mask
            r, c = ALL_ACTIONS[tf.reshape(tf.argmax(masked_target_q, axis=1), [])]
            update = 1 if game.player == 0 else -1
            next_state_list[r][c] = update
            next_state = tuple([tuple(row) for row in next_state_list])

        # Store experience tuple (S,A,R,S') in the memory buffer.
        # We store the done variable as well for convenience.
        if t < 23:
            memory_buffer.append(experience(state, action, reward, next_state, done))
        elif t == 23:
            O_state, O_action, O_reward, O_next_state, O_done = state, action, reward, next_state, done
        else:
            if game.winner == 1:
                memory_buffer.append(experience(O_state, O_action, O_reward+10, O_next_state, O_done))
                memory_buffer.append(experience(state, action, reward-15, next_state, done))
            elif game.winner == 0:
                memory_buffer.append(experience(O_state, O_action, O_reward-15, O_next_state, O_done))
                memory_buffer.append(experience(state, action, reward+10, next_state, done))
            else:
                memory_buffer.append(experience(O_state, O_action, O_reward, O_next_state, O_done))
                memory_buffer.append(experience(state, action, reward, next_state, done))
        
        # Only update the network every NUM_STEPS_FOR_UPDATE time steps.
        update = check_update_conditions(t, NUM_STEPS_FOR_UPDATE, memory_buffer)
        
        if update:
            # Sample random mini-batch of experience tuples (S,A,R,S') from Deque
            experiences = get_experiences(memory_buffer)
            
            # Set the y targets, perform a gradient descent step,
            # and update the network weights.
            loss = agent_learn(experiences, GAMMA)
            total_loss_history.append(loss)

        
    # Update the ε value
    epsilon = get_new_eps(epsilon)
    av_latest_loss = np.mean(total_loss_history[-1000:])
    print(f"\rEpisode {i+1} | Total loss average: {av_latest_loss:.2f}; each score: {game.scores[0]}, {game.scores[1]}    ", end="")
    
    if (i+1) % 1000 == 0:
        print(f"\rEpisode {i+1} | Total loss average: {av_latest_loss:.2f}; epsilon: {epsilon}; loss: {total_loss_history[-1]}           ")
    #if av_latest_loss > 3:
    #    break 

        
tot_time = time.time() - start

print(f"\nTotal Runtime: {tot_time:.2f} s ({(tot_time/60):.2f} min)")

In [None]:
print_last_game(memory_buffer)

We can plot the point history to see how our agent improved during training.

In [None]:
# Plot the point history
plot_history(total_loss_history)

<a name="10"></a>
## 10 - Play with the five-tigers AI

Now that we have trained our agent, we can see it in action. We will use the `play` function to create a game with our agent using the trained $Q$-Network. We input a list of 0/1 to set the first/second player and how many games we play.

In [None]:
def play(human_experiences, settings=[]):
    """
    Play human game against the AI.
    `settings` can be set 0 or 1 to specify whether
    human player moves first or second.
    a list with no elements means play one game with ai in random other
    we store the play data in human_experiences
    """
    loop = 1 if len(settings) == 0 else len(settings)
    # If no player order set, choose human's order randomly
    for l in range(loop):
        if len(settings) == 0:
            human_player = random.randint(0, 1)
        else:
            human_player = settings[l]

        # Create new game
        game = five_tigers()

        # Game loop
        for t in range(25):

            # Print contents of board
            game.render()

            # Compute available actions
            available_actions = five_tigers.available_actions(game.board)  # tuple(i, j)
            time.sleep(1)

            state = game.board_to_state() # 2 dimension tuple

            # Let human make a move
            if game.player == human_player:
                print("Your Turn")
                while True:
                    try:
                        row = int(input("Choose Row: "))
                        column = int(input("Choose Column: "))
                        if (row, column) in available_actions:
                            action = (row, column)
                            break
                        print("Invalid move, try again.")
                    except:
                        print("Invalid move, try again.")


            # Have AI make a move
            else:
                print("AI's Turn")
                input_qn = state_to_input(state) # (1, 25) tf tensor
                q_values = q_network(input_qn) # (1, 25) tf tensor
                print(f'q values = {tf.reshape(tf.reduce_max(q_values, axis=1), [])}')
                action = get_action(q_values, input_qn) # tuple(i, j)
                row, column = action
                print(f"AI chose to move row {row}, column {column}.")

            # Make move
            next_state, reward, done = game.move(action)

            if t != 24:
        # get next state where rival'Q is largest
                next_state_list = [list(row) for row in next_state]  # (5,5) list
                input_qn = state_to_input(next_state)
                mask = tf.where(tf.equal(input_qn, 1.0), -100.0, input_qn)
                mask = tf.where(tf.equal(mask, -1.0), -100.0, mask)
                masked_target_q = target_q_network(input_qn) + mask
                r, c = ALL_ACTIONS[tf.reshape(tf.argmax(masked_target_q, axis=1), [])]
                update = 1 if game.player == 0 else -1
                next_state_list[r][c] = update
                next_state = tuple([tuple(row) for row in next_state_list])
            # Store experience tuple (S,A,R,S') in the memory buffer.
            # We store the done variable as well for convenience.
            if t < 23:
                memory_buffer.append(experience(state, action, reward, next_state, done))
            elif t == 23:
                O_state, O_action, O_reward, O_next_state, O_done = state, action, reward, next_state, done
            else:
                if game.winner == 1:
                    memory_buffer.append(experience(O_state, O_action, O_reward+10, O_next_state, O_done))
                    memory_buffer.append(experience(state, action, reward-15, next_state, done))
                elif game.winner == 0:
                    memory_buffer.append(experience(O_state, O_action, O_reward-15, O_next_state, O_done))
                    memory_buffer.append(experience(state, action, reward+10, next_state, done))
                else:
                    memory_buffer.append(experience(O_state, O_action, O_reward, O_next_state, O_done))
                    memory_buffer.append(experience(state, action, reward, next_state, done))

            # Check for winner
            if t == 24:
                game.render()
                print()
                print("GAME OVER")
                if game.winner == human_player:
                    print(f"Winner is human")
                elif game.winner == 1-human_player:
                    print(f"Winner is AI")
                elif game.winner == 2:
                    print("Tie")
                else:
                    print(f"Sorry, AI made an invalid move, you win")
                print('scores:')
                print(f"Human: {game.scores[human_player]}")
                print(f"AI   : {game.scores[1-human_player]}")
                break

In [None]:
human_experiences = deque(maxlen=MEMORY_SIZE)

In [None]:
play(human_experiences, settings=[1,0])

In [None]:
# take advantage of the symmetry of the board, we can expand our data 8-fold
def expend_experiences(human_experiences):
    def point_rotate(action, k):
        r, c = action[0]-2, action[1]-2
        for i in range(k):
            r, c = -c, r
        return (r+2, c+2)
    def point_flip(action):
        r, c = action
        return (r, 4-c)
    expended_human_experiences = deque(maxlen=MEMORY_SIZE)
    for experiences in human_experiences:
        expended_human_experiences.append(experiences)
        state = np.array(experiences.state)
        action = experiences.action
        reward = experiences.reward
        next_state = np.array(experiences.next_state)
        done_val = experiences.done
        flip_state = np.fliplr(state)
        flip_action = point_flip(action)
        flip_next_state = np.fliplr(next_state)
        expended_human_experiences.append(experience(tuple(map(tuple, flip_state.tolist())), flip_action, reward, tuple(map(tuple, flip_next_state.tolist())), done_val))
        for k in range(1,4):
            expended_human_experiences.append(experience(tuple(map(tuple, np.rot90(state,k).tolist())),point_rotate(action,k), reward, tuple(map(tuple, np.rot90(next_state,k).tolist())), done_val))
            expended_human_experiences.append(experience(tuple(map(tuple, np.rot90(flip_state,k).tolist())),point_rotate(flip_action,k), reward, tuple(map(tuple, np.rot90(flip_next_state,k).tolist())), done_val))
    return expended_human_experiences
        

In [None]:
# using human_experiences to improve ai performance
# sadly leading to overfitting
if len(human_experiences) > MINIBATCH_SIZE:
    start = time.time()

    num_rounds = 250

    total_loss_history = []

    epsilon = 0.1    # initial ε value for ε-greedy policy

    # Create a memory buffer D with capacity N

    # Set the target network weights equal to the Q-Network weights
    #target_q_network.set_weights(q_network.get_weights())

    for i in range(num_rounds):
        
        buffer = expend_experiences(human_experiences)
        # Sample random mini-batch of experience tuples (S,A,R,S') from Deque
        experiences = get_experiences(buffer)
        
        # Set the y targets, perform a gradient descent step,
        # and update the network weights.
        loss = agent_learn(experiences, GAMMA)
        total_loss_history.append(loss)

        # Update the ε value
        epsilon = get_new_eps(epsilon)
        av_latest_loss = np.mean(total_loss_history[-100:])
        print(f"\rEpisode {i+1} | Total loss average: {av_latest_loss:.2f}    ", end="")

        if (i+1) % 1000 == 0:
            print(f"\rEpisode {i+1} | Total loss average: {av_latest_loss:.2f}; epsilon: {epsilon}                          ")

            
    tot_time = time.time() - start

    print(f"\nTotal Runtime: {tot_time:.2f} s ({(tot_time/60):.2f} min)")
else:
    print('please play with AI more times to create enough training set')
    print(f'now training set length: {len(human_experiences)}')
    print('least enough training set length: 64')