## Reinforcement Learning - Maze Game

<b> Reinforcement learning </b> is a type of machine learning in which an agent learns to make decisions in an environment by interacting with it and receiving rewards or punishments based on its actions. The goal of the agent is to learn a policy, or a set of actions to take in different states, that maximizes the total reward it receives over time. Reinforcement learning is often used in scenarios where there is no clear answer or set of rules for how to act, and the agent must learn through trial and error. Some common applications of reinforcement learning include game playing, robotics, and autonomous driving.

#### Maze Challenge

In this game, the player controls a red ball and tries to reach the green square (the finish point) by navigating through the maze. The game has multiple levels with increasing difficulty, and the player needs to avoid obstacles and enemies while reaching the finish point. The obstacles include walls and blue squares that act as traps, while the enemies are the red balls that move randomly in the maze and can kill the player on contact. The game also has a scoring system where the player earns points by reaching the finish point and loses points for hitting obstacles or enemies. The game ends when the player completes all the levels or runs out of points.


#### A reinforcement learning approach

In Maze Game, the agent is a little blue dot that moves around the maze, and its goal is to reach the red square at the end of the maze. We can use Reinforcement Learning (RL) to solve this game.

We can represent the maze as a grid, where each cell in the grid represents a state, and the possible actions are moving up, down, left or right. At each time step, the agent observes the current state and takes an action, which transitions it to a new state and gives it a reward based on the new state. The goal of the agent is to learn a policy that maximizes its expected cumulative reward over time.

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
import numpy as np
import pandas as pd
import tkinter as tk

In [None]:
class RL:
    def __init__(self, n_action, n_feature,
                 lr=0.01, reward_decay=0.9, batch_size=64, replace_step=200, memory_size=500, e_greedy=0.9, epsilon_increment=None):
        self.n_action = n_action
        self.n_feature = n_feature
        self.lr = lr
        self.reward_decay = reward_decay
        self.batch_size = batch_size
        self.replace_step = replace_step
        self.memory_size = memory_size
        self.epsilon_max = e_greedy
        self.epsilon_increment = epsilon_increment
        self.epsilon = 0 if self.epsilon_increment is not None else self.epsilon_max

        ## counter
        self.learning_counter = 0
        self.memory_counter = 0

        ## memory
        self.memory = np.zeros(shape=(self.memory_size, 2 * self.n_feature + 2))

        ## initializer
        self.w_initializer = tf.initializers.random_normal(mean=0, stddev=0.3)
        self.b_initializer = tf.initializers.constant(0.2)

        ## loss & optimizer & metric
        self.loss = tf.losses.mean_squared_error
        self.optimizer = tf.optimizers.RMSprop(self.lr)
        self.metrics = ['acc']

        ## build net
        self.eval_net = self.build_eval_net()
        self.target_net = self.build_target_net()

        ## summary
        print(self.eval_net.summary())
        print(self.target_net.summary())

    def build_eval_net(self):
        model = Sequential([
            Dense(32, activation='relu', kernel_initializer=self.w_initializer,
                  bias_initializer=self.b_initializer, name='evaluate_Dense1', input_shape=[self.n_feature]),
            Dense(self.n_action, activation='softmax', kernel_initializer=self.w_initializer,
                  bias_initializer=self.b_initializer, name='evaluate_Dense2')
        ])
        return model

    def build_target_net(self):
        model = Sequential([
            Dense(32, activation='relu', name='target_Dense1', input_shape=[self.n_feature]),
            Dense(self.n_action, activation='softmax', name='target_Dense2')
        ])
        return model

    def replace_parameters(self):
        self.eval_net.weights
        w = self.eval_net.get_weights()
        self.target_net.set_weights(w)

    def choose_action(self, observation):
        observation = observation[np.newaxis, :]

        if np.random.rand() > self.epsilon:
            action = np.random.randint(0, self.n_action)
        else:
            q_eval = self.eval_net.predict(observation)
            action = np.argmax(q_eval)
        return action

    def store_transition(self, s, a, r, s_):
        transition = np.hstack((s, (a, r), s_))
        index = self.memory_counter % self.memory_size
        self.memory[index:] = transition
        self.memory_counter += 1

    def get_q_target(self, batch_memory):
        ## extarct q_eval (shape -> q_target)
        row_index = np.arange(0, self.batch_size)
        column_index = batch_memory[:, self.n_feature].astype(np.int)
        index = list(zip(row_index, column_index))

        ## y_true
        q_next = self.target_net.predict(batch_memory[:, -self.n_feature:])
        rewards = batch_memory[:, self.n_feature + 1]
        q_target = rewards + self.reward_decay * np.max(q_next, axis=1)

        return q_target, index

    @tf.function
    def train_model(self, batch_memory, q_target, index):
        with tf.GradientTape() as tape:
            q_eval = self.eval_net(batch_memory[:, :self.n_feature])
            q_eval = tf.gather_nd(q_eval, index)
            loss = self.loss(q_target, q_eval)

        ## optimize
        gradients = tape.gradient(loss, self.eval_net.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.eval_net.trainable_variables))

    def learn(self):
        if self.learning_counter % self.replace_step == 0:
            self.replace_parameters()

        ## sample batch
        if self.memory_counter < self.memory_size:
            index = np.random.choice(self.memory_counter, size=self.batch_size)
        else:
            index = np.random.choice(self.memory_size, size=self.batch_size)
        batch_memory = self.memory[index, :]

        ## training model
        q_target, index = self.get_q_target(batch_memory)
        batch_memory, q_target, index = tf.convert_to_tensor(batch_memory), \
                                        tf.convert_to_tensor(q_target), \
                                        tf.convert_to_tensor(index)
        self.train_model(batch_memory, q_target, index)

        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
        self.learning_counter += 1

In [2]:
UNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid width


class Maze(tk.Tk, object):
    def __init__(self):
        super(Maze, self).__init__()
        self.action_space = ['u', 'd', 'l', 'r']
        self.n_actions = len(self.action_space)
        self.n_features = 2
        self.title('maze')
        self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))
        self._build_maze()

    def _build_maze(self):
        self.canvas = tk.Canvas(self, bg='white',
                           height=MAZE_H * UNIT,
                           width=MAZE_W * UNIT)

        # create grids
        for c in range(0, MAZE_W * UNIT, UNIT):
            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT
            self.canvas.create_line(x0, y0, x1, y1)
        for r in range(0, MAZE_H * UNIT, UNIT):
            x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r
            self.canvas.create_line(x0, y0, x1, y1)

        # create origin
        origin = np.array([20, 20])

        # hell
        hell1_center = origin + np.array([UNIT * 2, UNIT])
        self.hell1 = self.canvas.create_rectangle(
            hell1_center[0] - 15, hell1_center[1] - 15,
            hell1_center[0] + 15, hell1_center[1] + 15,
            fill='black')

        # create oval
        oval_center = origin + UNIT * 2
        self.oval = self.canvas.create_oval(
            oval_center[0] - 15, oval_center[1] - 15,
            oval_center[0] + 15, oval_center[1] + 15,
            fill='yellow')

        # create red rect
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')

        # pack all
        self.canvas.pack()

    def reset(self):
        self.update()
        time.sleep(0.1)
        self.canvas.delete(self.rect)
        origin = np.array([20, 20])
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')
        # return observation
        return (np.array(self.canvas.coords(self.rect)[:2]) - np.array(self.canvas.coords(self.oval)[:2]))/(MAZE_H*UNIT)

    def step(self, action):
        s = self.canvas.coords(self.rect)
        base_action = np.array([0, 0])
        if action == 0:   # up
            if s[1] > UNIT:
                base_action[1] -= UNIT
        elif action == 1:   # down
            if s[1] < (MAZE_H - 1) * UNIT:
                base_action[1] += UNIT
        elif action == 2:   # right
            if s[0] < (MAZE_W - 1) * UNIT:
                base_action[0] += UNIT
        elif action == 3:   # left
            if s[0] > UNIT:
                base_action[0] -= UNIT

        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent

        next_coords = self.canvas.coords(self.rect)  # next state

        # reward function
        if next_coords == self.canvas.coords(self.oval):
            reward = 1
            done = True
        elif next_coords in [self.canvas.coords(self.hell1)]:
            reward = -1
            done = True
        else:
            reward = 0
            done = False
        s_ = (np.array(next_coords[:2]) - np.array(self.canvas.coords(self.oval)[:2]))/(MAZE_H*UNIT)
        return s_, reward, done

    def render(self):
        # time.sleep(0.01)
        self.update()

In [3]:
def run():
    step = 0
    for episode in tf.range(300):
        observation = env.reset()

        while True:
            env.render()
            action = DQL.choose_action(observation)
            observation_, reward, done = env.step(action)
            DQL.store_transition(observation, action, reward, observation_)

            if(step > 250) and (step % 5 ==0):
                DQL.learn()

            step += 1
            observation = observation_
            if done:
                break


if __name__ == '__main__':
    env = Maze()
    DQL = RL(env.n_actions, env.n_features)
    env.after(100, run())
    env.mainloop()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 evaluate_Dense1 (Dense)     (None, 32)                96        
                                                                 
 evaluate_Dense2 (Dense)     (None, 4)                 132       
                                                                 
Total params: 228
Trainable params: 228
Non-trainable params: 0
_________________________________________________________________
None
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 target_Dense1 (Dense)       (None, 32)                96        
                                                                 
 target_Dense2 (Dense)       (None, 4)                 132       
                                                                 
Total params: 228
Trainable par



None




Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  column_index = batch_memory[:, self.n_feature].astype(np.int)
















































































































































