## TF-Agent based HungryGoose using DQN

- Port the hungry goose env to Tf-Agent
    - We create a 7x11 2D grid as the input to the DQN
    - Output is the Q-value for each of the 4 possible actions.
    - Note, tutorial link: https://www.tensorflow.org/agents/tutorials/2_environments_tutorial
   
   
- Train a simple DQN policy on the new environment
    - Note, tutorial link: https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial

### Import packages

In [None]:
# imports kaggle environment
from kaggle_environments.envs.hungry_geese.hungry_geese import (Observation, 
                                                                Configuration, 
                                                                Action, 
                                                                row_col, 
                                                                translate, 
                                                                greedy_agent)
from kaggle_environments import evaluate, make, utils
# import other necessary packages
import numpy as np
import random
from tqdm import tqdm
# install tf-agents
!pip install -q tf-agents
# other 
import abc
import tensorflow as tf
import numpy as np

from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers
from tf_agents.environments import suite_gym
from tf_agents.trajectories import time_step as ts

tf.compat.v1.enable_v2_behavior()

## Step 1: Port HungryGoose to TF-Agent Env

In [None]:
class TFHungryGoose(py_environment.PyEnvironment):
    """The TF-Agent ported env of HungryGoose"""

    def __init__(self):
        """Init the env with basic info
        """
        # first create the kaggle hungry goose env
        env = make("hungry_geese")
        # init the trainer (wth 3 greedy adv)
        self.trainer = env.train([None, greedy_agent, greedy_agent, greedy_agent])
        obs = self.trainer.reset()
        # there are 4 actions -- the 4 directions
        self._action_spec = array_spec.BoundedArraySpec(
            shape=(), dtype=np.int32, minimum=0, maximum=3, name='action')
        # observation is a 7x11 2d matrix
        self._observation_spec = array_spec.BoundedArraySpec(
            shape=(1, 7, 11, 1), dtype=np.float32, minimum=0, maximum=10, name='observation')
        # init state
        self._state = self.create_grid_from_geese_position(obs)
        # init ending state == False
        self._episode_ended = False
        # action mapping
        self.action_name_mapping = {0: 'NORTH', 1: 'SOUTH', 2: 'EAST', 3: 'WEST'}

    def create_grid_from_geese_position(self, obs, grid_cols=11, grid_rows=7):
        """Create a grid form the given geese positions and game board dimensions 
        Identifier ---
        {0}: free space; {1, 2, 3, 4}: for 4 geese bodies, where 1 is me; {5}: food; 
        {6} for head of other agents and 7 for the head of mine agent
        """
        # extract the reuired info frm obs
        geese_position = obs.geese
        foods = obs.food
        my_index = obs.index # which should be 1
        # create matrix with all a free space
        matrix = np.zeros((grid_rows, grid_cols))
        # for each geese position add a specific idenitfier in matrix
        goose_id = [1, 2, 3, 4]
        for i, goose_position in enumerate(geese_position):
            for j, pos in enumerate(goose_position):
                row, col = row_col(pos, grid_cols)
                if j == 0:
                    if i!=my_index: # mark as head
                        matrix[row][col] = 6
                    else:
                        matrix[row][col] = 7
                else:# normal body
                    matrix[row][col] = goose_id[i]
        # add identifier for the food    
        np.put(matrix, foods, [5])
        # return 
        return matrix.reshape(1, 7, 11, 1).astype('float32')

    def action_spec(self):
        return self._action_spec

    def observation_spec(self):
        return self._observation_spec

    def _reset(self):
        """Reset the env"""
        obs = self.trainer.reset()
        self._state = self.create_grid_from_geese_position(obs)
        self._episode_ended = False
        return ts.restart(self._state)

    def __reward_manager(self, reward, step, geese):
        """Modifying the default reward of the env
        Mods:
        1. Every step you survive, you get 50 rewards
        2. Every food you eat gives you additonal 50 points
        3. If you loose, you get -1000 points
        4. If you win, you get 1000 points
        5. First step reward is 200! -- remove this
        """
        # 
        if step == 1 and (reward != 0): # first step and survived, return only survive reward
            return 50
        elif (reward == 0) or (len(geese[0])==0): # you loose, hence large neg reward
            return -1000
        # check if you won or not
        elif (max([len(goose) for goose in geese[1:]]) == 0) and (reward != 0): #you just won
            return 1000
        elif (reward%100)==0: # you survived but not won nor ate, hence only survive reward
            return 50
        else: # you survived and ate, hence
            return 100
        
    def _step(self, action):
        """Define the operation on env provided some action"""
        
        if self._episode_ended:
            # The last action ended the episode. Ignore the current action and start
            # a new episode.
            return self.reset()
        
        # map the action to the env Action
        action = self.action_name_mapping[int(action)]
        
        # perform the action; returned are -- (obs, reward, done, info)
        obs, reward, self._episode_ended, info = self.trainer.step(action)
        
        # mod the reward
        reward = self.__reward_manager(reward, obs.step, obs.geese)
        # modify the state
        self._state = self.create_grid_from_geese_position(obs)
        
        # handle the env termination or transition based on the outcome of action on the env
        if self._episode_ended:
            return ts.termination(self._state, reward)
        else:
            return ts.transition(self._state, reward=reward, discount=1.0)        

### Test the env, it should not throw any errors!

In [None]:
## test the env on sample episodes
env = TFHungryGoose()
utils.validate_py_environment(env, episodes=5)

## Step 2: DQN agent on the ported HungryGoose env

In [None]:
import base64
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential
from tf_agents.policies import random_tf_policy, policy_saver
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory, time_step, policy_step
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, Flatten
from keras.optimizers import Adam

# !pip install wandb
import wandb

### Set the hyperparameter

In [None]:
# Start a run, tracking hyperparameters
config = {
    'num_iterations' : 20000 ,

    'initial_collect_steps' : 100  ,
    'collect_steps_per_iteration' : 1  ,
    'replay_buffer_max_length' : 100000  ,
    "replay_buffer_num_steps": 2,
    
    'batch_size' : 64  ,
    'learning_rate' : 1e-3  ,
    'log_interval' : 1000  ,

    'num_eval_episodes' : 10  ,
    'eval_interval' : 10000  ,
    }

### Create the Q-Network for the DQN

- At the heart of the DQN is a NN which takes state as input layer and return Actions in the last layer.
- We will use a simple Convolution based NN as we have a 2D grid as input.

In [None]:
# The QNetwork model
model = sequential.Sequential([
    Conv2D(64, kernel_size=3, activation="relu"),
    Conv2D(32, kernel_size=3, activation="relu"),
    Flatten(),
    Dense(48, activation="relu"),
    Dense(24, activation="relu"),
    Dense(4, activation=None), # 4 actions hence last layer outputs 4
])

### Create the DQN agent

In [None]:
# create the envs to test and eval
train_py_env = TFHungryGoose()
eval_py_env = TFHungryGoose()
# convert the env to TF
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

# set the optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=config['learning_rate'])

# step counter
train_step_counter = tf.Variable(0)

# create the DQN agent
agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=model,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

# init the agent
agent.initialize()

### Create policies and ReplayBuffer for Experience Replay

- Define the policies and create func to calculate avg reard from a policy.
    Agents contain two policies:
    - agent.policy — The main policy that is used for evaluation and deployment.
    - agent.collect_policy — A second policy that is used for data collection.


- Create the Experience Replay module
    
    - Each row of the replay buffer only stores a single observation step. 
    But since the DQN Agent needs both the current and next observation to compute the loss, 
    the dataset pipeline will sample two adjacent rows for each item in the batch (num_steps=2).
    This dataset is also optimized by running parallel calls and prefetching data.



In [None]:
# get the policies
eval_policy = agent.policy
collect_policy = agent.collect_policy

# func to get avg reward
def compute_avg_return(environment, policy, num_episodes=10):

  total_return, won = 0.0, 0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return
    if episode_return > 1000:
        won+=1

  avg_return = total_return / num_episodes
  return avg_return, won / num_episodes

# test the policy
# compute_avg_return(TFHungryGoose(), eval_policy)

## Expeience Replay
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=config['replay_buffer_max_length'])

def collect_step(environment, policy, buffer):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)
  # Add trajectory to the replay buffer
  buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
  for _ in range(steps):
    collect_step(env, policy, buffer)

# collect some initial random experiences
collect_data(train_env, collect_policy, replay_buffer, config['initial_collect_steps'])

# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
#     num_parallel_calls=3, 
    sample_batch_size=config['batch_size'], 
    num_steps=config['replay_buffer_num_steps']).prefetch(3)

# create an iterator
iterator = iter(dataset)

# create a policy saver
saver = policy_saver.PolicySaver(eval_policy, batch_size=None)
# checkpointer 
train_checkpointer = common.Checkpointer(
        ckpt_dir="./agent_checkpoint",
        max_to_keep=1,
        agent=agent,
        policy=agent.policy,
        replay_buffer=replay_buffer,
        global_step=train_step_counter
    )
# init or reset the check point -- load existing model if present
train_checkpointer.initialize_or_restore()

### Train the agent!!

In [None]:
# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, config['num_eval_episodes'])
returns = [avg_return]

for _ in range(config['num_iterations']):

  # Collect a few steps using collect_policy and save to the replay buffer.
  collect_data(train_env, agent.collect_policy, replay_buffer, config['collect_steps_per_iteration'])

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter

  if step % config['log_interval'] == 0:
    print('step = {0}: loss = {1}'.format(int(step), train_loss))

  if step % config['eval_interval'] == 0:
    avg_return, won = compute_avg_return(eval_env, agent.policy, config['num_eval_episodes'])
    print('step = {0}: Average Return = {1}'.format(int(step), avg_return))
    returns.append(avg_return)
    # save checkpoint
    train_checkpointer.save(train_step_counter)

### Test the agent

In [None]:
# create TF-agent playing function
def tf_agent_play(obs, conf):
    # get the state info
    state = TFHungryGoose().create_grid_from_geese_position(obs)
    # create timestep
    timestep = time_step.TimeStep(
        np.array(0, dtype='int32'),
        np.array(0, dtype='float32'),
        np.array(0, dtype='float32'),
        np.array(state, dtype='float32')
    )
    # run the policy
    action_step = agent.policy.action(timestep)
    # extract the action
    action = TFHungryGoose().action_name_mapping[int(action_step.action)]
    #
    return action

# def go_west(obs, conf):
#     return 'WEST'
    
# # test the agent
# env_test = make("hungry_geese", debug=True)
# _ = env_test.run([tf_agent_play, go_west, go_west, go_west])
# env_test.render(mode="ipython", width=500, height=450)

In [None]:
## evaluate the agent over multiple trials
from joblib import Parallel, delayed

# variable
trials = 10

# run parallel test for 100
results = Parallel()( 
    delayed(evaluate)("hungry_geese", [
        tf_agent_play, 
        greedy_agent, 
        greedy_agent, 
        greedy_agent, 
    ], num_episodes=1) 
for _ in range(trials) )

mean_score = np.mean(results, axis=0).astype(np.int).flatten()
print("mean", mean_score)

max_score = np.max( results, axis=0).astype(np.int).flatten()
print("max ", max_score)

### Observation

- As observed from the evalutions, the trained model is not winning any competitions to say the least :)
- This is expected, as we trained the agent for just 20k iterations and also we used quite simple architecture and hyperparameter.
- That said, the motive was to port HungryGoose to Tf-Agent, so that we can try out more sophisticated algorithms and training procedures.
- I have trained this same agent for ~1M iterations with little improvements.
- Some areas I am planning to explore to improve the result are, 
    - Train for more iterations.
    - Try other algorithms like DDQN with priority replay.
    - Modify the state to encode more information
    - Use more sophisticated Q_network (maybe more CNN layers)
    - Modify the reward function.
    
    
#### If you learned anything new please upvote the notebook. Thnx!!!