# Baby steps towards a Halite Reinforcement Learning agent

This kernel contains the first steps I have taken in creating a reinforcement learning agent for Halite. Being a beginner in reinforcement learning, I have worked my way up in baby steps using the tf-agents package. 

At this point, here are the steps that I have taken : 
1. Create an environment and teach an agent to control a single ship, with no opponents. Reward is collected halite. Agent learns to maximise collected halite, by mining the highest containing halite cells of the board. 
2. Add a single opponent ship, moving randomly without collecting any halite. Reward is still collected halite but with a high penalty if the agent ship is destroyed. Agent learns to maximise collected halite while avoiding the opponent ship. 
3. Use the agent trained at step 2 as an opponent to train a second agent. Reward is now the difference in collected halite between the agent and his opponent. Agent must learn to optimise his halite collection better than his opponent. 

All these steps are applied on a simplified version of the game : The board is 7 * 7 and the game lasts 20 turns (instead of 21 * 21 and 400 turns). 

The main difficulties I still have to overcome to train an agent on the full game are : 
1. How can you enable the agent to control multiple ships ? You can see that in this code the agent observation is defined as relative to the ship the agent must control. My goal is to call the agent as many times as there are ships, giving it each time the specific observation for this ship. For this, the order in which the ships must be submitted to the agent needs to be defined. Adapting the reward is also a challenge. 
2. How to implement "league training" ? Having the agent train against multiple versions of himself in a virtual league in order to make sure the agent does not overspecialize in defeating a particular strategy and still manages to defeat earlier versions of himself. 

Unfortunately the Kaggle kernel does not yet include tf-agents, so it is not possible to submit a bot to the competition using this package. I am hoping they will add it for future Halite competitions ! 

You can find a jupyter notebook version of this code on my GitHub. It will probably be more up to date then this kernel so you are welcome to check it out : https://github.com/alexandersmedley/halite_rl

Any feedback or comment will be greatly appreciated ! Like I said, I am beginning in Reinforcement Learning so there are probably loads of ways this code can be improved. 


# Code description
The code is based on the tf-agents DQN tutorial for the cartpole gym environment : https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial?hl=en

## Environment 
To adapt the tf-agents cartpole tutorial to Halite, the main difficulty was to create a compatible environment. In the tutorial, the cartpole gym environment already provides everything needed for tf-agents to interact with an agent. For Halite, a custom environment compatible with tf-agents must be created. 

I created this environment by following a tic tac toc tutorial (https://towardsdatascience.com/creating-a-custom-environment-for-tensorflow-agent-tic-tac-toe-example-b66902f73059). Basically, we need to create a python environment class containing the following functions : 
* init : Create a board instance and initialize internal variables. 
* _step : Apply the action chosen by the agent to the board and calculate reward. Return observation and reward in a time_step. 
* _reset : Reset board at the end of a game. 

I also added the following functions to the environment (I guess they could be placed elsewhere, in a helper class for example):
* get_observation : Get agent observation from the current board state. 
* set_opponent_behaviour and opponent_agent : Define opponent behaviour. 

## Observation
The observation is defined relative to the current ship the agent must control. For now, the agent has two "radars", one containing the surrounding halite, the other one containing the ally and enemy ships. For ships, I have distinguished ally ships (we don't want to collide with those), enemy ships with more cargo (we want to collide with those) and enemy ships with less cargo (we absolutely don't want to collide with those). 

## Reward 
The reward is the difference in halite collected between the agent ship and the opponent ship. If the opponent's behaviour is defined as randomly moving, then its collected halite will always be 0. 

## Neural network 
The agent neural network is a Conv2D layer followed by two Dense layers. Conv2D is used as a feature extractor. 

I have not tested any alternative architecture. This simple neural network gave good results on the simplified version of the game. 

# Install tf-agents

In [None]:
pip install tensorflow_probability==0.11.1

In [None]:
pip install tf_agents==0.6.0

# Import 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf 

from kaggle_environments import evaluate, make
from kaggle_environments.envs.halite.helpers import *

from tf_agents.environments import py_environment
from tf_agents.environments import tf_py_environment
from tf_agents.trajectories import time_step as ts
from tf_agents.networks import categorical_q_network
from tf_agents.agents.categorical_dqn import categorical_dqn_agent
from tf_agents.utils import common
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.policies.policy_saver import PolicySaver
from tf_agents.specs import BoundedArraySpec

from random import randint

# Hyperparameters

In [None]:
num_iterations = 20000

num_atoms = 51 # number of atoms for Categorical Q-Network
min_q_value = -1000
max_q_value = 1000 
epsilon_greedy = 0.1

initial_collect_steps = 400 
collect_steps_per_iteration = 10
replay_buffer_max_length = 100000

batch_size = 64
learning_rate = 1e-3
log_interval = 200

num_eval_episodes = 10
eval_interval = 1000

# Environment

In [None]:
def time_step_ok(time_step_nok):
    # to convert an time_step created from observation into a agent policy compatible time_step
    tensors = []
    for array in time_step_nok:
        new_array = np.expand_dims(array, axis=0)
        tensor = tf.constant(new_array)
        tensors.append(tensor)
        
    # step_type
    # reward
    # discount
    # observation
    
    return ts.TimeStep(tensors[0], tensors[1], tensors[2], tensors[3])

In [None]:
class HaliteWrapper(py_environment.PyEnvironment):
    def __init__(self):
        
        # Create environment and board 
        self._board_size = 7
        self._agent_count = 2
        self._max_turns = 20
        self._starting_halite = 2500
        self.env = make("halite", configuration={'episodeSteps':self._max_turns, 'size':self._board_size, 
                                                 'startingHalite':self._starting_halite})
        self.state = self.env.reset(num_agents=self._agent_count)
        
        self.obs = self.state[0].observation
        self.config = self.env.configuration
        self.max_cell_halite = self.config.maxCellHalite
        self.board = Board(self.obs, self.config)
        
        # Define observation and action specs 
        self.observation_shape = self.get_observation(self.board).shape
        self._observation_spec = BoundedArraySpec(
            shape=self.observation_shape, 
            dtype=np.int64, maximum=None, minimum=None)
        
        self._action_def = {0: None,
                            1: ShipAction.NORTH,
                            2: ShipAction.EAST,
                            3: ShipAction.SOUTH,
                            4: ShipAction.WEST,
#                             5: ShipAction.CONVERT
                           }
        self._action_spec = BoundedArraySpec(shape=(), dtype=np.int32, maximum=len(self._action_def)-1, minimum=0)
        
        # Define opponent behaviour 
        self.opponent_behaviour = 'static'
        self.opponent_policy = None
        
        # Define counters 
        self.turn_counter = 1 # turn that will be solved on next step, not last solved turn
        self.previous_cargo_halite = 0
        self.opponent_previous_cargo_halite = 0
        self.game_over = False
        
    def get_observation(self, board, position=(0,0), player_id=0):
        """
        Agent observation. Halite and ship channels with set radius around input position. 
        """
        size = board.configuration.size

        # Halite distribution
        halite = np.array(board.observation['halite']).reshape(size, size)
        halite = np.tile(halite, (3,3))

        # convert Halite SDK position format to np matrix 
        row = size - 1 - position[1] + size
        col = position[0] + size
        radius = 4
        halite_radius = halite[row-radius+1:row+radius, col-radius+1:col+radius]
        halite_radius = np.expand_dims(halite_radius, axis=-1)

        # Ship distribution
        if board[position].ship:
            ref_halite = board[position].ship.halite
            ref_ship_id = board[position].ship.id
        else:
            ref_halite = 0
            ref_ship_id = None 

        ships = np.zeros((size, size))
        for ship in board.ships.values():
            if ship.player_id == player_id: # ally
                if ship.id == ref_ship_id:
                    value = 0
                else:
                    value = -1
            elif ship.halite > ref_halite: # opponent with more halite
                value = 1
            else: # opponent with equal or less halite 
                value = -2
            ships[size-1-ship.position[1], ship.position[0]] = value

        ships = np.tile(ships, (3,3))
        ships_radius = ships[row-radius+1:row+radius, col-radius+1:col+radius]
        ships_radius = np.expand_dims(ships_radius, axis=-1)

        # Concatenate channels
        observation = np.concatenate([halite_radius, ships_radius], axis=-1)
        observation = np.array(observation, dtype='int64')
        return observation
        
    def observation_spec(self):
        return self._observation_spec
    
    def action_spec(self):
        return self._action_spec
    
    def _reset(self):
        # reset environment and board
        self.env = make("halite", configuration={'episodeSteps':self._max_turns, 'size':self._board_size, 
                                                 'startingHalite':self._starting_halite})
        self.state = self.env.reset(num_agents=self._agent_count)
        
        self.obs = self.state[0].observation
        self.config = self.env.configuration
        self.max_cell_halite = self.config.maxCellHalite
        self.board = Board(self.obs, self.config)
        
        # reset counters
        self.turn_counter = 1
        self.previous_cargo_halite = 0
        self.opponent_previous_cargo_halite = 0
        self.game_over = False
        
        observation = self.get_observation(self.board, position=self.board.players[0].ships[0].position, player_id=0)
        
        return_object = ts.restart(observation)
        
        return return_object
    
    def _step(self, action):
        
        if self.game_over:
            return self._reset()
        
        # Agent action
        action = int(action)
        self.board.ships['0-1'].next_action = self._action_def[action]
        
        # Opponent action
        self.board.ships['0-2'].next_action = self.opponent_agent()
        
        self.board = self.board.next()
        
        # Calculate reward
        if len(self.board.players[0].ships) == 0: # ship destroyed
            reward = -10000
        elif len(self.board.players[1].ships) == 0: # enemy ship destroyed
            reward = 10000
        else:
            reward = (self.board.ships['0-1'].halite - self.previous_cargo_halite 
                      - (self.board.ships['0-2'].halite - self.opponent_previous_cargo_halite))
        
        # Update counters 
        self.turn_counter += 1
        if len(self.board.players[0].ships) > 0:
            self.previous_cargo_halite = self.board.ships['0-1'].halite
        else:
            self.previous_cargo_halite = 0
        
        if len(self.board.players[1].ships) > 0:
            self.opponent_previous_cargo_halite = self.board.ships['0-2'].halite
        else:
            self.opponent_previous_cargo_halite = 0
            
        if self.turn_counter >= self._max_turns:
            self.game_over = True
        elif len(self.board.players[0].ships) == 0 or len(self.board.players[1].ships) == 0:
            self.game_over = True
            
        if len(self.board.players[0].ships) > 0:
            observation = self.get_observation(self.board, position=self.board.players[0].ships[0].position, player_id=0)
        else:
            observation = self.get_observation(self.board, position=(0,0), player_id=0)
        
        # Return 
        if self.game_over:
            return_object = ts.termination(observation, reward)
            return return_object
        else:
            return_object = ts.transition(observation, reward, discount=1.0)
            return return_object
        
    def set_opponent_behaviour(self, behaviour, policy_file=None):
        """
        Set opponent behaviour internal variables. 
        If behaviour is load_policy then policy_file must be specified. 
        """
        self.opponent_behaviour = behaviour
        if policy_file:
            self.opponent_policy = tf.saved_model.load(policy_file)
            
    def opponent_agent(self):
        """
        Return opponent action based on opponent behaviour. 
        """
        
        if self.opponent_behaviour == 'static':
            action = None
            
        elif self.opponent_behaviour == 'random_moving':
            action = self._action_def[randint(1,4)]
            
        elif self.opponent_behaviour == 'random_collecting':
            action = self._action_def[randint(0,4)]
            
        elif self.opponent_behaviour == 'load_policy':
            me = self.board.opponents[0]
            observation = self.get_observation(self.board, position=me.ships[0].position, player_id=me.id)
            
            if self.obs.step == 0:
                time_step = ts.restart(observation)
            elif self.obs.step >= self.config.episodeSteps-1:
                time_step = ts.termination(observation, reward=0)
            else:
                time_step = ts.transition(observation, reward=0)
            
            time_step = time_step_ok(time_step)
            
            action_int = int(self.opponent_policy.action(time_step).action)
            action = self._action_def[action_int]

        return action

In [None]:
train_py_env = HaliteWrapper()
eval_py_env = HaliteWrapper()

train_py_env.set_opponent_behaviour(behaviour='random_moving', policy_file=None)
eval_py_env.set_opponent_behaviour(behaviour='random_moving', policy_file=None)

train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

# Agent

In [None]:
preprocessing_layers = tf.keras.models.Sequential()
preprocessing_layers.add(tf.keras.layers.Conv2D(filters=128, kernel_size=(3,3), input_shape=train_py_env.observation_shape, 
                                                padding='same', activation='relu'))
preprocessing_layers.add(tf.keras.layers.Flatten())

fc_layer_params = (128,128)

q_net = categorical_q_network.CategoricalQNetwork(
    input_tensor_spec=train_env.observation_spec(),
    action_spec=train_env.action_spec(),
    preprocessing_layers=preprocessing_layers,
    num_atoms=num_atoms, 
    fc_layer_params=fc_layer_params)

In [None]:
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = categorical_dqn_agent.CategoricalDqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    categorical_q_network=q_net,
    optimizer=optimizer,
    min_q_value=min_q_value, 
    max_q_value=max_q_value,
    epsilon_greedy=epsilon_greedy,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

In [None]:
eval_policy = agent.policy
collect_policy = agent.collect_policy

In [None]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

In [None]:
def compute_avg_return(environment, policy, num_episodes=10):

    total_return = 0.0
    for _ in range(num_episodes):
        time_step = environment.reset()
        episode_return = 0.0

        while not time_step.is_last():
            action_step = policy.action(time_step)
            time_step = environment.step(action_step.action)
            episode_return += time_step.reward
        total_return += episode_return

    avg_return = total_return / num_episodes
    return avg_return.numpy()[0]

In [None]:
compute_avg_return(eval_env, random_policy, num_eval_episodes)

# Collect data

In [None]:
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_max_length)

def collect_step(environment, policy, buffer):
    time_step = environment.current_time_step()
    action_step = policy.action(time_step)
    next_time_step = environment.step(action_step.action)
    traj = trajectory.from_transition(time_step, action_step, next_time_step)

    # Add trajectory to the replay buffer
    buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
    for _ in range(steps):
        collect_step(env, policy, buffer)

collect_data(train_env, random_policy, replay_buffer, initial_collect_steps)

# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, 
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)

dataset

In [None]:
iterator = iter(dataset)

print(iterator)

# Train agent

In [None]:
%%time

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

    # Collect a few steps using collect_policy and save to the replay buffer.
    collect_data(train_env, agent.collect_policy, replay_buffer, collect_steps_per_iteration)

    # Sample a batch of data from the buffer and update the agent's network.
    experience, unused_info = next(iterator)
    train_loss = agent.train(experience).loss

    step = agent.train_step_counter.numpy()

    if step % log_interval == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss))

    if step % eval_interval == 0:
        avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
        print('step = {0}: Average Return = {1}'.format(step, avg_return))
        returns.append(avg_return)

In [None]:
iterations = range(0, num_iterations + 1, eval_interval)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=6000)

# plt.savefig('return_per_iteration.png', bbox_inches='tight')

# Watch our agent play the game !

In [None]:
def my_rl_agent(obs, config):
    board = Board(obs, config)
    me = board.current_player
    
    observation = train_py_env.get_observation(board, position=me.ships[0].position, player_id=me.id)
    
    if obs.step == 0:
        time_step = ts.restart(observation)
    elif obs.step >= config.episodeSteps-1:
        time_step = ts.termination(observation, reward=0)
    else:
        time_step = ts.transition(observation, reward=0)
    
    time_step = time_step_ok(time_step)
    
    action_int = int(eval_policy.action(time_step).action)
    action = train_py_env._action_def[action_int]
    
    for ship in me.ships:
        ship.next_action = action

    return me.next_actions

In [None]:
def random_ship_agent(obs, config):
    board = Board(obs, config)
    me = board.current_player
    
    for ship in me.ships:
        ship.next_action = train_py_env._action_def[randint(0,4)]
    
    return me.next_actions

def static_ship_agent(obs, config):
    board = Board(obs, config)
    me = board.current_player
    
    return me.next_actions

def random_moving_ship_agent(obs, config):
    board = Board(obs, config)
    me = board.current_player
    
    for ship in me.ships:
        ship.next_action = train_py_env._action_def[randint(1,4)]
    
    return me.next_actions

In [None]:
train_py_env._reset()
env = train_py_env.env
env.run([my_rl_agent, random_moving_ship_agent])
env.render(mode='ipython', width=400, height=300)

# Save the agent

In [None]:
# Use this to save the agent policy and load it as an opponent 
saver = PolicySaver(agent.policy, batch_size=None)
# saver.save('agent_policy') 

# You can then use is as an opponent to train a second agent by using : 
# train_py_env.set_opponent_behaviour(behaviour='load_policy', policy_file='my_policy')
# eval_py_env.set_opponent_behaviour(behaviour='load_policy', policy_file='my_policy')
# instead of the random_moving behaviour defined here