## Designing game AI with Reinforcement learning

![game](http://www.andreykurenkov.com/writing/images/2016-4-15-a-brief-history-of-game-ai/5-samuel.jpg)
> On February 24, 1956, Arthur Samuel’s Checkers program, which was developed for play on the IBM 701, was demonstrated to the public on television

## Introduction
In this notebook, we are going to design a neural network to simulate a game through A.I. The game is **Halite** by **Two Sigma**. It is a resource management game where you build and control a small armada of ships. Your algorithms determine their movements to collect halite, a luminous energy source. The most halite at the end of the match wins, but it's up to you to figure out how to make effective and efficient moves. You control your fleet, build new ships, create shipyards, and mine the regenerating halite on the game board.<br>
We would be using Actor-Critic agent, as our base reinforcement learning model. The purpose of the agent would be to predict moves to control the direction of the ship to collect halite and deposit them in the shipyard.

## Rules of the Game
To go through details about the rules of the game I would recommend the notebook [Getting started with Halite](https://www.kaggle.com/alexisbcook/getting-started-with-halite) by [Alexis Cook](https://www.kaggle.com/alexisbcook). She has elabored and explained the rules of the game very well in her notebook.

## What is Reinforcement Learning?
Reinforcement Learning is the science of making optimal decisions using experience. It is categorized with supervised learning and unsupervised learning, rather than categorizing it with Machine Learning and Deep Learning. Reinforcement learning is the methodology that deals with the interaction between agent and environment  through actions and has got nothing to do with labeled and unlabeled data,  although there is another category called **Semi-supervised learning**, which is indeed a hybrid of supervised and unsupervised learning.<br><br>
The word "reinforce" means strengthen or support (an object or substance), especially with additional material.
<br>
But what we are strengthening here, and what is our support which strengthen?<br>
We are trying to strengthen the learning ability of an **agent** to understand the environment. But that also happens in machine learning and deep learning, where the model is trained and the model learns a pattern from the trained data while minimizing the loss and improving the accuracy. The factor that strengthens the learning ability in reinforcement learning is **Reward**. A high positive reward is awarded to the agent for making a correct decision, and the agent should be penalized for making a wrong decision. The agent should get a slight negative reward for not making a correct decision after every time-step. "Slight" negative because we would prefer our agent to take more time in taking a decision rather than making the wrong decision.

Now, lets talk about some of the most important terms like agent, policy, states.
<br>
In reinforcement learning an **Agent** is a self-learning model that learns some type of interaction between it and the environment. The agent wants to achieve some kind of **goal** within mentioned environment while it interacts with it. This interaction is divided into time steps. In each time step, **action** is performed by agent. This action changes the **state** of the environment and based on the success of it agent gets a certain **reward**. This way the agent learns what actions should be performed an which shouldn’t in a defined environment state. 

At each time step, the agent takes an action on the environment based on its policy $\pi(a_t|s_t)$, where $s_t$ is the current observation from the environment, and receives a reward $r_{t+1}$ and the next observation $s_{t+1}$ from the environment. The goal is to improve the policy so as to maximize the sum of rewards (return). 
> a policy is an agent's strategy.

## What is Actor-Critic agent?
Now, before jumping into the concept of Actor-Critic agent, I would recommend you to have some basic knowledge about Q-Learning, followed by deep Q-Learning because without these two you won't understand the significance and necessity Actor-Critic agent.
<br>
In sort,<br>
As an agent takes actions and moves through an environment, it learns to map the observed state of the environment to two possible outputs:
* Recommended action: A probabiltiy value for each action in the action space. The part of the agent responsible for this output is called the **actor**.
* Estimated rewards in the future: Sum of all rewards it expects to receive in the future. The part of the agent responsible for this output is the **critic**.

Agent and Critic learn to perform their tasks, such that the recommended actions from the actor maximize the rewards.<br>
Source - [Keras.io](https://keras.io/examples/rl/actor_critic_cartpole/)

## Implementation

In [None]:
!pip install kaggle-environments --upgrade

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import sys
import PIL.Image

import tensorflow as tf
import logging

from sklearn import preprocessing
import random
import matplotlib.pyplot as plt
import seaborn as sns

from kaggle_environments import evaluate, make
from kaggle_environments.envs.halite.helpers import *


In [None]:
seed=123
tf.compat.v1.set_random_seed(seed)
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
tf.compat.v1.keras.backend.set_session(sess)
logging.disable(sys.maxsize)
global ship_

## Analyzing the environment
Lets take a tour of our environment and its settings first.

In [None]:
env = make("halite", debug=True)
env.run(["random"])
env.render(mode="ipython",width=800, height=600)

In [None]:
env.configuration

In [None]:
env.specification

In [None]:
env.specification.reward

In [None]:
env.specification.action

In [None]:
env.specification.observation

## The game begins
So lets train our model with respect to random actions and see what happens...

In [None]:
def getDirTo(fromPos, toPos, size):
    fromX, fromY = divmod(fromPos[0],size), divmod(fromPos[1],size)
    toX, toY = divmod(toPos[0],size), divmod(toPos[1],size)
    if fromY < toY: return ShipAction.NORTH
    if fromY > toY: return ShipAction.SOUTH
    if fromX < toX: return ShipAction.EAST
    if fromX > toX: return ShipAction.WEST

# Directions a ship can move
directions = [ShipAction.NORTH, ShipAction.EAST, ShipAction.SOUTH, ShipAction.WEST]

# Will keep track of whether a ship is collecting halite or carrying cargo to a shipyard
ship_states = {}

# Returns the commands we send to our ships and shipyards
def simple_agent(obs, config):
    size = config.size
    board = Board(obs, config)
    me = board.current_player
    # If there are no ships, use first shipyard to spawn a ship.
    if len(me.ships) == 0 and len(me.shipyards) > 0:
        me.shipyards[0].next_action = ShipyardAction.SPAWN

    # If there are no shipyards, convert first ship into shipyard.
    if len(me.shipyards) == 0 and len(me.ships) > 0:
        me.ships[0].next_action = ShipAction.CONVERT
    
    for ship in me.ships:
        if ship.next_action == None:
            
            ### Part 1: Set the ship's state 
            if ship.halite < 200: # If cargo is too low, collect halite
                ship_states[ship.id] = "COLLECT"
            if ship.halite > 500: # If cargo gets very big, deposit halite
                ship_states[ship.id] = "DEPOSIT"
                
            ### Part 2: Use the ship's state to select an action
            if ship_states[ship.id] == "COLLECT":
                # If halite at current location running low, 
                # move to the adjacent square containing the most halite
                if ship.cell.halite < 100:
                    neighbors = [ship.cell.north.halite, ship.cell.east.halite, 
                                 ship.cell.south.halite, ship.cell.west.halite]
                    best = max(range(len(neighbors)), key=neighbors.__getitem__)
                    ship.next_action = directions[best]
            if ship_states[ship.id] == "DEPOSIT":
                # Move towards shipyard to deposit cargo
                direction = getDirTo(ship.position, me.shipyards[0].position, size)
                if direction: ship.next_action = direction
                
    return me.next_actions

In [None]:
trainer = env.train([None, "random"])
observation = trainer.reset()
while not env.done:
    my_action = simple_agent(observation, env.configuration)
    print("My Action", my_action)
    observation = trainer.step(my_action)[0]
    print("Reward gained",observation.players[0][0])

In [None]:
env.render(mode="ipython",width=800, height=600)

## Our objective
As you could see from the results that the yellow ships on the left-hand side show almost no movement whereas the red ships on the right-hand side show some smart movements to collect halite, deposit them in the shipyard and spawn accordingly. Our objective would be to train the yellow ships through reinforcement learning and program an AI model which could perform the given task in the most efficient path possible.

## The Obstacles
I am not a mastermind with reinforcement learning. So, I faced some problems which I would discuss now and also how I tried to solve some of them.
* Controlling only one ship - I have made this program in a way that the agent could only control one ship at a time. Which indeed means I have disabled the respawning of multiple ships.
* 5 Moves Game - Due to the first problem I have made this game limited up to the prediction of 4 direction(East, West, North, South) and predicting when to transform from ship to ship-yard. I have removed the SPAWN feature from prediction because I still had not discovered any way to control multiple ships through Actor-Critic agent.
* Deposit to the last shipyard - If there is no ship, the ship would spawn from the most recent shipyard developed and would deposit halites to the most recent shipyard developed. It would have been better if I would have discovered a way to calculate the nearest shipyard for deposition of collected halites. 
![chess](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP.GTWpPAXsc0-kjWXyEqpGywHaEt%26pid%3DApi&f=1)

## The Actor-Critic model

In [None]:
def ActorModel(num_actions,in_):
    common = tf.keras.layers.Dense(128, activation='tanh')(in_)
    common = tf.keras.layers.Dense(32, activation='tanh')(common)
    common = tf.keras.layers.Dense(num_actions, activation='softmax')(common)
    
    return common

In [None]:
def CriticModel(in_):
    common = tf.keras.layers.Dense(128)(in_)
    common = tf.keras.layers.ReLU()(common)
    common = tf.keras.layers.Dense(32)(common)
    common = tf.keras.layers.ReLU()(common)
    common = tf.keras.layers.Dense(1)(common)
    
    return common

In [None]:
input_ = tf.keras.layers.Input(shape=[441,])
model = tf.keras.Model(inputs=input_, outputs=[ActorModel(5,input_),CriticModel(input_)])
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(lr=7e-4)

In [None]:
huber_loss = tf.keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0
num_actions = 5
eps = np.finfo(np.float32).eps.item()
gamma = 0.99  # Discount factor for past rewards
env = make("halite", debug=True)
trainer = env.train([None,"random"])

## Encoding our moves

In [None]:
le = preprocessing.LabelEncoder()
label_encoded = le.fit_transform(['NORTH', 'SOUTH', 'EAST', 'WEST', 'CONVERT'])
label_encoded

In [None]:
def getDirTo(fromPos, toPos, size):
    fromX, fromY = divmod(fromPos[0],size), divmod(fromPos[1],size)
    toX, toY = divmod(toPos[0],size), divmod(toPos[1],size)
    if fromY < toY: return ShipAction.NORTH
    if fromY > toY: return ShipAction.SOUTH
    if fromX < toX: return ShipAction.EAST
    if fromX > toX: return ShipAction.WEST

# Directions a ship can move
directions = [ShipAction.NORTH, ShipAction.EAST, ShipAction.SOUTH, ShipAction.WEST]
   
def decodeDir(act_):
    if act_ == 'NORTH':return directions[0]
    if act_ == 'EAST':return directions[1]
    if act_ == 'SOUTH':return directions[2]
    if act_ == 'WEST':return directions[3]
    
# Will keep track of whether a ship is collecting halite or carrying cargo to a shipyard
ship_states = {}
ship_ = 0
def update_L1():
    ship_+=1
# Returns the commands we send to our ships and shipyards
def advanced_agent(obs, config, action):
    size = config.size
    board = Board(obs, config)
    me = board.current_player 
    act = le.inverse_transform([action])[0]
    global ship_
    
   # If there are no ships, use first shipyard to spawn a ship.
    if len(me.ships) == 0 and len(me.shipyards) > 0:
        me.shipyards[ship_-1].next_action = ShipyardAction.SPAWN

    # If there are no shipyards, convert first ship into shipyard.
    if len(me.shipyards) == 0 and len(me.ships) > 0 and ship_==0:
        me.ships[0].next_action = ShipAction.CONVERT   
    try: 
        if act=='CONVERT':
            me.ships[0].next_action = ShipAction.CONVERT
            update_L1()
            if len(me.ships)==0 and len(me.shipyards) > 0:
                me.shipyards[ship_-1].next_action = ShipyardAction.SPAWN
        if me.ships[0].halite < 200:
            ship_states[me.ships[0].id] = 'COLLECT'
        if me.ships[0].halite > 800:
            ship_states[me.ships[0].id] = 'DEPOSIT' 

        if ship_states[me.ships[0].id] == 'COLLECT': 
            if me.ships[0].cell.halite < 100:
                me.ships[0].next_action = decodeDir(act)
        if ship_states[me.ships[0].id] == 'DEPOSIT':
            # Move towards shipyard to deposit cargo
            direction = getDirTo(me.ships[0].position, me.shipyards[ship_-1].position, size)
            if direction: me.ships[0].next_action = direction
    except:
        pass
                
    return me.next_actions

In [None]:
while not env.done:    
    state = trainer.reset()
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1,env.configuration.episodeSteps+200):
            # of the agent in a pop up window.
            state_ = tf.convert_to_tensor(state.halite)
            state_ = tf.expand_dims(state_, 0)
            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state_)
            critic_value_history.append(critic_value[0, 0])
            
            # Sample action from action probability distribution
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))
            
            # Apply the sampled action in our environment
            action = advanced_agent(state, env.configuration, action)
            state = trainer.step(action)[0]
            gain=state.players[0][0]/5000
            rewards_history.append(gain)
            episode_reward += gain
            
            if env.done:
                state = trainer.reset() 
        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)
        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()
        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )
        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        
        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()
        
    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 550:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break

In [None]:
while not env.done:
    state_ = tf.convert_to_tensor(state.halite)
    state_ = tf.expand_dims(state_, 0)
    action_probs, critic_value = model(state_)
    critic_value_history.append(critic_value[0, 0])
    action = np.random.choice(num_actions, p=np.squeeze(action_probs))
    action_probs_history.append(tf.math.log(action_probs[0, action]))
    action = advanced_agent(state, env.configuration, action)
    state = trainer.step(action)[0]

## Results
The Yellow ships and shipyards are controlled by our trained actor-critic model and the red ship and shipyards are trained against the random predicting agent.

In [None]:
env.render(mode="ipython",width=800, height=600)

## Conclusion
Hey!! our ship is pretty smartly performing you see....
<br><br>
our ship is collecting halites, transforming into shipyards and also spawning if now ship is available in the most efficient way possible. In simple words, we have successfully trained our agent to direct the ship to collect halites in the most efficient way possible. I wish if I could have found a way to track the nearest shipyard and deposits collected halites or find a way to control multiple agents through reinforcement learning. Although there are research papers which I found explaining Multi-Reinforcement learning and Multi-goal Reinforcement learning, which I think could be more useful to solve this problem. Anyways my knowledge is currently limited to Actor-Critic agent, and I would study more to find a better solution than this.

## Thank you.
![quote](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fquotefancy.com%2Fmedia%2Fwallpaper%2F3840x2160%2F1741586-Magnus-Carlsen-Quote-Some-people-think-that-if-their-opponent.jpg&f=1&nofb=1)
<br><br>
### Please <span style="color:red">Up-Vote</span> and <span style="color:red">Share</span> this notebook if you like it or find the content informative. Also, let me know your opinions and suggestions in the comment section below.