# Capture the Flag (RL - Policy Gradient)

- Seung Hyun Kim
- skim449@illinois.edu

## Implementation Details

- Simple Policy gradient with experience buffer.
- The implementation network is slightly different
    - Better code for mini-batch
    - include self-play for red
    - 19x19 vision

### Sampling
- [x] Mini-batch to update 'average' gradient
- [x] Experience Replay for Random Sampling
- [ ] Importance Sampling
    
### Deterministic Policy Gradient
- [ ] DDPG
- [ ] MADDPG

### Stability and Reducing Variance
- [ ] Target Network
- [ ] TRPO
- [ ] PPO

### Multiprocessing
- [ ] Synchronous Environment Rolling
- [ ] Synchronous Training (A2C)
- [ ] Asynchronous Training (A3C)

### Applied Training Methods:
- [x] Self-play
- [ ] Batch Policy
- [x] Variable Reward

## Notes

- This notebook includes:
    - Building the structure of policy driven network.
    - Training with/without render
    - Saver that save model and weights to ./model directory
    - Writer that will record some necessary datas to ./logs

- This notebook does not include:
    - Simulation with RL policy
        - The simulation can be done using policy_RL.py
    - cap_test.py is changed appropriately.
    
## References :
- https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb (source)
- https://www.youtube.com/watch?v=PDbXPBwOavc
- https://github.com/lilianweng/deep-reinforcement-learning-gym/blob/master/playground/policies/actor_critic.py (source)
- https://github.com/spro/practical-pytorch/blob/master/reinforce-gridworld/reinforce-gridworld.ipynb

## TODO:

- enemy with different policies (zero, patrol)
- stochastic interaction
- Reward -> only 100 for completion (with small observation)

In [1]:
!rm -rf logs/B4R4_Rzero_FTW/ model/B4R4_Rzero_FTW

In [2]:
TRAIN_NAME='B4R4_Rzero_FTW'
LOG_PATH='./logs/'+TRAIN_NAME
MODEL_PATH='./model/' + TRAIN_NAME
GPU_CAPACITY=0.125 # gpu capacity in percentage

In [3]:
import os

import signal

import tensorflow as tf
import tensorflow.contrib.slim as slim
import tensorflow.contrib.layers as layers
from tensorflow.python.client import device_lib
import matplotlib.pyplot as plt
%matplotlib inline

import time
from datetime import datetime
import gym
import gym_cap
import gym_cap.envs.const as CONST
import numpy as np
import random

# the modules that you can use to generate the policy.
import policy.patrol 
import policy.random
import policy.policy_RL
import policy.zeros

# Data Processing Module
from utility.dataModule import one_hot_encoder
from utility.utils import MovingAverage as MA
from utility.utils import Experience_buffer, discount_rewards

# Import Network
from network.REINFORCE import REINFORCE as RF

%load_ext autoreload
%autoreload 2

## Hyperparameters

In [4]:
# Training Related
max_ep = 150
update_frequency = 50
batch_size = 2000
experience_size=10000

# Saving Related
save_network_frequency = 1000
save_stat_frequency = 100
moving_average_step = 100

# Parameters
LEARNING_RATE = 1e-3
gamma = 0.99
MAP_SIZE = 10
VISION_RANGE = 9
VISION_dX, VISION_dY = 2*VISION_RANGE+1, 2*VISION_RANGE+1

## Environment Setting

In [5]:
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
#Create a directory to save episode playback gifs to
if not os.path.exists(LOG_PATH):
    os.makedirs(LOG_PATH)

In [6]:
env = gym.make("cap-v0") # initialize the environment
policy_red = policy.zeros.PolicyGen(env.get_map, env.get_team_red)
#plt.imshow(env.render(mode='rgb_array'))

# Environment Related
action_space = 5
n_agent = len(env.get_team_blue)

print('red number : ', len(env.get_team_red))
print('blue number : ', len(env.get_team_blue))

red number :  4
blue number :  4


## Network Setting

### Reward Criteria:

- Ref. : https://arxiv.org/pdf/1807.01281.pdf
pg(3), pg(27-28)

> Since game outcome as the only reward signal is too sparse for RL to be effective, we require rewards rt to direct the learning process towards winning yet are more frequently available than the game outcome. In our approach, we operationalise the idea that each agent has a dense internal reward function (60,61,74), by specifying rt = w(ρt) based on the available game points signals ρt

> w is optimised for winning probability through population based training, another level of training performed at yet a slower time scale than RL.

-. Team-wise Points

1. Team captured flag
2. Team not captured flag
3. Team captured enemy
4. Team died by enemy
5. Enemy captured flag
6. Enemy not captured flag

-. Individual Points

7. I'm on Blue
8. I'm on Red
9. I moved
10. I see flag
11. I see enemy
12. I see aliance

-. global
13. times up but lost

In [7]:
r_criteria = 13
r_CtF_scoreboard = [100, 0, 25, -25, -100, 0, 0, 0, 0, 0, 0]
r_signal = [1,-1,1,-1,-1,1,1,-1,1,1,1,1]

In [8]:
def log_uniform_initializer(mu, std):
    def _initializer(shape, dtype=None, partition_info=None):
        out = np.random.lognormal(mean=mu, sigma=std, size=shape).astype(np.float32)
        return tf.constant(out)
    return _initializer

In [9]:
reward_update_freq = 10
class reward_signal:
    def __init__(self, lr):
        self.signals_holder = tf.placeholder(shape=[None, r_criteria], dtype = tf.float32)
        self.winning_holder = tf.placeholder(shape=[None], dtype=tf.float32)
        self.reward_matrix = layers.fully_connected(self.signals_holder, r_criteria, weights_initializer=log_uniform_initializer(0.1,10.0))
        self.reward = tf.reduce_sum(self.reward_matrix, name='reward')

        self.loss = -tf.reduce_sum(tf.log(tf.reduce_sum(self.reward_matrix, 1))*self.winning_holder)
        self.optimizer=tf.train.AdamOptimizer(learning_rate=lr)
        self.update = self.optimizer.minimize(self.loss)
        
        tf.summary.histogram('signals', self.reward_matrix)
        
    def signal(s0, r, a, s1, agents, env, tubl=False):
        signals = []
        for idx, agent in enumerate(agents):
            signal = []
            # Team captured flag
            signal.append(env.blue_win)
            # Team not captured flag
            signal.append(not env.blue_win)
            # Team captured enemy
            signal.append(r > 0 and r < 100)
            # Team died by enemy
            signal.append(r < 0 and r > -100)
            # Enemy captured flag
            signal.append(env.red_win)
            # Enemy not captured flag
            signal.append(not env.red_win)
            
            # I'm on Blue BG
            signal.append(s1[idx][1][(int)(VISION_RANGE/2)][(int)(VISION_RANGE/2)] == 0)
            # I'm on Red BG
            signal.append(s1[idx][1][(int)(VISION_RANGE/2)][(int)(VISION_RANGE/2)] == 1)
            # I moved (need some more info)
            signal.append(r != 0)
            # I see flag
            signal.append(sum(s1[idx][4]==-1) > 0)
            # I see enemy
            signal.append(sum(s1[idx][2]==-1) > 0)
            # I see aliance
            signal.append(sum(s1[idx][2]== 1) > 0)
            
            # times up but lost
            signal.append(tubl)
        signals.append(signal)
        return np.stack(signals)

In [10]:
tf.reset_default_graph() # Clear the Tensorflow graph.
rs = reward_signal(lr=1e-4)
myAgent = RF(lr=LEARNING_RATE,in_size=[None,VISION_dX,VISION_dY,6],action_size=5,grad_clip_norm=0,trainable=True)
global_step = tf.Variable(0, trainable=False, name='global_step') # global step
increment_global_step_op = tf.assign(global_step, global_step+1)
merged = tf.summary.merge_all()

ValueError: None values not supported.

## Session

In [None]:
# Launch the session
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=GPU_CAPACITY, allow_growth=True)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
#sess = tf.Session()

ma_reward = MA(moving_average_step)
ma_length = MA(moving_average_step)
ma_captured = MA(reward_update_freq)

# Setup Save and Restore Network
saver = tf.train.Saver(tf.global_variables())
writer = tf.summary.FileWriter(LOG_PATH, sess.graph)

ckpt = tf.train.get_checkpoint_state(MODEL_PATH)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    saver.restore(sess, ckpt.model_checkpoint_path)
    print("Load Model : ", ckpt.model_checkpoint_path)
else:
    sess.run(tf.global_variables_initializer())
    print("Initialized Variables")    

In [None]:
def record(summary_):
    with tf.device('/cpu:0'): 
        summary = tf.Summary()
        summary.value.add(tag='Records/mean_reward', simple_value=ma_reward())
        summary.value.add(tag='Records/mean_length', simple_value=ma_length())
        summary.value.add(tag='Records/mean_succeed', simple_value=ma_captured())
        writer.add_summary(summary, sess.run(global_step))
        
        #summary_str = sess.run(merged,feed_dict={myAgent.state_input:obs})
        writer.add_summary(summary_, sess.run(global_step))
        
        writer.flush()

In [None]:
def handler(signum, frame):
    raise Exception('Action took too much time')

In [None]:
def policy_rollout(PARTIAL=False):
    # Run single episode, return the results
    # Temporary fix for episode reset
    
    s = env.reset(map_size=MAP_SIZE, policy_red=policy_red)
        
    if PARTIAL:
        obs_next = one_hot_encoder(s, env.get_team_blue) # partial observation
    else:
        obs_next = one_hot_encoder(env._env, env.get_team_blue, VISION_RANGE) # Full observation
    
    ep_history = []
    indv_history = [[] for _ in range(len(env.get_team_blue))]
    
    was_alive = [ag.isAlive for ag in env.get_team_blue]
    prev_reward=0
    frame=0
    for frame in range(max_ep+1):
        obs = obs_next
        
        with tf.device('/cpu:0'):
            act_prob = sess.run(myAgent.output, feed_dict={myAgent.state_input:obs})
        act = [np.random.choice(action_space, p=act_prob[x]/sum(act_prob[x])) for x in range(n_agent)] # divide by sum : normalize
            
        s,r1,d,_ = env.step(act) #Get our reward for taking an action given a bandit.
        if PARTIAL:
            obs_next = one_hot_encoder(s, env.get_team_blue) # partial observation
        else:
            obs_next = one_hot_encoder(env._env, env.get_team_blue, VISION_RANGE) # Full observation
        
        rr = r1-prev_reward

        '''if frame == max_ep and d == False:
            #r -= frame * (30/1000)
            r = -100
            r1 = -100'''
        r_signals = reward_signal.signal(obs, rr, act, obs_next, env.get_team_blue, env, frame==max_ep and not d)
        r = sess.run(rs.reward, feed_dict={rs.signals_holder:r_signals})
        
        # Push history for individual that 'was' alive previous frame
        for idx, agent in enumerate(env.get_team_blue):
            if was_alive[idx]:
                indv_history[idx].append([obs[idx],act[idx],r[idx],r_signals])
        
        # State Transition
        prev_reward = r1
        was_alive = [ag.isAlive for ag in env.get_team_blue]
        
        if d:
            break

    for idx, history in enumerate(indv_history):
        if len(history)==0: continue
        _history = np.array(history)
        _history[:,2] = discount_rewards(_history[:,2], gamma)
        ep_history.extend(_history)
            
    if len(ep_history) > 0:        
        ep_history = np.stack(ep_history)
    
    return [frame, ep_history, r1, env.blue_win]

## Training

In [None]:
def run_training(num_ep):
    ep = sess.run(global_step)

    exp_buffer = Experience_buffer(experience_shape=4)
    try:
        progbar = tf.keras.utils.Progbar(num_ep,width=5, interval=0.5)
        for i in range(num_ep):
            progbar.update(i) # update progress bar
            ep += 1
            # Run episode
            frame, history, reward, did_won = policy_rollout(True)

            # Add history
            exp_buffer.add(history)

            batch_history = exp_buffer.sample(batch_size) # Sample from experience replay
            if len(batch_history) > 0:
                feed_dict={myAgent.reward_holder:batch_history[:,2],
                           myAgent.action_holder:batch_history[:,1],
                           myAgent.state_input:np.stack(batch_history[:,0])}
                with tf.device('/gpu:0'):
                    sess.run(myAgent.accumulate_gradient, feed_dict=feed_dict)

            if ep % update_frequency == 0 and ep != 0:
                with tf.device('/gpu:0'):
                    sess.run(myAgent.update_batch)
                    sess.run(myAgent.clear_batch)
                exp_buffer.flush()
                                
            # summarize and record
            ma_reward.append(reward)
            ma_length.append(frame)
            ma_captured.append(env.blue_win)

            if ep % reward_update_freq == 0 and ep != 0:
                # update reward signal weights
                sess.run(rs.update, feed_dict={rs.signals_holder : np.stack(batch_history[:,3]),
                                               rs.winning_holder : np.array(na_captured.tolist())})
                
            if ep % save_stat_frequency == 0 and ep != 0:
                summary_ = sess.run(merged, feed_dict=feed_dict.update({rs.signals_holder : np.stack(batch_history[:,3]),
                                               rs.winning_holder : np.array(na_captured.tolist())}))
                record(summary_)

            # save weight
            if ep % save_network_frequency == 0:
                saver.save(sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_step)

            sess.run(increment_global_step_op)
        return 0

    except KeyboardInterrupt:
        print('\n\nManually stopped the training (KeyboardInterrupt)');
        saver.save(sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_step)
        print("save: ", sess.run(global_step), 'episodes')
        
        return 1

## Self-Play Run

In [None]:
initial_zero_training=0
weight_change_freq = 50000

In [None]:
print('Training with fixed policy')
policy_red = policy.zeros.PolicyGen(env.get_map, env.get_team_red)
run_training(initial_zero_training)
print('training with fixed red: Done')

In [None]:
policy_red = policy.policy_RL.PolicyGen(env.get_map, env.get_team_red,
                                        model_dir=MODEL_PATH, color='red')

while True:
    if run_training(weight_change_freq): break
    print('training at : ', sess.run(global_step), '  red policy updated')
    if sess.run(global_step) % weight_change_freq == 0:
        policy_red.reset_network()