# Capture the Flag (RL - Policy Gradient)

- Seung Hyun Kim
- skim449@illinois.edu

## Notes
- This notebook includes:
    - Building the structure of policy driven network.
    - Training with/without render
    - Saver that save model and weights to ./model directory
    - Writer that will record some necessary datas to ./logs
- This notebook does not include running the CtF game with the RL policy. Using the network will be separately scripted in policy/policy_RL1.py.
    - cap_test.py is changed appropriately.
    
## References :
- https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb (source)
- https://www.youtube.com/watch?v=PDbXPBwOavc
- https://github.com/lilianweng/deep-reinforcement-learning-gym/blob/master/playground/policies/actor_critic.py (source)
- https://www.datahubbs.com/policy-gradients-and-advantage-actor-critic/

## TODO:
try to check if convolution

### Sampling
- [x] Mini-batch to update 'average' gradient
- [x] Experience Replay for Random Sampling
    - [x] Importance Sampling
    
### Move onto Deterministic
- [ ] DDPG and MADDPG

### Stability and Reducing Variance
- [ ] Target Network
- [ ] TRPO
- [x] PPO

### Multiprocessing
- [ ] Multiprocessing for Synchronous Training (A2C)
    - [ ] Asynchronous Training (A3C)

In [1]:
!rm -rf logs/B4R4_Rzero_AC_ImportSamp/ model/B4R4_Rzero_AC_ImportSamp

In [2]:
TRAIN_NAME='B4R4_Rzero_AC_ImportSamp'
LOG_PATH='./logs/'+TRAIN_NAME
MODEL_PATH='./model/' + TRAIN_NAME
GPU_CAPACITY=0.35 # gpu capacity in percentage

In [3]:
import os

import signal

import tensorflow as tf
import tensorflow.contrib.slim as slim
import tensorflow.contrib.layers as layers
from tensorflow.python.client import device_lib
import matplotlib.pyplot as plt
%matplotlib inline

import time
import gym
import gym_cap
import gym_cap.envs.const as CONST
import numpy as np
import random

from IPython.display import clear_output

# the modules that you can use to generate the policy.
import policy.patrol 
import policy.random
import policy.simple # custon written policy
#import policy.policy_RL
import policy.zeros

# Data Processing Module
from DataModule import one_hot_encoder
from Utils import MovingAverage as MA
from Utils import Experience_buffer, discount_rewards, normalize

## Hyperparameters

In [4]:
# Replay Variables
total_episodes = 5000000 #Set total number of episodes to train agent on.
max_ep = 150
mini_batch = 64
batch_size = 2048
experience_size=10000

# Saving Related
save_network_frequency = 1000
save_stat_frequency = 100
moving_average_step = 100

# Training Variables
LEARNING_RATE_FIX = True
LEARNINGRATE_AC  = 1e-4
LEARNINGRATE_CRITIC = 1e-1
LR_ACTOR_DECAY = 0.999
LR_CRITIC_DECAY = 0.995
LR_ACTOR_FINAL = 1e-5
LR_CRITIC_FINAL = 5e-5
gamma = 0.95
discount_factor = 0.9
pre_train = 3500

# Env Settings
MAP_SIZE = 10
VISION_RANGE = 4
VISION_dX, VISION_dY = 2*VISION_RANGE+1, 2*VISION_RANGE+1

## Environment Setting

In [None]:
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
#Create a directory to save episode playback gifs to
if not os.path.exists(LOG_PATH):
    os.makedirs(LOG_PATH)

In [None]:
env = gym.make("cap-v0") # initialize the environment
#plt.imshow(env.render(mode='rgb_array'))
policy_red = policy.zeros.PolicyGen(env.get_map, env.get_team_red)
action_space = 5
n_agent = len(env.get_team_blue)
print('red number : ', len(env.get_team_red))
print('blue number : ', len(env.get_team_blue))

red number :  4
blue number :  4


## Policy Network

In [None]:
class Agent():
    def __init__(self, in_size, action_size, grad_clip_norm):
        # Parameters
        self.grad_clip_norm = grad_clip_norm
        
        # Learning Rate Variables
        self.learning_rate = tf.placeholder(tf.float32, shape=None, name='learning_rate')
        self.learning_rate_critic = tf.placeholder(tf.float32, shape=None, name='learning_rate_critic')
        
        # Placeholders
        self.state_input = tf.placeholder(shape=in_size,dtype=tf.float32, name='state')
        self.action_holder = tf.placeholder(shape=[None],dtype=tf.int32, name='action')
        self.action_OH = tf.one_hot(self.action_holder, action_size)
        self.reward_holder = tf.placeholder(shape=[None],dtype=tf.float32, name='reward')
        self.td_target_holder = tf.placeholder(shape=[None], dtype=tf.float32, name='td_target')
        self.behavior_policy_holder = tf.placeholder(shape=[None,action_size], dtype=tf.float32, name='IS')
        
        # Feed-Forward Network
        # Actor stream
        layer = slim.conv2d(self.state_input, 32, [3,3], activation_fn=tf.nn.relu,
                            weights_initializer=layers.xavier_initializer_conv2d(),
                            padding='VALID',
                            scope='conv_1')
        layer = slim.max_pool2d(layer, [2,2])
        layer = slim.conv2d(layer, 64, [2,2], activation_fn=tf.nn.relu,
                            weights_initializer=layers.xavier_initializer_conv2d(),
                            padding='VALID',
                            scope='conv_2')
        layer = slim.flatten(layer)
        actor = layers.fully_connected(layer, 128,
                                    activation_fn=tf.nn.relu,
                                    scope='dense_1')
        self.actor = layers.fully_connected(actor, action_size,
                                    activation_fn=None,
                                    scope='dense_2')
        self.output = tf.nn.softmax(self.actor, name='action')
        self.output_argmax = tf.argmax(self.output, axis=1,output_type=tf.int32, name='argmax')
        self.actor_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='conv')+tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='dense')
        
        # Value stream
        self.critic = layers.fully_connected(layer, 1,
                                activation_fn=None,
                                            scope='critic_1')
        self.critic = tf.reshape(self.critic, [-1])
        self.critic_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='critic')
        
        # Verbose : Print Graph Structure
        print('Actor Network')
        for var in self.actor_vars:
            print(var)
        print('Critic Network')
        for var in self.critic_vars:
            print(var)
        
        # Feed Backward
        # - compute the  loss, and use it to find gradient, and update the network
        # - May be need to add bootstrap value at the end of the value
        #self.selected=tf.equal(self.action_holder, self.output_argmax)
        with tf.name_scope('train'):
            self.entropy = -tf.reduce_mean(self.output * tf.log(self.output+1e-8), name='entropy')
            self.advantage = self.td_target_holder - self.critic
        
        with tf.name_scope('critic_train'):
            self.loss_critic = tf.reduce_mean(tf.square(self.advantage))
            self.optimizer_critic = tf.train.AdamOptimizer(self.learning_rate_critic)
            self.grads_critic = self.optimizer_critic.compute_gradients(self.loss_critic, self.critic_vars)
            if self.grad_clip_norm:
                self.grads_critic = [(tf.clip_by_norm(grad, self.grad_clip_norm), var) for grad, var in self.grads_critic]
            self.grad_holders_critic = [(tf.Variable(var, trainable=False, dtype=tf.float32, name=var.op.name+'_buffer'), var) for var in self.critic_vars]
            self.accumulate_critic = tf.group([tf.assign_add(a[0],b[0]) for a,b in zip(self.grad_holders_critic, self.grads_critic)]) # add gradient to buffer
            self.update_critic = self.optimizer_critic.apply_gradients(self.grad_holders_critic)

        with tf.name_scope('actor_train'):
            self.policy_outputs = tf.reduce_sum(self.output * self.action_OH, 1)
            self.behav_policy_output = tf.reduce_sum(self.behavior_policy_holder * self.action_OH,1)
            self.sampling_weight = self.policy_outputs / (self.behav_policy_output+1e-8)
            self.objective_function = tf.clip_by_value(self.sampling_weight,0.8,1.2) * tf.log(self.policy_outputs+1e-8)
            self.loss_actor = -tf.reduce_sum(self.objective_function*tf.stop_gradient(self.advantage))
            '''self.optimizer_actor = tf.train.AdamOptimizer(self.learning_rate)
            self.grads_actor = self.optimizer_actor.compute_gradients(self.loss_actor)#, self.actor_vars)
            if self.grad_clip_norm:
                self.grads_actor = [(tf.clip_by_norm(grad, self.grad_clip_norm), var) for grad, var in self.grads_actor]
            self.grad_holders_actor = [(tf.Variable(var, trainable=False, dtype=tf.float32, name=var.op.name+'_buffer'), var) for var in self.actor_vars]
            self.update_actor = self.optimizer_actor.apply_gradients(self.grad_holders_actor)'''
            
        with tf.name_scope('update'):
            self.loss = self.loss_critic*0.5 + self.loss_actor - self.entropy * 0.001
            '''self.accumulate_gradient = tf.group([tf.assign_add(a[0],b[0]) for a,b in zip(self.grad_holders_actor, self.grads_actor)],
                                                [tf.assign_add(a[0],b[0]) for a,b in zip(self.grad_holders_critic, self.grads_critic)])
            self.clear_batch = tf.group([tf.assign(a[0],a[0]*0.0) for a in self.grad_holders_actor],
                                        [tf.assign(a[0],a[0]*0.0) for a in self.grad_holders_critic])
            self.update_batch = tf.group(self.update_actor, self.update_critic)'''
            # update using total loss
            self.optimizer = tf.train.AdamOptimizer(self.learning_rate)
            self.grads = self.optimizer.compute_gradients(self.loss)
            if self.grad_clip_norm:
                self.grads = [(tf.clip_by_norm(grad, self.grad_clip_norm), var) for grad, var in self.grads]
            self.grad_holders = [(tf.Variable(var, trainable=False, dtype=tf.float32, name=var.op.name+'_buffer'), var) for var in tf.trainable_variables()]
            self.update_batch = self.optimizer.apply_gradients(self.grad_holders)
            self.accumulate_gradient = tf.group([tf.assign_add(a[0],b[0]) for a,b in zip(self.grad_holders, self.grads)]) # add gradient to buffer
            self.clear_batch = tf.group([tf.assign(a[0],a[0]*0.0) for a in self.grad_holders])
            
        # Summary
        # Histogram output
        with tf.name_scope('debug_parameters'):
            tf.summary.histogram('output', self.output)
            tf.summary.histogram('actor', self.actor)
            tf.summary.histogram('critic', self.critic)        
            tf.summary.histogram('action', self.action_holder)
            tf.summary.histogram('IS_weight', self.sampling_weight)
            tf.summary.histogram('objective_function', self.objective_function)
            tf.summary.histogram('td_target', self.td_target_holder)
            tf.summary.histogram('rewards_in', self.reward_holder)
        
        # Graph summary Loss
        with tf.name_scope('summary'):
            tf.summary.scalar(name='actor_loss', tensor=self.loss_actor)
            tf.summary.scalar(name='critic_loss', tensor=self.loss_critic)
            tf.summary.scalar(name='total_loss', tensor=self.loss)
            tf.summary.scalar(name='Entropy', tensor=self.entropy)
        
        with tf.name_scope('weights_bias'):
            # Histogram weights and bias
            for var in slim.get_model_variables():
                tf.summary.histogram(var.op.name, var)
                
        with tf.name_scope('gradients'):
            # Histogram Gradients
            for var, grad in zip(self.critic_vars, self.grads):
                tf.summary.histogram(var.op.name+'/grad_critic', grad[0])
            '''for var, grad in zip(self.actor_vars, self.grads_actor):
                tf.summary.histogram(var.op.name+'/grad_actor', grad[0])'''
        
        with tf.name_scope('Learning_Rate'):
            # Learning Rate
            tf.summary.scalar(name='actor_lr', tensor=self.learning_rate)
            #tf.summary.scalar(name='critic_lr', tensor=self.learning_rate_critic)

In [None]:
tf.reset_default_graph() # Clear the Tensorflow graph.
myAgent = Agent(in_size=[None,VISION_dX,VISION_dY,6],action_size=5, grad_clip_norm=30) #Load the agent.
with tf.variable_scope('global_step'):
    global_step = tf.Variable(0, trainable=False, name='global_step') # global step
    increment_global_step_op = tf.assign(global_step, global_step+1)
merged = tf.summary.merge_all()

Actor Network
<tf.Variable 'conv_1/weights:0' shape=(3, 3, 6, 32) dtype=float32_ref>
<tf.Variable 'conv_1/biases:0' shape=(32,) dtype=float32_ref>
<tf.Variable 'conv_2/weights:0' shape=(2, 2, 32, 64) dtype=float32_ref>
<tf.Variable 'conv_2/biases:0' shape=(64,) dtype=float32_ref>
<tf.Variable 'dense_1/weights:0' shape=(256, 128) dtype=float32_ref>
<tf.Variable 'dense_1/biases:0' shape=(128,) dtype=float32_ref>
<tf.Variable 'dense_2/weights:0' shape=(128, 5) dtype=float32_ref>
<tf.Variable 'dense_2/biases:0' shape=(5,) dtype=float32_ref>
Critic Network
<tf.Variable 'critic_1/weights:0' shape=(256, 1) dtype=float32_ref>
<tf.Variable 'critic_1/biases:0' shape=(1,) dtype=float32_ref>


## Session

In [None]:
# Launch the session
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction   = GPU_CAPACITY,
                            allow_growth                      = True)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
#sess = tf.Session()

ma_reward = MA(moving_average_step)
ma_length = MA(moving_average_step)
ma_captured = MA(moving_average_step)

# Setup Save and Restore Network
saver = tf.train.Saver(tf.global_variables())
writer = tf.summary.FileWriter(LOG_PATH, sess.graph)

ckpt = tf.train.get_checkpoint_state(MODEL_PATH)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    saver.restore(sess, ckpt.model_checkpoint_path)
    print("Load Model : ", ckpt.model_checkpoint_path)
else:
    sess.run(tf.global_variables_initializer())
    print("Initialized Variables")

Initialized Variables


In [None]:
def record(summary_):
    with tf.device('/cpu:0'): 
        summary = tf.Summary()
        summary.value.add(tag='Records/mean_reward', simple_value=ma_reward())
        summary.value.add(tag='Records/mean_length', simple_value=ma_length())
        summary.value.add(tag='Records/mean_succeed', simple_value=ma_captured())
        writer.add_summary(summary, sess.run(global_step))
        
        #summary_str = sess.run(merged,feed_dict={myAgent.state_input:obs})
        writer.add_summary(summary_, sess.run(global_step))
        
        writer.flush()

In [None]:
#Close session
def handler(signum, frame):
    print('Reset Taking Too Long')
    raise Exception('Action took too much time')

In [None]:
def policy_rollout(EXPLORE=False, DETERMINISTIC=False):
    # Run single episode, return the results
    flag = True
    while flag:
        signal.signal(signal.SIGALRM, handler)
        signal.alarm(3) #Set the parameter to the amount of seconds you want to wait
        try:
            s = env.reset(map_size=MAP_SIZE, policy_red=policy_red)
            flag = False
        except:
            print('timeout. retry:')
            flag = True
        signal.alarm(0) #Disables the alarm
    

    #obs = one_hot_encoder(s, env.get_team_blue) # partial observation
    obs_next = one_hot_encoder(env._env, env.get_team_blue, VISION_RANGE)

    
    ep_history = []
    indv_history = [[] for _ in range(len(env.get_team_blue))]
    was_alive = [ag.isAlive for ag in env.get_team_blue]
    
    prev_reward=0
    total_reward = 0
    frame=0
    
    for frame in range(max_ep+1):
        obs = obs_next
        
        with tf.device('/cpu:0'):
            act_prob, v0 = sess.run([myAgent.output, myAgent.critic], feed_dict={myAgent.state_input:obs})
        act = [np.random.choice(action_space, p=act_prob[x]/sum(act_prob[x])) for x in range(n_agent)] # divide by sum : normalize
        behavior_policy = act_prob#np.log(act_chosen)
        
        s,r1,d,_ = env.step(act) #Get our reward for taking an action given a bandit.
        r = r1-prev_reward
        #if r >=1 and r < 100: # ignore capturing
        #    r = 0
        if frame == max_ep and d == False:
            r = -100
            r1 = -100
            d = True
        total_reward += r
        
        if d:
            v1 = np.array([0.0 for _ in range(n_agent)])
        else:
            obs_next = one_hot_encoder(env._env, env.get_team_blue, VISION_RANGE) # Full Observation
            v1 = sess.run(myAgent.critic, feed_dict={myAgent.state_input:obs_next})
        
        # Push history for individual that 'was' alive previous frame
        # [state, action, reward, td-target, td-error(advantage), behavior_policy]
        for idx, agent in enumerate(env.get_team_blue):
            if was_alive[idx]:
                indv_history[idx].append([obs[idx],act[idx],r,gamma*v1[idx],behavior_policy[idx]])
      
        # State Transition
        prev_reward = r1
        was_alive = [ag.isAlive for ag in env.get_team_blue]
        
        if d:
            break

    # Policy rollout for all agents are done.
    # Calculate Advantage for individual histories
    for idx, history in enumerate(indv_history):
        if len(history)==0:
            continue
        _history = np.array(history)
        _history[:,2] = discount_rewards(_history[:,2], discount_factor)#, normalize=True)
        _history[:,3] += _history[:,2] # Td target
        ep_history.extend(_history)
      
    if len(ep_history) > 0:        
        ep_history = np.stack(ep_history)

    return [frame, ep_history, r1, env.blue_win, total_reward]

## Training

In [None]:
if __name__=='__main__':
    ep = sess.run(global_step)
    exp_buffer = Experience_buffer(experience_shape=5, buffer_size=experience_size)
    batch_history = []
    try:
        progbar = tf.keras.utils.Progbar(total_episodes,width=5)
        while True:#ep < total_episodes+1:
            progbar.update(ep) # update progress bar

            # Run episode

            frame, history, reward, did_won, total_reward = policy_rollout()

            # Add history
            exp_buffer.add(history)

            if len(exp_buffer) > 0:
                batch_history = exp_buffer.sample(size=batch_size, shuffle=True)
                feed_dict={myAgent.learning_rate    :LEARNINGRATE_AC,
                           myAgent.state_input            :np.stack(batch_history[:,0]),
                           myAgent.action_holder          :batch_history[:,1],
                           myAgent.reward_holder          :batch_history[:,2],
                           myAgent.td_target_holder       :batch_history[:,3],
                           myAgent.behavior_policy_holder :np.stack(batch_history[:,4])}
                with tf.device('/gpu:0'):
                    sess.run(myAgent.accumulate_gradient, feed_dict=feed_dict)

            if ep % mini_batch == 0 and ep != 0:
                with tf.device('/gpu:0'):
                    sess.run(myAgent.update_batch, feed_dict={myAgent.learning_rate    :LEARNINGRATE_AC})
                    sess.run(myAgent.clear_batch)

            # decay lr
            if not LEARNING_RATE_FIX:
                LEARNINGRATE_ACTOR = max(LEARNINGRATE_ACTOR*LR_ACTOR_DECAY,LR_ACTOR_FINAL)
                LEARNINGRATE_CRITIC = max(LEARNINGRATE_CRITIC*LR_CRITIC_DECAY,LR_CRITIC_FINAL)

            # summarize and record
            ma_reward.append(reward)
            ma_length.append(frame)
            ma_captured.append(env.blue_win)

            if ep % save_stat_frequency == 0 and ep != 0 and len(batch_history) > 0:
                summary_ = sess.run(merged, feed_dict=feed_dict)
                record(summary_)

            # save weight
            if ep % save_network_frequency == 0 and ep != 0:
                saver.save(sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_step)

            # Proceed to next episode
            ep += 1
            sess.run(increment_global_step_op)

            #clear_output()
    except KeyboardInterrupt:
        print('\n\nManually stopped the training (KeyboardInterrupt)');
        saver.save(sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_step)
        print("save: ", sess.run(global_step), 'episodes')

   3761/5000000 [.....] - ETA: 895:13:24Reset Taking Too Long
timeout. retry:
   6463/5000000 [.....] - ETA: 913:41:46Reset Taking Too Long
timeout. retry:
   7819/5000000 [.....] - ETA: 907:11:00

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



   9406/5000000 [.....] - ETA: 906:07:02Reset Taking Too Long
timeout. retry:
  13094/5000000 [.....] - ETA: 908:11:46Reset Taking Too Long
timeout. retry:
  13434/5000000 [.....] - ETA: 907:46:34Reset Taking Too Long
timeout. retry:
  17354/5000000 [.....] - ETA: 904:50:22Reset Taking Too Long
timeout. retry:
  17658/5000000 [.....] - ETA: 904:49:11Reset Taking Too Long
timeout. retry:
  20759/5000000 [.....] - ETA: 905:53:17Reset Taking Too Long
timeout. retry:
  22200/5000000 [.....] - ETA: 905:39:49Reset Taking Too Long
timeout. retry:
  26212/5000000 [.....] - ETA: 900:29:02Reset Taking Too Long
timeout. retry:
  35089/5000000 [.....] - ETA: 902:16:46Reset Taking Too Long
timeout. retry:
  35602/5000000 [.....] - ETA: 902:13:26Reset Taking Too Long
timeout. retry:
  40986/5000000 [.....] - ETA: 906:22:35Reset Taking Too Long
timeout. retry:
  41114/5000000 [.....] - ETA: 906:45:40Reset Taking Too Long
timeout. retry:
  44541/5000000 [.....] - ETA: 909:54:42Reset Taking Too Long
ti