# Capture the Flag (RL - Policy Gradient)

- Seung Hyun Kim
- skim449@illinois.edu

## Implementation Details

- Simple Policy gradient with experience buffer.
- The implementation network is slightly different
    - Better code for mini-batch
    - include self-play for red
    - 19x19 vision

### Sampling
- [x] Mini-batch to update 'average' gradient
- [x] Experience Replay for Random Sampling
- [ ] Importance Sampling
    
### Deterministic Policy Gradient
- [ ] DDPG
- [ ] MADDPG

### Stability and Reducing Variance
- [ ] Target Network
- [ ] TRPO
- [ ] PPO

### Multiprocessing
- [ ] Synchronous Environment Rolling
- [ ] Synchronous Training (A2C)
- [ ] Asynchronous Training (A3C)

### Applied Training Methods:
- [ ] Self-play
- [ ] Batch Policy

## Notes

- This notebook includes:
    - Building the structure of policy driven network.
    - Training with/without render
    - Saver that save model and weights to ./model directory
    - Writer that will record some necessary datas to ./logs

- This notebook does not include:
    - Simulation with RL policy
        - The simulation can be done using policy_RL.py
    - cap_test.py is changed appropriately.
    
## References :
- https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb (source)
- https://www.youtube.com/watch?v=PDbXPBwOavc
- https://github.com/lilianweng/deep-reinforcement-learning-gym/blob/master/playground/policies/actor_critic.py (source)
- https://github.com/spro/practical-pytorch/blob/master/reinforce-gridworld/reinforce-gridworld.ipynb

## TODO:

- enemy with different policies (zero, patrol)
- stochastic interaction
- Reward -> only 100 for completion (with small observation)

In [1]:
!rm -rf logs/VANILLA/ model/VANILLA

In [2]:
TRAIN_NAME='VANILLA'
LOG_PATH='./logs/'+TRAIN_NAME
MODEL_PATH='./model/' + TRAIN_NAME
GPU_CAPACITY=0.5 # gpu capacity in percentage

In [3]:
import os

import signal
from itertools import count

import tensorflow as tf
import tensorflow.contrib.slim as slim
import tensorflow.contrib.layers as layers
from tensorflow.python.client import device_lib
import matplotlib.pyplot as plt
%matplotlib inline

import time
from datetime import datetime
import gym
import gym_cap
import gym_cap.envs.const as CONST
import numpy as np
import random

# the modules that you can use to generate the policy.
import policy.patrol 
import policy.random
import policy.policy_RL
import policy.zeros

# Data Processing Module
from utility.dataModule import one_hot_encoder
from utility.utils import MovingAverage as MA
from utility.utils import Experience_buffer, discount_rewards

## Hyperparameters

In [4]:
# Training Related
total_episode = 150000
max_ep = 150
update_frequency = 50
batch_size = 1000
experience_size=10000

# Saving Related
save_network_frequency = 1000
save_stat_frequency = 100
moving_average_step = 100

# Parameters
LEARNING_RATE = 1e-3
gamma = 0.99
MAP_SIZE = 20
VISION_RANGE = 19
VISION_dX, VISION_dY = 2*VISION_RANGE+1, 2*VISION_RANGE+1

## Environment Setting

In [None]:
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
#Create a directory to save episode playback gifs to
if not os.path.exists(LOG_PATH):
    os.makedirs(LOG_PATH)

In [None]:
env = gym.make("cap-v0") # initialize the environment
policy_red = policy.zeros.PolicyGen(env.get_map, env.get_team_red)
#plt.imshow(env.render(mode='rgb_array'))

# Environment Related
action_space = 5
n_agent = len(env.get_team_blue)

print('red number : ', len(env.get_team_red))
print('blue number : ', len(env.get_team_blue))

red number :  4
blue number :  4


  result = entry_point.load(False)


## Network Setting

In [None]:
class agent():
    def __init__(self, lr, in_size,action_size, grad_clip_norm):
        self.grad_clip_norm=grad_clip_norm
        
        with tf.name_scope('Network_Param'):
            self.input_shape=tf.constant(np.array([VISION_dX,VISION_dY]), name='input_shape')
        
        #These lines established the feed-forward part of the network. The agent takes a state and produces an action.
        self.state_input = tf.placeholder(shape=in_size,dtype=tf.float32, name='state')
        self.action_holder = tf.placeholder(shape=[None],dtype=tf.int32)
        self.action_OH = tf.one_hot(self.action_holder, action_size)
        self.reward_holder = tf.placeholder(shape=[None],dtype=tf.float32, name='reward')
        
        layer = slim.conv2d(self.state_input, 32, [5,5], activation_fn=tf.nn.relu,
                            weights_initializer=layers.xavier_initializer_conv2d(),
                            biases_initializer=tf.zeros_initializer(),
                            padding='SAME',
                            scope='conv1')
        layer = slim.max_pool2d(layer, [2,2])
        layer = slim.conv2d(layer, 64, [3,3], activation_fn=tf.nn.relu,
                            weights_initializer=layers.xavier_initializer_conv2d(),
                            biases_initializer=tf.zeros_initializer(),
                            padding='SAME',
                            scope='conv2')
        layer = slim.max_pool2d(layer, [2,2])
        layer = slim.conv2d(layer, 64, [2,2], activation_fn=tf.nn.relu,
                            weights_initializer=layers.xavier_initializer_conv2d(),
                            biases_initializer=tf.zeros_initializer(),
                            padding='SAME',
                            scope='conv3')
        layer = slim.flatten(layer)
        layer = layers.fully_connected(layer, 128, 
                                    activation_fn=tf.nn.relu)
        self.dense = layers.fully_connected(layer, action_size,
                                    activation_fn=None,
                                    scope='output_fc')
        self.output = tf.nn.softmax(self.dense, name='action')
        
        # Update Operations
        with tf.name_scope('train'):
            self.responsible_outputs = tf.reduce_sum(self.output * self.action_OH, 1)
            self.loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.reward_holder)
            '''self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
                        logits=self.dense, labels=self.action_holder)*self.reward_holder)'''
            self.optimizer = tf.train.AdamOptimizer(learning_rate=lr)
            self.gradients = self.optimizer.compute_gradients(self.loss)
            self.grads = [tf.clip_by_norm(grad, 50) for grad in self.gradients]
            
            self.grad_holders = [(tf.Variable(var, trainable=False, dtype=tf.float32, name=var.op.name+'_buffer'), var)
                                 for var in tf.trainable_variables()]
            self.update_batch = self.optimizer.apply_gradients(self.grad_holders)
            self.accumulate_gradient = tf.group([tf.assign_add(a[0],b[0]) for a,b in zip(self.grad_holders, self.grads)]) # add gradient to buffer
            self.clear_batch = tf.group([tf.assign(a[0],a[0]*0.0) for a in self.grad_holders])
                                        
        # Summary
        # Histogram output
        with tf.variable_scope('debug_parameters'):
            tf.summary.histogram('output', self.output)
            tf.summary.histogram('actor', self.dense)     
            tf.summary.histogram('action', self.action_holder)
        
        # Graph summary Loss
        with tf.variable_scope('summary'):
            tf.summary.scalar(name='total_loss', tensor=self.loss)
        
        with tf.variable_scope('weights_bias'):
            # Histogram weights and bias
            for var in slim.get_model_variables():
                tf.summary.histogram(var.op.name, var)
                
        with tf.variable_scope('gradients'):
            # Histogram Gradients
            for var, grad in zip(slim.get_model_variables(), self.gradients):
                tf.summary.histogram(var.op.name+'/grad', grad[0])

In [None]:
tf.reset_default_graph() # Clear the Tensorflow graph.
myAgent = agent(lr=LEARNING_RATE,in_size=[None,VISION_dX,VISION_dY,6],action_size=5,grad_clip_norm=50) #Load the agent.
global_step = tf.Variable(0, trainable=False, name='global_step') # global step
increment_global_step_op = tf.assign(global_step, global_step+1)
merged = tf.summary.merge_all()

## Session

In [None]:
# Launch the session
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=GPU_CAPACITY, allow_growth=True)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
#sess = tf.Session()

ma_reward = MA(moving_average_step)
ma_length = MA(moving_average_step)
ma_captured = MA(moving_average_step)

# Setup Save and Restore Network
saver = tf.train.Saver(tf.global_variables())
writer = tf.summary.FileWriter(LOG_PATH, sess.graph)

ckpt = tf.train.get_checkpoint_state(MODEL_PATH)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    saver.restore(sess, ckpt.model_checkpoint_path)
    print("Load Model : ", ckpt.model_checkpoint_path)
else:
    sess.run(tf.global_variables_initializer())
    print("Initialized Variables")    

Initialized Variables


In [None]:
def record(summary_):
    with tf.device('/cpu:0'): 
        summary = tf.Summary()
        summary.value.add(tag='Records/mean_reward', simple_value=ma_reward())
        summary.value.add(tag='Records/mean_length', simple_value=ma_length())
        summary.value.add(tag='Records/mean_succeed', simple_value=ma_captured())
        writer.add_summary(summary, sess.run(global_step))
        
        #summary_str = sess.run(merged,feed_dict={myAgent.state_input:obs})
        writer.add_summary(summary_, sess.run(global_step))
        
        writer.flush()

In [None]:
def handler(signum, frame):
    print('Reset Taking Too Long')
    raise Exception('Action took too much time')

In [None]:
def policy_rollout(PARTIAL=False):
    
    # Run single episode, return the results
    # Temporary fix for episode reset
    flag = True
    while flag:
        signal.signal(signal.SIGALRM, handler)
        signal.alarm(3) #Set the parameter to the amount of seconds you want to wait
        try:
            s = env.reset(map_size=MAP_SIZE, policy_red=policy_red)
            flag = False
        except:
            print('timeout. retry:')
            flag = True
        signal.alarm(0) #Disables the alarm
        
    if PARTIAL:
        obs_next = one_hot_encoder(s, env.get_team_blue, VISION_RANGE) # partial observation
    else:
        obs_next = one_hot_encoder(env._env, env.get_team_blue, VISION_RANGE) # Full observation
    
    ep_history = []
    indv_history = [[] for _ in range(len(env.get_team_blue))]
    
    was_alive = [ag.isAlive for ag in env.get_team_blue]
    prev_reward=0
    frame=0
    for frame in range(max_ep+1):
        obs = obs_next
        
        with tf.device('/cpu:0'):
            act_prob = sess.run(myAgent.output, feed_dict={myAgent.state_input:obs})
        act = [np.random.choice(action_space, p=act_prob[x]/sum(act_prob[x])) for x in range(n_agent)] # divide by sum : normalize
            
        s,r1,d,_ = env.step(act) #Get our reward for taking an action given a bandit.

        r = r1-prev_reward

        if frame == max_ep and d == False:
            #r -= frame * (30/1000)
            r = -100
            r1 = -100

        if PARTIAL:
            obs_next = one_hot_encoder(s, env.get_team_blue, VISION_RANGE) # partial observation
        else:
            obs_next = one_hot_encoder(env._env, env.get_team_blue, VISION_RANGE) # Full observation
        
        # Push history for individual that 'was' alive previous frame
        for idx, agent in enumerate(env.get_team_blue):
            if was_alive[idx]:
                indv_history[idx].append([obs[idx],act[idx],r])
        
        # State Transition
        prev_reward = r1
        was_alive = [ag.isAlive for ag in env.get_team_blue]
        
        if d:
            break

    for idx, history in enumerate(indv_history):
        if len(history)==0: continue
        _history = np.array(history)
        _history[:,2] = discount_rewards(_history[:,2], gamma)
        ep_history.extend(_history)
            
    if len(ep_history) > 0:        
        ep_history = np.stack(ep_history)
    
    return [frame, ep_history, r1, env.blue_win, obs]

## Training

In [None]:
def run_training(num_ep):
    ep = sess.run(global_step)

    exp_buffer = Experience_buffer(experience_shape=3)
    try:
        if num_ep == -1: 
            progbar = tf.keras.utils.Progbar(9999999,width=5, interval=0.5)
        else:
            progbar = tf.keras.utils.Progbar(num_ep,width=5, interval=0.5)
            
        count = 0
        while num_ep == -1 or count < num_ep:
            count += 1
            ep += 1
            progbar.update(count) # update progress bar

            # Run episode
            frame, history, reward, did_won, obs = policy_rollout(False)

            # Add history
            exp_buffer.add(history)

            batch_history = exp_buffer.sample(batch_size) # Sample from experience replay
            if len(batch_history) > 0:
                feed_dict={myAgent.reward_holder:batch_history[:,2],
                           myAgent.action_holder:batch_history[:,1],
                           myAgent.state_input:np.stack(batch_history[:,0])}
                with tf.device('/gpu:0'):
                    sess.run(myAgent.accumulate_gradient, feed_dict=feed_dict)

            if ep % update_frequency == 0 and ep != 0:
                with tf.device('/gpu:0'):
                    sess.run(myAgent.update_batch)
                    sess.run(myAgent.clear_batch)
                exp_buffer.flush()
                                
            # summarize and record
            ma_reward.append(reward)
            ma_length.append(frame)
            ma_captured.append(env.blue_win)

            if ep % save_stat_frequency == 0 and ep != 0:
                summary_ = sess.run(merged, feed_dict=feed_dict)
                record(summary_)

            # save weight
            if ep % save_network_frequency == 0:
                saver.save(sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_step)

            sess.run(increment_global_step_op)
            
        return 0

    except KeyboardInterrupt:
        print('\n\nManually stopped the training (KeyboardInterrupt)');
        saver.save(sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_step)
        print("save: ", sess.run(global_step), 'episodes')
        
        return 1

## Run

In [None]:
print('Training with fixed policy')
policy_red = policy.zeros.PolicyGen(env.get_map, env.get_team_red)
run_training(total_episode)
print('training with fixed red: Done')

Training with fixed policy
  1594/150000 [.....] - ETA: 35:38:41