# Capture the Flag (RL - Policy Gradient)

- Seung Hyun Kim
- skim449@illinois.edu

## Implementation Details

- Actor-critic
- Experience Replay + Retrace
- TRPO
- Off Policy
- Asynchronous
- Hindsight Experience Replay
    - Goal Replay
    - Action Replay

### Sampling
- [ ] Mini-batch to update 'average' gradient
- [x] Experience Replay
- [x] Importance Sampling
    - [x] Retrace

### Stability and Reducing Variance
- [x] Gradient clipping
- [x] Retrace
- [ ] Normalized Reward/Advantage
- [ ] Target Network
- [x] TRPO
- [ ] PPO

### Multiprocessing
- [ ] Synchronous Training (A2C)
- [x] Asynchronous Training (A3C)

### Applied Training Methods:
- [ ] Self-play

## Notes

- This notebook includes:
    - Building the structure of policy driven network.
    - Training with/without render
    - Saver that save model and weights to ./model directory
    - Writer that will record some necessary datas to ./logs

- This notebook does not include:
    - Simulation with RL policy
        - The simulation can be done using policy_RL.py
    - cap_test.py is changed appropriately.
    
## References :
- https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb (source)
- https://www.youtube.com/watch?v=PDbXPBwOavc
- https://github.com/lilianweng/deep-reinforcement-learning-gym/blob/master/playground/policies/actor_critic.py (source)
- https://github.com/spro/practical-pytorch/blob/master/reinforce-gridworld/reinforce-gridworld.ipynb
- https://arxiv.org/pdf/1611.01224.pdf
- http://papers.nips.cc/paper/6538-safe-and-efficient-off-policy-reinforcement-learning.pdf

## TODO:
- [x] Check if simple off-policy VANILLA trains
- [x] Build full ACER with asynch:
    - [x] ER
    - [ ] TRPO/PPO (optional)
    - [ ] Retrace (Important Sampling)
        - [ ] Q value retrace
        - [ ] Truncated Importance Sampling
    - [x] Asynchronous
- [x] Modify and make universal MDP (include g)
- [x] Modify environment pipeline
- [x] Modify existing ACER network for Universal
- [x] Implement HER algorithm
    - [x] goal replay
    - [x] action replay
        - Disabled: The entire replay buffer is used as goal buffer
        - The goal is sampled from replay buffer
    - Some improvisation: multi-agent takes 2 global goal and 2 sampled goal
- [x] Train

In [1]:
!rm -rf logs/HER_A3CER_v4/ model/HER_A3CER_v4

In [2]:
TRAIN_NAME='HER_A3CER_v4'
LOG_PATH='./logs/'+TRAIN_NAME
MODEL_PATH='./model/' + TRAIN_NAME
GPU_CAPACITY=0.5 # gpu capacity in percentage

In [3]:
import os

import multiprocessing
import threading

import tensorflow as tf

import time
import gym
import gym_cap
import gym_cap.envs.const as CONST
import numpy as np
import random
import math
import matplotlib.pyplot as plt
%matplotlib inline

# the modules that you can use to generate the policy. 
import policy.random
import policy.roomba
import policy.policy_RL
import policy.zeros

# Data Processing Module
from utility.dataModule import state_processor
from utility.utils import MovingAverage as MA
from utility.utils import discount_rewards, store_args, q_retrace
from utility.her import HER

from network.u_ACER import UACER as Network

%load_ext autoreload
%autoreload 2

## Hyperparameters

In [4]:
# Replay Variables
total_episodes= 200000
max_ep = 150
entropy_beta = 0.001
trajectory_length = 150
train_frequency = 1  # not yet implemented
minibatch_size = 200
action_replay_count = 1

# Saving Related
save_network_frequency = 1200
save_stat_frequency = 128
moving_average_step = 128

# Training Variables
lr_a = 1e-5
lr_c = 5e-5

gamma = 0.98 # discount_factor

# Env Settings
MAP_SIZE = 50
VISION_RANGE = 19 # What decide the network size !!!
VISION_dX, VISION_dY = 2*VISION_RANGE+1, 2*VISION_RANGE+1
IN_SIZE = [None,VISION_dX,VISION_dY,6]
GPS_SIZE = [None, 2]
ACTION_SPACE = 5
N_AGENT = 1
NENV = 8


## Environment Setting

In [5]:
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
#Create a directory to save episode playback gifs to
if not os.path.exists(LOG_PATH):
    os.makedirs(LOG_PATH)

In [6]:
global_rewards = MA(moving_average_step)
global_length = MA(moving_average_step)
global_succeed = MA(moving_average_step)
global_episodes = 0

# Launch the session
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=GPU_CAPACITY, allow_growth=True)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
progbar = tf.keras.utils.Progbar(total_episodes,interval=1)

## Worker

In [7]:
class Worker(object):
    @store_args
    def __init__(self, name, global_network, sess, global_step=0):
        # Initialize Environment worker
        print(f'worker: {name} initiated')
        self.env = gym.make("cap-v0").unwrapped
        self.env.reset(map_size=MAP_SIZE,
                       policy_red=policy.zeros.PolicyGen(self.env.get_map, self.env.get_team_red))
        print(f'worker: {name} environment info')
        print(f'    number of blue agents : {len(self.env.get_team_blue)}')
        print(f'    number of red agents  : {len(self.env.get_team_red)}')
        
        # Create Network for Worker
        self.network = Network(in_size=IN_SIZE, gps_size=GPS_SIZE, action_size=ACTION_SPACE,
                               scope=self.name, lr_actor=lr_a, lr_critic=lr_c,
                               grad_clip_norm=0, entropy_beta = entropy_beta,
                               sess=sess, global_network=global_network)
        
        # Create HER Module
        self.her = HER(depth=6)
        
    def work(self, saver, writer):
        global global_rewards, global_episodes, global_length, global_succeed
                
        # loop
        with self.sess.as_default(), self.sess.graph.as_default():
            while not coord.should_stop() and global_episodes < total_episodes:
                s0 = self.env.reset()
                s_local_0, s_gps_0, g_global = state_processor(s0, self.env.get_team_blue, VISION_RANGE, self.env._env,
                                                               flatten=False, partial=False)
                self.goal = g_global[:]#2]+self.her.sample_goal(2, g_global[0])
                
                # Bootstrap
                a1, pi1 = self.network.get_action(s_local_0, s_gps_0, self.goal)
                is_alive = [ag.isAlive for ag in self.env.get_team_blue]
                indv_history = [ [] for _ in range(len(self.env.get_team_blue)) ]
                for step in range(max_ep+1):
                    # Transition
                    a, pi0 = a1, pi1
                    was_alive = is_alive
                    
                    # Action
                    s1, _, d, _ = self.env.step(a)
                    s_local_1, s_gps_1, _ = state_processor(s1, self.env.get_team_blue, VISION_RANGE, self.env._env,
                                                            flatten=False, partial=False)
                    reward = [self.her.reward(s,a,g) for s,a,g in zip(s_gps_1, a, self.goal)]
                    a1, pi1 = self.network.get_action(s_local_1, s_gps_1, self.goal)
                    is_alive = [ag.isAlive for ag in self.env.get_team_blue]
                    
                    if step == max_ep and d == False:
                        d = True

                    # push to buffer
                    for idx, agent in enumerate(self.env.get_team_blue):
                        if was_alive[idx]:
                            indv_history[idx].append([[s_local_0[idx], s_gps_0[idx]],
                                                      a[idx],
                                                      [s_local_1[idx], s_gps_1[idx]],
                                                      pi0[idx][a[idx]],
                                                      reward[idx]
                                                     ])
                        # if not is_alive[idx]:
                        #    self.her.action_replay(s_gps_1[idx])
                            
                    if (step % trajectory_length == 0 and step > 0) or d:
                        self.process_history(indv_history)
                        indv_history = [[] for _ in self.env.get_team_blue]
                        
                    # Iteration Reset
                    s_local_0=s_local_1
                    s_gps_0=s_gps_1

                    if d:
                        r_episode = 1 if self.env.blue_win else -1
                        aloss, closs, etrpy = self.train()
                        break

                global_rewards.append(r_episode)
                global_length.append(step)
                global_succeed.append(self.env.blue_win)
                global_episodes += 1
                self.sess.run(global_step_next)
                progbar.update(global_episodes)
                
                if global_episodes % save_stat_frequency == 0 and global_episodes != 0:
                    summary = tf.Summary()
                    summary.value.add(tag='Records/mean_reward', simple_value=global_rewards())
                    summary.value.add(tag='Records/mean_length', simple_value=global_length())
                    summary.value.add(tag='Records/mean_succeed', simple_value=global_succeed())
                    summary.value.add(tag='summary/Entropy', simple_value=etrpy)
                    summary.value.add(tag='summary/actor_loss', simple_value=aloss)
                    summary.value.add(tag='summary/critic_loss', simple_value=closs)
                    writer.add_summary(summary,global_episodes)
                    writer.flush()
                    
                if global_episodes % save_network_frequency == 0 and global_episodes != 0:
                    saver.save(self.sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_episodes)
                        
    def process_history(self, indv_buffer):
        for idx, buffer in enumerate(indv_buffer):
            played_size = len(buffer)
            if played_size == 0:
                continue
                
            # Extract matrix    
            local_obs, gps_obs, action, local_obs_1, gps_obs_1, beta_policy = [],[],[],[],[],[]
            for mdp in buffer:
                local_obs.append(mdp[0][0])  # 0.0
                gps_obs.append(mdp[0][1])    # 0.1
                action.append(mdp[1])        # 1
                local_obs_1.append(mdp[2][0])
                gps_obs_1.append(mdp[2][1])
                beta_policy.append(mdp[3])   # 5
                
            goal = self.goal[idx]            # 3
            # Goal Replay
            #     global_goal + sampled_goal
            goal_list = [goal] + random.sample(gps_obs_1, min(action_replay_count,len(gps_obs_1)))
            for subgoal in goal_list:
                # Discount Reward and Universal Advantage
                reward, length = self.her.goal_replay(gps_obs_1, action, subgoal)
                critic = self.network.get_critic(local_obs[:length],
                                                 gps_obs[:length],
                                                 [subgoal]*length)
                bootstrap = self.network.get_critic(local_obs_1[length-1:length],
                                                    gps_obs_1[length-1:length],
                                                    [subgoal])
                # Importance Sampling weight
                _, action_probs = self.network.get_action(local_obs[:length], gps_obs[:length], [subgoal]*length)
                pi_i = np.array([prob[a] for a, prob in zip(action, action_probs)])
                is_weight = pi_i / beta_policy[:length]
                
                value_ext = np.append(critic, [bootstrap])
                #td_target = reward + gamma * value_ext[1:]
                q_ret = q_retrace(reward, value_ext, gamma, is_weight)
                advantage = q_ret - value_ext[:-1]
                advantage = discount_rewards(advantage,gamma)

                td_target = q_ret[:]#.tolist()   # 2
                advantage = advantage.tolist()   # 4

                for i in range(length):
                    transition = [[local_obs[i], gps_obs[i]],
                                   action[i], td_target[i], subgoal, advantage[i], is_weight[i]]
                    self.her.store_transition(transition)
            # Action Replay
            self.her.action_replay(gps_obs_1[-1])
        
    def train(self):
        aloss, closs, entropy = [],[],[]
        #print(f'Buffer Size : {len(self.her.replay_buffer.buffer)}')
        while not self.her.buffer_empty():
            minibatch = self.her.sample_minibatch(minibatch_size)
            local_obs, gps_obs, action, advantage, goal, td_target, is_weight = [],[],[],[],[],[],[]
            for tr in minibatch:
                local_obs.append(tr[0][0])
                gps_obs.append(tr[0][1])
                action.append(tr[1])
                td_target.append(tr[2])
                goal.append(tr[3])
                advantage.append(tr[4])
                is_weight.append(tr[5])
            al, cl, entr = self.network.update_global(local_obs, gps_obs, goal,
                                       action, advantage, td_target, is_weight)
            aloss.append(al)
            closs.append(cl)
            entropy.append(entr)
            
        self.network.pull_global()
        return np.mean(aloss), np.mean(closs), np.mean(entropy)
    

## Run

In [8]:
global_step = tf.Variable(0, trainable=False, name='global_step')
global_step_next = tf.assign_add(global_step, 1)
global_weights = Network(in_size=IN_SIZE, gps_size=GPS_SIZE, action_size=ACTION_SPACE,
                         scope='global', sess=sess,
                         set_global=True)

# Local workers
workers = []
# loop for each workers
for idx in range(NENV):
    name = 'W_%i' % idx
    workers.append(Worker(name, global_weights, sess, global_step=global_step))
    print(f'worker: {name} initiated')
saver = tf.train.Saver(max_to_keep=3)
writer = tf.summary.FileWriter(LOG_PATH, sess.graph)
    
ckpt = tf.train.get_checkpoint_state(MODEL_PATH)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    saver.restore(sess, ckpt.model_checkpoint_path)
    print("Load Model : ", ckpt.model_checkpoint_path)
else:
    sess.run(tf.global_variables_initializer())
    print("Initialized Variables")
    
coord = tf.train.Coordinator()
worker_threads = []
global_episodes = sess.run(global_step)

for worker in workers:
    job = lambda: worker.work(saver, writer)
    t = threading.Thread(target=job)
    t.start()
    worker_threads.append(t)
coord.join(worker_threads)

worker: W_0 initiated
worker: W_0 environment info
    number of blue agents : 1
    number of red agents  : 0


  result = entry_point.load(False)


worker: W_0 initiated
worker: W_1 initiated
worker: W_1 environment info
    number of blue agents : 1
    number of red agents  : 0
worker: W_1 initiated
worker: W_2 initiated
worker: W_2 environment info
    number of blue agents : 1
    number of red agents  : 0
worker: W_2 initiated
worker: W_3 initiated
worker: W_3 environment info
    number of blue agents : 1
    number of red agents  : 0
worker: W_3 initiated
worker: W_4 initiated
worker: W_4 environment info
    number of blue agents : 1
    number of red agents  : 0
worker: W_4 initiated
worker: W_5 initiated
worker: W_5 environment info
    number of blue agents : 1
    number of red agents  : 0
worker: W_5 initiated
worker: W_6 initiated
worker: W_6 environment info
    number of blue agents : 1
    number of red agents  : 0
worker: W_6 initiated
worker: W_7 initiated
worker: W_7 environment info
    number of blue agents : 1
    number of red agents  : 0
worker: W_7 initiated
Initialized Variables

  action = [np.random.choice(self.action_size, p=prob / sum(prob)) for prob in a_probs]




KeyboardInterrupt: 