# Capture the Flag (A3C Benchmark Training)

- Seung Hyun Kim
- skim449@illinois.edu

## Implementation Details

- Actor-critic
- On Policy
- Self-play

### Stability and Reducing Variance
- [x] Gradient clipping
- [x] Normalized Reward/Advantage
- [ ] Target Network
- [ ] TRPO
- [ ] PPO

### Multiprocessing
- [ ] Synchronous Training (A2C)
- [x] Asynchronous Training (A3C)

### Applied Training Methods:
- [x] Self-play
- [ ] Batch Policy

## Notes

- This notebook includes:
    - Building the structure of policy driven network.
    - Training with/without render
    - Saver that save model and weights to ./model directory
    - Writer that will record some necessary datas to ./logs

- This notebook does not include:
    - Simulation with RL policy
        - The simulation can be done using policy_RL.py
    - cap_test.py is changed appropriately.
    
## References :
- https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb (source)
- https://www.youtube.com/watch?v=PDbXPBwOavc
- https://github.com/lilianweng/deep-reinforcement-learning-gym/blob/master/playground/policies/actor_critic.py (source)
- https://github.com/spro/practical-pytorch/blob/master/reinforce-gridworld/reinforce-gridworld.ipynb

## TODO:


!rm -rf logs/A3C_benchmark2/ model/A3C_benchmark2

In [1]:
TRAIN_NAME='A3C_benchmark2'
LOG_PATH='./logs/'+TRAIN_NAME
MODEL_PATH='./model/' + TRAIN_NAME
GPU_CAPACITY=0.3 # gpu capacity in percentage

In [2]:
import os
import configparser

import signal
import threading
import multiprocessing

import tensorflow as tf

import time
import gym
import numpy as np
import random

# the modules that you can use to generate the policy. 
import policy.roomba
import policy.policy_A3C

# Data Processing Module
from utility.dataModule import one_hot_encoder as one_hot_encoder
from utility.utils import MovingAverage as MA
from utility.utils import Experience_buffer, discount_rewards


from network.a3c import ActorCritic as Network
from network.base import initialize_uninitialized_vars

from worker.worker import Worker

%load_ext autoreload
%autoreload 2

## Hyperparameters

In [3]:
# Importing global configuration
config = configparser.ConfigParser()
config.read('config.ini')

## Environment
action_space = config.getint('DEFAULT','ACTION_SPACE')
vision_range = 9#config.getint('DEFAULT','VISION_RANGE')

moving_average_step = config.getint('TRAINING','MOVING_AVERAGE_SIZE')

## GPU
gpu_capacity = GPU_CAPACITY #config.getfloat('GPU_CONFIG','GPU_CAPACITY')
gpu_allowgrow = config.getboolean('GPU_CONFIG', 'GPU_ALLOWGROW')

In [4]:
# Env Settings
vision_dx, vision_dy = 2*vision_range+1, 2*vision_range+1
nchannel = 6
in_size = [None,vision_dx,vision_dy,nchannel]
nenv = 4#(int) (multiprocessing.cpu_count())

# Asynch Settings
global_scope = 'global'

## Environment Setting

In [None]:
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    
#Create a directory to save episode playback gifs to
if not os.path.exists(LOG_PATH):
    os.makedirs(LOG_PATH)

In [None]:
global_rewards = MA(moving_average_step)
global_ep_rewards = MA(moving_average_step)
global_length = MA(moving_average_step)
global_succeed = MA(moving_average_step)
global_episodes = 0

# Launch the session
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=GPU_CAPACITY, allow_growth=True)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
progbar = tf.keras.utils.Progbar(1e6,interval=1)

## Run

In [None]:
# Global Network
global_step = tf.Variable(0, trainable=False, name='global_step')
global_step_next = tf.assign_add(global_step, 1)
global_network = Network(in_size=in_size, action_size=action_space, scope=global_scope, sess=sess)
global_vars = global_network.get_vars
global_vars.append(global_step)


In [None]:
# Local workers
workers = []
# loop for each workers
for idx in range(nenv):
    name = 'W_%i' % idx
    workers.append(Worker(name, global_network, sess,
                 global_episodes=global_step, increment_step_op=global_step_next,
                 progbar=progbar, selfplay=False))
    print(f'worker: {name} initiated')

#saver = tf.train.Saver(var_list=global_vars, max_to_keep=3)
saver = tf.train.Saver(max_to_keep=3)
writer = tf.summary.FileWriter(LOG_PATH, sess.graph)
    
ckpt = tf.train.get_checkpoint_state(MODEL_PATH)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    saver.restore(sess, ckpt.model_checkpoint_path)
    print("Load Model : ", ckpt.model_checkpoint_path)
    initialize_uninitialized_vars(sess)
    print("Initialized uninitialized variables: Done")
else:
    sess.run(tf.global_variables_initializer())
    print("Initialized Variables")
    
coord = tf.train.Coordinator()
worker_threads = []
global_episodes = sess.run(global_step)

saver.save(sess, MODEL_PATH+'/ctf_policy.ckpt', global_step=global_episodes)
print('    initial save done')

recorder = {'reward':global_rewards,
            'length':global_length,
            'succeed':global_succeed}

for worker in workers:
    job = lambda: worker.work(saver, writer, coord, recorder, MODEL_PATH)
    t = threading.Thread(target=job)
    t.start()
    worker_threads.append(t)
coord.join(worker_threads)

  result = entry_point.load(False)


---- String Representation of Environment ----
map objects : 
  Blue : 5 ground, 0 air
  Red  : 5 ground, 0 air
settings : 
  Stochastic Attack On
  Stochastic Map On
  Red operates under full vision
  Dense Reward
worker: W_0 initiated
---- String Representation of Environment ----
map objects : 
  Blue : 5 ground, 0 air
  Red  : 5 ground, 0 air
settings : 
  Stochastic Attack On
  Stochastic Map On
  Red operates under full vision
  Dense Reward
worker: W_1 initiated
---- String Representation of Environment ----
map objects : 
  Blue : 5 ground, 0 air
  Red  : 5 ground, 0 air
settings : 
  Stochastic Attack On
  Stochastic Map On
  Red operates under full vision
  Dense Reward
worker: W_2 initiated
---- String Representation of Environment ----
map objects : 
  Blue : 5 ground, 0 air
  Red  : 5 ground, 0 air
settings : 
  Stochastic Attack On
  Stochastic Map On
  Red operates under full vision
  Dense Reward
worker: W_3 initiated
INFO:tensorflow:Restoring parameters from ./model/A3