# A2C

A2C (advantage actor critic) is simpler version of [A3C (asynchronous advantage actor critic)](https://arxiv.org/abs/1602.01783) algorithm. It is similar to policy gradient algorithm with following adaptations:
* Number of environments are run in parallel to reduce correlations between samples,
* Neural network predicts both policy outputs (action probabilities) and value function. Value function is used for bootstrapping return when trajectory length is fixed (in policy gradients the trajectory always had to go till the end of episode). Additionally the value function is used as a baseline when calculating advantage.

Following is a rough implementation of the algorithm, which at least seems to work for Atari Pong.

In [1]:
import gym
import numpy as np

from keras.models import Model
from keras.layers import Input, TimeDistributed, Conv2D, Flatten, LSTM, Dense
from keras.initializers import RandomNormal
from keras.optimizers import RMSprop

from atari_utils import RandomizedResetEnv, AtariRescale42x42Env, AtariRescale84x84Env, ObservationBuffer

Using TensorFlow backend.


In [2]:
# helper function to create environment with required wrappers
def create_env(env_id):
    env = gym.make(env_id)
    env = RandomizedResetEnv(env)
    #env = AtariRescale42x42Env(env)
    # resize screen to 84x84
    env = AtariRescale84x84Env(env)
    env = ObservationBuffer(env)
    return env

In [3]:
# create temporary environment to fetch observation and action space
env = create_env('PongDeterministic-v4')

print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

OBS_SHAPE = env.observation_space.shape
NUM_ACTIONS = env.action_space.n

env.close()

Observation space: Box(84, 84, 4)
Action space: Discrete(6)


In [None]:
# create model with standard DeepMind architecture
x = Input(shape=OBS_SHAPE, name='x')
h = Conv2D(16, 8, strides=4, padding='valid', activation='relu', name='c1')(x)
h = Conv2D(32, 4, strides=2, padding='valid', activation='relu', name='c2')(h)
h = Flatten(name='fl')(h)

# policy head
h1 = Dense(256, activation='relu', name='h1')(h)
p = Dense(NUM_ACTIONS, activation='softmax', kernel_initializer=RandomNormal(stddev=0.01), name='p')(h1)

# value head
h2 = Dense(256, activation='relu', name='h2')(h)
v = Dense(1, name="v")(h2)

# model for training
model = Model(x, [p, v])
model.compile(loss=['sparse_categorical_crossentropy', 'mse'], optimizer=RMSprop(lr=0.0003))
model.summary()

# prediction-only models for policy and value
policynet = Model(x, p)
valuenet = Model(x, v)

Instructions for updating:
Colocations handled automatically by placer.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
x (InputLayer)                  (None, 84, 84, 4)    0                                            
__________________________________________________________________________________________________
c1 (Conv2D)                     (None, 20, 20, 16)   4112        x[0][0]                          
__________________________________________________________________________________________________
c2 (Conv2D)                     (None, 9, 9, 32)     8224        c1[0][0]                         
__________________________________________________________________________________________________
fl (Flatten)                    (None, 2592)         0           c2[0][0]                         
_____________________________________

In [None]:
GAMMA = 0.9

# discounting using value function for bootstrapping
def discount(rewards, dones, value):
    ret = value
    returns = np.empty_like(rewards)
    for t in reversed(range(rewards.shape[1])):
        ret = rewards[:, t] + GAMMA * (1 - dones[:, t]) * ret
        returns[:, t] = ret
    return returns

In [None]:
NUM_ACTORS = 20
NUM_TIMESTEPS = 5

# create buffers for training data
states = np.empty((NUM_ACTORS, NUM_TIMESTEPS + 1) + OBS_SHAPE)
actions = np.empty((NUM_ACTORS, NUM_TIMESTEPS), dtype=np.uint8)
rewards = np.empty((NUM_ACTORS, NUM_TIMESTEPS))
dones = np.empty((NUM_ACTORS, NUM_TIMESTEPS), dtype=np.uint8)

# create environments for all actors
envs = []
for i in range(NUM_ACTORS):
    env = create_env('PongDeterministic-v4')
    envs.append(env)

    state = env.reset()
    states[i, 0] = state



In [None]:
TOTAL_TIMESTEPS = 5000000

episode_rewards = []
episode_lengths = []
actor_rewards = np.zeros(NUM_ACTORS)
actor_lengths = np.zeros(NUM_ACTORS)
# train for 5M timesteps
for n in range(TOTAL_TIMESTEPS // (NUM_ACTORS * NUM_TIMESTEPS)):
    # train every NUM_TIMESTEPS
    for t in range(NUM_TIMESTEPS):
        # predict action probabilities using policy network
        probs = policynet.predict_on_batch(states[:,t])
        # step all actors
        for i in range(NUM_ACTORS):
            # sample action
            action = np.random.choice(NUM_ACTIONS, p=probs[i])
            # step environment
            state, reward, done, info = envs[i].step(action)

            # record episode reward and length
            actor_rewards[i] += reward
            actor_lengths[i] += 1

            # if episode finished
            if done:
                # start new episode
                state = envs[i].reset()
                
                # log episode data
                episode_rewards.append(actor_rewards[i])
                episode_lengths.append(actor_lengths[i])
                actor_rewards[i] = 0
                actor_lengths[i] = 0
                
                print("Episode:", len(episode_rewards), "Reward:", episode_rewards[-1], "Length:", episode_lengths[-1], "Actor:", i, "Timestep:", n * NUM_ACTORS * NUM_TIMESTEPS + t * NUM_ACTORS + i + 1)

            # record training data
            states[i, t + 1] = state
            actions[i, t] = action
            rewards[i, t] = reward
            dones[i, t] = done
    
    # predict baseline values for all states, including the last
    values = valuenet.predict_on_batch(states.reshape((-1,) + states.shape[2:]))
    values = values.reshape((NUM_ACTORS, NUM_TIMESTEPS + 1))
    # perform discounting using the last timestep value for bootstrapping
    returns = discount(rewards, dones, values[:, -1])
    # compute advantages
    advantages = returns - values[:, :-1]

    # train model
    losses = model.train_on_batch(states[:, :-1].reshape((-1,) + states.shape[2:]), 
        [actions.reshape((-1, 1)), returns.reshape((-1, 1))], 
        sample_weight=[advantages.flatten(), None])
    #print("Timestep:", n, "losses:", losses)
    
    # copy last state to be the first
    states[:, 0] = states[:, -1]

Instructions for updating:
Use tf.cast instead.
Episode: 1 Reward: -21.0 Length: 761.0 Actor: 0 Timestep: 15201
Episode: 2 Reward: -21.0 Length: 761.0 Actor: 4 Timestep: 15205
Episode: 3 Reward: -21.0 Length: 762.0 Actor: 7 Timestep: 15228
Episode: 4 Reward: -21.0 Length: 762.0 Actor: 9 Timestep: 15230
Episode: 5 Reward: -21.0 Length: 790.0 Actor: 5 Timestep: 15786
Episode: 6 Reward: -21.0 Length: 817.0 Actor: 1 Timestep: 16322
Episode: 7 Reward: -21.0 Length: 817.0 Actor: 14 Timestep: 16335
Episode: 8 Reward: -21.0 Length: 820.0 Actor: 15 Timestep: 16396
Episode: 9 Reward: -21.0 Length: 844.0 Actor: 17 Timestep: 16878
Episode: 10 Reward: -21.0 Length: 848.0 Actor: 18 Timestep: 16959
Episode: 11 Reward: -20.0 Length: 871.0 Actor: 12 Timestep: 17413
Episode: 12 Reward: -21.0 Length: 880.0 Actor: 16 Timestep: 17597
Episode: 13 Reward: -21.0 Length: 911.0 Actor: 19 Timestep: 18220
Episode: 14 Reward: -21.0 Length: 913.0 Actor: 8 Timestep: 18249
Episode: 15 Reward: -20.0 Length: 925.0 Acto

Episode: 122 Reward: -21.0 Length: 757.0 Actor: 4 Timestep: 107365
Episode: 123 Reward: -21.0 Length: 758.0 Actor: 15 Timestep: 107596
Episode: 124 Reward: -21.0 Length: 761.0 Actor: 18 Timestep: 108119
Episode: 125 Reward: -21.0 Length: 758.0 Actor: 7 Timestep: 108488
Episode: 126 Reward: -21.0 Length: 758.0 Actor: 0 Timestep: 108721
Episode: 127 Reward: -21.0 Length: 762.0 Actor: 12 Timestep: 109553
Episode: 128 Reward: -21.0 Length: 759.0 Actor: 13 Timestep: 109754
Episode: 129 Reward: -21.0 Length: 757.0 Actor: 17 Timestep: 109978
Episode: 130 Reward: -21.0 Length: 761.0 Actor: 16 Timestep: 110177
Episode: 131 Reward: -21.0 Length: 757.0 Actor: 9 Timestep: 110350
Episode: 132 Reward: -21.0 Length: 758.0 Actor: 6 Timestep: 111067
Episode: 133 Reward: -21.0 Length: 759.0 Actor: 10 Timestep: 111811
Episode: 134 Reward: -21.0 Length: 761.0 Actor: 8 Timestep: 112049
Episode: 135 Reward: -21.0 Length: 758.0 Actor: 1 Timestep: 113562
Episode: 136 Reward: -20.0 Length: 844.0 Actor: 14 Time

Episode: 244 Reward: -21.0 Length: 762.0 Actor: 7 Timestep: 199868
Episode: 245 Reward: -20.0 Length: 833.0 Actor: 18 Timestep: 200699
Episode: 246 Reward: -21.0 Length: 762.0 Actor: 12 Timestep: 200733
Episode: 247 Reward: -21.0 Length: 759.0 Actor: 13 Timestep: 201154
Episode: 248 Reward: -21.0 Length: 759.0 Actor: 0 Timestep: 201261
Episode: 249 Reward: -21.0 Length: 760.0 Actor: 16 Timestep: 201417
Episode: 250 Reward: -21.0 Length: 762.0 Actor: 17 Timestep: 202498
Episode: 251 Reward: -21.0 Length: 761.0 Actor: 10 Timestep: 203231
Episode: 252 Reward: -21.0 Length: 759.0 Actor: 8 Timestep: 203329
Episode: 253 Reward: -21.0 Length: 760.0 Actor: 1 Timestep: 204882
Episode: 254 Reward: -20.0 Length: 914.0 Actor: 6 Timestep: 205667
Episode: 255 Reward: -21.0 Length: 758.0 Actor: 9 Timestep: 205710
Episode: 256 Reward: -21.0 Length: 762.0 Actor: 11 Timestep: 206312
Episode: 257 Reward: -21.0 Length: 761.0 Actor: 14 Timestep: 207915
Episode: 258 Reward: -21.0 Length: 759.0 Actor: 3 Time

Episode: 366 Reward: -21.0 Length: 904.0 Actor: 15 Timestep: 300916
Episode: 367 Reward: -20.0 Length: 964.0 Actor: 12 Timestep: 301293
Episode: 368 Reward: -21.0 Length: 927.0 Actor: 18 Timestep: 302639
Episode: 369 Reward: -20.0 Length: 898.0 Actor: 6 Timestep: 303287
Episode: 370 Reward: -21.0 Length: 818.0 Actor: 16 Timestep: 305077
Episode: 371 Reward: -20.0 Length: 895.0 Actor: 8 Timestep: 305889
Episode: 372 Reward: -21.0 Length: 938.0 Actor: 7 Timestep: 306888
Episode: 373 Reward: -20.0 Length: 960.0 Actor: 9 Timestep: 307290
Episode: 374 Reward: -20.0 Length: 896.0 Actor: 11 Timestep: 307552
Episode: 375 Reward: -21.0 Length: 820.0 Actor: 5 Timestep: 309546
Episode: 376 Reward: -21.0 Length: 998.0 Actor: 10 Timestep: 309691
Episode: 377 Reward: -21.0 Length: 971.0 Actor: 3 Timestep: 309904
Episode: 378 Reward: -21.0 Length: 1004.0 Actor: 1 Timestep: 310342
Episode: 379 Reward: -20.0 Length: 1012.0 Actor: 2 Timestep: 313323
Episode: 380 Reward: -20.0 Length: 928.0 Actor: 14 Tim

Episode: 487 Reward: -21.0 Length: 848.0 Actor: 7 Timestep: 419648
Episode: 488 Reward: -20.0 Length: 1017.0 Actor: 9 Timestep: 420010
Episode: 489 Reward: -20.0 Length: 985.0 Actor: 0 Timestep: 420561
Episode: 490 Reward: -20.0 Length: 896.0 Actor: 5 Timestep: 420686
Episode: 491 Reward: -21.0 Length: 841.0 Actor: 4 Timestep: 422545
Episode: 492 Reward: -20.0 Length: 1267.0 Actor: 13 Timestep: 422614
Episode: 493 Reward: -19.0 Length: 1119.0 Actor: 16 Timestep: 422617
Episode: 494 Reward: -19.0 Length: 958.0 Actor: 1 Timestep: 422762
Episode: 495 Reward: -20.0 Length: 1048.0 Actor: 8 Timestep: 425329
Episode: 496 Reward: -20.0 Length: 1073.0 Actor: 14 Timestep: 426655
Episode: 497 Reward: -20.0 Length: 885.0 Actor: 3 Timestep: 427004
Episode: 498 Reward: -20.0 Length: 868.0 Actor: 10 Timestep: 428311
Episode: 499 Reward: -20.0 Length: 895.0 Actor: 18 Timestep: 430559
Episode: 500 Reward: -20.0 Length: 895.0 Actor: 2 Timestep: 431223
Episode: 501 Reward: -21.0 Length: 838.0 Actor: 12 T

Episode: 608 Reward: -19.0 Length: 960.0 Actor: 7 Timestep: 539808
Episode: 609 Reward: -20.0 Length: 986.0 Actor: 13 Timestep: 540934
Episode: 610 Reward: -21.0 Length: 1019.0 Actor: 1 Timestep: 541162
Episode: 611 Reward: -16.0 Length: 1297.0 Actor: 16 Timestep: 541397
Episode: 612 Reward: -21.0 Length: 1013.0 Actor: 17 Timestep: 542498
Episode: 613 Reward: -20.0 Length: 1174.0 Actor: 5 Timestep: 542506
Episode: 614 Reward: -19.0 Length: 1046.0 Actor: 3 Timestep: 545784
Episode: 615 Reward: -20.0 Length: 841.0 Actor: 8 Timestep: 546029
Episode: 616 Reward: -20.0 Length: 1245.0 Actor: 14 Timestep: 546355
Episode: 617 Reward: -20.0 Length: 1086.0 Actor: 12 Timestep: 547793
Episode: 618 Reward: -21.0 Length: 942.0 Actor: 11 Timestep: 547812
Episode: 619 Reward: -21.0 Length: 867.0 Actor: 9 Timestep: 548870
Episode: 620 Reward: -17.0 Length: 1279.0 Actor: 2 Timestep: 550883
Episode: 621 Reward: -18.0 Length: 1134.0 Actor: 4 Timestep: 553265
Episode: 622 Reward: -18.0 Length: 1040.0 Actor

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# plot results
plt.figure(figsize=(13, 5))
plt.subplot(1, 2, 1)
plt.plot(episode_rewards)
plt.title("Episode rewards")
plt.subplot(1, 2, 2)
plt.plot(episode_lengths)
plt.title("Episode lengths")