# Tutorial for A2C Reinforcement Learning using Keras

### MD Muhaimin Rahman
sezan92[at]gmail[dot]com

#### Target Readers

If you already have idea about Q Learning ,Deep Q Learning and Deep Neural Networks , then this tutorial is for you. Otherwise, you should learn them first

<a id ="libraries"></a>
### Importing Libraries


In [1]:
from __future__ import print_function,division
import gym
import keras
from keras import layers
from keras import backend as K
from collections import deque
from tqdm import tqdm
import random
import numpy as np
import copy
SEED =123
np.random.seed(SEED)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Important constants

In [2]:
num_episodes = 1000
steps_per_episode=200
BATCH_SIZE=256
TAU=0.001
GAMMA=0.99
actor_lr=0.001
critic_lr=0.001
SHOW= False
action_list = [0,1]#,2]

In [3]:
from keras.models import Model

<a id ="model"></a>
### Model Definition

After some trials and errors, I have selected this network. The Actor Network is 3 layer MLP with 320 hidden nodes in each layer. The critic network is also a 3 layer MLP with 640 hidden nodes in each layer.Notice that the return arguments of function ```create_critic_network```.

In [4]:
def create_actor_network(state_shape,action_shape):
    state=layers.Input(shape=state_shape,name="state")
    action_prob =layers.Input(shape=(2,),name="action_probability")
    l1 =layers.Dense(320,activation="relu")(state)
    l2 =layers.Dense(320,activation="relu")(l1)
    l3 =layers.Dense(320,activation="relu")(l2)
    action =layers.Dense(action_shape,activation="softmax")(l3)
    actor= Model(state,action)
    return actor,action_prob

def create_critic_network(state_shape):
    state = layers.Input(shape=state_shape,name="state")
    R_tensor = layers.Input(shape=(1,),name="R_tensor")
    l1 = layers.Dense(640,activation="relu")(state)
    l2 = layers.Dense(640,activation="relu")(l1)
    l3 = layers.Dense(640,activation="relu")(l2)
    value = layers.Dense(1)(l3)
    critic = Model(inputs=state,outputs=value)
    return critic,state,R_tensor

I am chosing ```MountainCar-v0``` game. Mainly because my GPU is not that good to work on higher dimensional state space

In [5]:
env = gym.make("CartPole-v0")

[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.[0m


In [6]:
state_shape= env.observation_space.sample().shape

In [7]:
action_shape=(env.action_space.n,)

In [8]:
action_shape

(2,)

In [9]:
actor,action_prob = create_actor_network(state_shape,action_shape[0])
critic,state_tensor,R_tensor = create_critic_network(state_shape)

In [10]:
action_prob

<tf.Tensor 'action_probability:0' shape=(?, 2) dtype=float32>

In [11]:
R_tensor

<tf.Tensor 'R_tensor:0' shape=(?, 1) dtype=float32>

I have chosen ```RMSProp``` optimizer, due to more stability compared to Adam . I found it after trials and errors, no theoritical background on chosing this optimizer

In [12]:
actor_optimizer = keras.optimizers.RMSprop(actor_lr)

critic_optimizer = keras.optimizers.RMSprop(critic_lr)

In [13]:
critic.compile(loss="mse",optimizer=critic_optimizer)

Instructions for updating:
keep_dims is deprecated, use keepdims instead


#### Actor training

I think this is the most critical part of ddpg in keras. The object ```critic``` and ```actor``` has a ```__call__``` method inside it, which will give output tensor if you give input a tensor. So to get the tensor object of ```Q``` we will use this functionality.

In [14]:
CriticValues = critic([state_tensor])


In [15]:
advantage = R_tensor-CriticValues

In [16]:
logp=K.log(action_prob)

In [17]:
logp

<tf.Tensor 'Log:0' shape=(?, 2) dtype=float32>

In [18]:
TD=-logp*advantage

In [19]:
advantage

<tf.Tensor 'sub:0' shape=(?, 1) dtype=float32>

In [20]:
logp

<tf.Tensor 'Log:0' shape=(?, 2) dtype=float32>

In [21]:
TD

<tf.Tensor 'mul:0' shape=(?, 2) dtype=float32>

In [22]:
entropy= -action_prob*logp

In [23]:
action_loss = TD-0.01*entropy

In [29]:
action_loss

TypeError: 'Tensor' object is not callable

In [25]:
actor.trainable_weights

[<tf.Variable 'dense_1/kernel:0' shape=(4, 320) dtype=float32_ref>,
 <tf.Variable 'dense_1/bias:0' shape=(320,) dtype=float32_ref>,
 <tf.Variable 'dense_2/kernel:0' shape=(320, 320) dtype=float32_ref>,
 <tf.Variable 'dense_2/bias:0' shape=(320,) dtype=float32_ref>,
 <tf.Variable 'dense_3/kernel:0' shape=(320, 320) dtype=float32_ref>,
 <tf.Variable 'dense_3/bias:0' shape=(320,) dtype=float32_ref>,
 <tf.Variable 'dense_4/kernel:0' shape=(320, 2) dtype=float32_ref>,
 <tf.Variable 'dense_4/bias:0' shape=(2,) dtype=float32_ref>]

In [26]:
K.mean(action_loss)

<tf.Tensor 'Mean:0' shape=() dtype=float32>

In [30]:
updates = actor_optimizer.get_updates(params=actor.trainable_weights,loss=action_loss)

ValueError: Tried to convert 'x' to a tensor and failed. Error: None values not supported.

Now we will create a function which will train the actor network.

In [None]:
actor_train = K.function(inputs=[state_tensor,R_tensor,action_coef],outputs=[actor(state_tensor),
                                                                      K.mean(action_loss)],
                   updates=updates)


<a id ="training"></a>
### Training

In [None]:
steps_per_episodes=200
max_total_reward=0
for episode in range(num_episodes):
    values =[]
    action_probs=[]
    rewards=[]
    states=[]
    action_probs=[]
    terminals=[]
    R_list=[]
    advantages=[]
    state= env.reset()
    state = state.reshape((-1),)
    total_reward=0
    value_loss=0
    action_loss=0
#     states =deque(max)
    for step in range(steps_per_episodes):
        action_probability= actor.predict(state.reshape(1,-1))
        action = np.random.choice(action_list,p=action_probability[0])
        action_probability[action_probability!=action_probability[0][action]]=0
        action_probability[action_probability==action_probability[0][action]]=1
        action_probs.append(action_probability[0])
        next_state,reward,done,_ = env.step(action)
        total_reward=total_reward+reward
        states.append(state)
        rewards.append(reward)
        terminals.append(done)
        value = critic.predict(state.reshape(1,-1))
        if SHOW:
            env.render()
        if done or step==(steps_per_episodes-1):
            if total_reward<-199:
                print("Failed!",end=" ")
                R=0
            elif total_reward>-199:
                print("Passed!",end=" ")
                R = value
            break
        
        state=next_state
        
    print("Episode %d Total Reward %f"%(episode,total_reward))
    
    for t in reversed(range(len(rewards))):
        R = rewards[t]+GAMMA*R
        R_list.append(R)
        advantage = R-critic.predict(states[t].reshape(1,-1))
        advantages.append(advantage)
#         value_loss =value_loss+advantage**2
#         action_prob = actor.predict(states[t].reshape(1,-1))
#         action_log_prob = np.log(action_prob[0][action_probs[t]]+1e-5)
#         entropy= -action_prob[0][action_probs[t]]*action_log_prob
#         policy_loss = policy_loss-action_log_prob*advantage-0.01*entropy
    states = np.vstack(states)
    R_list= np.vstack(R_list)
    action_probs = np.vstack(action_probs)
    loss=critic.train_on_batch(x=states,y=R_list)
    _,action_loss = actor_train(inputs=[states,R_list,action_probs])
    print("action loss %f"%action_loss)
    #print("Weights ")
    #print(actor.get_weights()[:-1])
    
        

In [None]:
action_probability

In [None]:
action_probability

In [None]:
action_loss

### Video

Please watch at 2x speed. I changed some simple mistakes after the video so the rewards are not exactly the same

[![](http://img.youtube.com/vi/9Fe_n-ovIaA/0.jpg)](http://www.youtube.com/watch?v=9Fe_n-ovIaA "Keras tutorial DDPG")