## Reinforcement Learning Tutorial -3: DDPG

### MD Muhaimin Rahman
contact: sezan92[at]gmail[dot]com

In the last tutorial, I tried to Explain DQN. DQN solves one problem, that is it can deal with continuous state space. But it cannot output continuous action. To solve that problem, here comes DDPG! It means Deep Deterministic Policy Gradient

- [Importing Libraries](#libraries)
- [Algorithm](#algorithm)
- [Model Definition](#model)
- [Replay Buffer](#buffer)
- [Noise Class](#noise)
- [Training](#training)


<a id ="libraries"></a>
### Importing Libraries


In [1]:
from __future__ import print_function,division
import gym
import keras
from keras import layers
from keras import backend as K
from collections import deque
from tqdm import tqdm
import random
import numpy as np
import copy
SEED =123
np.random.seed(SEED)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Important constants

In [2]:
num_episodes = 100
steps_per_episode=500
BATCH_SIZE=256
TAU=0.001
GAMMA=0.95
actor_lr=0.0001
critic_lr=0.001
SHOW= False

In [3]:
from keras.models import Model

<a id ="algorithm"></a>
### Algorithm

The actual algorithm was developed by Timothy lilicap et al. The algorithm is an actor-critic based algorithm.. Which means, it has two networks to train- an actor network, which predicts action based on the current state. The other networ- known as Critic network- evaluates the state and action. This is the case for all actor-critic networks. The critic network is updated using Bellman Equation like DQN. The difference is the training of actor network. In DDPG , we train Actor network by trying to get the maximum value of gradient of $Q(s,a)$ for given action $a$ in a state $s$. In a normal machine learning classification and regression algorithm, our target is to get the value with minimum loss. Then we train the network by gradient descent technique using the gradient of Loss .

\begin{equation}
\theta \gets \theta - \alpha \frac{\partial L}{\partial \theta}
\end{equation}

Here, $\theta$ is the weight parameter of the network, and $L$ is loss

But in our case, we have to get the maximize the $Q$ value. So we have to set the weight parameters such that we get the maximum $Q$ value. This technique is known as Gradient Ascent, as it does the exact opposite of Gradient Descent

\begin{equation}
\theta_a \gets \theta_a - \alpha (-\frac{\partial Q(s,a) }{\partial \theta_a}) 
\end{equation}

The above equation looks like the actual Gradient Descent equation. Only difference is , the minus sign. It makes the equation to minimize the negative value of $Q$ , which in turn maximizes $Q$ value.


So the training is as following

- 1) Define Actor network $actor$ and Critic Network $critic$
- 2) Define Target Actor and Critic Networks - $actor_{target}$ and $critic_{target}$ with exact same weights
- 3) Initialize Replay Buffer 
- 4) Get the initial state , $state$
- 5) Get the action $a$ from , $a \gets actor(state)$ + Noise .[Here Noise is given to make the process stochastic and not so deterministic. The paper uses ornstein uhlenbeck noise process , so we will as well]
- 6) Get Next state $state_{next}$ , Reward $r$ , Terminal from agent for given $state$ and $action$
- 7) Add the experience , $state$,$action$,$reward$,$state_{next}$,$terminal$ to replay buffer
- 8) Get sample minibatch from Replay buffer
- 9) Train Critic Network Using Bellman Equation. Like DQN
- 10) Train Actor Network using Gradient Ascent with gradients of $Q$ . $\theta_a \gets \theta_a - \alpha (-\frac{\partial Q(s,a) }{\partial \theta_a}) $
- 11) Update weights of $actor_{target}$ and $critic_{target}$ using the equation $ \theta \gets \tau \theta + (1-\tau)\theta_{target}$

<a id ="model"></a>
### Model Definition

After some trials and errors, I have selected this network. The Actor Network is 3 layer MLP with 320 hidden nodes in each layer. The critic network is also a 3 layer MLP with 640 hidden nodes in each layer.Notice that the return arguments of function ```create_critic_network```.

In [4]:
def create_actor_network(state_shape,action_shape):
    in1=layers.Input(shape=state_shape,name="state")
    l1 =layers.Dense(320,activation="relu")(in1)
    l2 =layers.Dense(320,activation="relu")(l1)
    l3 =layers.Dense(320,activation="relu")(l2)
    action =layers.Dense(action_shape,activation="tanh")(l3)
    actor= Model(in1,action)
    return actor

def create_critic_network(state_shape,action_shape):
    in1 = layers.Input(shape=state_shape,name="state")
    in2 = layers.Input(shape=action_shape,name="action")
    l1 = layers.concatenate([in1,in2])
    l2 = layers.Dense(640,activation="relu")(l1)
    l3 = layers.Dense(640,activation="relu")(l2)
    l4 = layers.Dense(640,activation="relu")(l3)
    value = layers.Dense(1)(l4)
    critic = Model(inputs=[in1,in2],outputs=value)
    return critic,in1,in2

I am chosing ```MountainCarContinuous-v0``` game. Mainly because my GPU is not that good to work on higher dimensional state space

In [5]:
env = gym.make("MountainCarContinuous-v0")

[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.[0m


In [6]:
state_shape= env.observation_space.sample().shape

In [7]:
action_shape=env.action_space.sample().shape

In [8]:
actor = create_actor_network(state_shape,action_shape[0])
critic,state_tensor,action_tensor = create_critic_network(state_shape,action_shape)
target_actor=create_actor_network(state_shape,action_shape[0])
target_critic,_,_ = create_critic_network(state_shape,action_shape)
target_actor.set_weights(actor.get_weights())
target_critic.set_weights(critic.get_weights())

I have chosen ```RMSProp``` optimizer, due to more stability compared to Adam . I found it after trials and errors, no theoritical background on chosing this optimizer

In [9]:
actor_optimizer = keras.optimizers.RMSprop(actor_lr)

critic_optimizer = keras.optimizers.RMSprop(critic_lr)

In [10]:
critic.compile(loss="mse",optimizer=critic_optimizer)

Instructions for updating:
keep_dims is deprecated, use keepdims instead


#### Actor training

I think this is the most critical part of ddpg in keras. The object ```critic``` and ```actor``` has a ```__call__``` method inside it, which will give output tensor if you give input a tensor. So to get the tensor object of ```Q``` we will use this functionality.

In [11]:
CriticValues = critic([state_tensor,actor(state_tensor)])

Now it is time to get the gradient value of $-\frac{\partial Q(s,a)}{\theta_a}$

In [12]:
updates = actor_optimizer.get_updates(
    params=actor.trainable_weights,loss=-K.mean(CriticValues))

Now we will create a function which will train the actor network.

In [13]:
actor_train = K.function(inputs=[state_tensor],outputs=[actor(state_tensor),CriticValues],
                   updates=updates)


<a id ="buffer"></a>
### Replay Buffer

In [14]:
memory = deque(maxlen=10000)

In [15]:
state = env.reset()
state = state.reshape(-1,)
for _ in tqdm(range(memory.maxlen)):
    action = env.action_space.sample()
    next_state,reward,terminal,_=env.step(action)
    
    state=next_state
    if terminal:
        reward=-100
        state= env.reset()
        state = state.reshape(-1,)
    memory.append((state,action,reward,next_state,terminal))

100%|██████████| 10000/10000 [00:00<00:00, 21304.88it/s]


In [16]:
next_state

array([-0.49533114, -0.00344986])

<a id ="noise"></a>
### Noise class

The use of Noise is to make the process Stochastic and to help the agent explore different actions. The paper used Orstein Uhlenbeck Noise class

In [17]:
class OrnsteinUhlenbeckProcess(object):
    def __init__(self, theta, mu=0, sigma=1, x0=0, dt=1e-2, n_steps_annealing=10, size=1):
        self.theta = theta
        self.sigma = sigma
        self.n_steps_annealing = n_steps_annealing
        self.sigma_step = - self.sigma / float(self.n_steps_annealing)
        self.x0 = x0
        self.mu = mu
        self.dt = dt
        self.size = size
    def restart(self):
        self.x0=copy.copy(self.mu)
    def generate(self, step):
        #sigma = max(0, self.sigma_step * step + self.sigma)
        x = self.x0 + self.theta * (self.mu - self.x0) * self.dt + self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.size)
        self.x0 = x
        return x

<a id ="training"></a>
### Training

In [18]:
steps_per_episodes=5000
ou = OrnsteinUhlenbeckProcess(theta=0.35,mu=0.8,sigma=0.4,n_steps_annealing=10)
max_total_reward=0
for episode in range(num_episodes):
    state= env.reset()
    state = state.reshape(-1,)
    total_reward=0
    ou.restart()
    for step in range(steps_per_episodes):
        action= actor.predict(state.reshape(1,-1))+ou.generate(episode)
        next_state,reward,done,_ = env.step(action)
        total_reward=total_reward+reward
        #random minibatch from buffer
        
        batches=random.sample(memory,BATCH_SIZE)
        states= np.array([batch[0].reshape((-1,)) for batch in batches])
        actions= np.array([batch[1] for batch in batches])
        actions=actions.reshape(-1,1)
        rewards=np.array([batch[2] for batch in batches])
        rewards = rewards.reshape((-1,1))
        new_states=np.array([batch[3].reshape((-1,)) for batch in batches])
        terminals=np.array([batch[4] for batch in batches])
        terminals = terminals.reshape((-1,1))
        #training
        
        target_actions = target_actor.predict(new_states)
        target_Qs = target_critic.predict([new_states,target_actions])
        
        new_Qs = rewards+GAMMA*target_Qs*terminals
        critic.fit([states,actions],new_Qs,verbose=False)
        _,critic_values=actor_train(inputs=[states])
        target_critic_weights=[TAU*weight+(1-TAU)*target_weight for weight,target_weight in zip(critic.get_weights(),target_critic.get_weights())]
        target_actor_weights=[TAU*weight+(1-TAU)*target_weight for weight,target_weight in zip(actor.get_weights(),target_actor.get_weights())]
        target_critic.set_weights(target_critic_weights)
        target_actor.set_weights(target_actor_weights)
        print("Total Reward %f"%total_reward,end="\r")
        if SHOW:
            env.render()
        if done or step==(steps_per_episodes-1):
            
            if total_reward<0:
                print("Failed!",end=" ")
                reward=-100
            elif total_reward>0:
                print("Passed!",end=" ")
                reward=100
            memory.append((state,action,reward,next_state,done))
            break
        
        memory.append((state,action,reward,next_state,done))
        state=next_state
    if total_reward>max_total_reward:
        actor.save_weights("MC_DDPG_Weights/Actor_Best_weights episode %d_GAMMA_%f_TAU%f_lr_%f.h5"%(episode,GAMMA,TAU,actor_lr))
        critic.save_weights("MC_DDPG_Weights/Critic_Best_weights episode %d_GAMMA_%f_TAU%f_lr_%f.h5"%(episode,GAMMA,TAU,critic_lr))
        max_total_reward=total_reward
    print("Episode %d Total Reward %f"%(episode,total_reward))

Failed! Episode 0 Total Reward -121.829750
Failed! Episode 1 Total Reward -118.666569
Failed! Episode 2 Total Reward -162.041720
Failed! Episode 3 Total Reward -87.619343
Failed! Episode 4 Total Reward -96.425173
Failed! Episode 5 Total Reward -143.686660
Failed! Episode 6 Total Reward -38.429055
Passed! Episode 7 Total Reward 76.095012
Passed! Episode 8 Total Reward 78.345259
Failed! Episode 9 Total Reward -0.019448
Passed! Episode 10 Total Reward 48.135478
Passed! Episode 11 Total Reward 54.750747
Passed! Episode 12 Total Reward 79.410100
Passed! ward 83.0033361Episode 13 Total Reward 83.003336
Passed! Episode 14 Total Reward 83.174427
Failed! Episode 15 Total Reward -48.217783
Passed! Episode 16 Total Reward 87.284775
Passed! Episode 17 Total Reward 32.635075
Passed! Episode 18 Total Reward 84.165725
Passed! Episode 19 Total Reward 89.819676
Passed! Episode 20 Total Reward 76.334761
Passed! Episode 21 Total Reward 58.299554
Passed! Episode 22 Total Reward 81.035049
Failed! Episode 2

### Video

Please watch at 2x speed. I changed some simple mistakes after the video so the rewards are not exactly the same

[![](http://img.youtube.com/vi/9Fe_n-ovIaA/0.jpg)](http://www.youtube.com/watch?v=9Fe_n-ovIaA "Keras tutorial DDPG")