<a href="https://colab.research.google.com/github/gitHubAndyLee2020/OpenAI_Gym_RL_Algorithms_Database/blob/main/DDPG_Module.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### DDPG

> About

- Consists of Actor and Critic, where the Actor generates some continuous action for given state, and Critic generates expected reward for some state and action
- Target Critic is used to stablish the training of Critic
- Critic is first trained using actual reward and future predicted reward using Target Critic as target value, and then the updated Critic is used to train the Actor by judging its performance

> Pro

- Simplicity and Ease of Implementation

> Con

- Sample Inefficiency

```
class Replay_buffer():
  def __init__(self, max_size=capacity):
    - Intialize the storage, maximum size, and pointer to keep track of current index

  def push(self, data):
    - If the length of the storage as reached the maximum size, start replacing the data from the oldest item in the storage
    - Otherwise, append the data to the storage

  def sample(self, batch_size):
    - Select random batch_size items from the storage and return them
```

```
class Actor(nn.Module):
  def __init__(self, state_dim, action_dim, max_action):
    - Map state input -> hidden layer -> action probability
    - Also stores the maximum value for continuous action in variable self.max_action

  def forward(self, x):
    - Feed the input through the neural network, and apply tanh to the output of the neural network before multiplying it with self.max_action. Since tanh squeezes values to be between -1 and 1, the range of the forward function output is +-maximum action value
```

```
class Critic(nn.Module):
  def __init__(self, state_dim, action_dim):
    - Map state input and action -> hidden layer -> expected reward value
  
  def forward(self, x, u):
    - Feed the x: state and u: action through the neural network
```

```
class DDPG(object):
  def __init__(self, state_dim, action_dim, max_action):
    - Initialize the actor and target actor networks, and copy the weights of actor network to target actor network. Initialize the actor network optimizer
    - Initialize the critic and target critic networks, and copy the weights of critic network to target critic network. Initialize the critic network optimizer
    - Initialize the replay buffer

  def select_action(self, state):
    - Feed the state to actor network and get the action value
    - Return the action value

  def update(self):
    - For the specified number of update iterations, run the following update loop
    - Update loop is as follows
    # Critic network update
    - 1. Fetch a sample from the replay buffer and unwrap the state, action,next state, done, reward. These values are tensors of length batch_size
    - 2. Compute the target Q value by feeding the state and action to the target critic network. The target Q value is the expected reward given some state and action. Then calculate the final target Q value using formula final target Q = reward + gamma * target Q value. This sets the target Q value that the critic network should aim to hit. The target critic network is used to stabilize training, as it avoids the target Q value from fluctuating alongside current Q value, leading to oscillation
    - 3. Compute the current Q value by feeding the state and action to critic network
    - 4. Calculate the loss between current Q value and target Q value
    - 5. Apply backpropagation to critic network
    # Actor network update
    - 1. Generate action tensor by feeding state tensor to actor network
    - 2. Feed the state tensor and generated action tensor to critic network to get the expected reward tensor
    - 3. Get the negated mean of the expected reward tensor to get the Actor loss. This essentially calculates how much reward is expected from the actions chosen by the actor network, where if the expected reward is large, the Actor loss is smaller (due to negation) and requires less adjustment, whereas if the expected reward is small, the Acto is bigger and requires larger adjustment
    - 4. Apply backpropagation
    - Afterwards, the target networks are partially updated from main networks

  def save(self):
    - Save the weights of the actor and critic networks

  def load(self):
    - Load the weights into the actor and critic networks
```

```
def main():
  - If it is test model, run the agent for some number of iterations to get the number of achieved steps for each iteration
  - Else if it is train model, for some number of iterations, collect trajectories, which is a tuple of state, next state, action, reward, done, and update the agent
```

In [None]:
# Import necessary libraries and modules
from itertools import count
import os, sys, random
import numpy as np
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal
from tensorboardX import SummaryWriter

# Set various hyperparameters and settings for the algorithm
mode = 'train'
env_name = "Pendulum-v1"
tau = 0.005
target_update_interval = 1
test_iteration = 10
learning_rate = 1e-4
gamma = 0.99
capacity = 1000000
batch_size = 100
seed = False
random_seed = 9527
sample_frequency = 2000
render = False
log_interval = 50
load = False
render_interval = 100
exploration_noise = 0.1
max_episode = 100000
print_log = 5
update_iteration = 200

# Detect if CUDA is available and set device accordingly
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Script name for identifying experiments
script_name = "ddpg"

# Initialize environment
env = gym.make(env_name)

# Seed random numbers if 'seed' is True
if seed:
    env.seed(random_seed)
    torch.manual_seed(random_seed)
    np.random.seed(random_seed)

# Extract state and action dimensions from environment
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
max_action = float(env.action_space.high[0])
min_Val = torch.tensor(1e-7).float().to(device)

# Directory for saving models and logging
directory = './exp' + script_name + env_name +'./'

# Class definition for Replay Buffer
class Replay_buffer():
    def __init__(self, max_size=capacity):
        self.storage = []
        self.max_size = max_size
        self.ptr = 0

    def push(self, data):
        if len(self.storage) == self.max_size:
            self.storage[int(self.ptr)] = data
            self.ptr = (self.ptr + 1) % self.max_size
        else:
            self.storage.append(data)

    def sample(self, batch_size):
        ind = np.random.randint(0, len(self.storage), size=batch_size)
        x, y, u, r, d = [], [], [], [], []

        for i in ind:
            X, Y, U, R, D = self.storage[i]
            x.append(np.array(X, copy=False))
            y.append(np.array(Y, copy=False))
            u.append(np.array(U, copy=False))
            r.append(np.array(R, copy=False))
            d.append(np.array(D, copy=False))

        return np.array(x), np.array(y), np.array(u), np.array(r).reshape(-1, 1), np.array(d).reshape(-1, 1)

# Class definition for Actor network
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(Actor, self).__init__()

        self.l1 = nn.Linear(state_dim, 400)
        self.l2 = nn.Linear(400, 300)
        self.l3 = nn.Linear(300, action_dim)

        self.max_action = max_action

    def forward(self, x):
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = self.max_action * torch.tanh(self.l3(x))
        return x

# Class definition for Critic network
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()

        self.l1 = nn.Linear(state_dim + action_dim, 400)
        self.l2 = nn.Linear(400 , 300)
        self.l3 = nn.Linear(300, 1)

    def forward(self, x, u):
        x = F.relu(self.l1(torch.cat([x, u], 1)))
        x = F.relu(self.l2(x))
        x = self.l3(x)
        return x

# Main class definition for DDPG algorithm
class DDPG(object):
    def __init__(self, state_dim, action_dim, max_action):
        self.actor = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=1e-4)

        self.critic = Critic(state_dim, action_dim).to(device)
        self.critic_target = Critic(state_dim, action_dim).to(device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=1e-3)

        self.replay_buffer = Replay_buffer()
        self.writer = SummaryWriter(directory)

        self.num_critic_update_iteration = 0
        self.num_actor_update_iteration = 0
        self.num_training = 0

    def select_action(self, state):
        state = torch.FloatTensor(state.reshape(1, -1)).to(device)
        return self.actor(state).cpu().data.numpy().flatten()

    def update(self):

        for it in range(update_iteration):
            # Sample replay buffer
            x, y, u, r, d = self.replay_buffer.sample(batch_size)
            state = torch.FloatTensor(x).to(device)
            action = torch.FloatTensor(u).to(device)
            next_state = torch.FloatTensor(y).to(device)
            done = torch.FloatTensor(1-d).to(device)
            reward = torch.FloatTensor(r).to(device)

            # Compute the target Q value
            target_Q = self.critic_target(next_state, self.actor_target(next_state))
            target_Q = reward + (done * gamma * target_Q).detach()

            # Get current Q estimate
            current_Q = self.critic(state, action)

            # Compute critic loss
            critic_loss = F.mse_loss(current_Q, target_Q)
            self.writer.add_scalar('Loss/critic_loss', critic_loss, global_step=self.num_critic_update_iteration)
            # Optimize the critic
            self.critic_optimizer.zero_grad()
            critic_loss.backward()
            self.critic_optimizer.step()

            # Compute actor loss
            actor_loss = -self.critic(state, self.actor(state)).mean()
            self.writer.add_scalar('Loss/actor_loss', actor_loss, global_step=self.num_actor_update_iteration)

            # Optimize the actor
            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()

            # Update the frozen target models
            for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
                target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

            for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

            self.num_actor_update_iteration += 1
            self.num_critic_update_iteration += 1

    def save(self):
        torch.save(self.actor.state_dict(), directory + 'actor.pth')
        torch.save(self.critic.state_dict(), directory + 'critic.pth')

    def load(self):
        self.actor.load_state_dict(torch.load(directory + 'actor.pth'))
        self.critic.load_state_dict(torch.load(directory + 'critic.pth'))
        print("====================================")
        print("model has been loaded...")
        print("====================================")

# Main function for training or testing
def main():
    agent = DDPG(state_dim, action_dim, max_action)
    ep_r = 0
    if mode == 'test':
        agent.load()
        for i in range(test_iteration):
            state = env.reset()
            for t in count():
                action = agent.select_action(state)
                next_state, reward, done, info = env.step(np.float32(action))
                ep_r += reward
                env.render()
                if done or t >= max_length_of_trajectory:
                    print("Ep_i \t{}, the ep_r is \t{:0.2f}, the step is \t{}".format(i, ep_r, t))
                    ep_r = 0
                    break
                state = next_state

    elif mode == 'train':
        if load: agent.load()
        total_step = 0
        for i in range(max_episode):
            total_reward = 0
            step =0
            state = env.reset()
            for t in count():
                action = agent.select_action(state)
                action = (action + np.random.normal(0, exploration_noise, size=env.action_space.shape[0])).clip(
                    env.action_space.low, env.action_space.high)

                next_state, reward, done, info = env.step(action)
                if render and i >= render_interval : env.render()
                agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))

                state = next_state
                if done:
                    break
                step += 1
                total_reward += reward
            total_step += step+1
            print("Total T:{} Episode: \t{} Total Reward: \t{:0.2f}".format(total_step, i, total_reward))
            agent.update()
           # "Total T: %d Episode Num: %d Episode T: %d Reward: %f

            if i % log_interval == 0:
                agent.save()
    else:
        raise NameError("mode wrong!!!")

# Start the program
if __name__ == '__main__':
    main()