<a href="https://colab.research.google.com/github/swapnalisingh13/Reinforcement_L/blob/main/2348565_RL_Lab10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Policy Gradient methods**

In [1]:
#importing libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import gym

## **Vanilla Policy Gradient (VPG)**

VPG directly optimizes the policy using the policy gradient theorem. It uses the total return Gt (discounted sum of future rewards) to weigh the log probabilities of the actions taken during an episode.

In [2]:
# Environment setup
env = gym.make('CartPole-v1')
learning_rate = 0.01
gamma = 0.99  # Discount factor
epochs = 500
hidden_units = 128

# Policy Network
class PolicyNetwork(tf.keras.Model):
    def __init__(self, action_space):
        super(PolicyNetwork, self).__init__()
        self.hidden = layers.Dense(hidden_units, activation='relu')
        self.output_layer = layers.Dense(action_space, activation='softmax')

    def call(self, state):
        x = self.hidden(state)
        return self.output_layer(x)

policy = PolicyNetwork(env.action_space.n)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

# Compute discounted rewards
def compute_discounted_rewards(rewards, gamma):
    discounted = np.zeros_like(rewards, dtype=np.float32)
    running_add = 0
    for t in reversed(range(len(rewards))):
        running_add = rewards[t] + gamma * running_add
        discounted[t] = running_add
    return discounted

# Training loop
for epoch in range(epochs):
    state = env.reset()
    state = np.expand_dims(state, axis=0)
    states, actions, rewards = [], [], []
    done = False

    # Generate trajectory
    while not done:
        state_tensor = tf.convert_to_tensor(state, dtype=tf.float32)
        action_probs = policy(state_tensor)
        action = np.random.choice(env.action_space.n, p=np.squeeze(action_probs.numpy()))
        next_state, reward, done, _ = env.step(action)

        states.append(state)
        actions.append(action)
        rewards.append(reward)

        state = np.expand_dims(next_state, axis=0)

    # Compute returns
    returns = compute_discounted_rewards(rewards, gamma)

    # Update Policy
    with tf.GradientTape() as tape:
        state_tensor = tf.convert_to_tensor(np.vstack(states), dtype=tf.float32)
        action_tensor = tf.convert_to_tensor(actions, dtype=tf.int32)
        return_tensor = tf.convert_to_tensor(returns, dtype=tf.float32)

        action_probs = policy(state_tensor)
        action_log_probs = tf.math.log(tf.reduce_sum(action_probs * tf.one_hot(action_tensor, env.action_space.n), axis=1))

        # Loss is negative of expected return
        policy_loss = -tf.reduce_mean(action_log_probs * return_tensor)

    grads = tape.gradient(policy_loss, policy.trainable_variables)
    optimizer.apply_gradients(zip(grads, policy.trainable_variables))

    # Logging
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Total Reward: {sum(rewards)}")

  deprecation(
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):


Epoch 0, Total Reward: 21.0
Epoch 10, Total Reward: 19.0
Epoch 20, Total Reward: 38.0
Epoch 30, Total Reward: 31.0
Epoch 40, Total Reward: 72.0
Epoch 50, Total Reward: 76.0
Epoch 60, Total Reward: 63.0
Epoch 70, Total Reward: 40.0
Epoch 80, Total Reward: 32.0
Epoch 90, Total Reward: 52.0
Epoch 100, Total Reward: 58.0
Epoch 110, Total Reward: 34.0
Epoch 120, Total Reward: 31.0
Epoch 130, Total Reward: 25.0
Epoch 140, Total Reward: 20.0
Epoch 150, Total Reward: 18.0
Epoch 160, Total Reward: 17.0
Epoch 170, Total Reward: 16.0
Epoch 180, Total Reward: 17.0
Epoch 190, Total Reward: 15.0
Epoch 200, Total Reward: 18.0
Epoch 210, Total Reward: 39.0
Epoch 220, Total Reward: 49.0
Epoch 230, Total Reward: 50.0
Epoch 240, Total Reward: 84.0
Epoch 250, Total Reward: 64.0
Epoch 260, Total Reward: 82.0
Epoch 270, Total Reward: 133.0
Epoch 280, Total Reward: 55.0
Epoch 290, Total Reward: 52.0
Epoch 300, Total Reward: 42.0
Epoch 310, Total Reward: 75.0
Epoch 320, Total Reward: 49.0
Epoch 330, Total Rew

## **REINFORCE with Baseline**

Extends VPG by introducing a baseline (value function) to reduce the variance in updates. Instead of Gt, it uses the advantage function:

                              At=Gt-V(st)
Where V(st) is the value function (estimated using a critic network).

In [3]:
# Value Network
class ValueNetwork(tf.keras.Model):
    def __init__(self):
        super(ValueNetwork, self).__init__()
        self.hidden = layers.Dense(hidden_units, activation='relu')
        self.value = layers.Dense(1, activation=None)

    def call(self, state):
        x = self.hidden(state)
        return self.value(x)

value_net = ValueNetwork()
value_optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

# Training loop
for epoch in range(epochs):
    state = env.reset()
    state = np.expand_dims(state, axis=0)
    states, actions, rewards = [], [], []
    done = False

    # Generate trajectory
    while not done:
        state_tensor = tf.convert_to_tensor(state, dtype=tf.float32)
        action_probs = policy(state_tensor)
        action = np.random.choice(env.action_space.n, p=np.squeeze(action_probs.numpy()))
        next_state, reward, done, _ = env.step(action)

        states.append(state)
        actions.append(action)
        rewards.append(reward)

        state = np.expand_dims(next_state, axis=0)

    # Compute returns
    returns = compute_discounted_rewards(rewards, gamma)

    # Update Policy
    with tf.GradientTape() as tape:
        state_tensor = tf.convert_to_tensor(np.vstack(states), dtype=tf.float32)
        action_tensor = tf.convert_to_tensor(actions, dtype=tf.int32)
        return_tensor = tf.convert_to_tensor(returns, dtype=tf.float32)

        # Use baseline (value function) to reduce variance
        values = tf.squeeze(value_net(state_tensor))
        advantages = return_tensor - values

        action_probs = policy(state_tensor)
        action_log_probs = tf.math.log(tf.reduce_sum(action_probs * tf.one_hot(action_tensor, env.action_space.n), axis=1))

        policy_loss = -tf.reduce_mean(action_log_probs * advantages)

    grads = tape.gradient(policy_loss, policy.trainable_variables)
    optimizer.apply_gradients(zip(grads, policy.trainable_variables))

    # Update Critic
    with tf.GradientTape() as value_tape:
        values = tf.squeeze(value_net(state_tensor))
        value_loss = tf.reduce_mean((return_tensor - values) ** 2)

    value_grads = value_tape.gradient(value_loss, value_net.trainable_variables)
    value_optimizer.apply_gradients(zip(value_grads, value_net.trainable_variables))

    # Logging
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Total Reward: {sum(rewards)}, Value Loss: {value_loss.numpy():.4f}")


Epoch 0, Total Reward: 104.0, Value Loss: 1807.7913
Epoch 10, Total Reward: 121.0, Value Loss: 2038.6803
Epoch 20, Total Reward: 105.0, Value Loss: 1348.1639
Epoch 30, Total Reward: 48.0, Value Loss: 305.3307
Epoch 40, Total Reward: 84.0, Value Loss: 554.8294
Epoch 50, Total Reward: 87.0, Value Loss: 356.2919
Epoch 60, Total Reward: 107.0, Value Loss: 329.7827
Epoch 70, Total Reward: 100.0, Value Loss: 101.3395
Epoch 80, Total Reward: 96.0, Value Loss: 66.2135
Epoch 90, Total Reward: 104.0, Value Loss: 81.6713
Epoch 100, Total Reward: 106.0, Value Loss: 59.9112
Epoch 110, Total Reward: 115.0, Value Loss: 103.0607
Epoch 120, Total Reward: 115.0, Value Loss: 91.3867
Epoch 130, Total Reward: 321.0, Value Loss: 1276.1494
Epoch 140, Total Reward: 149.0, Value Loss: 1866.3496
Epoch 150, Total Reward: 155.0, Value Loss: 178.0727
Epoch 160, Total Reward: 144.0, Value Loss: 185.4613
Epoch 170, Total Reward: 110.0, Value Loss: 34.6652
Epoch 180, Total Reward: 113.0, Value Loss: 51.9295
Epoch 190

## **Deterministic Policy Gradient (DPG)**

DPG optimizes a deterministic policy μ(s∣θ^μ), meaning it directly maps states s to actions a, rather than sampling actions probabilistically as in traditional policy gradients. It's particularly useful for continuous action spaces.

In [14]:
import tensorflow as tf
import numpy as np
import gym

# Environment with continuous action space (e.g., Pendulum)
env = gym.make("Pendulum-v1")
state_shape = env.observation_space.shape[0]
action_space = env.action_space.shape[0]
action_bounds = env.action_space.high[0]

# Hyperparameters
learning_rate_actor = 0.001
learning_rate_critic = 0.002
gamma = 0.99
tau = 0.005  # For soft updates
episodes = 500

# Actor Network (Deterministic Policy)
class ActorNetwork(tf.keras.Model):
    def __init__(self):
        super(ActorNetwork, self).__init__()
        self.hidden = tf.keras.layers.Dense(128, activation='relu')
        self.output_layer = tf.keras.layers.Dense(action_space, activation='tanh')  # Output between -1 and 1

    def call(self, state):
        x = self.hidden(state)
        return self.output_layer(x) * action_bounds

# Critic Network (Q-value approximation)
class CriticNetwork(tf.keras.Model):
    def __init__(self):
        super(CriticNetwork, self).__init__()
        self.state_layer = tf.keras.layers.Dense(64, activation='relu')
        self.action_layer = tf.keras.layers.Dense(64, activation='relu')
        self.output_layer = tf.keras.layers.Dense(1)  # Q-value output

    def call(self, state, action):
        state_out = self.state_layer(state)
        action_out = self.action_layer(action)
        combined = tf.concat([state_out, action_out], axis=-1)
        return self.output_layer(combined)

# Initialize networks and target networks
actor = ActorNetwork()
critic = CriticNetwork()
target_actor = ActorNetwork()
target_critic = CriticNetwork()

# Copy weights to target networks
target_actor.set_weights(actor.get_weights())
target_critic.set_weights(critic.get_weights())

# Optimizers
actor_optimizer = tf.keras.optimizers.Adam(learning_rate_actor)
critic_optimizer = tf.keras.optimizers.Adam(learning_rate_critic)

# Replay buffer
class ReplayBuffer:
    def __init__(self, capacity=100000):
        self.buffer = []
        self.capacity = capacity

    def store(self, transition):
        if len(self.buffer) >= self.capacity:
            self.buffer.pop(0)
        self.buffer.append(transition)

    def sample(self, batch_size):
        idx = np.random.choice(len(self.buffer), batch_size, replace=False)
        return [self.buffer[i] for i in idx]

# Initialize replay buffer
replay_buffer = ReplayBuffer()

# Training loop
batch_size = 64
for episode in range(episodes):
    state = env.reset()
    state = np.expand_dims(state, axis=0)  # Add batch dimension
    total_reward = 0
    done = False

    while not done:
        # Actor forward pass (deterministic action)
        state_tensor = tf.convert_to_tensor(state, dtype=tf.float32)
        action = actor(state_tensor).numpy()[0]
        action += np.random.normal(0, 0.1, size=action.shape)  # Add exploration noise

        # Take action in environment
        next_state, reward, done, _ = env.step(action)
        next_state = np.expand_dims(next_state, axis=0)

        # Store transition in replay buffer
        replay_buffer.store((state, action, reward, next_state, done))
        total_reward += reward

        # Sample from replay buffer
        if len(replay_buffer.buffer) >= batch_size:
            transitions = replay_buffer.sample(batch_size)
            states, actions, rewards, next_states, dones = zip(*transitions)

            states = tf.convert_to_tensor(np.vstack(states), dtype=tf.float32)
            actions = tf.convert_to_tensor(np.vstack(actions), dtype=tf.float32)
            rewards = tf.convert_to_tensor(np.vstack(rewards), dtype=tf.float32)
            next_states = tf.convert_to_tensor(np.vstack(next_states), dtype=tf.float32)
            dones = tf.convert_to_tensor(np.vstack(dones), dtype=tf.float32)

            # Update Critic
            next_actions = target_actor(next_states)
            next_q_values = target_critic(next_states, next_actions)
            q_targets = rewards + gamma * next_q_values * (1 - dones)

            with tf.GradientTape() as tape:
                q_values = critic(states, actions)
                critic_loss = tf.reduce_mean((q_values - q_targets) ** 2)

            critic_grads = tape.gradient(critic_loss, critic.trainable_variables)
            critic_optimizer.apply_gradients(zip(critic_grads, critic.trainable_variables))

            # Update Actor
            with tf.GradientTape() as tape:
                actions_pred = actor(states)
                actor_loss = -tf.reduce_mean(critic(states, actions_pred))  # Maximize Q-value

            actor_grads = tape.gradient(actor_loss, actor.trainable_variables)
            actor_optimizer.apply_gradients(zip(actor_grads, actor.trainable_variables))

            # Soft update target networks
            for target_var, var in zip(target_actor.trainable_variables, actor.trainable_variables):
                target_var.assign(tau * var + (1 - tau) * target_var)

            for target_var, var in zip(target_critic.trainable_variables, critic.trainable_variables):
                target_var.assign(tau * var + (1 - tau) * target_var)

        state = next_state

    print(f"Episode {episode}, Total Reward: {total_reward}")


Episode 0, Total Reward: -1209.9862589632853
Episode 1, Total Reward: -992.1452181828838
Episode 2, Total Reward: -1042.7546274058489
Episode 3, Total Reward: -1695.459039064374
Episode 4, Total Reward: -1228.367104256855
Episode 5, Total Reward: -1220.0449549743448
Episode 6, Total Reward: -892.4569701524111
Episode 7, Total Reward: -1867.7670082580462
Episode 8, Total Reward: -1901.8891868492767
Episode 9, Total Reward: -1051.1809790804864
Episode 10, Total Reward: -1317.9979993748045
Episode 11, Total Reward: -1055.785661554805
Episode 12, Total Reward: -1054.3756826409874
Episode 13, Total Reward: -1107.8952674661527
Episode 14, Total Reward: -1813.256416099839
Episode 15, Total Reward: -1302.7823337558102
Episode 16, Total Reward: -999.2457439816972
Episode 17, Total Reward: -1861.025215601512
Episode 18, Total Reward: -1776.3825262583262
Episode 19, Total Reward: -1305.9766174842684
Episode 20, Total Reward: -1496.8940657648004
Episode 21, Total Reward: -1772.9861103905994
Episod

Inference

Vanilla Policy Gradient (VPG): Estimates policy gradients using Monte Carlo sampling, but suffers from high variance.

REINFORCE: A Monte Carlo-based variant of VPG, also high variance but simple to implement.

Deterministic Policy Gradient (DPG): Uses deterministic policies for continuous action spaces, improving sample efficiency and stability.

comparison:
- Vanilla Policy Gradient (VPG): Stochastic policies, high variance, suited for general tasks.
- REINFORCE: A specific VPG method, uses Monte Carlo for updates, still high variance.
- Deterministic Policy Gradient (DPG): Uses deterministic policies, works better for continuous action spaces, more stable and efficient.