## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Spring 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

## Section 0: Setup and Imports

In [79]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gymnasium
import gym
import matplotlib.pyplot as plt
from collections import deque

# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7ab8c5ace170>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a – Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

```python
# TODO: Define SeparateActorCritic class
```

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

```python
# TODO: Simulate loss computation and backpropagation
```

🔗 Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [2]:
# TODO: Define a class SeparateActorCritic with separate networks for actor and critic

# BEGIN_YOUR_CODE

class SeparateActorCritic(nn.Module):
    def __init__(self, input_dim, hidden_dim, action_dim):
        super(SeparateActorCritic, self).__init__()
        self.actor = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
        self.critic = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )

    def forward(self, x):
        return self.actor(x), self.critic(x)

# Dummy settings
input_dim = 8
hidden_dim = 128
action_dim = 4
batch_size = 5
dummy_states = torch.randn(batch_size, input_dim)
dummy_returns = torch.randn(batch_size, 1)

#JOINT OPTIMIZER
model1 = SeparateActorCritic(input_dim, hidden_dim, action_dim)

policy, values = model1(dummy_states)
log_probs = torch.log(policy.gather(1, torch.randint(0, action_dim, (batch_size, 1))))
entropy = -(policy * torch.log(policy + 1e-8)).sum(dim=1, keepdim=True)
advantage = dummy_returns - values.detach()

actor_loss = -(log_probs * advantage).mean() - 0.01 * entropy.mean()
critic_loss = F.mse_loss(values, dummy_returns)

optimizer_joint = optim.Adam(list(model1.actor.parameters()) + list(model1.critic.parameters()), lr=1e-3)
optimizer_joint.zero_grad()
(actor_loss + critic_loss).backward()
optimizer_joint.step()

print("Joint optimizer step completed.")
print("Actor Loss:", actor_loss.item())
print("Critic Loss:", critic_loss.item())

#SEPARATE OPTIMIZERS
model2 = SeparateActorCritic(input_dim, hidden_dim, action_dim)

policy, values = model2(dummy_states)
log_probs = torch.log(policy.gather(1, torch.randint(0, action_dim, (batch_size, 1))))
entropy = -(policy * torch.log(policy + 1e-8)).sum(dim=1, keepdim=True)
advantage = dummy_returns - values.detach()

actor_loss = -(log_probs * advantage).mean() - 0.01 * entropy.mean()
critic_loss = F.mse_loss(values, dummy_returns)

optimizer_actor = optim.Adam(model2.actor.parameters(), lr=1e-3)
optimizer_critic = optim.Adam(model2.critic.parameters(), lr=1e-3)

# Update actor
optimizer_actor.zero_grad()
actor_loss.backward(retain_graph=True)
optimizer_actor.step()

# Update critic
optimizer_critic.zero_grad()
critic_loss.backward()
optimizer_critic.step()

print("Separate optimizers step completed.")
print("Actor Loss:", actor_loss.item())
print("Critic Loss:", critic_loss.item())


# END_YOUR_CODE

Joint optimizer step completed.
Actor Loss: 0.3700183928012848
Critic Loss: 0.1863611489534378
Separate optimizers step completed.
Actor Loss: 0.4754375219345093
Critic Loss: 0.1904817372560501


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

### **Joint Optimizer: One Optimizer for Actor & Critic**

#### **Motivation**:
- **Simplifies training**: Only one optimizer to manage.
- **Efficient memory usage**: Shared optimizer state (like moment estimates in Adam).
- **Synchronized updates**: Both networks are updated together in one pass.

#### **When to Use**:
- When the actor and critic networks are tightly coupled or **share parameters** (e.g., shared CNN base in some actor-critic implementations).
- When you want a simpler training loop.
- When both networks benefit from **shared learning rates** and optimization dynamics.

### **Separate Optimizers: One for Actor, One for Critic**

#### **Motivation**:
- **Decouples learning dynamics**: You can control learning rates, weight decay, and schedulers **independently**.
- More **fine-grained control** over training stability, especially in complex environments.
- Better when actor and critic have **different convergence behavior**.

#### **When to Use**:
- In large-scale or high-stakes environments (e.g., PPO, A3C) where **training stability** is critical.
- When the actor and critic are entirely separate networks (like in your setup).
- If you want to **freeze** one of the networks at any point (e.g., to prevent value overfitting).

### Task 1b – Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

```python
# TODO: Define SharedActorCritic class
```

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

```python
# TODO: Simulate shared network loss computation and backpropagation
```

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

🔗 More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [3]:
# BEGIN_YOUR_CODE
class SharedActorCritic(nn.Module):
    def __init__(self, input_dim, hidden_dim, action_dim):
        super(SharedActorCritic, self).__init__()
        self.shared = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        self.actor_head = nn.Sequential(
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
        self.critic_head = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        base = self.shared(x)
        probs = self.actor_head(base)
        value = self.critic_head(base)
        return probs, value

input_dim = 8
hidden_dim = 128
action_dim = 4
batch_size = 5

dummy_states = torch.randn(batch_size, input_dim)
dummy_returns = torch.randn(batch_size, 1)
model = SharedActorCritic(input_dim, hidden_dim, action_dim)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
probs, values = model(dummy_states)
actions = torch.randint(0, action_dim, (batch_size, 1))
log_probs = torch.log(probs.gather(1, actions))
entropy = -(probs * torch.log(probs + 1e-8)).sum(dim=1, keepdim=True)
advantage = dummy_returns - values.detach()

actor_loss = -(log_probs * advantage).mean() - 0.01 * entropy.mean()
critic_loss = F.mse_loss(values, dummy_returns)
total_loss = actor_loss + critic_loss
optimizer.zero_grad()
total_loss.backward()
optimizer.step()

print("Shared network training step complete.")
print("Actor Loss:", actor_loss.item())
print("Critic Loss:", critic_loss.item())


# END_YOUR_CODE

Shared network training step complete.
Actor Loss: 0.1559678316116333
Critic Loss: 0.3759419322013855


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

### **Shared Network**

#### **Motivation & Benefits:**
- **Efficient representation**: A shared base helps learn common features from the state, saving computation.
- **Parameter efficiency**: Fewer parameters = faster training, better for memory-limited scenarios.
- **Better gradient flow**: Shared features are updated from both actor and critic gradients—can speed up convergence.

#### **When to use:**
- In environments where actor and critic benefit from **shared understanding of the state**.
- When training needs to be fast or memory-efficient (e.g., on-device RL).
- Common in **vanilla A2C, A3C, PPO**.

## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

```python
# TODO: Define function `create_shared_network(env)`
```

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

```python
# TODO: Loop through environments and test `create_shared_network`
```

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

🔗 Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [3]:
!pip install swig
!pip install "gymnasium[box2d]"
!pip install "gymnasium[mujoco]"

Collecting swig
  Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/1.9 MB[0m [31m15.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.3.0
Collecting box2d-py==2.3.5 (from gymnasium[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (se

In [57]:
from gymnasium.spaces import Discrete, Box
from torchvision import transforms

class SharedNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, action_space):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )

        self.action_space = action_space

        if isinstance(action_space, Discrete):
            self.actor = nn.Sequential(
                nn.Linear(hidden_dim, action_space.n),
                nn.Softmax(dim=-1)
            )
        elif isinstance(action_space, Box):
            self.actor_mean = nn.Linear(hidden_dim, action_space.shape[0])
            self.actor_log_std = nn.Parameter(torch.zeros(action_space.shape[0]))
        else:
            raise NotImplementedError("Unsupported action space.")

        self.critic = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = self.shared(x)
        if isinstance(self.action_space, Discrete):
            return self.actor(x), self.critic(x)
        elif isinstance(self.action_space, Box):
            mean = self.actor_mean(x)
            log_std = self.actor_log_std.expand_as(mean)
            return (mean, log_std), self.critic(x)


def create_shared_network(env, hidden_dim=128):
    obs_space = env.observation_space
    act_space = env.action_space

    if isinstance(obs_space, Discrete):
        input_dim = obs_space.n
    elif isinstance(obs_space, Box):
        input_dim = int(np.prod(obs_space.shape))
    else:
        raise NotImplementedError("Unsupported observation space.")

    return SharedNetwork(input_dim, hidden_dim, act_space)


def preprocess_obs(obs, obs_space):
    if isinstance(obs_space, Discrete):
        return torch.eye(obs_space.n)[obs].float().unsqueeze(0)
    elif isinstance(obs_space, Box):
        return torch.tensor(obs, dtype=torch.float32).flatten().unsqueeze(0)
    else:
        raise NotImplementedError


env_ids = [
    "CliffWalking-v0",       # Gymnasium
    "LunarLander-v3",        # Gymnasium
    "PongNoFrameskip-v4",    # Gym
    "HalfCheetah-v5"         # Gymnasium
]

for env_id in env_ids:
    try:
        env = gymnasium.make(env_id)
        obs, _ = env.reset()
        model = create_shared_network(env)
        obs_tensor = preprocess_obs(obs, env.observation_space)

        with torch.no_grad():
            output = model(obs_tensor)

        print(f"✅ {env_id} passed.")
        print("  Actor Output:", output[0])
        print("  Critic Output:", output[1].item())
        print()

        env.close()

    except Exception as e:
        print(f"❌ {env_id} failed: {e}")


✅ CliffWalking-v0 passed.
  Actor Output: tensor([[0.2403, 0.2580, 0.2575, 0.2442]])
  Critic Output: 0.013050481677055359

✅ LunarLander-v3 passed.
  Actor Output: tensor([[0.2480, 0.2451, 0.2258, 0.2811]])
  Critic Output: -0.0473642498254776

❌ PongNoFrameskip-v4 failed: Environment `PongNoFrameskip` doesn't exist.
✅ HalfCheetah-v5 passed.
  Actor Output: (tensor([[ 0.0835,  0.0516, -0.0133,  0.0180, -0.0043,  0.0262]]), tensor([[0., 0., 0., 0., 0., 0.]], requires_grad=True))
  Critic Output: -0.011038020253181458



Here, PongNoFrameskip-v4 environment does not exist because we are using gymnasium and PongNoFrameskip-v4 exists in gym. So, I have run PongNoFrameskip-v4 in the below cell.

In [2]:
!pip install numpy==1.23.5
!pip install "gym[atari,accept-rom-license]" ale-py autorom
!AutoROM --accept-license
import gym

print(gym.envs.registry.keys())


Collecting autorom
  Using cached AutoROM-0.6.1-py3-none-any.whl.metadata (2.4 kB)
  Using cached AutoROM-0.4.2-py3-none-any.whl.metadata (2.8 kB)
INFO: pip is looking at multiple versions of gym[accept-rom-license,atari] to determine which version is compatible with other requirements. This could take a while.
Collecting gym[accept-rom-license,atari]
  Using cached gym-0.26.2.tar.gz (721 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting ale-py
  Downloading ale_py-0.8.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.4.2; extra == "accept-rom-license"->gym[accept-rom-license,atari])
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m9.4 MB/s[0m eta [

AutoROM will download the Atari 2600 ROMs.
They will be installed to:
	/usr/local/lib/python3.11/dist-packages/AutoROM/roms

Existing ROMs will be overwritten.
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/adventure.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/air_raid.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/alien.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/amidar.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/assault.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/asterix.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/asteroids.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/atlantis.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/atlantis2.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/backgammon.bin
Installed /usr/local/lib/python3.11/dist-packages/AutoROM/roms/bank_heist.bin
Inst

In [56]:
from gym.wrappers import FrameStack, GrayScaleObservation, ResizeObservation

def make_pong_env(env_id="PongNoFrameskip-v4"):
    env = gym.make(env_id)
    env = gym.wrappers.AtariPreprocessing(env, grayscale_obs=True, scale_obs=False, frame_skip=4)
    env = ResizeObservation(env, shape=84)
    env = FrameStack(env, num_stack=4)
    return env

class SharedNetwork(nn.Module):
    def __init__(self, input_shape, action_space):
        super(SharedNetwork, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),  # from DeepMind Nature paper
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
        )

        conv_out_size = self._get_conv_output(input_shape)
        self.fc = nn.Linear(conv_out_size, 512)

        # Actor
        if isinstance(action_space, gym.spaces.Discrete):
            self.actor = nn.Linear(512, action_space.n)
        else:
            self.actor_mean = nn.Linear(512, action_space.shape[0])
            self.actor_log_std = nn.Parameter(torch.zeros(action_space.shape[0]))

        # Critic
        self.critic = nn.Linear(512, 1)

        self.action_space = action_space

    def _get_conv_output(self, shape):
        o = torch.zeros(1, *shape)
        o = self.conv(o)
        return int(np.prod(o.size()))

    def forward(self, x):
        x = x / 255.0
        x = x.squeeze(-1)
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc(x))

        value = self.critic(x)

        if isinstance(self.action_space, gym.spaces.Discrete):
            logits = self.actor(x)
            probs = F.softmax(logits, dim=-1)
            return probs, value
        else:
            mean = self.actor_mean(x)
            log_std = self.actor_log_std.expand_as(mean)
            return (mean, log_std), value

def create_shared_network(env):
    input_shape = env.observation_space.shape
    action_space = env.action_space
    return SharedNetwork(input_shape, action_space)

if __name__ == "__main__":
    env = make_pong_env()
    net = create_shared_network(env)

    obs = env.reset()
    obs = obs[0]
    obs = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)

    with torch.no_grad():
        actor_out, critic_out = net(obs)

    print("Actor Output:", actor_out)
    print("Critic Output:", critic_out.item())

Actor Output: tensor([[0.1694, 0.1722, 0.1685, 0.1668, 0.1615, 0.1615]])
Critic Output: -0.004982149228453636


In [26]:
print(gymnasium.envs.registry.keys())

dict_keys(['CartPole-v0', 'CartPole-v1', 'MountainCar-v0', 'MountainCarContinuous-v0', 'Pendulum-v1', 'Acrobot-v1', 'phys2d/CartPole-v0', 'phys2d/CartPole-v1', 'phys2d/Pendulum-v0', 'LunarLander-v3', 'LunarLanderContinuous-v3', 'BipedalWalker-v3', 'BipedalWalkerHardcore-v3', 'CarRacing-v3', 'Blackjack-v1', 'FrozenLake-v1', 'FrozenLake8x8-v1', 'CliffWalking-v0', 'Taxi-v3', 'tabular/Blackjack-v0', 'tabular/CliffWalking-v0', 'Reacher-v2', 'Reacher-v4', 'Reacher-v5', 'Pusher-v2', 'Pusher-v4', 'Pusher-v5', 'InvertedPendulum-v2', 'InvertedPendulum-v4', 'InvertedPendulum-v5', 'InvertedDoublePendulum-v2', 'InvertedDoublePendulum-v4', 'InvertedDoublePendulum-v5', 'HalfCheetah-v2', 'HalfCheetah-v3', 'HalfCheetah-v4', 'HalfCheetah-v5', 'Hopper-v2', 'Hopper-v3', 'Hopper-v4', 'Hopper-v5', 'Swimmer-v2', 'Swimmer-v3', 'Swimmer-v4', 'Swimmer-v5', 'Walker2d-v2', 'Walker2d-v3', 'Walker2d-v4', 'Walker2d-v5', 'Ant-v2', 'Ant-v3', 'Ant-v4', 'Ant-v5', 'Humanoid-v2', 'Humanoid-v3', 'Humanoid-v4', 'Humanoid-v5

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

### **1. Discrete Observations (e.g., `CliffWalking-v0`)**
- **Setup:**
  - Observation space is `Discrete(n)` → Use **one-hot encoding**.
  - Action space is `Discrete(k)` → Actor outputs a **Softmax distribution** over actions.
- **Motivation:**
  - One-hot encoding is a simple and efficient way to embed discrete states.
  - Softmax makes the actor output interpretable as a probability distribution, suitable for categorical policies (e.g., in REINFORCE or PPO).
- **Use Case:**
  - Classic tabular-style environments like grid worlds, where each state is a discrete index.

---

### **2. Continuous Observations + Discrete Actions (e.g., `LunarLander-v3`)**
- **Setup:**
  - Observation space is a `Box` → Flatten and feed to a fully connected layer.
  - Action space is `Discrete(k)` → Actor uses **Softmax** over logits.
- **Motivation:**
  - Many physics-based simulators use continuous observations (e.g., positions, velocities).
  - Discrete actions still apply, so a classification-style output is appropriate.
- **Use Case:**
  - Great for problems where state features are continuous but actions are discrete (robotics, lunar lander, car control).

---

### **3. Preprocessed Image Inputs + Discrete Actions (e.g., `PongNoFrameskip-v4`)**
- **Setup:**
  - Use **Atari wrappers** (e.g., grayscale, resize, frame stack) to get a 4x84x84 input.
  - CNN → Flatten → FC → Actor (softmax) + Critic.
- **Motivation:**
  - CNNs are essential for spatial and temporal feature extraction from pixel input.
  - Atari games have discrete actions, so Softmax fits well.
- **Use Case:**
  - Image-based environments where the agent perceives the world visually (e.g., Atari, vision-based robotics).
  - This architecture aligns with the DeepMind DQN and A3C setups.

---

### **4. Continuous Observations + Continuous Actions (e.g., `HalfCheetah-v5`)**
- **Setup:**
  - Observation space is a `Box` → Flatten → FC layers.
  - Action space is `Box` → Actor outputs:
    - Mean vector via a Linear layer.
    - `log_std` as a learnable parameter (shared across samples or optionally per-sample).
- **Motivation:**
  - For continuous action policies, the agent samples from a **Gaussian distribution**, parameterized by `mean` and `std`.
  - This is common in **policy gradient methods** like PPO, TRPO, SAC.
- **Use Case:**
  - Locomotion and control tasks with precise force/torque outputs (e.g., MuJoCo, PyBullet).



### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

```python
# TODO: Define `normalize_observation(obs, env)`
```

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [80]:
# BEGIN_YOUR_CODE
from gym.wrappers.atari_preprocessing import AtariPreprocessing

def normalize_observation(obs, env):
    obs_space = env.observation_space
    if isinstance(obs_space, (gym.spaces.Box, gymnasium.spaces.Box)) and hasattr(obs_space, 'low') and hasattr(obs_space, 'high'):
        if not isinstance(obs, np.ndarray):
            obs = np.array(obs)

        if np.issubdtype(obs.dtype, np.uint8):
            print("\nImage-based obs detected, scaling by 255.")
            return obs.astype(np.float32) / 255.0

        obs = obs.astype(np.float32)
        low = obs_space.low
        high = obs_space.high
        normalized = np.copy(obs)
        finite_mask = np.isfinite(low) & np.isfinite(high)
        if np.any(finite_mask):
            normalized[finite_mask] = (
                (obs[finite_mask] - low[finite_mask]) /
                (high[finite_mask] - low[finite_mask] + 1e-8)
            )

        return normalized
    return obs

env_lunar = gymnasium.make("LunarLander-v3")
obs_lunar, _ = env_lunar.reset()
norm_obs_lunar = normalize_observation(obs_lunar, env_lunar)

print("LunarLander-v3:")
print("Original:", obs_lunar)
print("Normalized:", norm_obs_lunar)
print("Range: (min =", np.min(norm_obs_lunar), ", max =", np.max(norm_obs_lunar), ")")

env_pong = gym.make("PongNoFrameskip-v4")
env_pong = AtariPreprocessing(env_pong, frame_skip=1, scale_obs=False)
env_pong = FrameStack(env_pong, 4)
obs_pong, _ = env_pong.reset()
obs_pong = np.array(obs_pong)
norm_obs_pong = normalize_observation(obs_pong, env_pong)

print("PongNoFrameskip-v4:")
print("Original dtype:", obs_pong.dtype)
print("Normalized dtype:", norm_obs_pong.dtype)
print("Shape:", norm_obs_pong.shape)
print("Range: (min =", np.min(norm_obs_pong), ", max =", np.max(norm_obs_pong), ")")
# END_YOUR_CODE

LunarLander-v3:
Original: [ 0.00143003  1.401336    0.1448246  -0.4259648  -0.00165019 -0.03280495
  0.          0.        ]
Normalized: [0.500286   0.7802672  0.50724125 0.47870177 0.49986866 0.49835977
 0.         0.        ]
Range: (min = 0.0 , max = 0.7802672 )

Image-based obs detected, scaling by 255.
PongNoFrameskip-v4:
Original dtype: uint8
Normalized dtype: float32
Shape: (4, 84, 84)
Range: (min = 0.20392157 , max = 0.9254902 )


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

### **Motivation Behind the Normalization Function**

Observation normalization is a **critical preprocessing step** for many reinforcement learning (RL) algorithms. The main goal is to ensure the input values fall within a standardized range, often `[0, 1]` or `[-1, 1]`. This improves:

- **Stability of training**
- **Faster convergence**
- **Better generalization across environments**
- **More balanced gradient updates**

## **LunarLander-v3 (Low-Dimensional Vector Observations)**

### **Observation Characteristics**
- Observation: 8-dimensional continuous vector
- Range: Varies across each dimension
- Example: Position, velocity, angle, leg contacts

### **Normalization Method**
- Normalize each component using:
  $\text{normalized}_i = \frac{x_i - \text{low}_i}{\text{high}_i - \text{low}_i}$

- Ensures all features are in the range `[0, 1]` if bounds are finite.

### **Why This Matters**
- Helps algorithms like **DQN, PPO, A2C** treat all features equally in magnitude.
- Prevents features with larger numeric ranges from dominating others.
- Especially useful when observation components vary greatly (e.g., positions vs. contact flags).

### **When to Use**
- For any **low-dimensional Box space** with well-defined bounds.
- Ideal when using **MLPs** or other non-convolutional architectures.

## **PongNoFrameskip-v4 (Image Observations)**

### 🔍 **Observation Characteristics**
- Observation: Stack of 4 grayscale frames (shape: `(4, 84, 84)`)
- Data type: `uint8` (pixel values in `[0, 255]`)

### **Normalization Method**
- Divide by 255 to scale values to `[0, 1]`

### **Why This Matters**
- Neural networks (especially CNNs) work better with inputs in a standardized range.
- Prevents large gradients from pixel values that are too high.
- Improves learning efficiency and stability, especially for vision-based policies.

### **When to Use**
- Any time you're dealing with **image-based observations** (Atari, VizDoom, etc.)
- Essential when using CNNs or pretrained models that expect normalized input


## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it’s applied.

🔗 PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html


---

In [82]:
from torch.nn.utils import clip_grad_norm_

class ActorCritic(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(ActorCritic, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.actor_fc = nn.Linear(128, output_dim)
        self.critic_fc = nn.Linear(128, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        action_logits = self.actor_fc(x)
        state_value = self.critic_fc(x)
        return action_logits, state_value

input_dim = 8
output_dim = 4
model = ActorCritic(input_dim, output_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Dummy Training Loop
for step in range(1, 6):
    dummy_obs = torch.randn(1, input_dim)
    dummy_action = torch.randint(0, output_dim, (1,))
    dummy_return = torch.rand(1, 1)

    logits, value = model(dummy_obs)

    policy_loss = F.cross_entropy(logits, dummy_action)
    value_loss = F.mse_loss(value, dummy_return)
    loss = policy_loss + value_loss

    optimizer.zero_grad()
    loss.backward()

    #before clipping
    total_norm_before = torch.norm(
        torch.stack([p.grad.norm() for p in model.parameters() if p.grad is not None])
    )

    # Clip gradients
    clip_grad_norm_(model.parameters(), max_norm=0.5)

    # after clipping
    total_norm_after = torch.norm(
        torch.stack([p.grad.norm() for p in model.parameters() if p.grad is not None])
    )

    optimizer.step()
    print(f"[Step {step}] Loss: {loss.item():.4f} | Grad Norm Before: {total_norm_before:.4f} | After: {total_norm_after:.4f}")


[Step 1] Loss: 1.5469 | Grad Norm Before: 4.5488 | After: 0.5000
[Step 2] Loss: 1.5651 | Grad Norm Before: 2.6962 | After: 0.5000
[Step 3] Loss: 1.8967 | Grad Norm Before: 3.5366 | After: 0.5000
[Step 4] Loss: 1.8945 | Grad Norm Before: 4.4850 | After: 0.5000
[Step 5] Loss: 1.6763 | Grad Norm Before: 5.1891 | After: 0.5000


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In deep reinforcement learning (especially with Actor-Critic architectures), gradients can sometimes **explode**, particularly:
- When rewards are very large or highly variable
- When the network becomes unstable due to off-policy learning or poor initial exploration
- When using deep or recurrent networks with many layers

#### Problem: Exploding Gradients
- Large gradients → large parameter updates → unstable learning → diverging policy/value function

#### Solution: Gradient Clipping
This technique **caps the total gradient norm** to a fixed maximum (here, `0.5`) to ensure updates remain stable:


If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |
|---|---|---|
| Ruthvik Vasantha Kumar  | Task 1 | 100%  |
| Ruthvik Vasantha Kumar  | Task 2 | 100%  |
| Shreyas Bellary Manjunath  | Task 3 | 100%  |
| Shreyas Bellary Manjunath  | Task 4 | 100%  |
|   | **Total** | 100%  |
