## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Spring 2025</center>

Welcome to the Assignment 3, Part 1: Introduction to Actor-Critic Methods! It includes the implementation of simple actor and critic networks and best practices used in modern Actor-Critic algorithms.

<b>Submitted By:</b><br>
Name - Shivansh Gupta<br>
UBIT No - 50604127<br>
UBIT Name - sgupta67<br>
UB Email ID - sgupta67@buffalo.edu<br>

Name - Karan Ramchandani<br>
UBIT No - 50610533<br>
UBIT Name - karamchan<br>
UB Email ID - karamchan@buffalo.edu<br>

## Section 0: Setup and Imports

In [43]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import deque
from gymnasium.wrappers import AtariPreprocessing
import ale_py

# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x112001b70>

## Section 1: Actor-Critic Network Architectures and Loss Computation

In this section, you will explore two common architectural designs for Actor-Critic methods and implement their corresponding loss functions using dummy tensors. These architectures are:
- A. Completely separate actor and critic networks
- B. A shared network with two output heads

Both designs are widely used in practice. Shared networks are often more efficient and generalize better, while separate networks offer more control and flexibility.

---


### Task 1a – Separate Actor and Critic Networks with Loss Function

Define a class `SeparateActorCritic`. Your goal is to:
- Create two completely independent neural networks: one for the actor and one for the critic.
- The actor should output a probability distribution over discrete actions (use `nn.Softmax`).
- The critic should output a single scalar value.

 Use `nn.ReLU()` as your activation function. Include at least one hidden layer of reasonable width (e.g. 64 or 128 units).

```python
# TODO: Define SeparateActorCritic class
```

 Next, simulate training using dummy tensors:
1. Generate dummy tensors for log-probabilities, returns, estimated values, and entropies.
2. Compute the actor loss using the advantage (return - value).
3. Compute the critic loss as mean squared error between values and returns.
4. Use a single optimizer for both the Actor and the Critic. In this case, combine the actor and critic losses into a total loss and perform backpropagation.
5. Use a separate optimizers for both the Actor and the Critic. In this case, keep the actor and critic losses separate and perform backpropagation.

```python
# TODO: Simulate loss computation and backpropagation
```

🔗 Helpful references:
- PyTorch Softmax: https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
- PyTorch MSE Loss: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html

---

In [44]:
# TODO: Define a class SeparateActorCritic with separate networks for actor and critic

# BEGIN_YOUR_CODE

# Creating a completely independent actor-critic neural network architecture
class SeparateActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim):
        super(SeparateActorCritic, self).__init__()
        
        # Actor network
        self.a_fc1 = nn.Linear(state_dim, hidden_dim)
        self.a_fc2 = nn.Linear(hidden_dim, action_dim)
        self.a_softmax = nn.Softmax(dim=-1)
        
        # Critic network
        self.c_fc1 = nn.Linear(state_dim, hidden_dim)
        self.c_fc2 = nn.Linear(hidden_dim, 1)
        
    def forward(self, state):
        
        # Forward pass for actor
        f_actor = F.relu(self.a_fc1(state))
        logits_actor = self.a_fc2(f_actor)
        action_probs = self.a_softmax(logits_actor)
        
        # Forward pass for critic
        f_critic = F.relu(self.c_fc1(state))
        state_value = self.c_fc2(f_critic)
        
        return action_probs, state_value
    
if __name__ == "__main__":
    
    batch_size = 5
    state_dim = 4
    action_dim = 2
    hidden_dim = 128
    dummy_states = torch.randn(batch_size, state_dim)
    
    # Initialize the actor-critic model
    AC_model = SeparateActorCritic(state_dim, action_dim, hidden_dim)
    
    # Forward pass through the model to get the action probabilities and state values
    action_probs, state_values = AC_model(dummy_states)
    
    # For simulation of loss computation, by using multinomial sampling we sampled an action for each state
    dummy_actions = action_probs.multinomial(num_samples=1).squeeze(-1)
    
    # Computing the log probabilities of the dummy actions taken
    dummy_log_probs = torch.log(action_probs[range(batch_size), dummy_actions])
    
    # Computing the entropy of the each action probability distribution
    dummy_entropy_value = -torch.sum(action_probs * torch.log(action_probs + 1e-10), dim=1)
    
    # Creating dummy rewards (or target values) for the critic evaluation
    dummy_critic_rewards = torch.randn(batch_size)
    
    # Computing the advantage estimates which is the difference between the returns and the estimated state values
    compute_advantage = dummy_critic_rewards - state_values.squeeze(-1)
    
    # Computing the actor loss using the max log probabilities and the advantage estimates
    actor_loss = -torch.mean(dummy_log_probs * compute_advantage) - 0.01 * torch.mean(dummy_entropy_value)
    
    # Computing the critic loss using the mean squared error between the returns and the estimated state values
    critic_loss = F.mse_loss(state_values.squeeze(-1), dummy_critic_rewards)
    
    print("Actor Loss:", actor_loss.item())
    print("Critic Loss:", critic_loss.item())
    
    # Combining and using a simple optimizer for both actor and critic
    combined_optimizer = optim.Adam(AC_model.parameters(), lr=0.001)
    combined_loss_computed = actor_loss + critic_loss
    
    # Performing a backward pass and optimizing the model using the combined loss
    combined_optimizer.zero_grad()
    combined_loss_computed.backward()
    combined_optimizer.step()
    print("Combined Loss:", combined_loss_computed.item())
    
    print("\nCombined Optimizer Step Performed....")
    
    #--------------------------------------------------------------------------------------------"""
    
    # Using seperate optimizers for actor and critic
    parameter_Actor = list(AC_model.a_fc1.parameters()) + list(AC_model.a_fc2.parameters())
    parameter_Critic = list(AC_model.c_fc1.parameters()) + list(AC_model.c_fc2.parameters())
    
    optimizer_actor = optim.Adam(parameter_Actor, lr=0.001)
    optimizer_critic = optim.Adam(parameter_Critic, lr=0.001)
    
    # Forward pass again to get a fresh computation graph as we first performed a backward pass using a combined optimizer
    action_probs, state_values = AC_model(dummy_states)

    # Recomputing dummy actions, log-probabilities, and entropies
    dummy_actions = action_probs.multinomial(num_samples=1).squeeze(-1)
    dummy_log_probs = torch.log(action_probs[range(batch_size), dummy_actions])
    dummy_entropy_value = -torch.sum(action_probs * torch.log(action_probs + 1e-10), dim=1)

    # Recomputing the advantage and losses for the above computed values
    compute_advantage = dummy_critic_rewards - state_values.squeeze(-1)
    actor_loss = -torch.mean(dummy_log_probs * compute_advantage) - 0.01 * torch.mean(dummy_entropy_value)
    critic_loss = F.mse_loss(state_values.squeeze(-1), dummy_critic_rewards)

    # Now performing a backward pass and optimizing the model using the seperate optimizers
    optimizer_actor.zero_grad()
    optimizer_critic.zero_grad()

    torch.autograd.backward(
        [actor_loss, critic_loss],
        [torch.ones_like(actor_loss), torch.ones_like(critic_loss)]
    )

    optimizer_actor.step()
    optimizer_critic.step()
    
    print("\nSeperate Optimizer Step Performed....")
    
        
# END_YOUR_CODE

Actor Loss: -0.41221699118614197
Critic Loss: 2.0711171627044678
Combined Loss: 1.6589001417160034

Combined Optimizer Step Performed....

Seperate Optimizer Step Performed....


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In [45]:
# The reason for using separate networks for the actor and the critic is based on the requirement to maintain the independent nature of learning for both modules with regards to policy learning (actor) and value estimation (critic). The seperate nature of the two modules confirms that both networks can focus on their respective duties of policy improvement for the actor and value function approximation for the critic—without interfering with each other. This seperation encourages more flexible learning dynamics, especially in situations where the value function and policy representations are very different.

# The use of this kind of setup are particularly beneficial in environments where the value function and policy have conflicting or diverging learning patterns. For example, in Atari environments such as PongNoFrameskip-v4, the actor can learn in a better way by capturing fast-moving dynamics for selecting actions, while the critic could need to focus on more stable value evaluation for the long-term. Environments like HalfCheetah-v3,the actor and the critic could learning effectively by having independent model paths that successfully manage to capture different dimensions of movement and reward.

### Task 1b – Shared Network with Actor and Critic Heads + Loss Function

Now define a class `SharedActorCritic`:
- Build a shared base network (e.g., linear layer + ReLU)
- Create two heads: one for actor (output action probabilities) and one for critic (output state value)

```python
# TODO: Define SharedActorCritic class
```

Then:
1. Pass a dummy input tensor through the model to obtain action probabilities and value.
2. Simulate dummy rewards and compute advantage.
3. Compute the actor and critic losses, combine them, and backpropagate.

```python
# TODO: Simulate shared network loss computation and backpropagation
```

 Use `nn.Softmax` for actor output and `nn.Linear` for scalar critic output.

🔗 More reading:
- Policy Gradient Methods: https://spinningup.openai.com/en/latest/algorithms/vpg.html
- Actor-Critic Overview: https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial
- PyTorch Categorical Distribution: https://pytorch.org/docs/stable/distributions.html#categorical

---

In [46]:
# BEGIN_YOUR_CODE

# Creating a shared base network for both actor and critic
class SharedActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim):
        super(SharedActorCritic, self).__init__()
        
        # Shared base network
        self.shared_fc1 = nn.Linear(state_dim, hidden_dim)
        
        # Seoerate heads for actor and critic
        self.actor_head = nn.Linear(hidden_dim, action_dim)
        self.softmax_actor = nn.Softmax(dim=-1)
        self.critic_head = nn.Linear(hidden_dim, 1)
        
    def forward(self, state):
        shared_output = F.relu(self.shared_fc1(state))
        
        # Forward pass for actor
        actor_logits = self.actor_head(shared_output)
        probab_action = self.softmax_actor(actor_logits)
        
        # Forward pass for critic
        state_value = self.critic_head(shared_output)
        
        return probab_action, state_value
    
if __name__ == "__main__":
    batch_size = 5
    state_dim = 4
    action_dim = 2
    hidden_dim = 128
    dummy_states_shared = torch.randn(batch_size, state_dim)
    
    # Initializing the actor-critic model
    AC_model_shared = SharedActorCritic(state_dim, action_dim, hidden_dim)
    
    # Forward pass through the model to get the action probabilities and state values
    action_probs_shared, state_values_shared = AC_model_shared(dummy_states)
    
    # For simulation of loss computation, by using multinomial sampling we sampled an action for each state
    dummy_actions_shared = action_probs_shared.multinomial(num_samples=1).squeeze(-1)
    
    # Computing the log probabilities of the dummy actions taken
    dummy_log_probs_shared = torch.log(action_probs_shared[range(batch_size), dummy_actions_shared])
    
    # Computing the entropy of the each action probability distribution
    dummy_entropy_value_shared = -torch.sum(action_probs_shared * torch.log(action_probs_shared + 1e-10), dim=1)
    
    # Creating dummy rewards (or target values) for the critic evaluation
    dummy_critic_rewards_shared = torch.randn(batch_size)
    
    # Computing the advantage estimates which is the difference between the returns and the estimated state values
    compute_advantage_shared = dummy_critic_rewards_shared - state_values_shared.squeeze(-1)
    
    # Computing the actor loss using the max log probabilities and the advantage estimates
    actor_loss_shared = -torch.mean(dummy_log_probs_shared * compute_advantage_shared) - 0.01 * torch.mean(dummy_entropy_value_shared)
    
    # Computing the critic loss using the mean squared error between the returns and the estimated state values
    critic_loss_shared = F.mse_loss(state_values_shared.squeeze(-1), dummy_critic_rewards_shared)
    
    print("Actor Loss (Shared):", actor_loss_shared.item())
    print("Critic Loss (Shared):", critic_loss_shared.item())
    
    total_loss_shared = actor_loss_shared + critic_loss_shared
    print("Total Loss (Shared):", total_loss_shared.item())
    
    # Using a single optimizer for the shared model
    optimizer_shared = optim.Adam(AC_model_shared.parameters(), lr=0.001)
    
    # Performing a backward pass and optimizing the model using the shared optimizer
    optimizer_shared.zero_grad()
    total_loss_shared.backward()
    optimizer_shared.step()
    
    print("Shared Optimizer Step Performed....")
    
    
# END_YOUR_CODE

Actor Loss (Shared): -0.1714530736207962
Critic Loss (Shared): 0.45052018761634827
Total Loss (Shared): 0.27906709909439087
Shared Optimizer Step Performed....


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In [47]:
# The motivation behin using a shared base network for both actor and critic is to improve learning efficiency. In this setup, the actor and critic share the same layers to extract features, by following this process they don’t have to learn different sets of features. The shared layers capture general patterns from input states, which the actor and critic then use for their respective outputs and this ultimately leads to fewer parameters, allowing faster learning. This is highly crucial as shared architecture simplifies the system, making it easier to scale and manage.

# The use of this kind of setup are particularly beneficial in environments when both the actor and critic benefit from using the same information from the environment, and this occur when the poilicy and value function rely on somewhat same features. For example, in environments like CartPole-v1, the actor and critic can learn effectively by having a shared base network that captures the same features like cart position, angle and velocity. Similarly, in environments like LunarLander-v3, the actor and critic can learn effectively by having a shared base network that captures the same features like lander position and velocity which can help take better decisions.

## Section 2: Auto-Adaptive Network Setup for Environments

You will now create a function that builds a shared actor-critic network that adapts to any Gymnasium environment. This function should inspect the environment and build input/output layers accordingly.

### Task 2: Auto-generate Input and Output Layers
Write a function `create_shared_network(env)` that constructs a neural network using the following rules:
- The input layer should match the environment's observation space.
- The output layer for the **actor** should depend on the action space:
  - For discrete actions: output probabilities using `nn.Softmax`.
  - For continuous actions: output mean and log std for a Gaussian distribution.
- The **critic** always outputs a single scalar value.

```python
# TODO: Define function `create_shared_network(env)`
```

#### Environments to Support:
Test your function with the following environments:
1. `CliffWalking-v0` (Use one-hot encoding for discrete integer observations.)
2. `LunarLander-v3` (Standard Box space for observations and discrete actions.)
3. `PongNoFrameskip-v4` (Use gym wrappers for Atari image preprocessing.)
4. `HalfCheetah-v5` (Continuous observation and continuous action.)

```python
# TODO: Loop through environments and test `create_shared_network`
```

Hint: Use `gym.spaces` utilities to determine observation/action types dynamically.

🔗 Observation/Action Space Docs:
- https://gymnasium.farama.org/api/spaces/

---

In [12]:
# BEGIN_YOUR_CODE
def create_shared_network(env, hidden_dim):
    observation_space = env.observation_space
    action_space = env.action_space
    
    # Finding the input dimension based on the observation space
    if isinstance(observation_space, gym.spaces.Discrete):
        input_dimension = observation_space.n
        is_discrete_observation = True
    elif isinstance(observation_space, gym.spaces.Box):
        input_dimension = int(np.prod(observation_space.shape))
        is_discrete_observation = False
    else:
        raise ValueError("Unsupported observation space type".format(type(observation_space)))
    
    # Finding the actor head setting based on the action space
    continuous_action = False
    if isinstance(action_space, gym.spaces.Discrete):
        actor_output_dimension = action_space.n
    elif isinstance(action_space, gym.spaces.Box):
        actor_output_dimension = int(np.prod(action_space.shape))
        continuous_action = True
    else:
        raise ValueError("Unsupported action space type".format(type(action_space)))
    
    # Creating the shared actor-critic model
    class Shared_AC_Netwok(nn.Module):
        def __init__(self):
            super(Shared_AC_Netwok, self).__init__()
            self.is_discrete_observation = is_discrete_observation
            self.continuous_action = continuous_action
            self.input_dimension = input_dimension
            
            # Shared base network for actor and critic
            self.shared_fc1 = nn.Linear(input_dimension, hidden_dim)
            self.relu_layer = nn.ReLU()
            
            # Actor head configuration
            if self.continuous_action:
                self.actor_head_avg = nn.Linear(hidden_dim, actor_output_dimension)
                self.actor_log_std = nn.Parameter(torch.zeros(actor_output_dimension))
            else:
                self.actor_head = nn.Linear(hidden_dim, actor_output_dimension)
                self.actor_softmax = nn.Softmax(dim=-1)
                
            # Critic head configuration
            self.critic_head = nn.Linear(hidden_dim, 1)
            
        def forward(self, state):
            if self.is_discrete_observation:
                state = F.one_hot(state.long(), num_classes=self.input_dimension).float()
            else:
                if len(state.shape) > 2:
                    state = state.view(state.size(0), -1)
                    
            state = self.relu_layer(self.shared_fc1(state))
            state_value = self.critic_head(state)
            
            # Computing the action probabilities for continuous and discrete action spaces
            if self.continuous_action:
                mean = self.actor_head_avg(state)
                log_std = torch.exp(self.actor_log_std)
                return (mean, log_std), state_value
            else:
                logits = self.actor_head(state)
                action_probs = self.actor_softmax(logits)
                return action_probs, state_value
            
    return Shared_AC_Netwok()

if __name__ == "__main__":
    used_env_names = [
        "CliffWalking-v0",
        "LunarLander-v3",
        "PongNoFrameskip-v4",
        "HalfCheetah-v5"
    ]
    
    for env_name in used_env_names:
        print(f"\n-----Environment: {env_name} imported successfully-----\n")
        try:
            test_env = gym.make(env_name)
        except gym.error.NameNotFound as e:
            print(f"Error creating environment {env_name}: {e}")
            continue
        
        # Apply Atari Wrappers for preprocessing for Atari pong
        if env_name == "PongNoFrameskip-v4":
            try:
                test_env = gym.wrappers.AtariPreprocessing(test_env, frame_skip=4, grayscale_obs=True, scale_obs=True)
                test_env = gym.wrappers.FrameStackObservation(test_env, stack_size=4)
            except Exception as e:
                print(f"Error applying Atari wrappers: {e}")
                continue
        
        hidden_dim = 128
        
        # Initializing the shared actor-critic model
        shared_ac_network = create_shared_network(test_env, hidden_dim)
        
        # Creating a dummy simulation for testing all the envrionments
        batch_size = 2
        if isinstance(test_env.observation_space, gym.spaces.Discrete):
            dummy_states = torch.randint(0, test_env.observation_space.n, (batch_size,))
        elif isinstance(test_env.observation_space, gym.spaces.Box):
            dummy_states = torch.randn(batch_size, *test_env.observation_space.shape)
        else:
            dummy_states = None
        
        # Performing a forward pass through the shared actor-critic model
        if dummy_states is not None:
            shared_newtork_output = shared_ac_network(dummy_states)
            if isinstance(test_env.action_space, gym.spaces.Box):
                (state_mean, log_std), state_value = shared_newtork_output
                print(f"Action Mean: {state_mean}, \nAction Log Std: {log_std}, \nState Value Estimates: {state_value}")
            else:
                action_probs, state_value = shared_newtork_output
                print(f"Action Probabilities: {action_probs}, \nState Value Estimates: {state_value}")
        
        
# END_YOUR_CODE


-----Environment: CliffWalking-v0 imported successfully-----

Action Probabilities: tensor([[0.2648, 0.2445, 0.2479, 0.2428],
        [0.2665, 0.2409, 0.2354, 0.2572]], grad_fn=<SoftmaxBackward0>), 
State Value Estimates: tensor([[-0.0416],
        [ 0.0014]], grad_fn=<AddmmBackward0>)

-----Environment: LunarLander-v3 imported successfully-----

Action Probabilities: tensor([[0.1729, 0.3826, 0.2302, 0.2143],
        [0.2624, 0.1999, 0.2749, 0.2629]], grad_fn=<SoftmaxBackward0>), 
State Value Estimates: tensor([[0.1171],
        [0.1530]], grad_fn=<AddmmBackward0>)

-----Environment: PongNoFrameskip-v4 imported successfully-----

Action Probabilities: tensor([[0.1869, 0.1190, 0.1025, 0.1307, 0.1793, 0.2816],
        [0.1485, 0.1148, 0.2463, 0.1504, 0.1575, 0.1825]],
       grad_fn=<SoftmaxBackward0>), 
State Value Estimates: tensor([[0.0064],
        [0.4165]], grad_fn=<AddmmBackward0>)

-----Environment: HalfCheetah-v5 imported successfully-----

Action Mean: tensor([[-0.3002, -0.059

### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In [49]:
# The motivation behind using a general purpose shared network `create_shared_network(env)` is to design a flexible and scalable network architecture that can automatically handles different environments without needing manual adjustments. As we know Gym environments are very different in terms of what the observation space are and what type of actions are needed to be taken in those environments. The observations can be discrete numbers, continuous lists, or groups of image frames, whereas actions can be discrete choices, continuous actions, multi-discrete in nature. So this function dynamically looks at the types of spaces in these environments and then, it creates a shared actor-critic model that encodes these inputs accordingly and additionally select the right output head structure for either discrete or continuous actions. 

# The use of this kind of setup is particularly beneficial when the environments are hetrogeneous in nature - like CliffWalking-v0, which involves fixed states and actions, or LunarLander-v3, where you have box observations and fixed choices for actions. In PongNoFrameskip-v4, you work with image-based input and fixed actions, while in HalfCheetah-v5, you deal with continuous states and actions using MuJoCo, so instead of designing separate network architectures for each environments, we can use this efficient actor-critic model directly using this function. This architecture is highly useful in rprojects where experiments span various benchmarks, as it allows for rapid prototyping and testing of different algorithms without the need for extensive code changes.

### Task 3: Write Observation Normalization Function
Create a function `normalize_observation(obs, env)` that:
- Checks if the observation space is `Box` and has `low` and `high` attributes.
- If so, normalize the input observation.
- Otherwise, return the observation unchanged.

```python
# TODO: Define `normalize_observation(obs, env)`
```

Test this function with observations from:
- `LunarLander-v3`
- `PongNoFrameskip-v4`

Note: Atari observations are image arrays. Normalize pixel values to [0, 1]. For LunarLander-v3, the different elements in the observation vector have different ranges. Normalize them to [0, 1] using the `low` and `high` attributes of the observation space.


---

In [13]:
# BEGIN_YOUR_CODE
def normalize_observation(obs, env):
    
    # Check if the observation space is a Box Space and has low and high attributes
    if isinstance(env.observation_space, gym.spaces.Box) and hasattr(env.observation_space, 'low') and hasattr(env.observation_space, 'high'):
        low_attr = env.observation_space.low
        high_attr = env.observation_space.high
        
        # For image based observations we take the low as 0 and high as 255, so divide by 255 to normalize
        if np.all(low_attr == 0) and np.all(high_attr == 255):
            obs = obs.astype(np.float32) / 255.0
            return obs
        # For other observations we normalize the values between 0 and 1
        else:
            difference_value = high_attr - low_attr
            difference_value[difference_value == 0] = 1
            obs = (obs - low_attr) / difference_value
            return obs
    else:
        raise ValueError("Observation space is not a Box space or does not have low/high attributes.")
    
if __name__ == '__main__':
    print(f"\n-----Testing for Environment: LunarLander-v3-----\n")
    lunar_test_env = gym.make("LunarLander-v3")
    observation_lunar, _ = lunar_test_env.reset()
    print(f"Initial Observation Space for LunarLander: \n{observation_lunar}")
    print("Lunar observation range: {} to {}\n".format(observation_lunar.min(), observation_lunar.max()))
    normalized_observation_lunar = normalize_observation(observation_lunar, lunar_test_env)
    print(f"Normalized Observation Space for LunarLander: \n{normalized_observation_lunar}")
    print("Lunar normalized observation range: {} to {}\n".format(normalized_observation_lunar.min(), normalized_observation_lunar.max()))
    
    print(f"\n-----Testing for Environment: PongNoFrameskip-v4-----\n")
    pong_test_env = gym.make("PongNoFrameskip-v4")
    observation_pong, _ = pong_test_env.reset()
    
    if hasattr(pong_test_env, '__array__'):
        observation_pong = np.array(observation_pong)
    
    print(f"Initial Observation Space shape for Pong: {observation_pong.shape}")
    print("Pong observation pixel range: {} to {}\n".format(observation_pong.min(), observation_pong.max()))
    normalized_observation_pong = normalize_observation(observation_pong, pong_test_env)
    print(f"Normalized Observation Space shape for Pong:{normalized_observation_pong.shape}")
    print("Pong normalized observation pixel range: {} to {}".format(normalized_observation_pong.min(), normalized_observation_pong.max()))
            
       
# END_YOUR_CODE


-----Testing for Environment: LunarLander-v3-----

Initial Observation Space for LunarLander: 
[ 3.1213759e-04  1.4149177e+00  3.1597104e-02  1.7766991e-01
 -3.5485908e-04 -7.1572280e-03  0.0000000e+00  0.0000000e+00]
Lunar observation range: -0.007157227955758572 to 1.4149177074432373

Normalized Observation Space for LunarLander: 
[0.5000624  0.78298354 0.5015799  0.5088835  0.49997178 0.49964213
 0.         0.        ]
Lunar normalized observation range: 0.0 to 0.7829835414886475


-----Testing for Environment: PongNoFrameskip-v4-----

Initial Observation Space shape for Pong: (210, 160, 3)
Pong observation pixel range: 0 to 228

Normalized Observation Space shape for Pong:(210, 160, 3)
Pong normalized observation pixel range: 0.0 to 0.8941176533699036


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In [50]:
# The motivation behind Normalization of observation is a crucial step in the preprocessing pipeline of reinforcement learning, especially when training deep neural networks. Raw observations can vary highly in size and value, which can make training unstable, and cause issues like disappearing or excessively large gradients, and exponentially slow down learning. The "normalize_observation" function helps by adjusting all these observations to fit within the same range, i.e [0,1], so that the changes in the observations are more consistent and easier to learn from. This method ensures that the neural network receives inputs that are all on a consistent scale and motivate the network to learn more easily give better outputs for the actor and critic heads.

# Normalization is highly used when the environments have continuous observation spaces or when dealing with raw pixel inputs. For example, LunarLander-v3, uses an 8-dimensional state vector, which can have high numbers for position or velocity. By normalizing these numbers, none of them become too dominant, which helps the training process stay stable. While, environment like PongNoFrameskip-v4 consists of images that are 210 X 160 pixels which are RGB in nature so the pixel values will range from 0 to 255 and by scaling these values to a range between 0.0 and 1.0, the changes in obseervation becomes more stable and suitable for training neural networks.

## Section 4: Gradient Clipping

To prevent exploding gradients, it's common practice to clip gradients before optimizer updates.

### Task 4: Clip Gradients for Actor-Critic Networks
Use dummy tensors and apply gradient clipping with the following PyTorch method:
```python
# During training, after loss.backward():
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
```

Reuse the loss computation from Task 1a or 1b. After computing the gradients, apply gradient clipping.
Print the gradient norm before and after clipping to verify it’s applied.

🔗 PyTorch Docs: https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html


---

In [42]:
# BEGIN_YOUR_CODE

# Creating a Seperate Actor-Critic model as per 1a
class SeparateActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim):
        super(SeparateActorCritic, self).__init__()
        
        # Actor network
        self.a_fc1 = nn.Linear(state_dim, hidden_dim)
        self.a_fc2 = nn.Linear(hidden_dim, action_dim)
        self.a_softmax = nn.Softmax(dim=-1)
        
        # Critic network
        self.c_fc1 = nn.Linear(state_dim, hidden_dim)
        self.c_fc2 = nn.Linear(hidden_dim, 1)
        
    def forward(self, state):
        
        # Forward pass for actor
        f_actor = F.relu(self.a_fc1(state))
        logits_actor = self.a_fc2(f_actor)
        action_probs = self.a_softmax(logits_actor)
        
        # Forward pass for critic
        f_critic = F.relu(self.c_fc1(state))
        state_value = self.c_fc2(f_critic)
        
        return action_probs, state_value
    
# Computing L2 norm of the gradients of all parameters in the model
def get_gradient_normalization(model):
    total = 0
    for param in model.parameters():
        if param.grad is not None:
            normalized_norm = param.grad.data.norm(2)
            total += normalized_norm.item() ** 2
    total = total ** 0.5
    return total

# Running a dummy simulation to test the gradient normalization
if __name__ == "__main__":
    
    batch_size = 5
    state_dim = 4
    action_dim = 2
    hidden_dim = 128
    dummy_states = torch.randn(batch_size, state_dim)
    
    # Initializing the actor-critic model
    AC_model = SeparateActorCritic(state_dim, action_dim, hidden_dim)
    
    # Forward pass through the model to get the action probabilities and state values
    action_probs, state_values = AC_model(dummy_states)
    
    # For simulation of loss computation, by using multinomial sampling we sampled an action for each state
    dummy_actions = action_probs.multinomial(num_samples=1).squeeze(-1)
    
    # Computing the log probabilities of the dummy actions taken
    dummy_log_probs = torch.log(action_probs[range(batch_size), dummy_actions])
    
    # Computing the entropy of the each action probability distribution
    dummy_entropy_value = -torch.sum(action_probs * torch.log(action_probs + 1e-10), dim=1)
    
    # Creating dummy rewards (or target values) for the critic evaluation
    dummy_critic_rewards = torch.randn(batch_size)
    
    # Computing the advantage estimates which is the difference between the returns and the estimated state values
    compute_advantage = dummy_critic_rewards - state_values.squeeze(-1)
    
    # Computing the actor loss using the max log probabilities and the advantage estimates
    actor_loss = -torch.mean(dummy_log_probs * compute_advantage) - 0.01 * torch.mean(dummy_entropy_value)
    
    # Computing the critic loss using the mean squared error between the returns and the estimated state values
    critic_loss = F.mse_loss(state_values.squeeze(-1), dummy_critic_rewards)
    
    print("Actor Loss:", actor_loss.item())
    print("Critic Loss:", critic_loss.item())
    total_loss = actor_loss + critic_loss
    print("Combined Loss:", total_loss.item())
    
    # Performing a backward pass and optimizing the model using the combined loss
    optimizer = optim.Adam(AC_model.parameters(), lr=0.001)
    optimizer.zero_grad()
    
    total_loss.backward()
    
    # Computing the gradient normalization
    previous_norm = get_gradient_normalization(AC_model)
    print("Gradient Norm before optimization:", previous_norm)
    
    # Performing the gradient clipping
    torch.nn.utils.clip_grad_norm_(AC_model.parameters(), max_norm=0.5)
    
    # Computing the gradient normalization after clipping
    clipped_norm = get_gradient_normalization(AC_model)
    print("Gradient Norm after optimization:", clipped_norm)
    
    # Performing the optimization step
    optimizer.step()

# END_YOUR_CODE

Actor Loss: -0.41626057028770447
Critic Loss: 1.5845913887023926
Combined Loss: 1.1683307886123657
Gradient Norm before optimization: 3.8339337676779257
Gradient Norm after optimization: 0.4999999033364367


### Discuss the motivation behind each setup and when it may be preferred in practice.

YOUR ANSWER:

In [1]:
# The motivation behind using Gradient clipping is to prevent the exploding gradient issue during backpropogation, as in the case of actor-critic methods, the optimization for both the actor and critic is performed simultaneously, which can lead to large gradients due to high variance environments or when the rewards has super rapid highs or lows, which ultimately can destabilize the training process. So by applyin L2-Norm clipping we can limit the maximum allowed gradient norm to 0.5 as used above, ensuring that the updates remain within a reasonable and controlled range, and allow the model to have smoother, more stable learning.

# Gradient clipping is particularly beneficial in environments with variable or big rewards. For example, Atari environment like BipedalWalker-v3, where rewards heavily depend on prior exploration, and during RNN training where gradients may explode or vanish. In the dummy simulation used above, gradient clipping is being applied on a typical separate actor-critic neural network, and  effect in the  gradient norms before and after (clipping) can be clearly observed. This is routinely done with optimizers like Adam, or RMS, which allow for fast convergence with high numerical stability in various RL tasks.

If you are working in a team, provide a contribution summary.
| Team Member | Step# | Contribution (%) |
|---|---|---|
|  50604127 (sgupta67), 506010533 (karamchan) | Task 1 |  80%, 20% |
|  50604127 (sgupta67) | Task 2 |  100% |
|  506010533 (karamchan) | Task 3 | 100%  |
| 50604127 (sgupta67), 506010533 (karamchan)  | Task 4 | 20%, 80%  |
|   | **Total** | 100%  |
