# ABE tutorial 4
## Using functional approximation

In this fourth tutorial let's dive deeper into how we use neural network models with RL. Now that we understand the concept of functional approximation, we can see how these neural network models are operating. We'll spend some time on some of the options when it comes to building neural networks for functional approximation, and go over some computational hurdles. 

Steps:
* Go over the building of a neural network
* Normalization
* Converting A2C to continous action spaces
* Test out the continuous A2C algorithm in new environments


# Building a Neural Network

We'll use pytorch as our python package to build neural networks. We've seen these before in our first 4 tutorials, but here let's dive into the details a little more.

The first thing to note is that we are using a sequential approach to building our neural networks. In this approach we just need to specify a network by providing an ordered list of layers. Let's take a look at how to do this below, by building a simple three layer network:

* **Input layer**: this is the layer where the data comes into the model. Let's assume there are 4 input variables.

* **First hidden layer**: this a layer of nodes that is connected to the input layer and will transform the input data, and pass these transformed values to the output layer. Let's assume this hidden layer has 32 nodes.

* **Output layer**: this output layer will take the transformed values and output values that can be used to inform what actions can be taken. Let's assume there are two actions that can be taken.

You should see below these 3 layers, and you should see how each layers shape coresponds to the data: e.g., 4 input values gets passed to the 32 nodes in the hidden layer, and how those 32 nodes pass those transformed values to the 2 actions in the output layer.



In [None]:
import torch
import torch.nn as nn
import hiddenlayer as hl

#build a simple two three layer model
my_net = nn.Sequential(
            nn.Linear(4, 32),
            nn.Linear(32, 32),
            nn.Linear(32, 2)
        )

print(my_net)

To see a little more about the network we can install torchsummary. Just make sure you are still in your virtual environment and run the following.

```
pip install torchsummary
```

Then you should be able to run

In [None]:
from torchsummary import summary

# Specify the input shape as a tuple
input_shape = (1, 4)

# Print the summary
summary(my_net, input_shape)

We can see in the  summary above that the model has 1282 parameters! These are all the weights and bias values that are associated with each edge.

In the book, we saw that these weights and biases on their own are really just linear equations applied to some inputs... and that to capture non-linear relationships we had to introduce activation functions. Let's do that now!

In [None]:
#build a simple two three layer model
my_net_2 = nn.Sequential(
            nn.Linear(4, 32),
            nn.ReLU(),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.Linear(32, 2)
        )

summary(my_net_2, input_shape)

By placing the RELU activation layers after each layer we are filtering out any nodes that are outputing negative values. This cutoff is what let's a neural network model non-linear relationships.

You'll notice that the model has the same number of weights and biases paramters. This is because the activation function is really just a filter and requires no new parameters. 

You'll notice too that there is no activation function applied after the output layer. This is because we want the output layer to output a continuous value and we want to keep negative values as an option. We'll see that for the output layer we have to think more about what kinds of outputs we want (continuous numeric, restricted to be between 0-1, ...etc) and that will determine how we build this last layer. Internally, however, with the hidden layers we will generally use RELU activation functions.

# Some useful layers

There are many kinds of layers we can build into our networks, we'll learn a few as we build our agents. Generally, these layers solve some problem for us, or allow our agents to experience the world in a different way. 

In this tutorial we'll learn about the Normalization layer. This layer solves a problem for us. As our agents are continuously learning and adjusting weights/biases in their neural networks, the size of those weights can get quite large, making subsequent changes to those weights/biases harder to adapt when learning. To make sure the weights/biases don't get to large, and allow our agents to be more flexible in learning, we will normalize the values of the weights/biases. This will still mean that larger weights and smaller weights will still be relatively the same their magnitudes will be reduced. This solves, or helps to solve, the computational issues of having very large weights/biases.

Let's see how to add that into our network. With the sequential approach we just have to stack the new layers in like lego blocks.

In [None]:
#build a simple two three layer model
my_net_3 = nn.Sequential(
            nn.Linear(4, 32),
            nn.ReLU(),
            nn.LayerNorm(32),
            nn.Linear(32, 32),
            nn.ReLU(),
            nn.LayerNorm(32),
            nn.Linear(32, 2)
        )

summary(my_net_3, input_shape)


We can see that we have more parameters in this normalized model. These new parameters are used to normalize the weights/biases of each layer.

Again we don't add the layerNorm to the output layer as the magnitudes of output are meaningful and we want to keep these magnitudes.

# Learning in Neural Networks

We've seen how we can build neural networks using a lego like approach and using different kinds of layers. Let's see now how we can update the weights/biases of these layers so that the network can learn. To do this, let's:

* Simulate some data to use as input
* Measure how far the network predictions are from the "right" answer
* Adjust the weights and biases to make better predcitions
* Do this many times, until the network is makeing good predictions!

In [None]:
# Simulate some data



# Continuous A2C

Here we will look at how we can handle actions that are continuous. Up until now we've been relying on actions being e.g., ;eft, right, rather thane.g., motion adjusted by -0.23. The difference here is that there are not discrete actions to choose from, rather some amount of action in continuous space. This will be very usful as our agent bodies and their abilities to interact with their environment become more open ended. That is, we want to allow our agent to find many different ways to interact with the environment, and we don't want to constrain the agent to a few discrete actions. This has some costs, as it is much easier to learn how to use discrete actions, and is a way for us to help our agents to learn faster if there is some limit to their actions. We'll see that similar to discrete cases, we can add in some information/constraints in continuous action spaces to help our agent learn.

First let's see where we need to alter our A2C agent to allow for continuous action spaces.

### Neural network adjustments



Let's start with the actor network:

```python

self.actor = nn.Sequential(
            nn.Linear(np.prod(state_shape), hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, np.prod(action_shape))
        )

```

This model can remain the same, as it will output a continuous value for each action space. We'll use this value as the mean of a guassian distribution. 

We'll then have to add an additional parameter for the standard deviation of the guassian distribution.

```python
        self.actor_log_std = nn.Parameter(torch.zeros(np.prod(action_shape)), requires_grad=True)

```

Next in the forward pass, where our model is used to make predictions about which actions to take, we'll have to modify how those actions are chosen.

```python
        # Actor network outputs: mean and std deviation for Gaussian distribution
        action_mean = self.actor(base_features)
        action_log_std = self.actor_log_std.clamp(-20, 2)  # Clipping for numerical stability
        action_std = action_log_std.exp()  # Convert log std to std
```

Then when we return the action choice we keep both the mean and the std of the actions:

```python
return action_mean, action_std, state_value
```

### Policy adjustments


When taking actions we'll have to adjust how these actions are chosen:

```python
def forward(self, batch, state=None, **kwargs):

        #run the model and get the action means and the uncertainty (std)
        action_mean, action_std, _ = self.model(batch.obs)
        
        # Create Gaussian distribution for continuous actions
        dist = torch.distributions.Normal(action_mean, action_std)

        #sample actions from the Gaussian distribution
        action = dist.sample()
        
        # Clip actions to be within the environment’s action space: i.e., make sure the actions make sense / are possible
        action = torch.clamp(action, self.action_space.low[0], self.action_space.high[0])
        
        return Batch(act=action.cpu().numpy(), dist=dist)
```

For the learning section of the policy we need to calculate the log probability of each action in a slightly different way now that we have continuous actions.

```python
    def learn(self, batch, **kwargs):
        
        # Forward pass to get mean, std, and value
        action_mean, action_std, state_values = self.model(batch.obs)
        dist = torch.distributions.Normal(action_mean, action_std)
        
        # Compute log probabilities of the taken actions
        log_probs = dist.log_prob(batch.act).sum(dim=-1)

        #... the rest of the policy code stays the same

```

# Testing out our new A2C

Full model code

In [None]:
import gymnasium as gym
import torch
import torch.nn as nn
import numpy as np
from tianshou.env import DummyVectorEnv
from tianshou.data import Batch, ReplayBuffer, Collector
from tianshou.policy import BasePolicy
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts


class ActorCriticNet(nn.Module):
    def __init__(self, state_shape, action_shape, hidden_size=128):
        super().__init__()

        self.actor = nn.Sequential(
            nn.Linear(np.prod(state_shape), hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, np.prod(action_shape))
        )

        # Separate layer for log std to allow independent learning
        self.actor_log_std = nn.Parameter(torch.zeros(np.prod(action_shape)), requires_grad=True)


        self.critic = nn.Sequential(
            nn.Linear(np.prod(state_shape), hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, 1)
        )
        
        
    def forward(self, obs, state=None, info={}):
        if isinstance(obs, np.ndarray):
            obs = torch.tensor(obs, dtype=torch.float32)

        # Actor network outputs: mean and std deviation for Gaussian distribution
        action_mean = self.actor(obs)
        action_log_std = self.actor_log_std.clamp(-20, 2)  # Clipping for numerical stability
        action_std = action_log_std.exp()  # Convert log std to std
        
        # Critic network output: state value
        state_value = self.critic(obs).squeeze(-1)
        
        return action_mean, action_std, state_value


class A2CPolicy(BasePolicy):
    def __init__(self, model, optim, action_space, gamma=0.99):
        super().__init__(action_space=action_space)
        self.model = model
        self.optim = optim
        self.gamma = gamma

    def forward(self, batch, state=None, **kwargs):
        action_mean, action_std, _ = self.model(batch.obs)
        
        # Create Gaussian distribution for continuous actions
        dist = torch.distributions.Normal(action_mean, action_std)

        # Sample an action from the guassian distribution
        action = dist.sample()
        
        # Clip actions to be within the environment’s action space: i.e., make sure the action is possible
        action = torch.clamp(action, self.action_space.low[0], self.action_space.high[0])
        
        return Batch(act=action.cpu().numpy(), dist=dist)

    def learn(self, batch, **kwargs):
        
        # Forward pass to get mean, std, and value
        action_mean, action_std, state_values = self.model(batch.obs)
        dist = torch.distributions.Normal(action_mean, action_std)
        
        # Compute log probabilities of the taken actions
        log_probs = dist.log_prob(batch.act).sum(dim=-1)

        # Compute the critic's next state values (for TD target)
        with torch.no_grad():
            _, _, next_state_values = self.model(batch.obs_next)
            td_target = batch.rew + self.gamma * (1 - batch.done) * next_state_values
            
            # Calculate the normalized advantage
            advantage = td_target - state_values  # Advantage calculation
            advantage = (advantage - advantage.mean()) / (advantage.std() + 1e-8)


        # Calculate entropy for the policy distribution
        #entropy = dist.entropy().mean()
        
        # Calculate policy (actor) loss (include entropy regularization)
        policy_loss = -(log_probs * advantage.detach()).mean() #- 0.01 * entropy  # Adjust weight as needed
        
        # Calculate value (critic) loss
        value_loss = nn.functional.mse_loss(state_values, td_target)
        
        # Combine the losses
        loss = policy_loss + value_loss
        
        # Backpropagation
        self.optim.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
        self.optim.step()

        return {"loss": loss.item(), "policy_loss": policy_loss.item(), "value_loss": value_loss.item()}

In [None]:
# Create a single environment instance to access the space information
single_env = gym.make("MountainCarContinuous-v0")
state_shape = single_env.observation_space.shape 
action_shape = single_env.action_space.shape #change n to shape
action_space = single_env.action_space


# Setting up the actor-critic network and A2C policy
net = ActorCriticNet(state_shape, action_shape)
optimizer = torch.optim.Adam(net.parameters(), lr=1e-5, weight_decay=1e-4)
policy = A2CPolicy(model=net, optim=optimizer, action_space=action_space, gamma=0.99)




In [None]:

# Custom training loop
max_epoch = 10
step_per_epoch = 10000
keep_n_steps = 30
buffer = ReplayBuffer(size=keep_n_steps)

# Set up collectors
train_collector = Collector(policy, single_env, buffer)
test_collector = Collector(policy, single_env)

#start a logger
logger_a2c = ts.utils.TensorboardLogger(SummaryWriter('log/a2c_cont_custom'))

for epoch in range(max_epoch):
    train_collector.reset()
    for step in range(step_per_epoch):
        # Collect one transition and store it in the buffer
        #train_collector.collect(n_step=1)
        train_collector.collect(n_step=keep_n_steps)

        # Sample the most recent observations from the buffer
        #batch, _ = train_collector.buffer.sample(batch_size=30)
        batch = train_collector.buffer[-keep_n_steps:]

        # Manually convert each field to a torch tensor
        batch.obs = torch.tensor(batch.obs, dtype=torch.float32)
        batch.act = torch.tensor(batch.act, dtype=torch.long)
        batch.rew = torch.tensor(batch.rew, dtype=torch.float32)
        batch.done = torch.tensor(batch.done, dtype=torch.float32)
        batch.obs_next = torch.tensor(batch.obs_next, dtype=torch.float32)

        # Perform A2C learning
        policy.learn(batch)

    # Testing and evaluation
    result = test_collector.collect(n_episode=10, reset_before_collect=True)
    print(f'Epoch #{epoch + 1}: reward = {result.returns.mean()}, loss = {policy.learn(batch)["loss"]}')

    # Log the average reward for the epoch
    logger_a2c.writer.add_scalar("Reward/test_avg", result.returns.mean(), epoch)

Did it learn? Do you see rewards increasing? 

If so let's save the model:

In [None]:
torch.save(net.state_dict(), "models/A2C_mountain_model.pth")

Let's test out the model, and watch what it learnt.

Load in the trained model.

In [None]:
# Initialize a new network with the same architecture
loaded_net = ActorCriticNet(state_shape, action_shape)
loaded_net.load_state_dict(torch.load("models/A2C_mountain_model.pth"))


Let's create an environment and build a policy based on our saved model.

In [None]:
# Create the environment for evaluation with rendering enabled
eval_env = gym.make("MountainCarContinuous-v0", render_mode="human")

# Set the loaded network as the model for a new SARSA policy
loaded_policy = A2CPolicy(model=loaded_net, optim=optimizer, action_space=action_space, gamma=0.99)  # Set epsilon=0 for pure exploitation


Now let's run our agent in the environment. Note: you can change the number of episodes to watch!

In [None]:

# Set the number of episodes you want to watch
num_episodes = 10

for episode in range(num_episodes):
    obs, _ = eval_env.reset()
    done = False
    total_reward = 0
    
    print(f"Starting episode {episode + 1}")

    while not done:
        # Create a batch for the current observation
        obs_batch = Batch(obs=[obs])
        
        # Get action based on loaded model's Q-values (no exploration)
        action = loaded_policy.forward(obs_batch).act[0]
        
        # Step the environment with the selected action
        obs, reward, done, truncated, _ = eval_env.step(action)
        total_reward += reward

        # Check if the episode has ended
        if done or truncated:
            print(f"Episode {episode + 1} ended with total reward: {total_reward}")
            break  # Break out of the loop to start the next episode


# Close the environment after finishing all episodes
eval_env.close()

**Things to try**

Try changing the environment or changing the hyperparameters:

* **Learning rate** (how fast to learn from new data): too high and the agent might learn sprious correlations between actions and outcomes, too low and it might take the agent for ever to figure what actions lead to good rewards.

* **Discount factor** or **gamma** (how much does the agent value future vs. near rewards): too high and the agent might miss near rewards, too low and the agent might be too focused on the short term and miss longer term outcomes.

Try altering some of these hyperparameters and see how that changes the ability of your agent to learn! Which hyperparameters work best?



Now that actions can be continuous this opens up the possibility of more complex agent bodies. Let's try mujoco!

In [None]:
import gymnasium as gym

In [None]:
# Create a single environment instance to access the space information
single_env = gym.make("HalfCheetah-v4")
state_shape = single_env.observation_space.shape 
action_shape = single_env.action_space.shape #change n to shape
action_space = single_env.action_space


# Setting up the actor-critic network and A2C policy
net = ActorCriticNet(state_shape, action_shape)
optimizer = torch.optim.Adam(net.parameters(), lr=1e-5, weight_decay=1e-4)
policy = A2CPolicy(model=net, optim=optimizer, action_space=action_space, gamma=0.99)

