# ABE tutorial 5
## Using continuous actions!

In this fifth tutorial let's allow our agents to take continuous actions. Rather than our agents choosing to move left or right, we'll allow it to move by -0.23 or +0.12. This will expand the types of actions/bodies/environments that we can construct and study!

Steps:
* Converting A2C to continuous action spaces
* Test out the continuous A2C algorithm in new environments


# Continuous A2C

Here we will look at how we can handle actions that are continuous. Up until now we've been relying on actions being e.g., ;eft, right, rather thane.g., motion adjusted by -0.23. The difference here is that there are not discrete actions to choose from, rather some amount of action in continuous space. This will be very usful as our agent bodies and their abilities to interact with their environment become more open ended. That is, we want to allow our agent to find many different ways to interact with the environment, and we don't want to constrain the agent to a few discrete actions. This has some costs, as it is much easier to learn how to use discrete actions, and is a way for us to help our agents to learn faster if there is some limit to their actions. We'll see that similar to discrete cases, we can add in some information/constraints in continuous action spaces to help our agent learn.

First let's see where we need to alter our A2C agent to allow for continuous action spaces.

### Neural network adjustments



Let's start with the actor network:

```python

self.actor = nn.Sequential(
            nn.Linear(np.prod(state_shape), hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, np.prod(action_shape))
        )

```

This model can remain the same, as it will output a continuous value for each action space. We'll use this value as the mean of a guassian distribution. 

We'll then have to add an additional parameter for the standard deviation of the guassian distribution.

```python
        self.actor_log_std = nn.Parameter(torch.zeros(np.prod(action_shape)), requires_grad=True)

```

Next in the forward pass, where our model is used to make predictions about which actions to take, we'll have to modify how those actions are chosen.

```python
        # Actor network outputs: mean and std deviation for Gaussian distribution
        action_mean = self.actor(base_features)
        action_log_std = self.actor_log_std.clamp(-20, 2)  # Clipping for numerical stability
        action_std = action_log_std.exp()  # Convert log std to std
```

Then when we return the action choice we keep both the mean and the std of the actions:

```python
return action_mean, action_std, state_value
```

### Policy adjustments


When taking actions we'll have to adjust how these actions are chosen:

```python
def forward(self, batch, state=None, **kwargs):

        #run the model and get the action means and the uncertainty (std)
        action_mean, action_std, _ = self.model(batch.obs)
        
        # Create Gaussian distribution for continuous actions
        dist = torch.distributions.Normal(action_mean, action_std)

        #sample actions from the Gaussian distribution
        action = dist.sample()
        
        # Clip actions to be within the environment’s action space: i.e., make sure the actions make sense / are possible
        action = action = torch.clamp(action, self.action_space.low, self.action_space.high)
        
        return Batch(act=action.cpu().numpy(), dist=dist)
```

For the learning section of the policy we need to calculate the log probability of each action in a slightly different way now that we have continuous actions.

```python
    def learn(self, batch, **kwargs):
        
        # Forward pass to get mean, std, and value
        action_mean, action_std, state_values = self.model(batch.obs)
        dist = torch.distributions.Normal(action_mean, action_std)
        
        # Compute log probabilities of the taken actions
        log_probs = dist.log_prob(batch.act).sum(dim=-1)

        #... the rest of the policy code stays the same

```

# Testing out our new A2C

Full model code

In [1]:
import gymnasium as gym
import torch
import torch.nn as nn
import numpy as np
from tianshou.env import DummyVectorEnv
from tianshou.data import Batch, ReplayBuffer, Collector
from tianshou.policy import BasePolicy
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts


class ActorCriticNet(nn.Module):
    def __init__(self, state_shape, action_shape, hidden_size=128):
        super().__init__()

        self.actor = nn.Sequential(
            nn.Linear(np.prod(state_shape), hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, np.prod(action_shape))
        )

        # Separate layer for log std to allow independent learning
        self.actor_log_std = nn.Parameter(torch.zeros(np.prod(action_shape)), requires_grad=True)


        self.critic = nn.Sequential(
            nn.Linear(np.prod(state_shape), hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
            nn.Linear(hidden_size, 1)
        )
        
        
    def forward(self, obs, state=None, info={}):
        if isinstance(obs, np.ndarray):
            obs = torch.tensor(obs, dtype=torch.float32)

        # Actor network outputs: mean and std deviation for Gaussian distribution
        action_mean = self.actor(obs)
        action_log_std = self.actor_log_std.clamp(-20, 2)  # Clipping for numerical stability
        action_std = action_log_std.exp()  # Convert log std to std
        
        # Critic network output: state value
        state_value = self.critic(obs).squeeze(-1)
        
        return action_mean, action_std, state_value


class A2CPolicy(BasePolicy):
    def __init__(self, model, optim, action_space, gamma=0.99):
        super().__init__(action_space=action_space)
        self.model = model
        self.optim = optim
        self.gamma = gamma

    def forward(self, batch, state=None, **kwargs):
        action_mean, action_std, _ = self.model(batch.obs)
        
        # Create Gaussian distribution for continuous actions
        dist = torch.distributions.Normal(action_mean, action_std)

        # Sample an action from the guassian distribution
        action = dist.sample()
        
        # Convert action space bounds to PyTorch tensors
        action_min = torch.tensor(self.action_space.low, dtype=torch.float32, device=action.device)
        action_max = torch.tensor(self.action_space.high, dtype=torch.float32, device=action.device)
        
        # Clip actions to be within the environment’s action space
        action = torch.clamp(action, action_min, action_max)
        
        return Batch(act=action.cpu().numpy(), dist=dist)

    def learn(self, batch, **kwargs):
        
        # Forward pass to get mean, std, and value
        action_mean, action_std, state_values = self.model(batch.obs)
        dist = torch.distributions.Normal(action_mean, action_std)
        
        # Compute log probabilities of the taken actions
        log_probs = dist.log_prob(batch.act).sum(dim=-1)

        # Compute the critic's next state values (for TD target)
        with torch.no_grad():
            _, _, next_state_values = self.model(batch.obs_next)
            td_target = batch.rew + self.gamma * (1 - batch.done) * next_state_values
            
            # Calculate the normalized advantage (maybe show this only in tutorial 5?)
            advantage = td_target - state_values  # Advantage calculation
            advantage = (advantage - advantage.mean()) / (advantage.std() + 1e-8)


        # Calculate entropy for the policy distribution (maybe show this only in tutorial 5?)
        entropy = dist.entropy().mean()
        
        # Calculate policy (actor) loss (include entropy regularization)
        policy_loss = -(log_probs * advantage.detach()).mean() - 0.01 * entropy  # Adjust weight as needed
        
        # Calculate value (critic) loss
        value_loss = nn.functional.mse_loss(state_values, td_target)
        
        # Combine the losses
        loss = policy_loss + value_loss
        
        # Backpropagation
        self.optim.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=0.5)
        self.optim.step()

        return {"loss": loss.item(), "policy_loss": policy_loss.item(), "value_loss": value_loss.item()}

In [3]:
# Create a single environment instance to access the space information
single_env = gym.make("MountainCarContinuous-v0")
state_shape = single_env.observation_space.shape 
action_shape = single_env.action_space.shape #change n to shape
action_space = single_env.action_space


# Setting up the actor-critic network and A2C policy
net = ActorCriticNet(state_shape, action_shape)
optimizer = torch.optim.Adam(net.parameters(), lr=1e-5, weight_decay=1e-4)
policy = A2CPolicy(model=net, optim=optimizer, action_space=action_space, gamma=0.99)


Let's now train our mountain car!

In [4]:

# Custom training loop
max_epoch = 10
step_per_epoch = 10000
keep_n_steps = 30
buffer = ReplayBuffer(size=keep_n_steps)

# Set up collectors
train_collector = Collector(policy, single_env, buffer)
test_collector = Collector(policy, single_env)

#start a logger
logger_a2c = ts.utils.TensorboardLogger(SummaryWriter('log/a2c_cont_custom'))

for epoch in range(max_epoch):
    train_collector.reset()
    for step in range(step_per_epoch):
        # Collect one transition and store it in the buffer
        #train_collector.collect(n_step=1)
        train_collector.collect(n_step=keep_n_steps)

        # Sample the most recent observations from the buffer
        #batch, _ = train_collector.buffer.sample(batch_size=30)
        batch = train_collector.buffer[-keep_n_steps:]

        # Manually convert each field to a torch tensor
        batch.obs = torch.tensor(batch.obs, dtype=torch.float32)
        batch.act = torch.tensor(batch.act, dtype=torch.long)
        batch.rew = torch.tensor(batch.rew, dtype=torch.float32)
        batch.done = torch.tensor(batch.done, dtype=torch.float32)
        batch.obs_next = torch.tensor(batch.obs_next, dtype=torch.float32)

        # Perform A2C learning
        policy.learn(batch)

    # Testing and evaluation
    result = test_collector.collect(n_episode=10, reset_before_collect=True)
    print(f'Epoch #{epoch + 1}: reward = {result.returns.mean()}, loss = {policy.learn(batch)["loss"]}')

    # Log the average reward for the epoch
    logger_a2c.writer.add_scalar("Reward/test_avg", result.returns.mean(), epoch)



Epoch #1: reward = -48.375916737774574, loss = 0.6401092410087585
Epoch #2: reward = -44.70324852508283, loss = -0.218825563788414
Epoch #3: reward = -30.13676958747182, loss = -0.18150193989276886
Epoch #4: reward = -38.4824174600884, loss = -0.321969211101532
Epoch #5: reward = -35.13535820313831, loss = -0.29562821984291077


KeyboardInterrupt: 

Did it learn? Do you see rewards increasing? 

If so let's save the model:

In [5]:
torch.save(net.state_dict(), "models/A2C_mountain_model.pth")

Let's test out the model, and watch what it learnt.

Load in the trained model.

In [6]:
# Initialize a new network with the same architecture
loaded_net = ActorCriticNet(state_shape, action_shape)
loaded_net.load_state_dict(torch.load("models/A2C_mountain_model.pth"))


  loaded_net.load_state_dict(torch.load("models/A2C_mountain_model.pth"))


<All keys matched successfully>

Let's create an environment and build a policy based on our saved model.

In [7]:
# Create the environment for evaluation with rendering enabled
eval_env = gym.make("MountainCarContinuous-v0", render_mode="human")

# Set the loaded network as the model for a new SARSA policy
loaded_policy = A2CPolicy(model=loaded_net, optim=optimizer, action_space=action_space, gamma=0.99)  # Set epsilon=0 for pure exploitation


Now let's run our agent in the environment. Note: you can change the number of episodes to watch!

In [8]:

# Set the number of episodes you want to watch
num_episodes = 1

for episode in range(num_episodes):
    obs, _ = eval_env.reset()
    done = False
    total_reward = 0
    
    print(f"Starting episode {episode + 1}")

    while not done:
        # Create a batch for the current observation
        obs_batch = Batch(obs=[obs])
        
        # Get action based on loaded model's Q-values (no exploration)
        action = loaded_policy.forward(obs_batch).act[0]
        
        # Step the environment with the selected action
        obs, reward, done, truncated, _ = eval_env.step(action)
        total_reward += reward

        # Check if the episode has ended
        if done or truncated:
            print(f"Episode {episode + 1} ended with total reward: {total_reward}")
            break  # Break out of the loop to start the next episode


# Close the environment after finishing all episodes
eval_env.close()

Starting episode 1
Episode 1 ended with total reward: -37.1962358648925
Starting episode 2


KeyboardInterrupt: 

**Things to try**

Try changing the environment or changing the hyperparameters:

* **Learning rate** (how fast to learn from new data): too high and the agent might learn sprious correlations between actions and outcomes, too low and it might take the agent for ever to figure what actions lead to good rewards.

* **Discount factor** or **gamma** (how much does the agent value future vs. near rewards): too high and the agent might miss near rewards, too low and the agent might be too focused on the short term and miss longer term outcomes.

Try altering some of these hyperparameters and see how that changes the ability of your agent to learn! Which hyperparameters work best?



Now that actions can be continuous this opens up the possibility of more complex agent bodies. Let's try to train a elaborate body to learn how to walk!

You'll have to install mujoco by typing the following into your terminal (remember this will install packages into your virtual environment so make sure it's still active):

```
pip install mujoco
pip install "gymnasium[mujoco]"
```

We can run the code below to make sure all is working well!
It should create the "Hopper" environment and print out the observation space, and action spaces. 
You can read up on the Hopper environment, and see the actions the agent can take, the observations it can see, and the rewards it can acheive. You'll note they are continuous actions!
* https://gymnasium.farama.org/environments/mujoco/hopper/

In [9]:
import gymnasium as gym
env = gym.make("Ant-v4")  
print(env.observation_space)
print(env.action_space)

RuntimeError: Could not find supported GCC executable.

HINT: On OS X, install GCC 9.x with `brew install gcc@9`. or `port install gcc9`.

Let's then use our training code to train a Hopper!

In [10]:
# Create a single environment instance to access the space information
single_env = gym.make("Ant-v4")
state_shape = single_env.observation_space.shape 
action_shape = single_env.action_space.shape #change n to shape
action_space = single_env.action_space


# Setting up the actor-critic network and A2C policy
net = ActorCriticNet(state_shape, action_shape)
optimizer = torch.optim.Adam(net.parameters(), lr=1e-4, weight_decay=1e-4)
policy = A2CPolicy(model=net, optim=optimizer, action_space=action_space, gamma=0.99)



RuntimeError: Could not find supported GCC executable.

HINT: On OS X, install GCC 9.x with `brew install gcc@9`. or `port install gcc9`.

Let's now train our Ant!

Note: this training will take some time. You should be starting to see training taking more and more time as we build our agents towards a modern RL-agent. We'll start to talk more about the kinds of hardware you can access (i.e., free GPU on google colab), and how hardware becomes ever more important as we go along.

In [None]:

# Custom training loop
max_epoch = 10
step_per_epoch = 1000
keep_n_steps = 200
buffer = ReplayBuffer(size=keep_n_steps)

# Set up collectors
train_collector = Collector(policy, single_env, buffer)
test_collector = Collector(policy, single_env)

#start a logger
logger_a2c = ts.utils.TensorboardLogger(SummaryWriter('log/a2c_ant'))

for epoch in range(max_epoch):
    train_collector.reset()
    for step in range(step_per_epoch):
        # Collect one transition and store it in the buffer
        train_collector.collect(n_step=keep_n_steps)

        # Sample the most recent observations from the buffer
        batch = train_collector.buffer[-keep_n_steps:]

        # Manually convert each field to a torch tensor
        batch.obs = torch.tensor(batch.obs, dtype=torch.float32)
        batch.act = torch.tensor(batch.act, dtype=torch.long)
        batch.rew = torch.tensor(batch.rew, dtype=torch.float32)
        batch.done = torch.tensor(batch.done, dtype=torch.float32)
        batch.obs_next = torch.tensor(batch.obs_next, dtype=torch.float32)

        # Normalize rewards in the collected batch
        batch.rew = (batch.rew - batch.rew.mean()) / (batch.rew.std() + 1e-8)

        # Perform A2C learning
        policy.learn(batch)

    # Testing and evaluation
    result = test_collector.collect(n_episode=10, reset_before_collect=True)
    print(f'Epoch #{epoch + 1}: reward = {result.returns.mean()}, loss = {policy.learn(batch)["loss"]}')

    # Log the average reward for the epoch
    logger_a2c.writer.add_scalar("Reward/test_avg", result.returns.mean(), epoch)



Epoch #1: reward = -162.7210415896879, loss = 2.000091314315796
Epoch #2: reward = -128.76497461739456, loss = 2.927469491958618
Epoch #3: reward = -39.84995910090584, loss = 3.4323105812072754
Epoch #4: reward = -115.4043885090005, loss = 4.375180244445801
Epoch #5: reward = -80.79453522350789, loss = 4.126519203186035
Epoch #6: reward = -152.55011291506122, loss = 3.291814088821411
Epoch #7: reward = -81.24397693740792, loss = 5.396844387054443
Epoch #8: reward = -63.7591292514662, loss = 7.346578598022461
Epoch #9: reward = -35.531873256604804, loss = 8.251445770263672
Epoch #10: reward = 20.11763495070202, loss = 2.450540542602539


Did it learn? Do you see rewards increasing? 

If so let's save the model:

In [26]:
torch.save(net.state_dict(), "models/A2C_ant_model.pth")

Let's test out the model, and watch what it learnt.

Load in the trained model.

In [8]:
# Initialize a new network with the same architecture
loaded_net = ActorCriticNet(state_shape, action_shape)
loaded_net.load_state_dict(torch.load("models/A2C_ant_model.pth"))

NameError: name 'state_shape' is not defined

Let's create an environment and build a policy based on our saved model.

In [51]:
# Create the environment for evaluation with rendering enabled
eval_env = gym.make("Ant-v4", render_mode="rgb_array")

# Set the loaded network as the model for a new SARSA policy
loaded_policy = A2CPolicy(model=loaded_net, optim=optimizer, action_space=action_space, gamma=0.99)  # Set epsilon=0 for pure exploitation


Now let's run our ant agent in the environment. Note: you can change the number of episodes to watch!

In [2]:
import imageio

# Set the number of episodes you want to watch
num_episodes = 1

frames = []

for episode in range(num_episodes):
    obs, _ = eval_env.reset()
    done = False
    total_reward = 0
    
    print(f"Starting episode {episode + 1}")

    while not done:
        # Create a batch for the current observation
        obs_batch = Batch(obs=[obs])
        
        # Get action based on loaded model's Q-values (no exploration)
        action = loaded_policy.forward(obs_batch).act[0]
        
        # Step the environment with the selected action
        obs, reward, done, truncated, _ = eval_env.step(action)
        total_reward += reward

        # Check if the episode has ended
        if done or truncated:
            print(f"Episode {episode + 1} ended with total reward: {total_reward}")
            break  # Break out of the loop to start the next episode

        #record what's going on
        frames.append(eval_env.render())


# Close the environment after finishing all episodes
eval_env.close()

imageio.mimsave("ant_v4_simulation.mp4", frames, fps=30)

NameError: name 'eval_env' is not defined

In [1]:
frames

NameError: name 'frames' is not defined