# ABE Tutorial 5  
## Using Continuous Actions

In this tutorial, we extend our previous Actor-Critic (A2C) implementation to handle continuous action spaces. Instead of selecting from a set of discrete actions (e.g., left or right), our agent will choose from a continuous range of actions (e.g., a movement of $-0.23$ or $+0.12$). This flexibility allows us to model more realistic and complex environments.

**Tutorial Outline:**
- Converting A2C to work with continuous action spaces
- Testing the continuous A2C algorithm in new environments

***
## **Walkthrough**: Continuous A2C

In previous tutorials, the agent’s actions were limited to a set of discrete choices. In many real-world scenarios (for example, controlling a robot or a vehicle), it is more natural to work with continuous actions.

Here, we explain how to modify the actor network and policy so that:
- The actor outputs parameters (mean and standard deviation) of a Gaussian distribution.
- Actions are sampled from this distribution and then clipped to remain within valid bounds.

### 1. Neural Network Adjustments



Let's start with the actor network:

```python

        import numpy as np
        import torch
        import torch.nn as nn


        class ActorNetwork(nn.Module):
        """
        Actor network for continuous actions.
        
        This network processes the input state and outputs the mean values
        for each action dimension. A separate parameter is used to learn the
        log standard deviation.
        """
        def __init__(self, state_shape, action_shape, hidden_size=128):
                super().__init__()
                input_dim = int(np.prod(state_shape))
                output_dim = int(np.prod(action_shape))
                
                # Define a sequential network to output the action means.
                self.actor = nn.Sequential(
                nn.Linear(input_dim, hidden_size),
                nn.ReLU(),
                nn.LayerNorm(hidden_size),
                nn.Linear(hidden_size, hidden_size),
                nn.ReLU(),
                nn.LayerNorm(hidden_size),
                nn.Linear(hidden_size, output_dim)
                )

```

This model can remain the same, as it will output a continuous value for each action space. We'll use this value as the mean of a Gaussian distribution. 

We'll then have to add an additional parameter for the standard deviation of the Gaussian distribution.

```python
                # Define a learnable parameter for log standard deviation.
                # Clipping is performed later in the forward pass for numerical stability.
                self.actor_log_std = nn.Parameter(torch.zeros(output_dim), requires_grad=True)
```

Next in the forward pass, where our model is used to make predictions about which actions to take, we'll have to modify how those actions are chosen.

```python
        def forward(self, state):
                """
                Perform a forward pass through the actor network.
                
                Parameters:
                state (torch.Tensor): The input state.
                
                Returns:
                action_mean (torch.Tensor): Mean of the Gaussian distribution for each action.
                action_std (torch.Tensor): Standard deviation for each action.
                """
                # Compute the mean values from the actor network.
                action_mean = self.actor(state)
                # Clamp log_std for numerical stability, then exponentiate.
                action_log_std = self.actor_log_std.clamp(-20, 2)
                action_std = action_log_std.exp()
```

Then when we return the action choice we keep both the mean and the std of the actions:

```python
                return action_mean, action_std
```

### 2. Policy Adjustments

When taking actions we'll have to adjust how these actions are chosen:

```python
        import torch
        from torch.distributions import Normal
        from tianshou.data import Batch
        
        def policy_forward(model, batch, action_space):
        """
        Forward pass for the policy.
        
        Parameters:
                model (nn.Module): The actor-critic network.
                batch (Batch): A batch of data from the environment.
                action_space (gym.Space): The action space to clip actions.
        
        Returns:
                Batch: Contains the sampled actions and the distribution.
        """
        # Forward pass through the model to obtain mean and std for the Gaussian.
        action_mean, action_std, _ = model(batch.obs)
        
        # Create a Gaussian distribution with the obtained parameters.
        dist = Normal(action_mean, action_std)
        
        # Sample an action from the distribution.
        action = dist.sample()
        
        # Clip actions to ensure they are within the environment's bounds.
        # Convert action space bounds to tensors for compatibility.
        action_min = torch.tensor(action_space.low, dtype=torch.float32, device=action.device)
        action_max = torch.tensor(action_space.high, dtype=torch.float32, device=action.device)
        action = torch.clamp(action, action_min, action_max)
        
        return Batch(act=action.cpu().numpy(), dist=dist)
```

For the learning section of the policy we need to calculate the log probability of each action in a slightly different way now that we have continuous actions.

```python
    def policy_learn(model, optim, batch, gamma=0.99):
        """
        Learn method for the continuous action policy.
        
        This method computes the loss for both the policy (actor) and value function (critic),
        then performs backpropagation to update the network parameters.
        
        Parameters:
            model (nn.Module): The actor-critic network.
            optim (torch.optim.Optimizer): The optimizer.
            batch (Batch): A batch of transitions from the replay buffer.
            gamma (float): Discount factor for future rewards.
        
        Returns:
            dict: A dictionary with loss components.
        """
        # Forward pass to get mean, std, and state value.
        action_mean, action_std, state_values = model(batch.obs)
        dist = Normal(action_mean, action_std)
        
        # Compute log probabilities of the taken actions. Sum over action dimensions.
        log_probs = dist.log_prob(batch.act).sum(dim=-1)
```

***
## **Full Implementation and Testing**: : Continuous A2C

Below, we integrate the neural network and policy adjustments into a full Actor-Critic model. We will define our actor-critic network, the A2C policy, and then run a training loop on the "MountainCarContinuous-v0" environment.

### 1. Preliminary Code

In [None]:
import os
import shutil
import subprocess
import tempfile
import time
from datetime import datetime

import torch
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts

# Timestamped ID to avoid overwriting previous runs.
agent_id = datetime.now().strftime("%Y%m%d_%H%M%S")  # Format: YYYYMMDD_HHMMSS

# Setup directories for models and logs.
continuous_a2c_dir = f"continuous_a2c_run_{agent_id}"
logs_dir = os.path.join(continuous_a2c_dir, "logs")
models_dir = os.path.join(continuous_a2c_dir, "models")

os.makedirs(continuous_a2c_dir, exist_ok=True)
print(f"Run files will be saved in: {continuous_a2c_dir}")
os.makedirs(logs_dir, exist_ok=True)
print(f"TensorBoard logs will be saved in: {logs_dir}")
os.makedirs(models_dir, exist_ok=True)
print(f"Models will be saved in: {models_dir}")

# Create a TensorBoard logger.
logger = ts.utils.TensorboardLogger(SummaryWriter(logs_dir))
print(f"TensorBoard logging is active.")

# Select the appropriate device.
device = torch.device("cuda" if torch.cuda.is_available() else 
                      "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

### 2. Actor-Critic Network for Continuous Actions

In [None]:
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Normal
from tianshou.data import Batch

class ActorCriticNet(nn.Module):
    """
    Combined Actor-Critic network for continuous action environments.

    The actor branch outputs the mean and standard deviation for a Gaussian distribution,
    from which actions are sampled. The critic branch outputs a scalar value estimate for
    the current state.

    This network is callable, meaning that calling an instance will invoke the forward pass.
    """
    def __init__(self, state_shape, action_shape, hidden_size=128):
        super().__init__()
        input_dim = int(np.prod(state_shape))
        output_dim = int(np.prod(action_shape))
        
        # Actor network: processes the input state and produces the mean of the Gaussian.
        self.actor = nn.Sequential(
            nn.Linear(input_dim, hidden_size),  # Input to hidden layer.
            nn.ReLU(),                          # Non-linear activation.
            nn.LayerNorm(hidden_size),          # Normalization for stable training.
            nn.Linear(hidden_size, hidden_size),# Second hidden layer.
            nn.ReLU(),                          # Activation function.
            nn.LayerNorm(hidden_size),          # Normalization layer.
            nn.Linear(hidden_size, output_dim)  # Output layer for action mean.
        )
        
        # Log standard deviation as a learnable parameter.
        # We use log_std to ensure the standard deviation remains positive after exponentiation.
        self.actor_log_std = nn.Parameter(torch.zeros(output_dim), requires_grad=True)
        
        # Critic network: processes the input state to estimate its value.
        self.critic = nn.Sequential(
            nn.Linear(input_dim, hidden_size),  # Input to hidden layer.
            nn.ReLU(),                          # Activation function.
            nn.LayerNorm(hidden_size),          # Normalization for stability.
            nn.Linear(hidden_size, hidden_size),# Second hidden layer.
            nn.ReLU(),                          # Activation function.
            nn.LayerNorm(hidden_size),          # Normalization layer.
            nn.Linear(hidden_size, 1)           # Output layer for state value.
        )

    def forward(self, obs, state=None, info={}):
        """
        Forward pass for the Actor-Critic network.

        Parameters:
            obs (np.ndarray or torch.Tensor): The input state observations.
            state (optional): Additional state information (if any).
            info (dict, optional): Additional information (if any).

        Returns:
            tuple: (action_mean, action_std, state_value)
                - action_mean: Mean values for the Gaussian distribution of each action.
                - action_std: Standard deviation for each action.
                - state_value: Scalar value estimation of the current state.
        """
        # Convert numpy array input to torch tensor if necessary.
        if isinstance(obs, np.ndarray):
            obs = torch.tensor(obs, dtype=torch.float32, device=next(self.parameters()).device)
        
        # Compute the mean for actions using the actor network.
        action_mean = self.actor(obs)
        # Clamp log_std for numerical stability and compute the standard deviation.
        action_log_std = self.actor_log_std.clamp(-20, 2)
        action_std = action_log_std.exp()
        # Compute the state value using the critic network.
        state_value = self.critic(obs).squeeze(-1)
        return action_mean, action_std, state_value

    def __call__(self, obs, state=None, info={}):
        """
        Enables the network to be called like a function.

        This method delegates the call to the forward method.

        Parameters:
            obs (np.ndarray or torch.Tensor): The input state observations.
            state (optional): Additional state information.
            info (dict, optional): Additional information.

        Returns:
            tuple: (action_mean, action_std, state_value) as defined in forward().
        """
        return self.forward(obs, state, info)

### 3. Advantage Actor-Critic (A2C) Policy for Continuous Actions

In [None]:
import numpy as np
import torch
import torch.nn as nn
from torch.distributions import Normal
from tianshou.data import Batch

class A2CPolicy:
    """
    Advantage Actor-Critic (A2C) policy for continuous actions.

    This policy class provides methods for selecting actions and updating the model
    based on collected experiences. It is made callable and includes a map_action method
    to integrate seamlessly with the latest Tianshou Collector API.
    """
    def __init__(self, model, optim, action_space, gamma=0.99):
        self.model = model
        self.optim = optim
        self.action_space = action_space
        self.gamma = gamma

    def forward(self, batch, state=None, **kwargs):
        """
        Perform a forward pass to select an action.

        Parameters:
            batch (Batch): Contains the observation data.
            state (optional): Additional state information.
            **kwargs: Additional keyword arguments.

        Returns:
            Batch: A batch containing the sampled action (in numpy format)
                   and its associated distribution.
        """
        # Get the action mean and standard deviation from the model.
        action_mean, action_std, _ = self.model(batch.obs)
        # Create a Gaussian distribution with the obtained parameters.
        dist = Normal(action_mean, action_std)
        # Sample an action from the distribution.
        action = dist.sample()

        # Convert the action space bounds to torch tensors and clip the action.
        action_min = torch.tensor(self.action_space.low, dtype=torch.float32, device=action.device)
        action_max = torch.tensor(self.action_space.high, dtype=torch.float32, device=action.device)
        action = torch.clamp(action, action_min, action_max)

        return Batch(act=action.cpu().numpy(), dist=dist)

    def learn(self, batch, **kwargs):
        """
        Update the policy using a batch of transitions.

        The method calculates the loss for both the actor and critic components,
        performs backpropagation, and updates the model parameters.

        Parameters:
            batch (Batch): A batch of transitions with observations, actions, rewards, etc.
            **kwargs: Additional keyword arguments.

        Returns:
            dict: A dictionary with the overall loss and its components:
                  "loss", "policy_loss", and "value_loss".
        """
        # Forward pass to get current predictions.
        action_mean, action_std, state_values = self.model(batch.obs)
        # Create a Gaussian distribution to compute log probabilities.
        dist = Normal(action_mean, action_std)
        # Calculate log probabilities for the taken actions (summing over action dimensions).
        log_probs = dist.log_prob(batch.act).sum(dim=-1)

        # Compute the Temporal Difference (TD) target using next state values.
        with torch.no_grad():
            _, _, next_state_values = self.model(batch.obs_next)
            td_target = batch.rew + self.gamma * (1 - batch.done) * next_state_values
            # Calculate and normalize the advantage.
            advantage = td_target - state_values
            advantage = (advantage - advantage.mean()) / (advantage.std() + 1e-8)

        # Compute the entropy for encouraging exploration.
        entropy = dist.entropy().mean()
        # Actor loss: encourages actions with higher advantage, includes entropy regularization.
        policy_loss = -(log_probs * advantage.detach()).mean() - 0.01 * entropy
        # Critic loss: mean squared error between predicted state value and TD target.
        value_loss = nn.functional.mse_loss(state_values, td_target)
        # Total loss is a combination of both.
        loss = policy_loss + value_loss

        # Perform backpropagation and update the network parameters.
        self.optim.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=0.5)
        self.optim.step()

        return {"loss": loss.item(), "policy_loss": policy_loss.item(), "value_loss": value_loss.item()}

    def map_action(self, action):
        """
        Maps the raw action output to the environment's action space.
        For continuous actions, this is an identity mapping.

        Parameters:
            action: The raw action output.

        Returns:
            The action, unchanged.
        """
        return action

    def __call__(self, batch, state=None, **kwargs):
        """
        Makes the policy callable, enabling it to be used directly by Tianshou's Collector.

        This method delegates the call to the forward method.

        Parameters:
            batch (Batch): Contains the observation data.
            state (optional): Additional state information.
            **kwargs: Additional keyword arguments.

        Returns:
            Batch: The output from the forward method.
        """
        return self.forward(batch, state, **kwargs)

### 4. Create the "MountainCarContinuous-v0" Environment

In [None]:
import gymnasium as gym

# Create a single environment to access space details.
env = gym.make("MountainCarContinuous-v0")
state_shape = env.observation_space.shape
action_shape = env.action_space.shape
action_space = env.action_space

### 5. Initialize the Actor-Critic Network, A2C Policy, and Optimizer

In [None]:
from torch import optim

net = ActorCriticNet(state_shape, action_shape).to(device)
optimizer = optim.Adam(net.parameters(), lr=1e-5, weight_decay=1e-4)
policy = A2CPolicy(model=net, optim=optimizer, action_space=action_space, gamma=0.99)

### 6. Training on MountainCarContinuous-v0

In [None]:
import os
import shutil
import subprocess
import tempfile
import time

import torch
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from tianshou.data import ReplayBuffer, Collector, Batch
import tianshou as ts
from tqdm.notebook import tqdm
from IPython.display import IFrame, display

# === TensorBoard Setup ===
def kill_port(port):
    """
    Terminates any processes that are listening on the specified port.
    Works on both Unix-based systems and Windows.
    """
    try:
        if os.name == 'nt':
            # Windows: Use netstat and taskkill to kill processes on the given port.
            cmd = f'for /f "tokens=5" %a in (\'netstat -aon ^| findstr :{port}\') do taskkill /F /PID %a'
            subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print(f"Killed processes on port {port}.")
        else:
            # Unix (Linux/Mac): Use lsof to find processes on the port and kill them.
            cmd = f"lsof -ti:{port} | xargs kill -9"
            subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print(f"Killed processes on port {port}.")
    except subprocess.CalledProcessError as e:
        if "returned non-zero exit status 1" in str(e):
            pass
        else:
            print(f"Could not kill process on port {port}: {e}")

# Kill any processes on port 6006 to ensure it is free.
kill_port(6006)

# Ensure that logs_dir is defined from our run setup cell.
# (logs_dir was defined earlier in the setup_run.py cell.)
if 'logs_dir' not in globals():
    logs_dir = "./logs"  # Fallback path

# Clear previous TensorBoard sessions.
tensorboard_info = os.path.join(tempfile.gettempdir(), ".tensorboard-info")
if os.path.exists(tensorboard_info):
    shutil.rmtree(tensorboard_info)

# Launch TensorBoard in the background on port 6006.
tb_command = [
    "tensorboard",
    "--logdir", logs_dir,
    "--port", "6006",
    "--host", "localhost",
    "--reload_interval", "30"
]
tb_process = subprocess.Popen(tb_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Allow time for TensorBoard to start and display its dashboard.
time.sleep(5)
display(IFrame(src="http://localhost:6006", width="100%", height="800"))
# ------------------------------------------------------------------------------

# === Training Hyperparameters and Setup ===
max_epoch = 10               # Total number of epochs for training.
steps_per_epoch = 10000      # Number of training steps per epoch.
keep_n_steps = 30            # Number of recent transitions to use for learning.

# Create a ReplayBuffer to store transitions.
buffer = ReplayBuffer(size=keep_n_steps)

# Create collectors for training and testing.
train_collector = Collector(policy, env, buffer)
test_collector = Collector(policy, env)

# Lists to store summaries for analysis.
epoch_training_losses = []
epoch_test_rewards = []
epoch_durations = []

global_start_time = time.time()  # Start overall training timer.

# === Training Loop with Detailed Progress Tracking ===
for epoch in range(max_epoch):
    epoch_start_time = time.time()  # Timer for the current epoch.
    train_collector.reset()         # Reset collector at the start of each epoch.
    running_loss = 0.0              # Accumulate loss to compute average loss.

    # Set up a tqdm progress bar with dynamic post-fix metrics.
    progress_bar = tqdm(
        range(steps_per_epoch),
        desc=f"Epoch {epoch+1}/{max_epoch}",
        dynamic_ncols=True
    )
    
    for step in progress_bar:
        # Collect a fixed number of steps and store transitions in the buffer.
        train_collector.collect(n_step=keep_n_steps)
        # Retrieve the most recent transitions.
        batch = train_collector.buffer[-keep_n_steps:]
        
        # Convert batch fields to torch tensors for compatibility.
        batch.obs = torch.tensor(batch.obs, dtype=torch.float32)
        batch.act = torch.tensor(batch.act, dtype=torch.float32)  # Continuous actions use float type.
        batch.rew = torch.tensor(batch.rew, dtype=torch.float32)
        batch.done = torch.tensor(batch.done, dtype=torch.float32)
        batch.obs_next = torch.tensor(batch.obs_next, dtype=torch.float32)
        
        # Update the policy using the collected batch and capture the loss.
        learn_info = policy.learn(batch)
        loss_val = learn_info.get("loss", 0)
        running_loss += loss_val
        
        global_step = epoch * steps_per_epoch + step
        
        # Log step-level metrics to TensorBoard.
        logger.writer.add_scalar("Loss/train_step", loss_val, global_step)
        logger.writer.add_scalar("Loss/train_running_avg", running_loss / (step + 1), global_step)
        
        # Flush logs periodically.
        if step % 50 == 0:
            logger.writer.flush()
        
        # Update progress bar with current metrics.
        progress_bar.set_postfix({
            "Step": f"{step}/{steps_per_epoch}",
            "Loss": f"{loss_val:07.3f}",
            "AvgLoss": f"{running_loss / (step + 1):07.3f}"
        })
        
        # Print summary at every 25% of the epoch.
        if step % (steps_per_epoch // 4) == 0:
            print(
                f"Epoch {epoch+1}, Step {step}/{steps_per_epoch}: "
                f"Step Loss = {loss_val}, Running Avg Loss = {running_loss / (step + 1)}"
            )
    
    # Compute average loss over the epoch.
    avg_loss = running_loss / steps_per_epoch

    # Evaluate the agent on 10 episodes.
    test_collector.reset()
    test_result = test_collector.collect(n_episode=10)
    mean_reward = np.mean(test_result["rews"])
    std_reward = np.std(test_result["rews"])
    min_reward = np.min(test_result["rews"])
    p25_reward = np.percentile(test_result["rews"], 25)
    median_reward = np.median(test_result["rews"])
    p75_reward = np.percentile(test_result["rews"], 75)
    max_reward = np.max(test_result["rews"])

    # Log epoch-level metrics to TensorBoard.
    logger.writer.add_scalar("Reward/test_avg", mean_reward, epoch)
    logger.writer.add_scalar("Loss/train_avg", avg_loss, epoch)
    logger.writer.flush()

    # Calculate epoch elapsed time.
    epoch_elapsed = time.time() - epoch_start_time
    epoch_training_losses.append(avg_loss)
    epoch_test_rewards.append(mean_reward)
    epoch_durations.append(epoch_elapsed)

    # Print detailed epoch summary.
    print(
        f"\nEpoch {epoch+1} Summary:\n"
        f"  - Epoch Elapsed Time      : {epoch_elapsed} seconds\n"
        f"  - Steps Collected         : {steps_per_epoch}\n"
        f"  - Average Training Loss   : {avg_loss}\n"
        f"  - Mean Test Reward        : {mean_reward}\n"
        f"  - Std Test Reward         : {std_reward}\n"
        f"  - Min Test Reward         : {min_reward}\n"
        f"  - 25th Percentile Reward  : {p25_reward}\n"
        f"  - Median Test Reward      : {median_reward}\n"
        f"  - 75th Percentile Reward  : {p75_reward}\n"
        f"  - Max Test Reward         : {max_reward}\n"
    )

# Final flush and close the TensorBoard writer.
logger.writer.close()

# Overall training statistics.
total_elapsed = time.time() - global_start_time
overall_avg_loss = np.mean(epoch_training_losses)
overall_avg_reward = np.mean(epoch_test_rewards)
total_epochs = len(epoch_durations)

print("\nOverall Training Summary:")
print(f"  - Total Epochs            : {total_epochs}")
print(f"  - Overall Average Loss    : {overall_avg_loss}")
print(f"  - Overall Average Reward  : {overall_avg_reward}")
print(f"  - Total Elapsed Time      : {total_elapsed} seconds")

print("\nFinal Epoch Summary:")
print(
    f"  - Epoch {total_epochs}:\n"
    f"      * Average Training Loss : {epoch_training_losses[-1]}\n"
    f"      * Average Test Reward   : {epoch_test_rewards[-1]}\n"
    f"      * Epoch Elapsed Time    : {epoch_durations[-1]} seconds\n"
)

Did it learn? Do you see rewards increasing? 

If so, let's save the model's state dictionary. This allows you to reload the trained agent later without retraining.

In [None]:
import os
import torch

# Save the model's state dictionary for future use.
model_path = os.path.join(models_dir, f"A2C_mountain_model_{agent_id}.pth")
torch.save(net.state_dict(), model_path)
print(f"Model saved to {model_path}")

### 7. Testing the Trained Model

First, we load the saved Actor-Critic model and prepares it for evaluation.

In [None]:
import os
import torch

model_path = os.path.join(models_dir, f"A2C_mountain_model_{agent_id}.pth")

# Initialize a new network with the same architecture and load saved parameters.
loaded_net = ActorCriticNet(state_shape, action_shape).to(device)
loaded_net.load_state_dict(torch.load(model_path, map_location=device))
print("Model loaded successfully.")

We create an evaluation environment and build a policy based on the loaded model.

In [None]:
import gymnasium as gym

 # Set to True to record the evaluation video, False to render in human mode.
output_video = False

# Create the evaluation environment.
if output_video:
    eval_env = gym.make("MountainCarContinuous-v0", render_mode="rgb_array")
else:
    eval_env = gym.make("MountainCarContinuous-v0", render_mode="human")

loaded_policy = A2CPolicy(model=loaded_net, optim=optimizer, action_space=action_space, gamma=0.99)
print("Evaluation environment and policy set up.")

Now, let's run our agent in the environment. **Note**: You can change the number of episodes to watch!

In [None]:
import imageio
import numpy as np
from tianshou.data import Batch

num_episodes = 20       # Number of evaluation episodes.
frames = []             # List to store frames for video recording.
episode_rewards = []    # List to store total rewards per episode.
episode_lengths = []    # List to store the number of steps per episode.

# Loop over each evaluation episode.
for episode in range(num_episodes):
    obs, _ = eval_env.reset()  # Reset the environment and get the initial observation.
    done = False               # Flag to track the end of an episode.
    total_reward = 0           # Accumulate the total reward for the episode.
    step_count = 0             # Counter for the number of steps.
    print(f"Starting episode {episode + 1}")
    
    # Run the episode until termination.
    while not done:
        # Create a Batch object from the current observation.
        obs_batch = Batch(obs=[obs])
        # Get the action from the loaded policy.
        action = loaded_policy.forward(obs_batch).act[0]
        # Apply the action to the environment.
        obs, reward, done, truncated, _ = eval_env.step(action)
        total_reward += reward  # Accumulate reward.
        step_count += 1         # Increment the step counter.
        frames.append(eval_env.render())  # Record the current frame.
        
        # End the episode if finished.
        if done or truncated:
            print(f"Episode {episode + 1} ended with total reward: {total_reward} after {step_count} steps.")
            break

    episode_rewards.append(total_reward)
    episode_lengths.append(step_count)

# Convert lists to numpy arrays for statistical calculations.
episode_rewards = np.array(episode_rewards)
episode_lengths = np.array(episode_lengths)

# Compute and print comprehensive performance statistics.
if episode_rewards.size > 0 and episode_lengths.size > 0:
    count_rewards = len(episode_rewards)
    mean_rewards = np.mean(episode_rewards)
    std_rewards = np.std(episode_rewards)
    min_rewards = np.min(episode_rewards)
    p25_rewards = np.percentile(episode_rewards, 25)
    median_rewards = np.median(episode_rewards)
    p75_rewards = np.percentile(episode_rewards, 75)
    max_rewards = np.max(episode_rewards)
    
    count_lengths = len(episode_lengths)
    mean_lengths = np.mean(episode_lengths)
    std_lengths = np.std(episode_lengths)
    min_lengths = np.min(episode_lengths)
    p25_lengths = np.percentile(episode_lengths, 25)
    median_lengths = np.median(episode_lengths)
    p75_lengths = np.percentile(episode_lengths, 75)
    max_lengths = np.max(episode_lengths)
    
    print("\nFinal Evaluation Performance Summary:")
    print(f"Total Episodes Evaluated: {num_episodes}\n")
    header = "{:<22} {:>15} {:>20}".format("Statistic", "Rewards", "Episode Lengths")
    print(header)
    print("-" * len(header))
    print("{:<22} {:>15d} {:>20d}".format("Count", count_rewards, count_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Mean", mean_rewards, mean_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Std Dev", std_rewards, std_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Min", min_rewards, min_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("25th Percentile", p25_rewards, p25_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Median", median_rewards, median_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("75th Percentile", p75_rewards, p75_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Max", max_rewards, max_lengths))
else:
    print("No performance data was collected. Please verify the Collector configuration.")

# Close the environment after evaluation.
eval_env.close()
print("Evaluation completed and environment closed.")

try:
    # Save the recorded frames as a video file.
    video_path = os.path.join(continuous_a2c_dir, f"mountain_car_simulation_{agent_id}.mp4")
    imageio.mimsave(video_path, frames, fps=60)
    print(f"Simulation video saved as {video_path}")
except Exception as e:
    print("If a video is intended, make sure that the rendering mode is set to 'rgb_array' in the environment.")
    print(f"Otherwise, an error occurred while saving the simulation video: {e}")


***
**Things to Try**

Try changing the environment or changing the hyperparameters:

- Experiment with different environments (e.g., `Pendulum-v1`, `BipedalWalker-v3`) to see how the continuous A2C performs.
- Tweak hyperparameters such as the learning rate and discount factor. Note that a high learning rate may cause the agent to pick up spurious patterns, while a low rate might slow down the learning.
- Observe the rewards: Do they increase steadily? What hyperparameters seem to work best for your specific environment?
***

## The Ant‑v5 Environment
### Training an Agent in a More Complex Environment

Now that our agent can handle continuous actions, we can train it in a more challenging environment. For example, the `Ant-v5` environment requires the agent to control a multi-legged robot. This is also an opportunity to explore more complex body dynamics.

We can run the code below to make sure all is working well!
It should create the "Hopper" environment and print out the observation space, and action spaces. 
You can read up on the Hopper environment, and see the actions the agent can take, the observations it can see, and the rewards it can acheive. You'll note they are continuous actions!
* https://gymnasium.farama.org/environments/mujoco/hopper/

We now verify that the Ant‑v5 environment is correctly installed and inspect its observation and action spaces.  
Understanding these spaces is essential, as they define what the agent can "see" (observations) and how it can "act" (actions) in the environment.

In [None]:
import gymnasium as gym

# Create the Ant-v5 environment.
env_ant = gym.make("Ant-v5")

# Print the observation space and action space.
print("Observation space:", env_ant.observation_space)
print("Action space:", env_ant.action_space)

# Close the environment to free resources.
env_ant.close()

* The observation space tells you what state features are available (for Ant-v5, a Box with shape (27,)).
* The action space shows the allowed actions (for Ant-v5, a Box with shape (8,) with values typically between -1 and 1).

### 1. Preliminary Code

In [None]:
import os
import shutil
import subprocess
import tempfile
import time
from datetime import datetime

import torch
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts

# Timestamped ID to avoid overwriting previous runs.
agent_id = datetime.now().strftime("%Y%m%d_%H%M%S")  # Format: YYYYMMDD_HHMMSS

# Setup directories for models and logs.
a2c_ant_agent_dir = f"a2c_ant_agent_{agent_id}"
logs_dir = os.path.join(a2c_ant_agent_dir, "logs")
models_dir = os.path.join(a2c_ant_agent_dir, "models")

os.makedirs(a2c_ant_agent_dir, exist_ok=True)
print(f"Run files will be saved in: {a2c_ant_agent_dir}")
os.makedirs(logs_dir, exist_ok=True)
print(f"TensorBoard logs will be saved in: {logs_dir}")
os.makedirs(models_dir, exist_ok=True)
print(f"Models will be saved in: {models_dir}")

# Create a TensorBoard logger.
logger = ts.utils.TensorboardLogger(SummaryWriter(logs_dir))
print(f"TensorBoard logging is active.")

# Select the appropriate device.
device = torch.device("cuda" if torch.cuda.is_available() else 
                      "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

### 2. Setting Up the Agent for Ant-v5

 we set up the environment and initialize our Actor-Critic network and A2C policy for the Ant‑v5 environment.  
The agent uses the observation and action spaces to configure its neural network, which outputs parameters for a Gaussian distribution over actions.

Let's then use our training code to train a Hopper!

In [None]:
import gymnasium as gym
from torch import optim

# Create a single instance of the Ant-v5 environment to extract its properties.
single_env_ant = gym.make("Ant-v5")
state_shape = single_env_ant.observation_space.shape    # Shape of the observation vector.
action_shape = single_env_ant.action_space.shape        # Shape of the action vector.
action_space = single_env_ant.action_space              # The action space (used for clipping actions).

# Initialize the Actor-Critic network for Ant-v5 and move it to the appropriate device.
net_ant = ActorCriticNet(state_shape, action_shape).to(device)

# Use an optimizer with a slightly higher learning rate due to the increased complexity of Ant-v5.
optimizer_ant = optim.Adam(net_ant.parameters(), lr=1e-4, weight_decay=1e-4)

# Create the A2C policy using the network, optimizer, and action space.
policy_ant = A2CPolicy(model=net_ant, optim=optimizer_ant, action_space=action_space, gamma=0.99)

# Close the temporary environment instance.
single_env_ant.close()
print("Ant-v5 setup complete.")

### 3. Training the Ant Agent (Ant-v5)

Let's now train our Ant!

Note: this training will take some time. You should be starting to see training taking more and more time as we build our agents towards a modern RL-agent. We'll start to talk more about the kinds of hardware you can access (i.e., free GPU on Google Colab), and how hardware becomes ever more important as we go along.

In [None]:
import os
import shutil
import subprocess
import tempfile
import time

import torch
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from tianshou.data import ReplayBuffer, Collector, Batch
import tianshou as ts
from tqdm.notebook import tqdm
from IPython.display import IFrame, display

# === TensorBoard Setup ===
def kill_port(port):
    """
    Terminates any processes that are listening on the specified port.
    Works on both Unix-based systems and Windows.
    """
    try:
        if os.name == 'nt':
            cmd = f'for /f "tokens=5" %a in (\'netstat -aon ^| findstr :{port}\') do taskkill /F /PID %a'
            subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print(f"Killed processes on port {port}.")
        else:
            cmd = f"lsof -ti:{port} | xargs kill -9"
            subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print(f"Killed processes on port {port}.")
    except subprocess.CalledProcessError as e:
        if "returned non-zero exit status 1" in str(e):
            pass
        else:
            print(f"Could not kill process on port {port}: {e}")

kill_port(6006)

# Use logs_dir defined in the setup_run cell, or use a fallback.
if 'logs_dir' not in globals():
    logs_dir = "./logs"

# Clear previous TensorBoard sessions.
tensorboard_info = os.path.join(tempfile.gettempdir(), ".tensorboard-info")
if os.path.exists(tensorboard_info):
    shutil.rmtree(tensorboard_info)

# Launch TensorBoard in the background on port 6006.
tb_command = [
    "tensorboard",
    "--logdir", logs_dir,
    "--port", "6006",
    "--host", "localhost",
    "--reload_interval", "30"
]
tb_process = subprocess.Popen(tb_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

time.sleep(5)
display(IFrame(src="http://localhost:6006", width="100%", height="800"))
# ------------------------------------------------------------------------------

# === Training Hyperparameters and Setup for Ant-v5 ===
max_epoch = 10             # Total number of training epochs.
steps_per_epoch = 1000     # Number of training steps per epoch.
keep_n_steps = 200         # Number of transitions to collect per update.

# Create a ReplayBuffer to store transitions.
buffer_ant = ReplayBuffer(size=keep_n_steps)

# Create collectors for training and testing using the Ant-v5 environment.
train_collector_ant = Collector(policy_ant, env_ant, buffer_ant)
test_collector_ant = Collector(policy_ant, env_ant)

# Lists to store training summaries for analysis.
epoch_training_losses = []
epoch_test_rewards = []
epoch_durations = []

global_start_time = time.time()  # Start overall training timer.

# === Training Loop with Detailed Progress Tracking ===
for epoch in range(max_epoch):
    epoch_start_time = time.time()  # Timer for the current epoch.
    train_collector_ant.reset()       # Reset the collector at the start of the epoch.
    running_loss = 0.0                # Accumulate loss for averaging.

    # Set up a tqdm progress bar for this epoch.
    progress_bar = tqdm(
        range(steps_per_epoch),
        desc=f"Ant Epoch {epoch+1}/{max_epoch}",
        dynamic_ncols=True
    )
    
    for step in progress_bar:
        # Collect a fixed number of transitions.
        train_collector_ant.collect(n_step=keep_n_steps)
        # Retrieve the latest batch of transitions.
        batch = train_collector_ant.buffer[-keep_n_steps:]
        
        # Convert batch data to torch tensors.
        batch.obs = torch.tensor(batch.obs, dtype=torch.float32, device=device)
        batch.act = torch.tensor(batch.act, dtype=torch.float32, device=device)
        batch.rew = torch.tensor(batch.rew, dtype=torch.float32, device=device)
        batch.done = torch.tensor(batch.done, dtype=torch.float32, device=device)
        batch.obs_next = torch.tensor(batch.obs_next, dtype=torch.float32, device=device)
        
        # Normalize rewards to stabilize training.
        batch.rew = (batch.rew - batch.rew.mean()) / (batch.rew.std() + 1e-8)
        
        # Update the policy using the collected batch.
        loss_dict = policy_ant.learn(batch)
        loss_val = loss_dict.get("loss", 0)
        running_loss += loss_val
        
        global_step = epoch * steps_per_epoch + step
        
        # Log step-level metrics to TensorBoard.
        logger.writer.add_scalar("Loss/train_step_ant", loss_val, global_step)
        logger.writer.add_scalar("Loss/train_running_avg_ant", running_loss / (step + 1), global_step)
        
        if step % 50 == 0:
            logger.writer.flush()
        
        # Update progress bar with current metrics.
        progress_bar.set_postfix({
            "Step": f"{step}/{steps_per_epoch}",
            "Loss": f"{loss_val:07.3f}",
            "AvgLoss": f"{running_loss / (step + 1):07.3f}"
        })
        
        if step % (steps_per_epoch // 4) == 0:
            print(
                f"Ant Epoch {epoch+1}, Step {step}/{steps_per_epoch}: "
                f"Step Loss = {loss_val}, Running Avg Loss = {running_loss / (step + 1)}"
            )
    
    # Compute average loss for the epoch.
    avg_loss = running_loss / steps_per_epoch

    # Evaluate the agent on 10 episodes.
    test_collector_ant.reset()
    test_result = test_collector_ant.collect(n_episode=10)
    mean_reward = np.mean(test_result["rews"])
    std_reward = np.std(test_result["rews"])
    min_reward = np.min(test_result["rews"])
    p25_reward = np.percentile(test_result["rews"], 25)
    median_reward = np.median(test_result["rews"])
    p75_reward = np.percentile(test_result["rews"], 75)
    max_reward = np.max(test_result["rews"])

    # Log epoch-level metrics to TensorBoard.
    logger.writer.add_scalar("Reward/test_avg_ant", mean_reward, epoch)
    logger.writer.add_scalar("Loss/train_avg_ant", avg_loss, epoch)
    logger.writer.flush()

    epoch_elapsed = time.time() - epoch_start_time
    epoch_training_losses.append(avg_loss)
    epoch_test_rewards.append(mean_reward)
    epoch_durations.append(epoch_elapsed)

    print(
        f"\nAnt Epoch {epoch+1} Summary:\n"
        f"  - Epoch Elapsed Time      : {epoch_elapsed} seconds\n"
        f"  - Steps Collected         : {steps_per_epoch}\n"
        f"  - Average Training Loss   : {avg_loss}\n"
        f"  - Mean Test Reward        : {mean_reward}\n"
        f"  - Std Test Reward         : {std_reward}\n"
        f"  - Min Test Reward         : {min_reward}\n"
        f"  - 25th Percentile Reward  : {p25_reward}\n"
        f"  - Median Test Reward      : {median_reward}\n"
        f"  - 75th Percentile Reward  : {p75_reward}\n"
        f"  - Max Test Reward         : {max_reward}\n"
    )

# Final flush and close the TensorBoard writer.
logger.writer.close()

total_elapsed = time.time() - global_start_time
overall_avg_loss = np.mean(epoch_training_losses)
overall_avg_reward = np.mean(epoch_test_rewards)
total_epochs = len(epoch_durations)

print("\nOverall Training Summary (Ant):")
print(f"  - Total Epochs            : {total_epochs}")
print(f"  - Overall Average Loss    : {overall_avg_loss}")
print(f"  - Overall Average Reward  : {overall_avg_reward}")
print(f"  - Total Elapsed Time      : {total_elapsed} seconds")

print("\nFinal Epoch Summary (Ant):")
print(
    f"  - Epoch {total_epochs}:\n"
    f"      * Average Training Loss : {epoch_training_losses[-1]}\n"
    f"      * Average Test Reward   : {epoch_test_rewards[-1]}\n"
    f"      * Epoch Elapsed Time    : {epoch_durations[-1]} seconds\n"
)

Did it learn? Do you see rewards increasing? 

If so, let's save the model's state dictionary. This allows you to reload the trained agent later without retraining.

In [None]:
import os
import torch

# Save the model's state dictionary for future use.
model_path = os.path.join(models_dir, f"ant_v5_model_{agent_id}.pth")
torch.save(net_ant.state_dict(), model_path)
print(f"Model saved to {model_path}")

Let's test out the model, and watch what it learnt.

### 4. Setup the Evaluation Environment for Ant-v5

Load in the trained model.

In [None]:
import os
import torch

model_path = os.path.join(models_dir, f"ant_v5_model_{agent_id}.pth")

# Initialize a new network with the same architecture and load the saved parameters.
loaded_net_ant = ActorCriticNet(state_shape, action_shape).to(device)
loaded_net_ant.load_state_dict(torch.load(model_path, map_location=device))
print("Ant-v5 model loaded successfully.")

Let's create an environment and build a policy based on our saved model.

In [None]:
import gymnasium as gym

 # Set to True to record the evaluation video, False to render in human mode.
output_video = False

# Create the evaluation environment for Ant-v5 with rendering mode enabled.
if output_video:
    eval_env_ant = gym.make("Ant-v5", render_mode="rgb_array")
else:
    eval_env_ant = gym.make("Ant-v5", render_mode="human")

# Build the evaluation policy using the loaded model.
loaded_policy_ant = A2CPolicy(model=loaded_net_ant, optim=optimizer_ant, action_space=action_space, gamma=0.99)
print("Ant-v5 evaluation environment and policy are ready.")

### 5. Running the Ant-v5 Agent and Recording a Video

Now, let's run our ant agent in the environment. We record each frame to create a simulation video. 

**Note**: You can change the number of episodes to watch!

In [None]:
import imageio
import numpy as np
from tianshou.data import Batch

num_episodes = 20       # Number of evaluation episodes.
frames_ant = []         # List to store frames for the video.
episode_rewards = []    # List to store total rewards per episode.
episode_lengths = []    # List to store the number of steps per episode.

# Loop over each evaluation episode.
for episode in range(num_episodes):
    obs, _ = eval_env_ant.reset()  # Reset the environment.
    done = False                   # Flag to determine if the episode is finished.
    total_reward = 0               # Initialize total reward.
    step_count = 0                 # Initialize step counter.
    print(f"Starting episode {episode + 1}")
    
    # Run the episode.
    while not done:
        # Create a Batch object from the current observation.
        obs_batch = Batch(obs=[obs])
        # Obtain the action from the loaded policy.
        action = loaded_policy_ant.forward(obs_batch).act[0]
        # Apply the action to the environment.
        obs, reward, done, truncated, _ = eval_env_ant.step(action)
        total_reward += reward  # Accumulate reward.
        step_count += 1         # Increment step count.
        frames_ant.append(eval_env_ant.render())  # Record the current frame.
        
        # End the episode if finished.
        if done or truncated:
            print(f"Episode {episode + 1} ended with total reward: {total_reward} after {step_count} steps.")
            break

    episode_rewards.append(total_reward)
    episode_lengths.append(step_count)

# Convert lists to numpy arrays for statistical calculations.
episode_rewards = np.array(episode_rewards)
episode_lengths = np.array(episode_lengths)

# Compute and print comprehensive performance statistics.
if episode_rewards.size > 0 and episode_lengths.size > 0:
    count_rewards = len(episode_rewards)
    mean_rewards = np.mean(episode_rewards)
    std_rewards = np.std(episode_rewards)
    min_rewards = np.min(episode_rewards)
    p25_rewards = np.percentile(episode_rewards, 25)
    median_rewards = np.median(episode_rewards)
    p75_rewards = np.percentile(episode_rewards, 75)
    max_rewards = np.max(episode_rewards)
    
    count_lengths = len(episode_lengths)
    mean_lengths = np.mean(episode_lengths)
    std_lengths = np.std(episode_lengths)
    min_lengths = np.min(episode_lengths)
    p25_lengths = np.percentile(episode_lengths, 25)
    median_lengths = np.median(episode_lengths)
    p75_lengths = np.percentile(episode_lengths, 75)
    max_lengths = np.max(episode_lengths)
    
    print("\nFinal Evaluation Performance Summary:")
    print(f"Total Episodes Evaluated: {num_episodes}\n")
    header = "{:<22} {:>15} {:>20}".format("Statistic", "Rewards", "Episode Lengths")
    print(header)
    print("-" * len(header))
    print("{:<22} {:>15d} {:>20d}".format("Count", count_rewards, count_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Mean", mean_rewards, mean_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Std Dev", std_rewards, std_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Min", min_rewards, min_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("25th Percentile", p25_rewards, p25_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Median", median_rewards, median_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("75th Percentile", p75_rewards, p75_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Max", max_rewards, max_lengths))
else:
    print("No performance data was collected. Please verify the Collector configuration.")

# Close the evaluation environment.
eval_env_ant.close()
print("Evaluation completed and environment closed.")

try:
    # Save the recorded frames as a video file.
    video_ant_path = os.path.join(a2c_ant_agent_dir, f"ant_v5_simulation_{agent_id}.mp4")
    imageio.mimsave(video_ant_path, frames_ant, fps=60)
    print(f"Ant simulation video saved as {video_ant_path}")
except Exception as e:
    print("If a video is intended, make sure that the rendering mode is set to 'rgb_array' in the environment.")
    print(f"Otherwise, an error occurred while saving the simulation video: {e}")