# ABE Tutorial 1
## Setting up an ABE Workshop

In this first tutorial let's setup a workshop to build agents, environments, and RL algorithms!

Steps:
* Install tianshou
* Check that it works
* Explore available algorithms and environments


## Tianshou

Tianshou is a python library that makes working with deep reinforcement learning easier. It's focus is on developing implimentations of reinforcement learning algorithms that can interact with a wide range of environments. You can read more of the documentation here: https://tianshou.org

There are some good tutorials on how to use the different modules of tianshou here: https://tianshou.org/en/stable/02_notebooks/L0_overview.html

Below we'll cover most of what's covered in those tutorials here, but with a focus on what we are covering in the ABE book.


## 1. Setting Up Your Data Science Environment with Conda

A **virtual environment** is an isolated workspace that lets you maintain separate sets of packages for different projects. This isolation means that changes made in one environment do not affect others, which helps avoid conflicts between package versions.

For example, one project might need an older version of a library while another requires the latest version. Virtual environments allow you to work on both projects without interference.

You can also export a list of installed packages from a virtual environment to create a reproducible setup. This file can be shared, so others can recreate the same environment for your project.

**Conda** is a popular package manager that simplifies creating and managing virtual environments. It lets you install, update, and remove packages, as well as easily switch between different environments.

Compared to Python’s default package manager, **pip**, conda includes a package solver that checks for compatibility between dependencies. This feature helps prevent conflicts and makes it simpler to install complex libraries that have many dependencies, including non-Python ones (for instance, libraries in C or C++). For example, deep learning frameworks like TensorFlow and PyTorch, which often rely on system-level libraries such as CUDA for GPU support, are easier to install and update with conda.

**Miniforge** offers a minimal installer for conda that is pre-configured to use the community-driven **conda-forge** channel and uses the **libmamba** solver by default for faster dependency resolution.

### Installing Conda using Miniforge

If you would like to download Miniforge and install Conda manually using a GUI, visit the [Miniforge releases page](https://github.com/conda-forge/miniforge/releases/latest) and download the appropriate installer for your system.

**Otherwise**, below are streamlined instructions to download Miniforge and install Conda using command-line tools for Windows, macOS, and Linux:

#### Windows

1. **Open Windows PowerShell as Administrator**:
   - Press `Win + X` and select **Windows PowerShell (Admin)**.
   - Or, search for `PowerShell`, right-click the result, and select **Run as administrator**.

2. **Download the Miniforge Installer**:
   - Run the following command to download the latest Miniforge installer:

In [None]:
# Download the Miniforge installer
Invoke-WebRequest -Uri "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Windows-x86_64.exe" -OutFile "$env:TEMP\Miniforge3-Windows-x86_64.exe"

3. **Install Conda Silently**:
   - Execute the installer with default settings:

In [None]:
# Install Miniforge silently to the user folder
Start-Process -FilePath "$env:TEMP\Miniforge3-Windows-x86_64.exe" -ArgumentList "/InstallationType=JustMe /RegisterPython=0 /S /D=$env:USERPROFILE\Miniforge3" -NoNewWindow -Wait

4. **Add Miniforge to the System PATH and Initialize Conda**:
   - After running the commands below, close and reopen PowerShell to apply the changes:

In [None]:
# Permanently add Miniforge directories to the user PATH
$currentUserPath = [Environment]::GetEnvironmentVariable("Path", "User")
$newEntries = "$env:USERPROFILE\Miniforge3\Scripts;$env:USERPROFILE\Miniforge3\condabin"
if ($currentUserPath -notlike "*$env:USERPROFILE\Miniforge3*") {
    $newUserPath = "$currentUserPath;$newEntries"
    [Environment]::SetEnvironmentVariable("Path", $newUserPath, "User")
}

# Initialize Conda for PowerShell
conda init powershell

5. **Remove Any Conflicting Alias for Conda**:
   - Open your AllHosts profile in Notepad, by running the following command:

In [None]:
notepad $PROFILE.CurrentUserAllHosts

   - At the end of the file, add the following line, then save and close the file, and restart PowerShell:

In [None]:
Remove-Item alias:conda -ErrorAction SilentlyContinue

#### macOS

1. **Open Terminal**:
   - Navigate to **Applications** > **Utilities** > **Terminal**.

2. **Download the Miniforge Installer**:
   - Use `curl` to download the installer:

In [None]:
# Download the Miniforge installer
curl -fsSLo Miniforge3-MacOSX-$(uname -m).sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-$(uname -m).sh"

3. **Run the Installer**:
   - Make the installer executable and run it:

In [None]:
# Make the installer executable
chmod +x Miniforge3-MacOSX-$(uname -m).sh

# Install Miniforge silently (batch mode)
./Miniforge3-MacOSX-$(uname -m).sh -b

4. **Permanently Add Miniforge to the System PATH**:
   - Depending on which shell you use, run one of the following commands:

In [None]:
# Permanently add Miniforge to PATH

# If using bash:
echo 'export PATH="$HOME/miniforge3/bin:$PATH"' >> ~/.bash_profile

# If using zsh:
echo 'export PATH="$HOME/miniforge3/bin:$PATH"' >> ~/.zshrc

5. **Initialize Conda**:
   - Run the following command to initialize Conda, then close and reopen Terminal to apply the changes:

In [None]:
# Initialize Conda for the current shell
~/miniforge3/bin/conda init "$(basename "${SHELL}")"

#### Linux

1. **Open Terminal**.

2. **Download the Miniforge Installer**:
   - Use `wget` to download the installer:

In [None]:
# Download the Miniforge installer
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-$(uname -m).sh" -O Miniforge3-Linux.sh

3. **Run the Installer**:
   - Make the installer executable and run it:

In [None]:
# Make the installer executable
chmod +x Miniforge3-Linux.sh

# Install Miniforge silently (batch mode)
./Miniforge3-Linux.sh -b

4. **Permanently Add Miniforge to the System PATH**:
   - Depending on which shell you use, run one of the following commands:

In [None]:
# Permanently add Miniforge to PATH

# If using bash:
echo 'export PATH="$HOME/miniforge3/bin:$PATH"' >> ~/.bash_profile

# If using zsh:
echo 'export PATH="$HOME/miniforge3/bin:$PATH"' >> ~/.zshrc

5. **Initialize Conda**:
   - Run the following command to initialize Conda, then close and reopen Terminal to apply the changes:

In [None]:
# Initialize Conda for the current shell
~/miniforge3/bin/conda init "$(basename "${SHELL}")"

### Managing Conda Environments

#### Update Conda

Before creating a new environment, verify that conda is up to date by first updating conda, then updating the base environment:

In [None]:
conda update conda
conda update --all

#### Create a New Environment

To create a new environment named `ABE_tutorial_env` with a specific version of Python (3.12 in this case), use:

In [None]:
conda create --name ABE_tutorial_env python=3.12

#### Activate an Environment

To activate the environment, use:

In [None]:
conda activate ABE_tutorial_env

#### Additional Commands

- Deactivate an Environment: `conda deactivate`
- List all Environments: `conda env list`
- Clone an Environment: `conda create --name new_env --clone ABE_tutorial_env`
- Remove an Environment: `conda env remove --name ABE_tutorial_env`
- Export Environment to a File: `conda env export --name ABE_tutorial_env > environment.yml`
- Create Environment from File: `conda env create --file environment.yml`

#### Additional Resources

For more detailed information on conda, consider the following resources:

- **Official Documentation**: https://docs.conda.io/projects/conda/en/latest/

- **Cheat Sheet**: https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html

## 2. Installing Required Packages

To install the required packages for this series of tutorials, first ensure that you have activated the `ABE_tutorial_env` environment. Then, run the following command:

In [None]:
conda install pytorch tianshou gymnasium mujoco cudatoolkit numpy tensorboard torchinfo matplotlib torchvision imageio imageio-ffmpeg pygame ipython ipykernel ipywidgets tqdm torchaudio

| **Package**       | **Description**                                                                                                                                                          | **Deep Reinforcement Learning Use Examples**                                                                                                                                                   |
|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **PyTorch**       | A deep learning library for efficient tensor computations and dynamic network construction.                                                                             | Constructing and training neural networks for policies and value functions in deep RL models.                                                                                                   |
| **Tianshou**      | A reinforcement learning library built on PyTorch that offers flexible algorithm implementations and training utilities.                                                 | Implementing RL algorithms such as DQN, PPO, and A2C using customizable training loops.                                                                                                        |
| **Gymnasium**     | A toolkit providing standardized environments and interfaces for developing and evaluating RL algorithms.                                                                  | Simulating diverse environments—from classic control tasks to Atari games—for training and benchmarking RL agents.                                                                                |
| **Mujoco**        | A high-fidelity physics engine designed for simulating continuous control tasks, particularly in robotics.                                                                 | Training agents in realistic continuous control scenarios like robotic locomotion and manipulation.                                                                                             |
| **CUDA Toolkit**  | A suite of tools for GPU acceleration that speeds up tensor computations and model training.                                                                              | Accelerating the training process of deep RL models, especially when handling large networks or complex simulation environments.                                                                |
| **NumPy**         | A numerical computing library that supports multi-dimensional arrays and matrices.                                                                                        | Managing state representations and performing batch computations required by RL algorithms.                                                                                                     |
| **TensorBoard**   | A visualization tool for tracking training metrics and model architectures during deep learning experiments.                                                               | Monitoring reward curves, loss trends, and network structures throughout deep RL training sessions.                                                                                            |
| **TorchInfo**     | A model inspection tool that provides detailed summaries of PyTorch neural network architectures, including layer outputs and parameter counts.                         | Debugging and verifying network architectures used in deep RL to ensure proper layer configurations and resource allocation.                                                                   |
| **Matplotlib**    | A plotting library for creating static, animated, and interactive visualizations.                                                                                        | Visualizing training progress and performance metrics of deep RL agents over time.                                                                                                                |
| **TorchVision**   | A library offering datasets, pre-trained models, and image transformation utilities focused on computer vision tasks.                                                      | Preprocessing and augmenting visual inputs—such as frames from game environments—for image-based RL tasks.                                                                                        |
| **Imageio**       | A library for reading and writing images in various formats, useful for managing simulation frames.                                                                       | Saving and processing individual image frames produced during RL experiments for further analysis or reporting.                                                                                 |
| **Imageio-ffmpeg**| A Python wrapper for FFMPEG that provides video encoding and decoding capabilities.                                                                                      | Recording simulation frames and converting them into video files to visually inspect agent performance during training.                                                                          |
| **Pygame**        | A set of modules for creating games and interactive graphical interfaces in Python.                                                                                      | Building custom RL environments or rendering real-time visualizations of agent behavior in simulation settings.                                                                                 |
| **IPython**       | An interactive computing environment offering an enhanced Python shell for prototyping and quick experimentation.                                                         | Prototyping RL code interactively and debugging algorithms step by step before deploying full-scale experiments.                                                                                 |
| **ipykernel**     | The IPython kernel that powers Jupyter notebooks, enabling interactive execution of Python code.                                                                          | Supporting dynamic RL experiments in notebook environments, which are ideal for iterative development and testing.                                                                               |
| **ipywidgets**    | A collection of interactive HTML widgets that facilitate the creation of dynamic user interfaces in Jupyter notebooks.                                                     | Creating interactive dashboards and controls (like sliders) for real-time hyperparameter tuning and monitoring during RL training.                                                                 |
| **tqdm**          | A fast, extensible progress bar library that provides visual feedback for long-running loops.                                                                              | Tracking the progress of training episodes or iterations in deep RL experiments, offering immediate insight into execution status.                                                              |
| **TorchAudio**    | A library for loading, transforming, and analyzing audio data.                                                                                                           | Processing auditory signals when RL agents incorporate sound cues or require audio-based feedback as part of their input state.                                                                    |

In [None]:
pip install torchsummary

This command installs:

- **torchsummary**: Provides a summary of PyTorch model architectures.

Using `pip` within a conda environment is common, but it's important to remember that `pip` and `conda` manage packages differently. To avoid conflicts, it's best to use `conda` whenever possible.

### Check Installed Packages

#### List Installed Packages

To verify that the packages are installed correctly, activate the `ABE_tutorial_env` environment. Then, you can list the installed packages:

In [None]:
conda list

This command displays all packages installed in the current environment.

#### Check PyTorch and Tianshou Installation

Make sure that this notebook is running in the `ABE_tutorial_env` environment. Then, run the following code to check the PyTorch installation:

In [None]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())

If PyTorch is installed correctly, you should see the version number and whether PyTorch has access to your GPU.

To check the Tianshou installation, run the following code:

In [None]:
import tianshou
print(tianshou.__version__)

If Tianshou is installed correctly, you should see the version number.

## 3. Train our first RL agent

For this first example we'll focus just on the pieces that need to be in place for us to train an RL agent. Later as we move through the tutorials we'll learn more about each of the pieces and even start to customize some of them!

But for now let's use existing RL algorithms and some existing environments to just see how it all works together.

Import some libraries:

* **gymnasium** will have some environments for us to use (https://gymnasium.farama.org/)
* **torch** will let us build some neural networks (https://pytorch.org/)
* **TensorBoard** will let us see how well our agent is doing


In [None]:
import gymnasium as gym
import torch
from torch.utils.tensorboard import SummaryWriter
import tianshou as ts
import pygame

Generate a unique timestamped ID for our agent so that we can keep track of it in TensorBoard and compare to other agents later. Then we create subdirectories for the logs and models that we'll save during training.

In [None]:
import os
from datetime import datetime

# Timestamped ID for this run to avoid overwriting previous runs and to keep track of different runs
agent_id = datetime.now().strftime("%Y%m%d_%H%M%S")  # Format: YYYYMMDD_HHMMSS

first_dqn_dir = os.path.join("dqn", agent_id)
os.makedirs(first_dqn_dir, exist_ok=True) # Ensure the directory exists
print(f"All files for this run will be saved in the directory: {first_dqn_dir}")

logs_dir = os.path.join(first_dqn_dir, "logs")
os.makedirs(logs_dir, exist_ok=True) # Ensure the directory exists
print(f"Tensorboard logs will be saved in the directory: {logs_dir}")

models_dir = os.path.join(first_dqn_dir, "models")
os.makedirs(models_dir, exist_ok=True) # Ensure the directory exists
print(f"Models will be saved in the directory: {models_dir}")

Let's then start a "logger" so we can see what is going on. The code below will create a directory to store the logs of our agent. We will use this to store the summary statistics of our agent's performance.

In [None]:
logger = ts.utils.TensorboardLogger(SummaryWriter(logs_dir))
print(f"TensorBoard logs are being saved in: {logs_dir}")

### Setup an environment

To start, let's create an instance of the CartPole environment from gymnasium. This is a simple environment where the agent must balance a pole on a cart by moving the cart left or right.

In this environment, the agent receives a reward of +1 for each time step the pole remains upright. The episode ends if either:

- the pole falls over (angle > 12 degrees from vertical)
- the cart moves too far to the left or right (position > 2.4 units from center)
- the episode length is greater than 500

In [None]:
CartPole_env = gym.make("CartPole-v1")

**What is the agent able to see in this environment?**

To find out, we can check the observation space of the environment.

In [None]:
print(CartPole_env.observation_space)
print(type(CartPole_env.observation_space))
print(CartPole_env.observation_space.dtype)
print(CartPole_env.observation_space.shape)

We see that the observation space is a Box space with four dimensions. This means that the agent receives a four-dimensional observation at each time step.

In [None]:
print(CartPole_env.observation_space.low)
print(CartPole_env.observation_space.high)

This shows us the lower and upper bounds of the observation space, which means that the agent's observations will be within these bounds.

Each of the four dimensions corresponds to a different aspect of the environment:

1. The cart's horizontal position (from -4.8 to 4.8)
2. The cart's velocity (from -Inf to Inf)
3. The pole's angle (from ~ -0.418 rad (-24°) to ~ 0.418 rad (24°))
4. The pole's angular velocity (from -Inf to Inf)

When the environment is reset at the beginning of each episode, each of these four values is initialized to a random value between -0.05 and 0.05.

We could also ask: **What actions can the agent take in this environment?**

To find out, we can check the action space of the environment.

In [None]:
print(CartPole_env.action_space)

We learn that the agent can take one of two discrete actions at each time step:

- Action 0: Move the cart to the left
- Action 1: Move the cart to the right

### Setup an agent

Let's start building our agent. 

To start off let's build a neural network that take what the agent observes and converts that into actions.

In [None]:
# Import the network class and utility function for flattening spaces.
from tianshou.utils.net.common import Net
from gymnasium.spaces.utils import flatdim

# Select the appropriate device: CUDA (NVIDIA GPUs), MPS (Apple GPUs), or CPU.
# AMD GPUs with ROCm support accessed using the 'cuda' device string.
device = torch.device("cuda" if torch.cuda.is_available() else
                      "mps" if torch.backends.mps.is_available() else
                      "cpu")
print(f"Using device: {device}")

# Get observation and action dimensions from the CartPole environment.
state_shape = flatdim(CartPole_env.observation_space)   # Total number of elements in the observation space.
action_shape = flatdim(CartPole_env.action_space)       # Total number of elements in the action space (usually the number of discrete actions).

# Build a network that maps observations to action values.
# Available parameters:
#   - state_shape: Dimension of the flattened observation.
#   - action_shape: Dimension of the action output (number of actions).
#   - hidden_sizes: List defining the sizes of hidden layers (adjust for model capacity).
#   - device (optional): Computation device ('cpu' or 'cuda').
#   - activation (optional): Activation function between layers (default: torch.nn.ReLU).

net = Net(
    state_shape=state_shape,        # Input dimension (flattened observation).
    action_shape=action_shape,      # Output dimension (number of actions).
    hidden_sizes=[128, 128, 128],   # Hidden layer sizes; change to adjust model complexity.
    device=device,                  # Device for computations; switch to 'cuda' if using a GPU.
    # activation=torch.nn.ReLU      # Activation function; consider torch.nn.Tanh for different behavior.
)

# Print the network architecture.
print(net)

The hidden_sizes argument above is for setting the number of nodes within each neural network layer that link the initial input (i.e., observations of the current state: 4) and the possible actions (i.e., 2 discrete actions)

Now we'll need to build an optimizer to allow our agent to learn! This optimizer will adjust the weights in the neural network to link observations to actions that lead to more rewards.

In the case of the carte pole environment the rewards are the steps where the pole is held upright (i.e., less that 12 degrees from verticle).

We'll use a pre-built optimizer called Adam, with a learning rate of 0.001. The learning rate determines how quickly the agent adapts it's weights during each step. This will be a hyperparameter that we will use more later in the book.

In [None]:
# Create an Adam optimizer to update the network parameters.

# Available parameters for torch.optim.Adam:
#   - params: Iterable of parameters to optimize.
#   - lr: Learning rate (default: 0.001 here).
#   - eps (optional): Term added for numerical stability (default: 1e-08).
#   - weight_decay (optional): L2 penalty (default: 0).
#   - amsgrad (optional): Boolean flag to enable AMSGrad variant (default: False).

optim = torch.optim.Adam(
    net.parameters(),       # Parameters of the network to optimize.
    lr=0.001                # Learning rate; adjust to speed up or slow down convergence.
    # eps=1e-08,            # Small constant for numerical stability.
    # weight_decay=0,       # L2 regularization factor to prevent overfitting.
    # amsgrad=False         # Whether to use the AMSGrad variant of Adam.
)

# Print the optimizer details.
print(optim)

Now that we have a network and an optimizer let's define a policy that will control how learning takes place.

> The discount factor is how much the agent takes into acount future rewards vs. immediate rewards. A choice of 0.9 suggest that the agent should prioritize future rewards, while a choice of 0.1 suggests the agent should prioritize immediate rewards.

> estimation_step is how many steps into the future the agent should look when calculating the value of different actions.

> target_update_freq is how many steps should be taken before updating the network weights to match what the agent is learning.

Some of these parameters are specific to the RL algorithm we are using here (i.e., estimation_step, and target_update_freq).

In [None]:
# Create the DQN policy.

# Available parameters for ts.policy.DQNPolicy:
#   - model: The Q-network approximating state-action values.
#   - optim: Optimizer for the network parameters.
#   - discount_factor: Gamma; balances immediate vs. future rewards.
#   - action_space: Informs the policy of available actions.
#   - estimation_step: Number of steps for n-step return calculations.
#   - target_update_freq: Steps between updating the target network (ensures stability).
#   - reward_normalization (optional): Flag to normalize rewards.

policy = ts.policy.DQNPolicy(
    model=net,                                  # Q-network approximating state-action values.
    optim=optim,                                # Network optimizer; adjust learning rate or try a different optimizer if needed.
    discount_factor=0.9,                        # Gamma; values near 1 favor long-term rewards.
    action_space=CartPole_env.action_space,     # Provides the policy with information about available actions.
    estimation_step=3,                          # Number of steps in n-step returns; higher values may incorporate more future rewards.
    target_update_freq=320,                     # Frequency (in steps) to update the target network; higher values yield more stable targets.
    # reward_normalization=False,               # Whether to normalize rewards; set True if rewards have high variance.
)

# Print the policy details.
print(policy)

Now let's setup a collector to feed observations to the policy as the agent interacts with it's environment.

> We'll add a test collector that will run tests periodically to see how well our agent is performing.

In [None]:
# Setup the training and testing data collectors.

# Available parameters for ts.data.Collector:
#   - policy: The policy used to interact with the environment.
#   - env: The environment from which to collect experiences.
#   - buffer: Memory buffer for storing experiences. Here, VectorReplayBuffer is used for vectorized environments.
#   - exploration_noise (optional): Boolean to add extra noise during action selection; typically not used for DQN.
#   - preprocess_fn (optional): Function to preprocess data before storing.

train_collector = ts.data.Collector(
    policy,                                 # Policy used to collect training experiences.
    CartPole_env,                           # Environment instance for training.
    ts.data.VectorReplayBuffer(10000, 1),   # Replay buffer with capacity 10,000 and 1 environment; adjust capacity if needed.
    exploration_noise=True                  # Enables additional exploration noise; note that DQN typically uses epsilon-greedy.
)

# Setup the test collector for evaluation.
# For testing, exploration noise is usually enabled/disabled based on the evaluation strategy.
test_collector = ts.data.Collector(
    policy,                         # Policy used for evaluation.
    CartPole_env,                   # Environment instance for testing.
    exploration_noise=True          # Exploration noise; often kept lower to assess the deterministic performance.
)

Now that we have:

1. An environment
2. A Policy with a network model and an optimizer
3. A collector to store the agent experiences

We can now train our agent!

We'll use something called an Off Policy Trainer for now. This trainer controls the learning of a main offline neural network model, and only periodically updates a second version of this neural network that is used by the agent to make descisions. This helps with the stability of the training/learning. However, we'll see in a few tutorials how we can have a fully online trainer where there is no distinction between an off line neural network model and the one being used by the agent to learn. 

Parameters:
> max_epochs is the number of rounds of training to run before stopping the training.

> steps_per_epoch is the number actions the agent will take per epoch

> steps_per_collect is the number of actions to take before collecting experiences in the replay buffer (a list of stored experiences)

> episode_per_test is the number of episodes to run during testing that occurs at the end of an epoch. This estimates how much our agent has learned.

> batch_size is the amount of experiences to take from the replay buffer when trianing the neural network model.

> train_fn is a function that is called at the start of each training epoch. Here it sets the eps (epsilon) paramter to 0.1. Telling the agent to try an exploritory action 10% of the time, rather than what the agent thinks is the best current action.

> test_fn is the same as train_fn, just with the test environment.

> stop_fn is a function that will stop the training if its conditions are met. Here is stops when the rewards reach a specific threshold.


In [None]:
import os
import shutil
import subprocess
import tempfile
import time

import torch
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from tianshou.data import ReplayBuffer, Collector, Batch
import tianshou as ts
from tqdm.notebook import tqdm
from IPython.display import IFrame, display

def kill_port(port):
    """
    Terminates any processes that are listening on the specified port.
    Works on both Unix-based systems and Windows.
    """
    try:
        if os.name == 'nt':
            # Windows: Use netstat and taskkill to kill processes on the given port.
            # The command below might fail (exit status 1) if no process is found.
            cmd = f'for /f "tokens=5" %a in (\'netstat -aon ^| findstr :{port}\') do taskkill /F /PID %a'
            result = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print(f"Killed processes on port {port}.")
        else:
            # Unix (Linux/Mac): Use lsof to find processes on the port and kill them.
            cmd = f"lsof -ti:{port} | xargs kill -9"
            result = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print(f"Killed processes on port {port}.")
    except subprocess.CalledProcessError as e:
        # If the error message indicates that no process was found, we can ignore it.
        if "returned non-zero exit status 1" in str(e):
            pass
        else:
            print(f"Could not kill process on port {port}: {e}")

# Kill any processes on port 6006 to ensure it is free.
kill_port(6006)

# Clear previous TensorBoard sessions (cross-platform)
tensorboard_info = os.path.join(tempfile.gettempdir(), ".tensorboard-info")
if os.path.exists(tensorboard_info):
    shutil.rmtree(tensorboard_info)

# Launch TensorBoard in the background on port 6006.
tb_command = [
    "tensorboard",
    "--logdir", logs_dir,
    "--port", "6006",
    "--host", "localhost",
    "--reload_interval", "30"
]
tb_process = subprocess.Popen(tb_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Allow time for TensorBoard to start and display its dashboard.
time.sleep(5)
display(IFrame(src="http://localhost:6006", width="100%", height="800"))

#------------------------------------------------------------------------------ 

# Start training using OffpolicyTrainer.

# Available parameters for ts.trainer.OffpolicyTrainer:
#   - policy: The policy instance to be trained.
#   - train_collector: Collector that gathers training experiences.
#   - test_collector: Collector that gathers evaluation data.
#   - max_epoch: Maximum number of epochs for training.
#   - step_per_epoch: Number of environment steps per epoch.
#   - step_per_collect: Steps to collect between each policy update.
#   - episode_per_test: Number of episodes to run during each test phase.
#   - batch_size: Mini-batch size for sampling from the replay buffer.
#   - update_per_step: Ratio of gradient updates per collected environment step.
#   - train_fn: Function executed during training (e.g., to adjust epsilon for exploration).
#   - test_fn: Function executed during testing (e.g., to adjust epsilon).
#   - stop_fn: Function to decide when to stop training (e.g., when reward threshold is met).
#   - logger: Logger to capture training progress and metrics.
#   - save_checkpoint_fn: Function to save training checkpoints (default: None).

trainer = ts.trainer.OffpolicyTrainer(
    policy=policy,                                                                      # DQN policy to be trained.
    train_collector=train_collector,                                                    # Collector for training experiences.
    test_collector=test_collector,                                                      # Collector for evaluation data.
    max_epoch=5,                                                                        # Total training epochs; increase for longer training.
    step_per_epoch=10000,                                                               # Number of environment steps per epoch.
    step_per_collect=100,                                                               # Steps to collect between each update; lower values mean more frequent updates.
    episode_per_test=10,                                                                # Episodes to run during evaluation; adjust for reliable performance metrics.
    batch_size=64,                                                                      # Mini-batch size for training updates; modify based on memory and stability.
    update_per_step=1 / 10,                                                             # Ratio of gradient updates per collected environment step.
    train_fn=lambda epoch, env_step: policy.set_eps(0.1),                               # Function to adjust training parameters; here setting epsilon to 0.1.
    test_fn=lambda epoch, env_step: policy.set_eps(0.05),                               # Function to adjust evaluation parameters; here lowering epsilon to 0.05.
    stop_fn=lambda mean_rewards: mean_rewards >= CartPole_env.spec.reward_threshold,    # Stops training when average reward meets/exceeds the environment's threshold.
    logger=logger,                                                                      # Logger for tracking and recording training progress.
    # save_checkpoint_fn=None                                                           # Optional function to save training checkpoints.
).run()

# Print the full training summary.
print("\nTraining Summary:\n")
for key, value in trainer.items():
    print(f"{key}: {value}")

Once the code has finished you can save the trained agent to be used later!

In [None]:
import os
import torch

# Save the model's state dictionary for future use.
model_path = os.path.join(models_dir, f"{first_dqn_dir}.pth")
torch.save(net.state_dict(), model_path)
print(f"Model saved to {model_path}")

You can then load the agent.

In [None]:
import os
import torch

# Select the appropriate device: CUDA (NVIDIA GPUs), MPS (Apple GPUs), or CPU.
# AMD GPUs with ROCm support accessed using the 'cuda' device string.
device = torch.device("cuda" if torch.cuda.is_available() else
                      "mps" if torch.backends.mps.is_available() else
                      "cpu")
print(f"Using device: {device}")

# Load the trained policy
model_path = os.path.join(models_dir, f"{first_dqn_dir}.pth")
policy.load_state_dict(torch.load(model_path, map_location=device))
print("Model loaded successfully!")

Let's see how this works by watching the trained agent in a new environment.


In [None]:
# Create the environment with rendering enabled.
# The 'render_mode' set to "human" allows you to visually inspect the agent's performance.
env = gym.make("CartPole-v1", render_mode="human")

# Reset the environment to start with an initial state.
# The returned state is not used further, but the call ensures the environment is ready.
env.reset()

# Set the policy to evaluation mode.
# This ensures that layers like dropout or batch normalization work in inference mode.
policy.eval()

# Create a Collector that manages the interaction between the policy and the environment.
# For evaluation, we disable extra exploration by setting 'exploration_noise' to False.
collector = ts.data.Collector(policy, env, exploration_noise=False)

# Testing parameters:
n_episodes = 20         # Total number of episodes to run for testing.
frame_rate = 1 / 60     # Render delay between frames (60 frames per second).

# Collect data for the specified number of episodes.
# The 'render' parameter controls the visualization speed.
results = collector.collect(n_episode=n_episodes, render=frame_rate)

# Extract performance data from the results.
# 'rews' is expected to be a list or array of total rewards per episode.
# 'lens' is expected to be a list or array of the episode lengths (number of steps).
episode_rewards = results.get("rews", [])
episode_lengths = results.get("lens", [])

# Close the environment to free up resources and close any open rendering windows.
env.close()

# Convert lists to numpy arrays for statistical operations.
episode_rewards = np.array(episode_rewards)
episode_lengths = np.array(episode_lengths)

# Compute and print performance statistics if data has been collected.
if episode_rewards.size > 0 and episode_lengths.size > 0:
    # Calculate statistics for episode rewards.
    count_rewards   = len(episode_rewards)
    mean_rewards    = np.mean(episode_rewards)
    std_rewards     = np.std(episode_rewards)
    min_rewards     = np.min(episode_rewards)
    p25_rewards     = np.percentile(episode_rewards, 25)
    median_rewards  = np.median(episode_rewards)
    p75_rewards     = np.percentile(episode_rewards, 75)
    max_rewards     = np.max(episode_rewards)
    
    # Calculate statistics for episode lengths.
    count_lengths   = len(episode_lengths)
    mean_lengths    = np.mean(episode_lengths)
    std_lengths     = np.std(episode_lengths)
    min_lengths     = np.min(episode_lengths)
    p25_lengths     = np.percentile(episode_lengths, 25)
    median_lengths  = np.median(episode_lengths)
    p75_lengths     = np.percentile(episode_lengths, 75)
    max_lengths     = np.max(episode_lengths)
    
    # Print the summary table.
    print("Final Evaluation Performance Summary:")
    print(f"Total Episodes Evaluated: {n_episodes}\n")
    header = "{:<22} {:>15} {:>20}".format("Statistic", "Rewards", "Episode Lengths")
    print(header)
    print("-" * len(header))
    print("{:<22} {:>15d} {:>20d}".format("Count", count_rewards, count_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Mean", mean_rewards, mean_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Std Dev", std_rewards, std_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Min", min_rewards, min_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("25th Percentile", p25_rewards, p25_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Median", median_rewards, median_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("75th Percentile", p75_rewards, p75_lengths))
    print("{:<22} {:>15.2f} {:>20.2f}".format("Max", max_rewards, max_lengths))
else:
    print("No performance data was collected. Please verify the Collector configuration.")

How does your trained agent do? Was it better than you at keeping the pole upright?

**Things to try**

> Changing the environment to another classic control environment

>> Go to https://gymnasium.farama.org/environments/classic_control/

>> Choose another environment (it has to have discrete actions with the RL algorithm we are using here!): mountain car or acrobot. We'll learn in more depth other algorithms that can do both discrete and continuous actions.

>> Use the code above as a guide and attempt to run an agent on one of these new environments below!



In [None]:
#try out another environment

## Conclusions

From this tutorial you should be able to now train an agent in some discrete environments using tianshou. We'll start to dig deeper into each of the sections above, to get a better sense of how each part works, and to make the training closer to the contiuous learning within one-life time that is closer to the challenges faced by animals learning to behave in their environments.