# Notebook 2: Model Building and Training

**Objective:** The goal of this notebook is to define a custom Convolutional Neural Network (CNN) using PyTorch and then use it to train a Reinforcement Learning agent with Stable Baselines3. The agent will learn to balance the pole in the custom visual environment.

We will perform four main tasks:
1.  **Set up the environment** and redefine the custom `ImageWrapper`.
2.  **Define a custom CNN** feature extractor using PyTorch.
3.  **Configure a PPO agent** from Stable Baselines3 to use the network.
4.  **Run the training loop** and save the final, trained model.

In [2]:
# ==============================================================================
# SETUP AND INSTALLATIONS
# ==============================================================================
from google.colab import drive
import os

# 1. Mount Google Drive
drive.mount('/content/drive')

# 2. Define Project Path and Create Symlink
PROJECT_PATH = '/content/drive/My Drive/PortfolioProjects/visual-cart-pole-rl'

# Create a symlink for easy access
if not os.path.exists('/project'):
    !ln -sfn '{PROJECT_PATH}' /project
else:
    print("Symlink '/project' already exists.")

# 3. Install System Dependencies
!apt-get update && apt-get install -y swig

# 4. Install Python Libraries
!pip install gymnasium[box2d] stable-baselines3[extra] opencv-python-headless -q

print("✅ Setup complete. Environment is ready.")

Mounted at /content/drive
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:2 https://cli.github.com/packages stable InRelease
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [81.0 kB]
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,014 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,805 kB]
Hit:13 https://ppa.launchpadconte

## Step 1: Imports and Environment Setup 📚

First, we'll import all the necessary libraries for this notebook, including the custom `ImageWrapper` from the `utils/env_preprocessing.py` file we created. We can also add the project directory to the system path to ensure the import works reliably in Colab.

In [3]:
# ==============================================================================
# IMPORTS AND ENVIRONMENT SETUP
# ==============================================================================
import sys
import gymnasium as gym
import torch
import torch.nn as nn
from stable_baselines3 import PPO
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecTransposeImage

# Add the project directory to the system path to ensure imports work
if '/project' not in sys.path:
    sys.path.append('/project')

# Import our custom classes
from utils.env_preprocessing import ImageWrapper

print("Imports complete and custom wrapper is ready to be used.")

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)


Imports complete and custom wrapper is ready to be used.


## Step 2: Define a Custom CNN Feature Extractor 🧠

Stable Baselines3's `CnnPolicy` is powerful, but to get deeper and attain more control, we will desing our feature extractor here! We will build a custom CNN using PyTorch that inherits from `BaseFeaturesExtractor`.

Our network will consist of three convolutional layers followed by a fully connected layer. This replicates a standard architecture for processing images in reinforcement learning, designed to effectively downsample the image and extract key spatial features.

In [4]:
# ==============================================================================
# CUSTOM CNN CLASS DEFINITION
# ==============================================================================
class CustomCNN(BaseFeaturesExtractor):
    """
    A custom CNN feature extractor with configurable channel dimensions.

    :param observation_space: The observation space of the environment.
    :param features_dim: The number of features to extract.
    :param c1_out: Number of output channels for the first convolutional layer.
    :param c2_out: Number of output channels for the second convolutional layer.
    :param c3_out: Number of output channels for the third convolutional layer.
    """
    def __init__(
        self,
        observation_space: gym.spaces.Box,
        features_dim: int = 128,
        c1_out: int = 32,
        c2_out: int = 64,
        c3_out: int = 64
    ):
        super().__init__(observation_space, features_dim)
        n_input_channels = observation_space.shape[0]

        # Define the CNN layers using the configurable parameters
        self.cnn = nn.Sequential(
            nn.Conv2d(n_input_channels, c1_out, kernel_size=8, stride=4, padding=0),
            nn.ReLU(),
            nn.Conv2d(c1_out, c2_out, kernel_size=4, stride=2, padding=0),
            nn.ReLU(),
            nn.Conv2d(c2_out, c3_out, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute the flattened feature size after the CNN layers
        with torch.no_grad():
            sample_tensor = torch.as_tensor(observation_space.sample()[None]).float()
            n_flatten = self.cnn(sample_tensor).shape[1]

        # Define the final linear layers
        self.linear = nn.Sequential(
            nn.Linear(n_flatten, features_dim),
            nn.ReLU()
        )

    def forward(self, observations: torch.Tensor) -> torch.Tensor:
        return self.linear(self.cnn(observations))


### Step 2.1: Test the Custom CNN 🧪

Before using our `CustomCNN` in the full training pipeline, we can do a quick sanity check. We will:
1.  Create a temporary, dummy environment to get the correct observation space shape.
2.  Instantiate our `CustomCNN` model.
3.  Create a random dummy observation (input tensor).
4.  Perform a single forward pass and check the shape of the output tensor.

This ensures that our network's layers are connected correctly and produce an output with the expected dimensions (`batch_size`, `features_dim`).

---

Note:

1. `make_vec_env`: The Training Accelerator
    - A helper function from Stable Baselines3 that creates multiple, parallel copies of an environment.
    -  Instead of training on a single instance of the CartPole env, `make_vec_env` sets up several envs to run simultaneously. When we call the step() function, it sends an action to each environment and collects the results (the next image, the reward, etc.) from all of them at once.

2. `VecTransposeImage`: The Data Shaper
    - A wrapper that changes the order of an image's dimensions.
    - Our environment (using OpenCV and Gymnasium) produces images in channels-last format: (Height, Width, Channels). PyTorch's CNNs require images in channels-first format: (Channels, Height, Width).


In [5]:
# ==============================================================================
# TEST THE CUSTOM CNN
# ==============================================================================
print("--- Testing CustomCNN ---")

# 1. Create a dummy environment to get the observation space
# We follow the same wrapping steps as we will for the real training
test_env = make_vec_env('CartPole-v1', n_envs=4, env_kwargs={'render_mode': 'rgb_array'}, wrapper_class=ImageWrapper)
test_env = VecTransposeImage(test_env)
obs_space = test_env.observation_space
print(f"Observation space shape: {obs_space.shape}") # Should be (1, 84, 84)

# 2. Instantiate the CustomCNN
# We use the observation space from our wrapped environment
cnn_test_model = CustomCNN(observation_space=obs_space, features_dim=128)
print(f"CNN Model Initialized:\n{cnn_test_model}")

# 3. Create a dummy observation
# .reset() gives us a sample observation from the environment
dummy_obs = test_env.reset()
input_tensor = torch.as_tensor(dummy_obs).float()
print(f"\nInput tensor shape: {input_tensor.shape}") # Should be (n_envs, 1, 84, 84)

# 4. Perform a forward pass
features = cnn_test_model(input_tensor)
print(f"Output features shape: {features.shape}") # Should be (n_envs, features_dim)

# 5. Verify the output shape
n_envs = test_env.num_envs
expected_shape = (n_envs, cnn_test_model.features_dim)
assert features.shape == expected_shape, f"Shape mismatch! Expected {expected_shape}, got {features.shape}"

print("\n✅ Test passed! The CustomCNN is working as expected.")

# Clean up the test environment
test_env.close()

--- Testing CustomCNN ---
Observation space shape: (1, 84, 84)
CNN Model Initialized:
CustomCNN(
  (cnn): Sequential(
    (0): Conv2d(1, 32, kernel_size=(8, 8), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (3): ReLU()
    (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
    (6): Flatten(start_dim=1, end_dim=-1)
  )
  (linear): Sequential(
    (0): Linear(in_features=3136, out_features=128, bias=True)
    (1): ReLU()
  )
)


  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)



Input tensor shape: torch.Size([4, 1, 84, 84])
Output features shape: torch.Size([4, 128])

✅ Test passed! The CustomCNN is working as expected.


  return datetime.utcnow().replace(tzinfo=utc)


## Step 3: Configure and Train the PPO Agent 🚀

This is the final step in the training process. We will now define our learning agent, connect it to our custom environment and neural network, and start the training process.

Here's a detailed breakdown of the components involved:

* **The Agent (PPO):** We are using **Proximal Policy Optimization (PPO)**, a state-of-the-art reinforcement learning algorithm. PPO is known for its stability and strong performance. Its core job is to analyze the agent's experiences (the images, actions, and rewards) and update the neural network to produce better actions over time. It intelligently balances **exploration** (trying new things to find better strategies) and **exploitation** (using the best-known strategies).

* **The Policy (`CnnPolicy`):** This is a pre-built policy structure from Stable Baselines3 that is specifically designed for tasks with image-based observations. It provides the general architecture, which includes a feature extractor (the CNN part) and "heads" that decide on the next action and estimate the value of the current state.

* **The Custom Feature Extractor (`CustomCNN`):** This is our key customization. We are telling the `CnnPolicy` to *not* use its default CNN. Instead, we instruct it to use our `CustomCNN` class as its feature extractor. We pass this instruction using the `policy_kwargs` dictionary, which allows us to configure the policy's internal components.

* **The Training Process (`model.learn()`):** When we call this function, the main training loop begins. The agent will repeatedly perform actions in our four parallel environments, collect batches of experience, and use the PPO algorithm to update its network weights. We will see its performance gradually improve from random guessing to intentional, stable control.

In [6]:
# Check if a CUDA-enabled GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"✅ GPU is available and will be used: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("⚠️ GPU not found, using CPU.")

✅ GPU is available and will be used: Tesla T4


In [8]:
# ==============================================================================
# TRAINING THE AGENT
# ==============================================================================
# --- Configure the Policy ---
# The policy_kwargs dictionary is our way of telling the PPO agent's CnnPolicy
# to use our custom CNN class and its specific parameters.
policy_kwargs = dict(
    features_extractor_class=CustomCNN,
    features_extractor_kwargs=dict(
        features_dim=128,
        c1_out=32,
        c2_out=64,
        c3_out=64
    ),
)

# --- Create the Final Vectorized Environment ---
# We use our previously tested setup with 4 parallel environments.
vec_env = make_vec_env(
    'CartPole-v1',
    n_envs=4,
    env_kwargs={'render_mode': 'rgb_array'},
    wrapper_class=ImageWrapper
)
vec_env = VecTransposeImage(vec_env)

# --- Define the PPO Model ---
# We instantiate the PPO agent with our policy, environment, and custom arguments.
# `verbose=1` will print the training progress.
model = PPO(
    "CnnPolicy",
    vec_env,
    policy_kwargs=policy_kwargs,
    verbose=1,
    tensorboard_log="/project/logs/ppo_visual_cartpole_tensorboard/"
)

# --- Start Training ---
# The agent will train for 250,000 total steps. A "step" is a single
# action taken in one of the parallel environments.
model.learn(total_timesteps=250000)

# --- Save the Trained Model ---
# After training, the model's learned weights are saved to a file. This file
# contains the complete agent, ready for evaluation in the next notebook.
model.save("/project/models/ppo_visual_cartpole")

print("\n✅ Training complete and model saved to /project/models/ppo_visual_cartpole.zip")

Using cuda device
Logging to /project/logs/ppo_visual_cartpole_tensorboard/PPO_2
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 21.6     |
|    ep_rew_mean     | 21.6     |
| time/              |          |
|    fps             | 285      |
|    iterations      | 1        |
|    time_elapsed    | 28       |
|    total_timesteps | 8192     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 22.1         |
|    ep_rew_mean          | 22.1         |
| time/                   |              |
|    fps                  | 246          |
|    iterations           | 2            |
|    time_elapsed         | 66           |
|    total_timesteps      | 16384        |
| train/                  |              |
|    approx_kl            | 0.0047462834 |
|    clip_fraction        | 0.011        |
|    clip_range           | 0.2          |
|    entropy_loss   