# CSC_52081_EP Project

Advanced Machine Learning and Autonomous Agents Project

## Introduction

Reinforcement Learning (RL) has emerged as a robust framework for training autonomous agents to learn optimal behaviors through environmental interactions. This study utilizes the [`CarRacing-v3`](https://gymnasium.farama.org/environments/box2d/car_racing/) environment from Gymnasium, which presents a challenging control task in a racing scenario.

### Environment

The environment features a high-dimensional observation space, represented by a $96 \times 96$ RGB image capturing the car and track, necessitating the use of deep convolutional neural networks (CNNs) for effective feature extraction.

#### Action Space

The action space in CarRacing-v3 supports both continuous and discrete control modes.

In **continuous mode**, the agent outputs three real-valued commands:

- steering (ranging from $-1$ to $+1$)
- gas
- braking

In **discrete mode**, the action space is simplified to five actions:

- do nothing
- steer left
- steer right
- gas
- brake

This dual action representation enables a comprehensive evaluation of various RL algorithms under different control settings.

#### Reward

The reward structure combines a penalty of $-0.1$ per frame and a reward of $+\frac{1000}{N}$ for each new track tile visited, where $N$ is the total number of tiles. This incentivizes the agent to balance exploration (visiting tiles) with efficiency (minimizing frame usage). For example, completing the race after visiting all $N$ tiles in 732 frames yields a reward of $1000 - 0.1 \times 732 = 926.8$ points.

### Objective

The primary objective of this project is to compare RL policies across discrete and continuous action modalities. For discrete control, methods like **Deep Q-Network** (DQN) and **SARSA** are implemented, while continuous control is explored using approaches such as the **Cross-Entropy Method** (CEM), **Self-Adaptive Evolution Strategy** (SA-ES), and policy gradient techniques like **Proximal Policy Optimization** (PPO) and **Soft Actor-Critic** (SAC). This comparative analysis aims to understand the strengths and limitations of each method in handling complex decision spaces.

The high-dimensional visual inputs in `CarRacing-v3` require effective feature extraction, addressed through a tailored CNN architecture. Transitioning between discrete and continuous action representations also demands careful algorithmic design and parameter tuning to ensure stable learning and convergence. While prior studies have often focused on either discrete or continuous action spaces separately, this work adopts a comparative approach, evaluating different agents within the same environment to assess performance under similar conditions.

At this stage, the work outlines the methodology and anticipated challenges, focusing on designing the CNN-based feature extractor, implementing RL algorithms, and establishing a framework for performance comparison. Preliminary findings are yet to be finalized, but the study is expected to provide insights into applying RL in high-dimensional, real-time control tasks. Limitations include the preliminary nature of experiments and the need for further tuning and validation. Future work will involve extensive empirical evaluations, exploring additional policy gradient methods, and refining the network architecture to better handle the complexities of `CarRacing-v3`.

### GitHub

The project's code is available on [GitHub](https://github.com/tr0fin0/ensta_CSC_52081_EP_project), offering a reproducible framework for future investigations and extensions.

## Installation

### Environment

#### WSL, Linux or MacOS

A `Python Virtual Environment` will be used for this project by run the following on a terminal on the project folder:

```bash
sudo apt install python3.10-venv
python3 -m venv env
source env/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
```

### Imports

In [3]:
from collections import deque
from ipywidgets import interact
from IPython.display import Video
from pathlib import Path
from tqdm.notebook import tqdm
from typing import cast, List, Tuple, Deque, Optional, Callable
import os
import gymnasium as gym
import gymnasium.wrappers as gym_wrap
import itertools
import torch
import torch.nn as nn
import torch.optim as optim
import random
from CNN import CNN
from SkipFrame import SkipFrame
from ReplayBuffer import ReplayBuffer
from tensordict import TensorDict
from torchrl.data import TensorDictReplayBuffer, LazyMemmapStorage

In [11]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
import csv

### Setup

#### Directories

In [5]:
DIRECTORY_OUTPUT = "output"
DIRECTORY_MODELS = Path(f"{DIRECTORY_OUTPUT}/models/")
DIRECTORY_VIDEOS = Path(f"{DIRECTORY_OUTPUT}/videos/")

if not DIRECTORY_VIDEOS.exists():
    DIRECTORY_VIDEOS.mkdir(parents=True)

if not DIRECTORY_MODELS.exists():
    DIRECTORY_MODELS.mkdir(parents=True)

## Demonstration

In [4]:
VIDEO_DEMO = "CSC_52081_EP_demonstration"
(DIRECTORY_VIDEOS / f"{VIDEO_DEMO}.mp4").unlink(missing_ok=True)

env = gym.make(
    "CarRacing-v3",
    render_mode="rgb_array",
    lap_complete_percent=0.95,
    domain_randomize=False,
    continuous=False
)
env = gym.wrappers.RecordVideo(env, video_folder=str(DIRECTORY_VIDEOS), name_prefix=VIDEO_DEMO)

done = False
observation, info = env.reset()

while not done:
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

env.close()

Video(
    DIRECTORY_VIDEOS / f"{VIDEO_DEMO}-episode-0.mp4",
    embed=True,
    html_attributes="controls autoplay loop",
)

  logger.warn(


## Description

only demonstration is right. from below here is only experimental.

### Global Definitions

#### Constants

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Set the device to CUDA if available, otherwise use CPU

env = gym.make("CarRacing-v3", 
               render_mode="rgb_array",
               lap_complete_percent=0.95,
               continuous=False)

env = SkipFrame(env, skip=4)
env = gym_wrap.GrayscaleObservation(env)
env = gym_wrap.ResizeObservation(env, shape=(84, 84))
env = gym_wrap.FrameStackObservation(env, stack_size=4)

### Functions

In [7]:
import matplotlib

is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display


def plot_reward(episode_num, reward_list, actions, save_dir) -> None:
    """
    Plots the reward per episode and the moving average of the reward.

    Args:
        episode_num (int): The current episode number.
        reward_list (list): A list of rewards obtained per episode.
        actions (int): The total number of actions taken so far.

    Returns:
        None
    """
    plt.figure(1)
    rewards_tensor = torch.tensor(reward_list, dtype=torch.float)

    if len(rewards_tensor) >= 11:
        eval_reward = torch.clone(rewards_tensor[-10:])
        mean_eval_reward = round(torch.mean(eval_reward).item(), 2)
        std_eval_reward = round(torch.std(eval_reward).item(), 2)

        plt.clf()
        plt.title(
            f'#{episode_num}: {actions} actions, [{mean_eval_reward:.1f}±{std_eval_reward:.1f}]'
        )
    else:
        plt.clf()
        plt.title('Training...')

    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.plot(rewards_tensor.numpy())

    if len(rewards_tensor) >= 50:
        reward_f = torch.clone(rewards_tensor[:50])
        means = rewards_tensor.unfold(0, 50, 1).mean(1).view(-1)
        means = torch.cat((torch.ones(49)*torch.mean(reward_f), means))
        plt.plot(means.numpy())

    plt.savefig(f"{save_dir}/reward_plot.png")

    #plt.pause(0.001)
    if is_ipython:
        display.display(plt.gcf())
        display.clear_output(wait=True)

### Deep SARSA

In [None]:
class DeepSARSA():
    def __init__(
        self,
        environment,
        device,
        CNN,
        gamma = 0.95,
        epsilon = 0.95,
        epsilon_decay = 0.98,
        epsilon_min = 0.02
    ):
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.env = environment
        self.shape_state = self.env.observation_space.shape
        self.shape_action = self.env.action_space.n
        self.device = device

        self.updating_network = CNN(self.shape_state, self.shape_action).float()
        self.updating_network = self.updating_network.to(device=self.device)
        self.frozen_network = CNN(self.shape_state, self.shape_action).float()
        self.frozen_network = self.frozen_network.to(device=self.device)
        self.optimizer = torch.optim.Adam(self.updating_network.parameters(), lr=0.0002)
        self.loss_function = torch.nn.MSELoss()
        self.buffer = TensorDictReplayBuffer(
            storage=LazyMemmapStorage(10000, device=torch.device('cpu'))
        )
        self.updates = 0

    def take_action(self, state):
        if np.random.rand() < self.epsilon:
            action = np.random.randint(self.shape_action)
        else:
            state = torch.tensor(
                state,
                dtype=torch.float32,
                device=self.device
                ).unsqueeze(0)
            action_values = self.updating_network(state)
            action = torch.argmax(action_values, axis=1).item()

        return action
    
    def add_sample(self, state, action, reward, next_state, next_action, done):
        """
        Store a sample in the replay buffer.

        Args:
            state (np.ndarray): The current state.
            action (int): The action taken.
            reward (float): The reward received.
            next_state (np.ndarray): The next state.
            next_action (int): The action taken in the next state.
            done (bool): Whether the episode has terminated.
        """
        self.buffer.add(
            TensorDict({
                "state": torch.tensor(state),
                "action": torch.tensor(action),
                "reward": torch.tensor(reward),
                "next_state": torch.tensor(next_state),
                "next_action": torch.tensor(next_action),
                "done": torch.tensor(done)
            }, batch_size=[])
        )
    
    def get_samples(self, batch_size):
        """
        Sample a batch of transitions from the replay buffer.

        Args:
            batch_size (int): The number of transitions to sample.

        Returns:
            tuple: A tuple containing batches of states, actions, rewards, new states, and termination flags.
        """
        batch = self.buffer.sample(batch_size)

        states = batch.get('state').type(torch.FloatTensor).to(self.device)
        actions = batch.get('action').squeeze().to(self.device)
        rewards = batch.get('reward').squeeze().to(self.device)
        next_states = batch.get('next_state').type(torch.FloatTensor).to(self.device)
        next_actions = batch.get('next_action').squeeze().to(self.device)
        dones = batch.get('done').squeeze().to(self.device)
    
        return states, actions, rewards, next_states, next_actions, dones
    
    def update_network(self, batch_size):
        self.updates += 1
        states, actions, rewards, next_states, next_actions, dones = self.get_samples(batch_size)
        action_values = self.updating_network(states)
        current_estimation = action_values[np.arange(batch_size), actions]
        with torch.no_grad():
            next_action_values = self.frozen_network(next_states)
            target_estimation = rewards + (1 - dones.float()) * self.gamma * next_action_values[np.arange(batch_size), next_actions]

        loss = self.loss_function(current_estimation, target_estimation)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        loss = loss.item()

        return current_estimation, loss
    
    def save(self, save_name:str = 'DEEP_SARSA'):
        """
        Save the current model to a file.

        Args:
            save_name (str): The name to use for the saved model file.
        """
        save_path = str(DIRECTORY_MODELS / f"{save_name}_{self.updates}.pt")

        torch.save({
            'upd_model_state_dict': self.updating_network.state_dict(),
            'frz_model_state_dict': self.frozen_network.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'epsilon': self.epsilon
        }, save_path)
        print(f"Model saved to {save_path} at update {self.updates}")

    def load(self, model_name):
        """
        Load a model from a file.

        Args:
            model_name (str): The name of the model file to load.
        """
        loaded_model = torch.load(str(DIRECTORY_MODELS / model_name), weights_only=False)

        updating_network_parameters = loaded_model['upd_model_state_dict']
        frozen_network_parameters = loaded_model['frz_model_state_dict']
        optimizer_parameters = loaded_model['optimizer_state_dict']

        self.updating_network.load_state_dict(updating_network_parameters)
        self.frozen_network.load_state_dict(frozen_network_parameters)
        self.optimizer.load_state_dict(optimizer_parameters)

    def write_log(
            self,
            rewards,
            losses,
            epsilons,
            log_filename='log_DEEP_SARSA.csv'
        ):
        """
        Write training logs to a CSV file.

        Args:
            rewards (list): List of rewards for each episode.
            losses (list): List of losses for each episode.
            epsilons (list): List of epsilon values for each episode.
            log_filename (str, optional): The name of the log file. Defaults to 'log_DEEP_SARSA.csv'.
        """
        rows = [
            ['reward'] + rewards,
            ['loss'] + losses,
            ['epsilon'] + epsilons
        ]
        with open(str(DIRECTORY_OUTPUT / log_filename), 'w') as csvfile:
            csvwriter = csv.writer(csvfile)
            csvwriter.writerows(rows)
        

## Train the Model

In [None]:
EPISODES = 1000
BATCH_SIZE = 32

agent = DeepSARSA(
    environment=env,
    device=device,
    CNN=CNN,
)

episode_epsilons = []
episode_rewards = []
episode_lengths = []
episode_losses = []

target_network_sync_period = 30
interval_save = 200
interval_plot = 10
iteration = 0

for episode in range(EPISODES):
    episode_reward = 0
    episode_length = 0

    done = False
    losses = []
    episode_epsilons.append(agent.epsilon)

    state, info = env.reset()
    action = agent.take_action(state)

    while not done:
        episode_length += 1
        iteration += 1

        next_state, reward, terminated, truncated, info = env.step(action)
        next_action = agent.take_action(next_state)

        done = (terminated or truncated)
        agent.add_sample(state, action, reward, next_state, next_action, done)
 
        episode_reward += float(reward)

        state = next_state
        action = next_action

        if len(agent.buffer) > BATCH_SIZE:
            q, loss = agent.update_network(BATCH_SIZE)
            losses.append(loss)

        if iteration % target_network_sync_period == 0:
            agent.frozen_network.load_state_dict(agent.updating_network.state_dict())

    agent.epsilon = max(agent.epsilon * agent.epsilon_decay, agent.epsilon_min)

    episode_rewards.append(episode_reward)
    episode_lengths.append(episode_length)
    episode_losses.append(np.mean(losses))

    if episode % interval_save == 0:
        agent.save()
        agent.write_log(episode_rewards, episode_losses, episode_epsilons)

    if episode % interval_plot == 0:
        plot_reward(episode, episode_rewards, episode_length, save_dir=DIRECTORY_OUTPUT)

agent.save()
env.close()

plt.ioff()
plt.show()

Model saved to output\models\DEEP_SARSA_249117.pt at update 249117


<Figure size 640x480 with 0 Axes>

In [2]:
plot_reward(episode, episode_rewards, episode_length, DIRECTORY_OUTPUT)


NameError: name 'plot_reward' is not defined

## Test it

In [10]:
VIDEO_EVAL = "DEEP_SARSA_EVALUATION"
(DIRECTORY_VIDEOS / f"{VIDEO_EVAL}.mp4").unlink(missing_ok=True)

env = gym.wrappers.RecordVideo(env, video_folder=str(DIRECTORY_VIDEOS), name_prefix=VIDEO_EVAL)

agent = DeepSARSA(
    environment=env,
    device=device,
    CNN=CNN,
)
MODEL_ID = 249117
agent.load(f"DEEP_SARSA_{MODEL_ID}.pt")

agent.epsilon = 0
seed_id = 1234

score = 0
action_count = 0

state, info = env.reset(seed=seed_id)
action = agent.take_action(state)
updating = True

while updating:
    next_state, reward, terminated, truncated, info = env.step(action)
    next_action = agent.take_action(next_state)

    updating = not (terminated or truncated)
    score += reward
    action_count += 1

    state = next_state
    action = next_action


print(f"Score:{score:.2f}, actions: {action_count}")

env.close()

Video(
    DIRECTORY_VIDEOS / f"{VIDEO_EVAL}-episode-0.mp4",
    embed=True,
    html_attributes="controls autoplay loop",
)

Score:778.79, actions: 250
