# CSC_52081_EP Project

Advanced Machine Learning and Autonomous Agents Project

## Introduction

Reinforcement Learning (RL) has emerged as a robust framework for training autonomous agents to learn optimal behaviors through environmental interactions. This study utilizes the [`CarRacing-v3`](https://gymnasium.farama.org/environments/box2d/car_racing/) environment from Gymnasium, which presents a challenging control task in a racing scenario.

### Environment

The environment features a high-dimensional observation space, represented by a $96 \times 96$ RGB image capturing the car and track, necessitating the use of deep convolutional neural networks (CNNs) for effective feature extraction.

#### Action Space

The action space in CarRacing-v3 supports both continuous and discrete control modes.

In **continuous mode**, the agent outputs three real-valued commands:

- steering (ranging from $-1$ to $+1$)
- gas
- braking

In **discrete mode**, the action space is simplified to five actions:

- do nothing
- steer left
- steer right
- gas
- brake

This dual action representation enables a comprehensive evaluation of various RL algorithms under different control settings.

#### Reward

The reward structure combines a penalty of $-0.1$ per frame and a reward of $+\frac{1000}{N}$ for each new track tile visited, where $N$ is the total number of tiles. This incentivizes the agent to balance exploration (visiting tiles) with efficiency (minimizing frame usage). For example, completing the race after visiting all $N$ tiles in 732 frames yields a reward of $1000 - 0.1 \times 732 = 926.8$ points.

### Objective

The primary objective of this project is to compare RL policies across discrete and continuous action modalities. For discrete control, methods like **Deep Q-Network** (DQN) and **SARSA** are implemented, while continuous control is explored using approaches such as the **Cross-Entropy Method** (CEM), **Self-Adaptive Evolution Strategy** (SA-ES), and policy gradient techniques like **Proximal Policy Optimization** (PPO) and **Soft Actor-Critic** (SAC). This comparative analysis aims to understand the strengths and limitations of each method in handling complex decision spaces.

The high-dimensional visual inputs in `CarRacing-v3` require effective feature extraction, addressed through a tailored CNN architecture. Transitioning between discrete and continuous action representations also demands careful algorithmic design and parameter tuning to ensure stable learning and convergence. While prior studies have often focused on either discrete or continuous action spaces separately, this work adopts a comparative approach, evaluating different agents within the same environment to assess performance under similar conditions.

At this stage, the work outlines the methodology and anticipated challenges, focusing on designing the CNN-based feature extractor, implementing RL algorithms, and establishing a framework for performance comparison. Preliminary findings are yet to be finalized, but the study is expected to provide insights into applying RL in high-dimensional, real-time control tasks. Limitations include the preliminary nature of experiments and the need for further tuning and validation. Future work will involve extensive empirical evaluations, exploring additional policy gradient methods, and refining the network architecture to better handle the complexities of `CarRacing-v3`.

### GitHub

The project's code is available on [GitHub](https://github.com/tr0fin0/ensta_CSC_52081_EP_project), offering a reproducible framework for future investigations and extensions.

## Installation

### Environment

#### WSL, Linux or MacOS

A `Python Virtual Environment` will be used for this project by run the following on a terminal on the project folder:

```bash
sudo apt install python3.10-venv
python3 -m venv env
source env/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
```

### Imports

In [None]:
import datetime
import gymnasium as gym
import gymnasium.wrappers as gym_wrap
import torch
import torch.nn as nn
from torch.nn.utils import parameters_to_vector, vector_to_parameters
import csv


from IPython.display import Video
from pathlib import Path
from typing import List

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np


plt.ion()

In [None]:
sns.set_context("talk")

In [None]:
def video_selector(file_path: List[Path]) -> Video:
    return Video(file_path, embed=True, html_attributes="controls autoplay loop")

### Setup

#### Directories

In [None]:
DIRECTORY_OUTPUT = "output"
DIRECTORY_MODELS = Path(f"{DIRECTORY_OUTPUT}/models/")
DIRECTORY_FIGURES = Path(f"{DIRECTORY_OUTPUT}/images/")
DIRECTORY_LOGS = Path(f"{DIRECTORY_OUTPUT}/logs/")

if not DIRECTORY_FIGURES.exists():
    DIRECTORY_FIGURES.mkdir(parents=True)

if not DIRECTORY_MODELS.exists():
    DIRECTORY_MODELS.mkdir(parents=True)

if not DIRECTORY_LOGS.exists():
    DIRECTORY_LOGS.mkdir(parents=True)

print(DIRECTORY_OUTPUT)
print(DIRECTORY_MODELS)
print(DIRECTORY_FIGURES)
print(DIRECTORY_LOGS)

## Demonstration

In [None]:
VIDEO_DEMO = "CSC_52081_EP_demonstration"
(DIRECTORY_FIGURES / f"{VIDEO_DEMO}.mp4").unlink(missing_ok=True)


env = gym.make(
    "CarRacing-v3",
    render_mode="rgb_array",
    lap_complete_percent=0.95,
    domain_randomize=False,
    continuous=False
)
env = gym.wrappers.RecordVideo(env, video_folder=str(DIRECTORY_FIGURES), name_prefix=VIDEO_DEMO)


done = False
observation, info = env.reset()

while not done:
    action = env.action_space.sample()

    observation, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

env.close()


Video(
    DIRECTORY_FIGURES / f"{VIDEO_DEMO}-episode-0.mp4",
    embed=True,
    html_attributes="controls autoplay loop",
)

## Description

only demonstration is right. from below here is only experimental.

## Plotting

In [None]:
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display

def plot_reward(generation, reward_list, sigmas):
    """
    Plot the reward per generation and a moving average.

    Args:
        generation (int): Current generation number.
        reward_list (list): List of best rewards per generation.
        sigmas (list): List of sigma values per generation.
    """
    plt.figure(1)
    rewards_tensor = torch.tensor(reward_list, dtype=torch.float)

    if len(rewards_tensor) >= 11:
        eval_reward = torch.clone(rewards_tensor[-10:])
        mean_eval_reward = round(torch.mean(eval_reward).item(), 2)
        std_eval_reward = round(torch.std(eval_reward).item(), 2)
        plt.clf()
        plt.title(
            f'Gen #{generation}: Best Reward: {reward_list[-1]:.2f}, Sigma: {sigmas[-1]:.4f}, '
            f'[{mean_eval_reward:.1f}±{std_eval_reward:.1f}]'
        )
    else:
        plt.clf()
        plt.title('Training...')

    plt.xlabel('Generation')
    plt.ylabel('Reward')
    plt.plot(rewards_tensor.numpy())

    if len(rewards_tensor) >= 50:
        reward_f = torch.clone(rewards_tensor[:50])
        means = rewards_tensor.unfold(0, 50, 1).mean(1).view(-1)
        means = torch.cat((torch.ones(49) * torch.mean(reward_f), means))
        plt.plot(means.numpy())

    plt.pause(0.001)
    if is_ipython:
        display.display(plt.gcf())
        display.clear_output(wait=True)


### Global Definitions

#### Environment

In [None]:
class SkipFrame(gym.Wrapper):
    """
    Gym environments custom wrapper to skip a specified number of frames.

    Attributes:
        env (gym.Env): The environment to wrap.
        _skip (int): The number of frames to skip.

    Methods:
        step(action):
            Repeats the given action for the specified number of frames and
            accumulates the reward.
    """
    def __init__(self, env, skip):
        super().__init__(env)
        self._skip = skip

    def step(self, action):
        total_reward = 0.0
        for _ in range(self._skip):
            state, reward, terminated, truncated, info = self.env.step(action)
            total_reward += reward
            if terminated:
                break
        return state, total_reward, terminated, truncated, info


In [None]:
def get_environment_continuous():
    """
    Create a continuous version of the CarRacing-v3 environment with appropriate wrappers.
    """
    env_cont = gym.make(
        "CarRacing-v3",
        render_mode="rgb_array",
        continuous=True
    )
    env_cont = SkipFrame(env_cont, skip=4)
    env_cont = gym_wrap.GrayscaleObservation(env_cont)
    env_cont = gym_wrap.ResizeObservation(env_cont, shape=(84, 84))
    env_cont = gym_wrap.FrameStackObservation(env_cont, stack_size=4)
    return env_cont

### CNN

In [None]:
class CNN(nn.Module):
    """
    A Convolutional Neural Network (CNN) for feature extraction from high-dimensional input.

    Attributes:
        net (nn.Sequential): The sequential model defining the CNN architecture.

    Methods:
        __init__(input_dimensions, output_dimensions):
            Initializes the CNN with the given input and output dimensions.
        forward(input):
            Defines the forward pass of the network.
    """
    def __init__(self, input_dimensions, output_dimensions):
        super().__init__()
        channel_n, height, width = input_dimensions

        if height != 84 or width != 84:
            raise ValueError(f"Invalid input ({height, width})-shape. Expected: (84, 84)")

        self.net = nn.Sequential(
            nn.Conv2d(in_channels=channel_n, out_channels=16, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(in_channels=16, out_channels=32, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(2592, 256),
            nn.ReLU(),
            nn.Linear(256, output_dimensions),
        )

    def forward(self, input):
        return self.net(input)

## CEM

### Agent

In [None]:
class Agent_CEM:
    """
    Agent using the Cross-Entropy Method (CEM) for policy optimization.

    Attributes:
        model (torch.nn.Module): Neural network used as the policy.
        mean (torch.Tensor): Flat vector with the current parameters (mean of the distribution).
        sigma (float): Standard deviation for sampling.
        population_size (int): Number of candidates per generation.
        elite_frac (float): Fraction of the best candidates (elite).
        num_elites (int): Number of elites (population_size * elite_frac).
        device (torch.device): Device for computation.
        dir_models (Path): Directory to save models.
        dir_logs (Path): Directory to save logs.

    Methods:
        get_action(state): Returns the continuous action for a given state.
        sample_candidate(): Generates a candidate (parameter vector) with noise.
        update_policy(candidates, rewards): Updates the policy using the elite candidates.
        save(save_name): Saves the current model and distribution parameters.
        load(model_name): Loads saved model and parameters.
        write_log(...): Records training metrics in a CSV file.
    """

    def __init__(self, state_shape, action_dim, device, directory_models, directory_logs, CNN,
                 population_size=50, elite_frac=0.2, initial_std=0.1, load_state=False, load_model=None):
        self.device = device
        self.dir_models = directory_models
        self.dir_logs = directory_logs

        self.population_size = population_size
        self.elite_frac = elite_frac
        self.num_elites = int(self.population_size * self.elite_frac)
        self.sigma = initial_std

        # Inicializa a rede de política (CNN)
        self.model = CNN(state_shape, action_dim).float().to(self.device)

        # Inicializa o vetor de parâmetros (média) a partir do modelo
        self.mean = parameters_to_vector(self.model.parameters()).detach().clone()

        if load_state:
            if load_model is None:
                raise ValueError("Especifique o nome do modelo para carregar.")
            self.load(load_model)

    def get_action(self, state):
        """
        Retorna a ação contínua para o estado dado.

        Para CarRacing-v3 contínuo, o espaço de ação é:
         - steering: [-1, 1]
         - gas: [0, 1]
         - brake: [0, 1]

        A rede gera três valores que são processados (tanh para steering, sigmoid para os demais).
        """
        state_tensor = torch.tensor(state, dtype=torch.float32, device=self.device).unsqueeze(0)
        with torch.no_grad():
            action = self.model(state_tensor)
        steering = torch.tanh(action[0, 0])
        gas = torch.sigmoid(action[0, 1])
        brake = torch.sigmoid(action[0, 2])
        return np.array([steering.item(), gas.item(), brake.item()])

    def sample_candidate(self):
        """
        Gera um vetor candidato de parâmetros a partir da média atual, adicionando ruído gaussiano.
        """
        noise = torch.randn_like(self.mean) * self.sigma
        candidate = self.mean + noise
        return candidate, noise

    def update_policy(self, candidates, rewards):
        """
        Atualiza a política com base nos candidatos elite.
        """
        rewards = np.array(rewards)
        elite_indices = rewards.argsort()[-self.num_elites:]
        elites = [candidates[i] for i in elite_indices]
        new_mean = torch.stack(elites, dim=0).mean(dim=0)
        # Atualiza sigma como o desvio padrão médio entre os elites
        new_sigma = torch.stack(elites, dim=0).std(dim=0).mean().item()
        self.mean = new_mean
        self.sigma = new_sigma
        # Atualiza os parâmetros do modelo
        vector_to_parameters(self.mean, self.model.parameters())

    def save(self, save_name='CEM'):
        """
        Salva o modelo e os parâmetros da distribuição em arquivo.
        """
        save_path = str(self.dir_models / f"{save_name}.pt")
        torch.save({
            'model_state_dict': self.model.state_dict(),
            'mean': self.mean,
            'sigma': self.sigma
        }, save_path)
        print(f"Modelo salvo em {save_path}")

    def load(self, model_name):
        """
        Carrega o modelo e os parâmetros da distribuição.
        """
        loaded = torch.load(str(self.dir_models / model_name))
        self.model.load_state_dict(loaded['model_state_dict'])
        self.mean = loaded['mean']
        self.sigma = loaded['sigma']
        vector_to_parameters(self.mean, self.model.parameters())
        print(f"Modelo {model_name} carregado.")

    def write_log(self, generations, best_rewards, avg_rewards, sigmas, log_filename='log_CEM.csv'):
        """
        Escreve os logs de treinamento em um arquivo CSV.
        """
        rows = [
            ['generation'] + generations,
            ['best_reward'] + best_rewards,
            ['avg_reward'] + avg_rewards,
            ['sigma'] + sigmas
        ]
        with open(str(self.dir_logs / log_filename), 'w') as csvfile:
            csvwriter = csv.writer(csvfile)
            csvwriter.writerows(rows)


In [None]:
def evaluate_policy(agent, env, episodes=1):
    """
    Evaluate the current policy of the agent in the environment, returning the average reward.
    """
    rewards = []
    for _ in range(episodes):
        state, info = env.reset()
        done = False
        total_reward = 0
        while not done:
            action = agent.get_action(state)
            state, reward, terminated, truncated, info = env.step(action)
            total_reward += reward
            done = terminated or truncated
        rewards.append(total_reward)
    return np.mean(rewards)

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# CEM hyperparameters
POPULATION_SIZE = 10
ELITE_FRAC = 0.2
INITIAL_STD = 0.1
GENERATIONS = 10        # Number of generations
EVAL_EPISODES = 1      # Episodes to evaluate each candidate

env = get_environment_continuous()
state, info = env.reset()

# For continuous control, the action dimension is 3
action_dim = 3

agent = Agent_CEM(
    state_shape=state.shape,
    action_dim=action_dim,
    device=DEVICE,
    directory_models=DIRECTORY_MODELS,
    directory_logs=DIRECTORY_LOGS,
    CNN=CNN,
    population_size=POPULATION_SIZE,
    elite_frac=ELITE_FRAC,
    initial_std=INITIAL_STD,
    load_state=False
)

# Lists for logging metrics
generation_numbers = []
best_rewards = []
average_rewards = []
sigma_values = []
generation_dates = []
generation_times = []

interval_log = 5

# Main training loop (by generation)
for generation in range(1, GENERATIONS + 1):
    candidates = []
    rewards = []

    # Generate the population and evaluate each candidate
    for i in range(POPULATION_SIZE):
        candidate_params, _ = agent.sample_candidate()
        # Apply candidate parameters to the model
        from torch.nn.utils import vector_to_parameters
        vector_to_parameters(candidate_params, agent.model.parameters())
        candidate_reward = evaluate_policy(agent, env, episodes=EVAL_EPISODES)
        candidates.append(candidate_params)
        rewards.append(candidate_reward)

    best_reward = np.max(rewards)
    avg_reward = np.mean(rewards)

    # Update the policy with elite candidates
    agent.update_policy(candidates, rewards)

    # Record metrics for the current generation
    generation_numbers.append(generation)
    best_rewards.append(best_reward)
    average_rewards.append(avg_reward)
    sigma_values.append(agent.sigma)

    now = datetime.datetime.now()
    generation_dates.append(now.date().strftime('%Y-%m-%d'))
    generation_times.append(now.time().strftime('%H:%M:%S'))

    print(f"Generation {generation}: Best Reward = {best_reward:.2f}, Average Reward = {avg_reward:.2f}, Sigma = {agent.sigma:.4f}")

    plot_reward(generation, best_rewards, sigma_values)

    # Save the model every interval_log generations
    if generation % interval_log == 0:
        agent.save(save_name=f"CEM_gen_{generation}")
        agent.write_log(
            generation_numbers,
            best_rewards,
            average_rewards,
            sigma_values
        )

# Final save and log writing
agent.save()
agent.write_log(
    generation_numbers,
    best_rewards,
    average_rewards,
    sigma_values
)

env.close()
plt.ioff()
plt.show()

In [None]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_ID = 20  # Adjust based on the last trained model
NUM_EPISODES = 3

# Directory where videos will be saved
VIDEO_DIRNAME = "cem_videos"

# Remove old video files
for episode_index in range(NUM_EPISODES):
    (DIRECTORY_FIGURES / VIDEO_DIRNAME / f"cem-video-episode-{episode_index}.mp4").unlink(missing_ok=True)

# Function to create the CarRacing environment with video recording
def get_environment_continuous():
    """
    Create a continuous version of the CarRacing-v3 environment with video recording.
    """
    env = gym.make("CarRacing-v3", render_mode="rgb_array", continuous=True)
    env = gym_wrap.GrayscaleObservation(env)
    env = gym_wrap.ResizeObservation(env, shape=(84, 84))
    env = gym_wrap.FrameStackObservation(env, stack_size=4)
    env = gym.wrappers.RecordVideo(env, video_folder=DIRECTORY_FIGURES / VIDEO_DIRNAME, episode_trigger=lambda x: True)
    env = gym.wrappers.RecordEpisodeStatistics(env, buffer_length=NUM_EPISODES)
    return env

env = get_environment_continuous()
state, info = env.reset()

# For continuous control, the action dimension is 3
action_dim = 3

agent = Agent_CEM(
    state_shape=state.shape,
    action_dim=action_dim,
    device=DEVICE,
    directory_models=DIRECTORY_MODELS,
    directory_logs=DIRECTORY_LOGS,
    CNN=CNN,
    load_state="eval",
    load_model=f"CEM_gen_{MODEL_ID}.pt"
)
agent.sigma = 0  # No exploration during evaluation

# Run episodes and record videos
for episode_index in range(NUM_EPISODES):
    total_reward = 0.0
    state, info = env.reset()
    episode_over = False

    while not episode_over:
        action = agent.get_action(state)
        state, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        episode_over = terminated or truncated

    print(f"Episode {episode_index}, Total Reward: {total_reward:.2f}")

# Print episode statistics
print(f"Episode total rewards: {env.return_queue}")
print(f"Episode lengths: {env.length_queue}")

env.close()


In [None]:
Video(DIRECTORY_FIGURES / VIDEO_DIRNAME / "rl-video-episode-0.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
Video(DIRECTORY_FIGURES / VIDEO_DIRNAME / "rl-video-episode-1.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
Video(DIRECTORY_FIGURES / VIDEO_DIRNAME / "rl-video-episode-2.mp4", embed=True, html_attributes="controls autoplay loop")

In [None]:
Video(DIRECTORY_FIGURES / VIDEO_DIRNAME / "rl-video-episode-3.mp4", embed=True, html_attributes="controls autoplay loop")