# Metadrive

We now introduce the environment we will be using for this notebook: [Metadrive](https://github.com/metadriverse/metadrive)

Check out the notebook [here](../quickstart.ipynb) for a quick introduction to Metadrive as well as how to install it.

Let's test out the environment by creating an instance of it and taking a random action at each timestep.

In [None]:
# We need to import metadrive to register the environments
import metadrive
import gymnasium as gym 

env = gym.make("MetaDrive-validation-v0", config={"use_render": True})
env.reset()
for i in range(100):
    obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
    if terminated or truncated:
        break
env.close()

Known pipe types:
  glxGraphicsPipe
(1 aux display modules not yet loaded.)


8.31221473623458e-05
0.009233961947637252
0.029430664503088177
0.0385080765555831
0.051269874340499226
0.024825500350000494
-0.004269349232754249
-0.012220657157477671
-0.011318983514367316
0.010826522401143868
0.019600871726443807
0.00518795331222229
0.0002647879602980098
-0.00020965787401731797
0.0027548640214803766
0.0036451430530871957
0.00203804888209766
4.7831459837470954e-05
-0.0008139261675983708
0.0002649450169180858
-0.004138602420487799
-0.003833691689167148
0.005658243306209003
0.0019439392483036298
0.0017501587413670697
0.0013271993844638941
-0.0012417981516064886
0.00047796214394067146
0.005576002975745207
0.011957905409772306
0.019367010686041318
0.03269041932155331
0.02264725586738955
0.010750797150814074
0.023785212217481203
0.019831233304526522
0.015112038625447117
0.017808277012386936
0.017960457098097456
0.020937374686892773
-0.0005746566562039816
-0.001424047920491952
0.0013622548321156132
-0.0023082070131272346
-0.0007034053034793181
0.0011865168268845503
0.000501

Metadrive uses the [Farama Gymnasium](https://gymnasium.farama.org/), which has a standard API for interacting with environments. There are a couple of functions and properties that are good to know about:
1. `reset()`: Resets the environment to its initial state and returns the initial observation.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.reset
2. `step(action)`: Takes an action and returns the next observation, the reward for taking the action, whether the episode is terminated, whether the episode is truncated (ran out of time), and any additional information.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.step
3. `close()`: Closes the environment.
    * Documentation: https://gymnasium.farama.org/api/env/#g1ymnasium.Env.close
4. `action_space`: The action space of the environment, which tells us the shape and bounds of the action space.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.action_space
5. `observation_space`: The observation space of the environment, which tells us the shape and bounds of the observation space.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.observation_space

Let's take a closer look at what our observation and action spaces are:

In [None]:
import metadrive
import gymnasium as gym 

env = gym.make("MetaDrive-validation-v0", config={"use_render": False})
print("Observation Space:", env.observation_space)
print("Action Space:", env.action_space)


Action Space: Box(-1.0, 1.0, (2,), float32)
Observation Space: Box(-0.0, 1.0, (259,), float32)


Box spaces represent a continuous space. As the documentation states, a Box represents the Cartesian product of $n$ closed intervals.

For our observation space, we have a 259 dimensional vector, where each element is in the range $[0.0, 1.0]$.
* Documentation: https://metadrive-simulator.readthedocs.io/en/latest/observation.html

For the action space, we have a 2 dimensional vector, where each element is in the range $[-1.0, 1.0]$. The first element represents the steering angle, and the second element represents the throttle.
* Documentation: https://metadrive-simulator.readthedocs.io/en/latest/action_and_dynamics.html


There's one problem you might see here: we previously stated that we were focusing on the discrete case of the policy gradient. Yet our action space seems to be continuous. So what gives?

The answer is that we are going to discretize our action space. We will discretize the steering angle into 3 bins, and the throttle into 3 bins. This will give us a total of 9 actions. We'll write a function to convert our discrete action into a continuous action.

In [None]:
import numpy as np
import numpy.typing as npt

def discrete2continuous(action:int) -> npt.NDArray[np.float32]:
    """
    Convert discrete action to continuous action
    """
    assert 0 <= action < 9
    throttle_magnitude = 1.0
    steering_magnitude = 0.6
    match action:
        case 0:
            return np.array([throttle_magnitude, steering_magnitude])
        case 1:
            return np.array([throttle_magnitude, 0])
        case 2:
            return np.array([throttle_magnitude, -steering_magnitude])
        case 3:
            return np.array([0, steering_magnitude])
        case 4:
            return np.array([0, 0])
        case 5:
            return np.array([0, -steering_magnitude])
        case 6:
            return np.array([-throttle_magnitude, steering_magnitude])
        case 7:
            return np.array([-throttle_magnitude, 0])
        case 8:
            return np.array([-throttle_magnitude, -steering_magnitude])


In [None]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

GAMMA = 0.75  # Discount factor for advantage estimation and reward discounting
ENTROPY_BONUS = 0.1

def deviceof(m: nn.Module) -> torch.device:
    return next(m.parameters()).device

class Actor(nn.Module):
    def __init__(self):
        super(Actor, self).__init__()
        self.num_actions = 9
        self.fc1 = nn.Linear(259, 128)
        self.fc2 = nn.Linear(128, self.num_actions)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        # output in (Batch, Width)
        output = F.softmax(x, dim=1)
        return output


def compute_policy_gradient_loss(
    # Current policy network's probability of choosing an action
    # in (Batch, Action)
    pi_theta_given_st: torch.Tensor,
    # One hot encoding of which action was chosen
    # in (Batch, Action)
    a_t: torch.Tensor,
    # Rewards To Go for the chosen action
    # in (Batch,)
    R_t: torch.Tensor,
) -> torch.Tensor:
    r"""
    Computes the policy gradient loss for a vector of examples, and reduces with mean.


    https://spinningup.openai.com/en/latest/algorithms/vpg.html#key-equations

    The standard policy gradient is given by the expected value over trajectories of:

    :math:`\sum_{t=0}^{T} \nabla_{\theta} (\log \pi_{\theta}(a_t|s_t))R_t`
    
    where:
    * :math:`\pi_{\theta}(a_t|s_t)` is the current policy's probability to perform action :math:`a_t` given :math:`s_t`
    * :math:`R_t` is the rewards-to-go from the state at time t to the end of the episode from which it came.
    """

    # here, the multiplication and sum is in order to extract the
    # in (Batch,)
    pi_theta_at_given_st = torch.sum(pi_theta_given_st * a_t, 1)

    # Note: this loss has doesn't actually represent whether the action was good or bad
    # it is a dummy loss, that is only used to compute the gradient

    # Recall that the policy gradient for a single transition (state-action pair) is given by:
    # $\nabla_{\theta} \log \pi_{\theta}(a_t|s_t)R_t$
    # However, it's easier to work with losses, rather than raw gradients.
    # Therefore we construct a loss, that when differentiated, gives us the policy gradient.
    # this loss is given by:
    # $-\log \pi_{\theta}(a_t|s_t)R_t$

    # in (Batch,)
    policy_loss_per_example = -torch.log(pi_theta_at_given_st) * R_t

    # in (Batch,)
    entropy_per_example = -torch.sum(
        torch.log(pi_theta_given_st) * pi_theta_given_st, 1
    )

    # we reward entropy, since excessive certainty indicate the model is 'overfitting'
    loss_per_example = policy_loss_per_example - ENTROPY_BONUS * entropy_per_example

    # we take the average loss over all examples
    return loss_per_example.mean()


def train_policygradient(
    actor: Actor,
    actor_optimizer: torch.optim.Optimizer,
    observation_batch: list[env.Observation],
    action_batch: list[env.Action],
    value_batch: list[env.Value],
) -> float:
    # assert that the batch_lengths are the same
    assert len(observation_batch) == len(action_batch)
    assert len(observation_batch) == len(value_batch)

    # get device
    device = deviceof(actor)

    # convert data to tensors on correct device

    # in (Batch, Width, Height)
    observation_batch_tensor = obs_batch_to_tensor(observation_batch, device)

    # in (Batch,)
    value_batch_tensor = torch.tensor(
        value_batch, dtype=torch.float32, device=device
    )

    # in (Batch, Action)
    chosen_action_tensor = F.one_hot(
        torch.tensor(action_batch).to(device).long(), num_classes=actor.num_actions
    )

    # train actor
    actor_optimizer.zero_grad()
    action_probs = actor.forward(observation_batch_tensor)
    actor_loss = compute_policy_gradient_loss(
        action_probs, chosen_action_tensor, value_batch_tensor
    )
    actor_loss.backward()
    actor_optimizer.step()

    # return the respective losses
    return actor_loss.item()

def compute_rtg(
    trajectory_rewards: npt.NDArray[np.float32],
) -> npt.NDArray[np.float32]:
    """
    Computes the gamma discounted reward-to-go for each state in the trajectory.
    """

    trajectory_len = len(trajectory_rewards)

    v_batch = np.zeros(trajectory_len)

    v_batch[-1] = trajectory_rewards[-1]

    # Use GAMMA to decay the advantage
    for t in reversed(range(trajectory_len - 1)):
        v_batch[t] = trajectory_rewards[t] + GAMMA * v_batch[t + 1]

    return v_batch


In [None]:
import numpy as np
import numpy.typing as npt
import torch
import os


EPISODES_PER_BATCH = 200
TRAIN_EPOCHS = 500000
MODEL_SAVE_INTERVAL = 100
SUMMARY_STATS_INTERVAL = 10
RANDOM_SEED = 42

SUMMARY_DIR = './summary'
MODEL_DIR = './models'

# create result directories
if not os.path.exists(SUMMARY_DIR):
    os.makedirs(SUMMARY_DIR)
if not os.path.exists(MODEL_DIR):
    os.makedirs(MODEL_DIR)

use_cuda = torch.cuda.is_available()
torch.manual_seed(RANDOM_SEED)

cuda = torch.device("cuda")
cpu = torch.device("cpu")

if use_cuda:
    device = cuda
else:
    device = cpu


In [None]:
# Initialize Network
ACTOR_LR = 1e-4  # Lower lr stabilises training greatly


In [None]:
# Train
for _ in range(TRAIN_EPOCHS):
    s_batch:list[npt.NDArray[np.float32]] = []
    a_batch:list[np.int8] = []
    rtg_batch:list[np.float32] = []
    
    for _ in range(EPISODES_PER_BATCH):
        # play the game
        ep_s, ep_a, ep_r, ep_rtg = play(actor_player,opponent_player, go_first)

        # now update the minibatch
        s_batch += ep_s
        a_batch += ep_a
        rtg_batch += ep_rtg

    actor_losses = train_policygradient(
        actor,
        actor_optimizer,
        s_batch,
        a_batch,
        v_batch
    )

    for actor_loss in actor_losses:
        writer.add_scalar('actor_loss', actor_loss, step)

        for opponent_name, rewards in rewards_vs.items():
            if len(rewards) > 400:
                avg_reward = np.array(rewards).mean()
                writer.add_scalar(f'reward_against_{opponent_name}', avg_reward, step)
                rewards_vs[opponent_name] = []

        if step % MODEL_SAVE_INTERVAL == 0:
            # Save the neural net parameters to disk.
            torch.save(actor.state_dict(), f"{MODEL_DIR}/nn_model_ep_{step}_actor.ckpt")
        
        step += 1