# Metadrive

We now introduce the environment we will be using for this notebook: [Metadrive](https://github.com/metadriverse/metadrive)

Check out the notebook [here](../quickstart.ipynb) for a quick introduction to Metadrive as well as how to install it.

Let's get started by doing a bunch of imports.

In [1]:
# We need to import metadrive to register the environments
import metadrive
import gymnasium as gym
import typing
import numpy as np
import numpy.typing as npt
import torch
import torch.nn as nn
import torch.nn.functional as F


Successfully registered the following environments: ['MetaDrive-validation-v0', 'MetaDrive-10env-v0', 'MetaDrive-100envs-v0', 'MetaDrive-1000envs-v0', 'SafeMetaDrive-validation-v0', 'SafeMetaDrive-10env-v0', 'SafeMetaDrive-100envs-v0', 'SafeMetaDrive-1000envs-v0', 'MARLTollgate-v0', 'MARLBottleneck-v0', 'MARLRoundabout-v0', 'MARLIntersection-v0', 'MARLParkingLot-v0', 'MARLMetaDrive-v0'].


Let's test out the environment by creating an instance of it and taking a random action at each timestep.

In [2]:
# horizon represents the number of steps in an episode before truncation
env = gym.make("MetaDrive-validation-v0", config={"use_render": True, "horizon": 100})
env.reset()
while True:
    obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
    if terminated or truncated:
        break
env.close()

  logger.deprecation(
Known pipe types:
  glxGraphicsPipe
(1 aux display modules not yet loaded.)
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 


Metadrive uses the [Farama Gymnasium](https://gymnasium.farama.org/), which has a standard API for interacting with environments. There are a couple of functions and properties that are good to know about:
1. `reset()`: Resets the environment to its initial state and returns the initial observation.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.reset
2. `step(action)`: Takes an action and returns the next observation, the reward for taking the action, whether the episode is terminated, whether the episode is truncated (ran out of time), and any additional information.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.step
3. `close()`: Closes the environment.
    * Documentation: https://gymnasium.farama.org/api/env/#g1ymnasium.Env.close
4. `action_space`: The action space of the environment, which tells us the shape and bounds of the action space.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.action_space
5. `observation_space`: The observation space of the environment, which tells us the shape and bounds of the observation space.
    * Documentation: https://gymnasium.farama.org/api/env/#gymnasium.Env.observation_space

Let's take a closer look at what our observation and action spaces are:

In [3]:
env = gym.make("MetaDrive-validation-v0", config={"use_render": False, "horizon": 100})
print("Observation Space:", env.observation_space)
print("Action Space:", env.action_space)
env.close()

Observation Space: Box(-0.0, 1.0, (259,), float32)
Action Space: Box(-1.0, 1.0, (2,), float32)


Box spaces represent a continuous space. As the documentation states, a Box represents the Cartesian product of $n$ closed intervals.

For our observation space, we have a 259 dimensional vector, where each element is in the range $[0.0, 1.0]$.
* Documentation: https://metadrive-simulator.readthedocs.io/en/latest/observation.html

For the action space, we have a 2 dimensional vector, where each element is in the range $[-1.0, 1.0]$. The first element represents the steering angle, and the second element represents the throttle.
* Documentation: https://metadrive-simulator.readthedocs.io/en/latest/action_and_dynamics.html


There's one problem you might see here: we previously stated that we were focusing on the discrete case of the policy gradient. Yet our action space seems to be continuous. So what gives?

The answer is that we are going to discretize our action space. We will discretize the steering angle into 2 bins, and the throttle into 2 bins. This will give us a total of 4 actions. We'll write a function to convert our discrete action into a continuous action, which we can provide to the environment.

In [4]:
NUM_ACTIONS = 4

def discrete2continuous(action:int) -> npt.NDArray[np.float32]:
    """
    Convert discrete action to continuous action
    """
    assert 0 <= action < 4
    throttle_magnitude = 1.0
    steering_magnitude = 0.6
    match action:
        case 0:
            return np.array([throttle_magnitude, 0.0])
        case 1:
            return np.array([0.0, steering_magnitude])
        case 2:
            return np.array([0.0, -steering_magnitude])
        case 3:
            return np.array([-throttle_magnitude, 0.0])


With this, we can write a function to collect a trajectory given a policy.

In [5]:
def collect_trajectory(env:gym.Env, policy:typing.Callable[[npt.NDArray], int]) -> tuple[list[npt.NDArray], list[int], list[float]]:
    """
    Collect a trajectory from the environment using the given policy
    """
    observations = []
    actions = []
    rewards = []
    obs, info = env.reset()
    
    while True:
        observations.append(obs)
        action = policy(obs)
        actions.append(action)
        obs, reward, terminated, truncated, info = env.step(discrete2continuous(action))
        rewards.append(reward)
        if terminated or truncated:
            break

    return observations, actions, rewards

Let's test out the function:

In [6]:
env = gym.make("MetaDrive-validation-v0", config={"use_render": False, "horizon": 100})
# horizon is the max number of steps in a trajectory
def random_policy(obs:npt.NDArray) -> int:
    """
    A random policy that returns a random action
    """
    return np.random.randint(0, NUM_ACTIONS)

obs, actions, rewards = collect_trajectory(env, random_policy)
# print the first 10 observations, actions, and rewards
print("Observations:", obs[:10])
print("Actions:", actions[:10])
print("Rewards:", rewards[:10])
env.close()

INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 


Observations: [array([0.09722222, 0.4861111 , 0.5       , 0.01234568, 0.5       ,
       0.5       , 0.5       , 0.        , 0.5       , 0.55      ,
       0.465     , 0.        , 0.5       , 0.5       , 0.95      ,
       0.46500003, 0.        , 0.5       , 0.5       , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.  

Our `collect_trajectory` function allows us to gather rewards from the trajectory. However, recall that what we'll actually train the network on is the reward-to-go. Let's now create the function./

In [7]:
GAMMA = 0.9

def rewards_to_go(trajectory_rewards: list[float]) -> list[float]:
    """
    Computes the gamma discounted reward-to-go for each state in the trajectory.
    """

    trajectory_len = len(trajectory_rewards)

    v_batch = np.zeros(trajectory_len)

    v_batch[-1] = trajectory_rewards[-1]

    # Use GAMMA to decay the advantage
    for t in reversed(range(trajectory_len - 1)):
        v_batch[t] = trajectory_rewards[t] + GAMMA * v_batch[t + 1]

    return list(v_batch)

So, now that we have a function that lets us go from policies to trajectories, we should work on creating a neural network based policy. The network should take in an observation, and return a probability for each number between 0 and 9.

We're going to keep the network fairly small for now.

In [8]:
# policy network
class Policy(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(259, 128)
        self.fc2 = nn.Linear(128, NUM_ACTIONS)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        # output in (Batch, Width)
        output = F.softmax(x, dim=1)
        return output

Let's define a policy function that allows us to sample an action:

In [9]:
def deviceof(m: nn.Module) -> torch.device:
    """
    Get the device of the given module
    """
    return next(m.parameters()).device

def nn_policy(net:Policy, obs:npt.NDArray) -> int:
    """
    A neural network policy that returns an action based on the given observation
    """
    # convert observation to a tensor
    obs_tensor = torch.from_numpy(obs).float().to(deviceof(net))
    # add batch dimension
    obs_tensor = obs_tensor.unsqueeze(0)
    # get the action probabilities
    action_probs = net(obs_tensor)
    # sample an action from the action probabilities
    action = torch.multinomial(action_probs, 1)
    return action.item()

Let's test it out (untrained)

In [10]:
env = gym.make("MetaDrive-validation-v0", config={"use_render": False, "horizon": 100})

policy = Policy()
obs, actions, rewards = collect_trajectory(env, lambda obs: nn_policy(policy, obs))
# print the first 10 observations, actions, and rewards
print("Observations:", obs[:10])
print("Actions:", actions[:10])
print("Rewards:", rewards[:10])

env.close()

INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 


Observations: [array([0.09722222, 0.4861111 , 0.5       , 0.01234568, 0.5       ,
       0.5       , 0.5       , 0.        , 0.5       , 0.55      ,
       0.465     , 0.        , 0.5       , 0.5       , 0.95      ,
       0.46500003, 0.        , 0.5       , 0.5       , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.        , 1.        ,
       1.        , 1.        , 1.        , 1.  

As you can see, it performs basically random actions, since it hasn't been trained yet.

Let's now work on the meat of the problem: the Policy Gradient algorithm.

In [11]:
ENTROPY_BONUS = 0.1

def compute_policy_gradient_loss(
    # Current policy network's probability of choosing an action
    # in (Batch, Action)
    pi_theta_given_st: torch.Tensor,
    # One hot encoding of which action was chosen
    # in (Batch, Action)
    a_t: torch.Tensor,
    # Rewards To Go for the chosen action
    # in (Batch,)
    R_t: torch.Tensor,
) -> torch.Tensor:
    r"""
    Computes the policy gradient loss for a vector of examples, and reduces with mean.

    The standard policy gradient is given by the expected value over trajectories of:

    :math:`\sum_{t=0}^{T} \nabla_{\theta} (\log \pi_{\theta}(a_t|s_t))R_t`
    
    where:
    * :math:`\pi_{\theta}(a_t|s_t)` is the current policy's probability to perform action :math:`a_t` given :math:`s_t`
    * :math:`R_t` is the rewards-to-go from the state at time t to the end of the episode from which it came.
    """

    # here, the multiplication and sum is in order to extract the
    # in (Batch,)
    pi_theta_at_given_st = torch.sum(pi_theta_given_st * a_t, 1)

    # Note: this loss has doesn't actually represent whether the action was good or bad
    # it is a dummy loss, that is only used to compute the gradient

    # Recall that the policy gradient for a single transition (state-action pair) is given by:
    # $\nabla_{\theta} \log \pi_{\theta}(a_t|s_t)R_t$
    # However, it's easier to work with losses, rather than raw gradients.
    # Therefore we construct a loss, that when differentiated, gives us the policy gradient.
    # this loss is given by:
    # $-\log \pi_{\theta}(a_t|s_t)R_t$

    # in (Batch,)
    policy_loss_per_example = -torch.log(pi_theta_at_given_st) * R_t

    # in (Batch,)
    entropy_per_example = -torch.sum(
        torch.log(pi_theta_given_st) * pi_theta_given_st, 1
    )

    # we reward entropy, since excessive certainty indicate the model is 'overfitting'
    loss_per_example = policy_loss_per_example - ENTROPY_BONUS * entropy_per_example

    # we take the average loss over all examples
    return loss_per_example.mean()


def train_policygradient(
    policy: Policy,
    actor_optimizer: torch.optim.Optimizer,
    observation_batch: list[npt.NDArray],
    action_batch: list[int],
    rtg_batch: list[float],
) -> float:
    # assert that the batch_lengths are the same
    assert len(observation_batch) == len(action_batch)
    assert len(observation_batch) == len(rtg_batch)

    # get device
    device = deviceof(policy)

    # convert data to tensors on correct device

    # in (Batch, Width)
    observation_batch_tensor = torch.from_numpy(np.stack(observation_batch)).to(device)

    # in (Batch,)
    rtg_batch_tensor = torch.tensor(
        rtg_batch, dtype=torch.float32, device=device
    )

    # in (Batch, Action)
    chosen_action_tensor = F.one_hot(
        torch.tensor(action_batch).to(device).long(), num_classes=NUM_ACTIONS
    )

    # train actor
    actor_optimizer.zero_grad()
    action_probs = policy.forward(observation_batch_tensor)
    actor_loss = compute_policy_gradient_loss(
        action_probs, chosen_action_tensor, rtg_batch_tensor
    )
    actor_loss.backward()
    actor_optimizer.step()

    # return the respective losses
    return actor_loss.item()


With that, we're done. Let's train!

In [15]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

policy = Policy().to(device)
actor_optimizer = torch.optim.Adam(policy.parameters(), lr=5e-4)

step = 0
rewards = []
losses = []

In [13]:
env = gym.make("MetaDrive-validation-v0", config={"use_render": False, "horizon": 500})

In [16]:
TRAIN_EPOCHS = 500
EPISODES_PER_BATCH = 10

# Train
while step < TRAIN_EPOCHS:
    obs_batch:list[npt.NDArray[np.float32]] = []
    act_batch:list[int] = []
    rtg_batch:list[float] = []
    
    trajectory_returns = []

    for _ in range(EPISODES_PER_BATCH):
        # Collect trajectory
        obs_traj, act_traj, rew_traj = collect_trajectory(env, lambda obs: nn_policy(policy, obs))
        rtg_traj = rewards_to_go(rew_traj)

        # Update batch
        obs_batch.extend(obs_traj)
        act_batch.extend(act_traj)
        rtg_batch.extend(rtg_traj)

        # Update trajectory returns
        trajectory_returns.append(sum(rew_traj))

    policy_loss = train_policygradient(
        policy,
        actor_optimizer,
        obs_batch,
        act_batch,
        rtg_batch
    )

    print(f"Step {step}, Policy Loss: {policy_loss:.3f}, Avg. Returns: {np.mean(trajectory_returns):.3f}")
    rewards.append(np.mean(trajectory_returns))
    losses.append(policy_loss)
    step += 1

INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 0, Policy Loss: -0.091, Avg. Returns: 1.303


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 1, Policy Loss: -0.087, Avg. Returns: 1.663


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 2, Policy Loss: -0.075, Avg. Returns: 2.203


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 3, Policy Loss: -0.047, Avg. Returns: 3.314


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 4, Policy Loss: -0.017, Avg. Returns: 4.501


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 5, Policy Loss: 0.036, Avg. Returns: 6.589


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 6, Policy Loss: 0.094, Avg. Returns: 9.013


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode

Step 7, Policy Loss: 0.196, Avg. Returns: 13.775


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: out_of_road.
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Epis

Step 8, Policy Loss: 0.230, Avg. Returns: 14.223


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: out_of_road.
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: out_of_road.
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: out_of_road.
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.p

Step 9, Policy Loss: 0.309, Avg. Returns: 17.272


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: out_of_road.
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: out_of_road.
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:E

Step 10, Policy Loss: 0.315, Avg. Returns: 16.580


INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 
INFO:/home/fidgetsinner/myworkspace/metadrive/metadrive/envs/base_env.py:Episode ended! Index: 0 Reason: max step 


In [None]:
env.close()

Let's visualize how the policy drives:

In [None]:
env = gym.make("MetaDrive-validation-v0", config={"use_render": True, "horizon": 500})
obs, act, rew = collect_trajectory(env, lambda obs: nn_policy(policy, obs))
env.close()

print("Reward:", sum(rew))

For us, we got returns of around 12-14 after training.