In [None]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab


# Reinforcement Learning (DQN) Tutorial
**Author**: [Adam Paszke](https://github.com/apaszke)
            [Mark Towers](https://github.com/pseudo-rnd-thoughts)

Heavily edited by [Matthew Dupree](https://github.com/4onen) and Heekyung Lee for UCSB ECE 157B 272B Winter 2025

This tutorial shows how to use PyTorch to train a Deep Q Learning (DQN) agent
on the CartPole-v1 task from [Gymnasium](https://gymnasium.farama.org).

**Task**

The agent has to decide between two actions - moving the cart left or
right - so that the pole attached to it stays upright. You can find more
information about the environment and other more challenging environments at
[Gymnasium's website](https://gymnasium.farama.org/environments/classic_control/cart_pole/).

**CartPole**

As the agent observes the current state of the environment and chooses
an action, the environment *transitions* to a new state, and also
returns a reward that indicates the consequences of the action. In this
task, rewards are +1 for every incremental timestep and the environment
terminates if the pole falls over too far or the cart moves more than 2.4
units away from center. This means better performing scenarios will run
for longer duration, accumulating larger return.

The CartPole task is designed so that the inputs to the agent are 4 real
values representing the environment state (position, velocity, etc.).
We take these 4 inputs without any scaling and pass them through a
small fully-connected network with 2 outputs, one for each action.
The network is trained to predict the expected value for each action,
given the input state. The action with the highest expected value is
then chosen.


**Packages**

First, let's import needed packages. For our environments, we need
[gymnasium](https://gymnasium.farama.org/), installed by using `pip`.
This is a fork of the original OpenAI Gym project and maintained by
the same team since Gym v0.19.

In order for the `box2d` environments to install correctly, we
first need to install the `swig` package. This is a limitation of
the build process for the `box2d` environments, so needs to be
run as a separate step.

For our model, we'll use the `torch` package, which is the PyTorch
machine learning library. Training these simple reinforcement learning
environments is quite doable on CPU, as the serial nature of the
environments means that we don't get much benefit from GPU parallelism.
Thus, a CPU-only installation link for PyTorch is provided, commented
out, below.

We use `numpy` to manipulate values in multiple places in the code, and
plot using the `plotly` library (which requires `pandas` and `nbformat`.)
`tqdm` provides nice progress bars for our training loops, and `ipywidgets`
allows for a nice increase in interactivity for some cells.

Finally, `ffmpeg` and `moviepy` are used to create the videos of the
agent's performance in the environment, when the environment is wrapped
with the `RecordVideo` wrapper from the `gym.wrappers` module.

In [1]:
%pip install swig
# %pip install torch --index-url https://download.pytorch.org/whl/cpu
%pip install torch "gymnasium[classic_control,box2d]" pandas numpy nbformat plotly tqdm moviepy ffmpeg ipywidgets

Collecting swig
  Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (3.5 kB)
Downloading swig-4.3.0-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.9 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.9 MB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m0.9/1.9 MB[0m [31m13.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.9/1.9 MB[0m [31m20.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.3.0
Collecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?

For `moviepy` to work in making recordings of the environment, you will need to have the `ffmpeg` executable installed. If you are running this in Google Colab and encounter issues, run:

```bash
!apt-get install ffmpeg
```

in a code cell. If you are running this locally, you can install `ffmpeg` by following the instructions [here](https://ffmpeg.org/download.html).

We'll also use the following from PyTorch:

-  neural networks (``torch.nn``)
-  optimization (``torch.optim``)
-  automatic differentiation (``torch.autograd``)


In [22]:
from typing import Tuple, NamedTuple
import math
import random
from collections import deque
from itertools import count

from pathlib import Path
from zipfile import ZipFile

import gymnasium as gym
from gymnasium.wrappers import TimeLimit
from gymnasium.wrappers import RecordVideo

import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px

from tqdm import trange, tqdm

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

MAX_ENV_STEPS = 750
# TODO
GYM_ENV = "LunarLander-v3"
VIDEOS_DIR = f"{GYM_ENV}_videos"
env = gym.make(GYM_ENV, render_mode="rgb_array", max_episode_steps=MAX_ENV_STEPS)
env = RecordVideo(env, video_folder=VIDEOS_DIR, episode_trigger=lambda episode_id: episode_id < 50 or episode_id % 25 == 0)
# if GPU is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Replay Memory

We'll be using experience replay memory for training our DQN. It stores
the transitions that the agent observes, allowing us to reuse this data
later. By sampling from it randomly, the transitions that build up a
batch are decorrelated. It has been shown that this greatly stabilizes
and improves the DQN training procedure.

For this, we're going to need two classes:

-  ``Transition`` - a named tuple representing a single transition in
   our environment. It essentially maps (state, action) pairs
   to their (next_state, reward) result, with the state being the
   screen difference image as described later on.
-  ``ReplayMemory`` - a cyclic buffer of bounded size that holds the
   transitions observed recently. It also implements a ``.sample()``
   method for selecting a random batch of transitions for training.




In [67]:
class Transition(NamedTuple):
    state: np.ndarray
    action: int
    next_state: np.ndarray
    reward: float

class ReplayMemory(object):

    def __init__(self, capacity: int):
        self.memory = deque([], maxlen=capacity)

    def push(self, *args):
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

Now, let's define our model. But first, let's quickly recap what a DQN is.

## DQN algorithm

Our environment is deterministic, so all equations presented here are
also formulated deterministically for the sake of simplicity. In the
reinforcement learning literature, they would also contain expectations
over stochastic transitions in the environment.

Our aim will be to train a policy that tries to maximize the discounted,
cumulative reward
$R_{t_0} = \sum_{t=t_0}^{\infty} \gamma^{t - t_0} r_t$, where
$R_{t_0}$ is also known as the *return*. The discount,
$\gamma$, should be a constant between $0$ and $1$
that ensures the sum converges. A lower $\gamma$ makes
rewards from the uncertain far future less important for our agent
than the ones in the near future that it can be fairly confident
about. It also encourages agents to collect reward closer in time
than equivalent rewards that are temporally far away in the future.

The main idea behind Q-learning is that if we had a function
$Q^*: State \times Action \rightarrow \mathbb{R}$, that could tell
us what our return would be, if we were to take an action in a given
state, then we could easily construct a policy that maximizes our
rewards:

\begin{align}\pi^*(s) = \arg\!\max_a \ Q^*(s, a)\end{align}

However, we don't know everything about the world, so we don't have
access to $Q^*$. But, since neural networks are universal function
approximators, we can simply create one and train it to resemble
$Q^*$.

For our training update rule, we'll use a fact that every $Q$
function for some policy obeys the Bellman equation:

\begin{align}Q^{\pi}(s, a) = r + \gamma Q^{\pi}(s', \pi(s'))\end{align}

The difference between the two sides of the equality is known as the
temporal difference error, $\delta$:

\begin{align}\delta = Q(s, a) - (r + \gamma \max_a' Q(s', a))\end{align}

To minimize this error, we will use the [Huber
loss](https://en.wikipedia.org/wiki/Huber_loss)_. The Huber loss acts
like the mean squared error when the error is small, but like the mean
absolute error when the error is large - this makes it more robust to
outliers when the estimates of $Q$ are very noisy. We calculate
this over a batch of transitions, $B$, sampled from the replay
memory:

\begin{align}\mathcal{L} = \frac{1}{|B|}\sum_{(s, a, s', r) \ \in \ B} \mathcal{L}(\delta)\end{align}

\begin{align}\text{where} \quad \mathcal{L}(\delta) = \begin{cases}
     \frac{1}{2}{\delta^2}  & \text{for } |\delta| \le 1, \\
     |\delta| - \frac{1}{2} & \text{otherwise.}
   \end{cases}\end{align}

### Q-network

Our model will be a feed forward  neural network that takes in the
difference between the current and previous screen patches. It has two
outputs, representing $Q(s, \mathrm{left})$ and
$Q(s, \mathrm{right})$ (where $s$ is the input to the
network). In effect, the network is trying to predict the *expected return* of
taking each action given the current input.




In [68]:
class DQN(nn.Module):

    def __init__(self, n_observations: int, n_actions: int):
        super(DQN, self).__init__()
        self.layer1 = nn.Linear(n_observations, 128)
        self.layer2 = nn.Linear(128, 128)
        self.layer3 = nn.Linear(128, n_actions)

    # Called with either one element to determine next action, or a batch
    # during optimization. Returns tensor([[left0exp,right0exp]...]).
    def forward(self, x):
        x = F.relu(self.layer1.forward(x))
        x = F.relu(self.layer2.forward(x))
        return self.layer3.forward(x)

## Training

### Hyperparameters and utilities
This cell instantiates our model and its optimizer, and defines some
utilities:

-  ``select_action`` - will select an action accordingly to an epsilon
   greedy policy. Simply put, we'll sometimes use our model for choosing
   the action, and sometimes we'll just sample one uniformly. The
   probability of choosing a random action will start at ``EPS_START``
   and will decay exponentially towards ``EPS_END``. ``EPS_DECAY``
   controls the rate of the decay.
-  ``plot_durations`` - a helper for plotting the duration of episodes,
   along with an average over the last 100 episodes (the measure used in
   the official evaluations). The plot will be underneath the cell
   containing the main training loop, and will update after every
   episode.




In [69]:
# BATCH_SIZE is the number of transitions sampled from the replay buffer
# GAMMA is the discount factor as mentioned in the previous section
# EPS_START is the starting value of epsilon
# EPS_END is the final value of epsilon
# EPS_DECAY controls the rate of exponential decay of epsilon, higher means a slower decay
# TAU is the update rate of the target network
# LR is the learning rate of the ``AdamW`` optimizer
BATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 8000
TAU = 0.005
LR = 7e-4

# Get number of actions from gym action space
n_actions: int = int(env.action_space.n)  # type: ignore
# Get the number of state observations
state, info = env.reset()
n_observations: int = len(state)

policy_net_decompiled = DQN(n_observations, n_actions).to(device)
policy_net: torch.jit.ScriptModule = torch.jit.script(policy_net_decompiled)  # type: ignore
target_net_decompiled = DQN(n_observations, n_actions).to(device)
target_net: torch.jit.ScriptModule = torch.jit.script(target_net_decompiled)  # type: ignore
target_net.load_state_dict(policy_net.state_dict())

optimizer = optim.AdamW(policy_net.parameters(), lr=LR, amsgrad=True, capturable=False)
memory = ReplayMemory(10000)


def select_action(state: torch.Tensor, steps_done: int):
    sample = torch.rand(1, dtype=torch.float16, requires_grad=False)
    eps_threshold = EPS_END + (EPS_START - EPS_END) * math.exp(
        -1.0 * steps_done / EPS_DECAY
    )
    if sample > eps_threshold:
        with torch.no_grad():
            # t.max(1) will return the largest column value of each row.
            # second column on max result is index of where max element was
            # found, so we pick action with the larger expected reward.
            return policy_net(state).max(1).indices.view(1, 1)
    else:
        return torch.tensor(
            [[env.action_space.sample()]], device=device, dtype=torch.long
        )


class PlotDurations:
    def __init__(self) -> None:
        fig = make_subplots(
            rows=2,
            cols=1,
            shared_xaxes=True,
            vertical_spacing=0.02,
            subplot_titles=("Episode Duration", "Episode Reward"),
        )
        fig.add_trace(
            go.Scatter(x=[], y=[], mode="lines", name="Duration"), row=1, col=1
        )
        fig.add_trace(
            go.Scatter(x=[], y=[], mode="lines", name="Reward"), row=2, col=1
        )
        fig.add_trace(
            go.Scatter(
                x=[],
                y=[],
                mode="lines",
                name="Duration (100 episode average)",
            ),
            row=1,
            col=1,
        )
        fig.add_trace(
            go.Scatter(
                x=[],
                y=[],
                mode="lines",
                name="Reward (100 episode average)",
            ),
            row=2,
            col=1,
        )
        fig.update_layout(
            title="Training...",
            showlegend=False,
            margin=dict(l=0, r=0, t=30, b=0),
        )
        fig.update_yaxes(title_text="Duration", row=1, col=1)
        fig.update_yaxes(title_text="Reward", row=2, col=1)
        fig.update_xaxes(title_text="Episode", row=2, col=1)
        self.fig_widget = go.FigureWidget(fig)

        try:
            import IPython.display as ipdisplay

            self.ipdisplay = ipdisplay
        except ImportError:
            self.ipdisplay = None

    def show(self):
        self.fig_widget.show()

    def update(self, episode_duration: int, episode_reward: float, redisplay: bool=True):
        duration_scatter: go.Scatter = self.fig_widget.data[0]  # type: ignore
        reward_scatter: go.Scatter = self.fig_widget.data[1]  # type: ignore
        episode = len(duration_scatter.x) + 1  # type: ignore
        new_episode_axis: Tuple[int] = duration_scatter.x + (episode,)  # type: ignore
        new_duration_axis: Tuple[int] = duration_scatter.y + (episode_duration,)  # type: ignore
        new_reward_axis: Tuple[int] = reward_scatter.y + (episode_reward,)  # type: ignore
        with self.fig_widget.batch_update():
            duration_scatter.x = new_episode_axis
            reward_scatter.x = new_episode_axis
            duration_scatter.y = new_duration_axis
            reward_scatter.y = new_reward_axis

        if redisplay and (self.ipdisplay is not None):
            self.ipdisplay.clear_output(wait=True)
            self.fig_widget.show()

    def finish(self, title: str = "Results"):
        self.fig_widget.update_layout(title=title)


plot_durations = PlotDurations()

### Training loop

Finally, the code for training our model.

Here, you can find an ``optimize_model`` function that performs a
single step of the optimization. It first samples a batch, concatenates
all the tensors into a single one, computes $Q(s_t, a_t)$ and
$V(s_{t+1}) = \max_a Q(s_{t+1}, a)$, and combines them into our
loss. By definition we set $V(s) = 0$ if $s$ is a terminal
state. We also use a target network to compute $V(s_{t+1})$ for
added stability. The target network is updated at every step with a
[soft update](https://arxiv.org/pdf/1509.02971.pdf) controlled by
the hyperparameter ``TAU``, which was previously defined.




In [70]:
@torch.compile(fullgraph=True)
def do_optimization_inner(state_batch, action_batch, reward_batch, non_final_next_states, non_final_mask):
    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken. These are the actions which would've been taken
    # for each batch state according to policy_net
    state_action_values = policy_net_decompiled(state_batch).gather(1, action_batch)

    # Compute V(s_{t+1}) for all next states.
    # Expected values of actions for non_final_next_states are computed based
    # on the "older" target_net; selecting their best reward with max(1).values
    # This is merged based on the mask, such that we'll have either the expected
    # state value or 0 in case the state was final.
    next_state_values = torch.zeros(BATCH_SIZE, device=device)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net_decompiled(non_final_next_states).max(1).values
    # Compute the expected Q values
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch

    # Compute Huber loss
    loss = nn.functional.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))

    return loss

def optimize_model():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
    # detailed explanation). This converts batch-array of Transitions
    # to Transition of batch-arrays.
    batch = Transition(*zip(*transitions))

    # Compute a mask of non-final states and concatenate the batch elements
    # (a final state would've been the one after which simulation ended)
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)), device=device, dtype=torch.bool)
    non_final_next_states = torch.cat([s for s in batch.next_state
                                                if s is not None])
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)

    loss = do_optimization_inner(state_batch, action_batch, reward_batch, non_final_next_states, non_final_mask)

    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    # In-place gradient clipping
    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
    optimizer.step()

In [71]:
@torch.compile(fullgraph=True)
@torch.no_grad
def policy_net_update():
    # Soft update of the target network's weights
    # θ′ ← τ θ + (1 −τ )θ′
    for target_net_param, policy_net_param in zip(target_net_decompiled.parameters(), policy_net_decompiled.parameters()):
        torch.Tensor.lerp_(target_net_param, policy_net_param, TAU)

Below, you can find the main training loop. At the beginning we reset
the environment and obtain the initial ``state`` Tensor. Then, we sample
an action, execute it, observe the next state and the reward (always
1), and optimize our model once. When the episode ends (our model
fails), we restart the loop.

Below, `num_episodes` is set to 600. Training RL agents can be a
noisy process, so restarting training can produce better results
if convergence is not observed.

**== EPILEPSY WARNING: The training graph in the cell below may flash repeatedly during training in Google Colab. ==**

If you are sensitive to flashing lights, set `DISPLAY_GRAPH_DURING_TRAINING` to `False` in the cell below,
and the graph will only be displayed at the end of training.

In [72]:
DISPLAY_GRAPH_DURING_TRAINING = True

num_episodes = 600

steps_done = 0
steps_at_max_duration = 0

for i_episode in trange(num_episodes):
    # Initialize the environment and get its state
    state, info = env.reset()
    state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
    total_reward = 0.0
    for t in count():
        action = select_action(state, steps_done)
        observation, reward, terminated, truncated, _ = env.step(action.item())
        total_reward += reward # type: ignore
        reward = torch.tensor([reward], device=device)
        done = terminated or truncated

        if terminated:
            next_state = None
        else:
            next_state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state # type: ignore

        # Perform one step of the optimization (on the policy network)
        optimize_model()
        # Update the target network, copying all weights and biases in DQN
        policy_net_update()

        steps_done += 1

        if done:
            break
    plot_durations.update(t, total_reward, redisplay=DISPLAY_GRAPH_DURING_TRAINING)

    # Early stopping check
    # If we're hitting the maximum duration too often, we're probably stuck
    if t > MAX_ENV_STEPS*0.99:
        steps_at_max_duration += 1
        if steps_at_max_duration > 25:
            print(f"Breaking early, episode {i_episode}")
            break
    else:
        steps_at_max_duration = 0

print('Complete')
plot_durations.finish("Training Results")
plot_durations.show()

100%|██████████| 600/600 [26:34<00:00,  2.66s/it]

Complete





# Export the training videos

We want to be able to see the training we just completed, so lets wrap the videos into a zip file for download from Colab. (Local users can find the videos in the current working directory and skip this step.)

Note that if you didn't set up a RecordVideo wrapper on your environment, there won't be any videos!

In [73]:
video_paths = list(Path(VIDEOS_DIR).glob("*.mp4"))
if not video_paths:
    print(f"No videos found in {VIDEOS_DIR}")
else:
    print(f"Zipping {len(video_paths)} videos from {VIDEOS_DIR}...")
    with ZipFile(f"{VIDEOS_DIR}.zip", "w") as video_zip:
        for video_path in video_paths:
            video_zip.write(video_path)

    # Try to display a link to the zipped videos
    print(f"Videos zipped to {VIDEOS_DIR}.zip")

Zipping 136 videos from LunarLander-v3_videos...
Videos zipped to LunarLander-v3_videos.zip


# Save the model

We're going to write the reinforcement learning model to disk so we can use it later.

We save it in a pytorch torchscript format, which will allow us to load it and run it even if we change the Python code above. This is useful for deployment, as we can load the model in a C++ application, for example. We can also load the model back into Python and continue training it, or use it to make predictions.

In [74]:
LANDER_NET_FILE = f"{GYM_ENV}_dqn.pt"
torch.jit.save(policy_net.cpu(), LANDER_NET_FILE)

# Try to display a link to the model file
print(f"Model saved to {LANDER_NET_FILE}")

Model saved to LunarLander-v3_dqn.pt


# Evaluate the model

We're going to evaluate the model by running it in the environment and seeing how well it performs over a hundred runs. This is evaluation only, so we should be able to run this part faster than the training part from above.

In [75]:
LANDER_NET_FILE = f"{GYM_ENV}_dqn.pt"
model = torch.jit.load(LANDER_NET_FILE).to(device)

In [76]:
# Unwrap our environment to prevent any weirdness.
eval_env = TimeLimit(env.unwrapped, max_episode_steps=750)

In [77]:
durations = []
# TODO: Track reward for the episodes
# Code here
episode_rewards = []

In [78]:
with torch.no_grad():
    EVAL_EPISODES = 100
    for _ in trange(EVAL_EPISODES):
        state, info = eval_env.reset()
        state = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
        for t in count():
            action = model(state).argmax().item()
            observation, reward, terminated, truncated, _ = eval_env.step(action)
            # TODO: Track reward for the episodes
            # Code here
            if t == 0:
                total_reward = 0.0
            total_reward += reward
            if terminated or truncated:
                episode_rewards.append(total_reward)
                break
            # TODO: Track reward for the episodes
            # Code here
            state = torch.tensor(observation, dtype=torch.float32, device=device).unsqueeze(0)
        durations.append(t)

100%|██████████| 100/100 [00:25<00:00,  3.87it/s]


# Plot the evaluation results

The cartpole assignment is considered solved when the agent can pass all steps in a given episode without the pole falling over. Here, we've limited the evaluation to 750 steps, so we should see over 700 steps on average in our durations histogram below.

In [79]:
mean_duration = np.mean(durations)
fig = px.histogram(x=durations, title=f"Episode Durations ({len(durations)} episodes)")
fig.add_vline(x = mean_duration, line_dash="dash", line_color="green", annotation_text=f"Mean: {np.mean(durations):.2f}", annotation_position="top right")
fig.update_layout(xaxis_title="Duration", yaxis_title="Frequency", showlegend=False, margin=dict(l=0, r=0, t=30, b=0))
fig.show()
print(f"Mean duration: {mean_duration:.2f} steps")

Mean duration: 331.77 steps


For other gyms, like LunarLander, the environment is considered solved not on duration, but on _reward_. Specifically, for LunarLander, the gym is solved when the agent can achieve an average reward of 200 over 100 episodes. This code doesn't track nor plot the episode reward, but you can easily modify the code to do so.

In [80]:
mean_reward = np.mean(episode_rewards) # Use episode_rewards instead of rewards
# Add a histogram of rewards
import plotly.graph_objects as go
fig = go.Figure(data=[go.Histogram(x=episode_rewards, nbinsx=20)]) # Use episode_rewards instead of rewards
fig.add_vline(x=mean_reward, line_dash="dash", line_color="red", annotation_text=f"Mean: {mean_reward:.2f}")
fig.update_layout(title="Reward Histogram", xaxis_title="Reward", yaxis_title="Frequency")
fig.show()
print(f"Mean reward: {mean_reward:.2f}")

Mean reward: 235.50


Remember to submit this notebook and your completed model to the GradeScope autograder.