<h1 style="text-align:center; font-size:28px;">Higher-Level RL Libraries</h1>

<br>

In Chapter 6, Deep Q-Networks, we implemented the deep Q-network (DQN) model published by DeepMind in 2015 (https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning). This paper had a significant effect on the RL field by demonstrating that it's possible to use nonlinear approximators in RL.

In this chapter, we will take another step and discuss the higher-level RL libraries, which will allow you to build your code from higher-level blocks. Most of the chapter will describe the __PyTorch Agent Net (PTAN)__ library. This library will be used in the rest of the book to avoid code repetition.

We will cover:

- The motivation for using high-level libraries, rather than reimplementing everything from scratch.
- The PTAN library, with code examples.
- Implementing DQN (on CartPol) using the PTAN library.
- Other RL libraries that you might consider.

<br>

# 1. Why RL libraries?
---

The implementation of DQN (in Chapter 6) wasn't very long and complicated. It was about 200 lines of training code + 120 lines in environment wrappers. 

When you are becoming familiar with RL methods, it is very useful to implement everything from scratch. However, the more involved you become in the field, the more often you will realize that you are writing the same code over and over again. Writing the same code over and over again is not very efficient, as bugs might be introduced every time, which will cost you time for debugging and understanding. In addition, carefully designed code that has been used in several projects usually has a higher quality in terms of performance, unit tests, readability, and documentation.

<br>

# 2. The PTAN library

<hr>

The PTAN library is available at https://github.com/Shmuma/ptan. All the subsequent examples were implemented using version 0.6 of PTAN, which can be installed by running the following line:

In [9]:
!pip install ptan==0.6



The original goal of PTAN was to simplify the RL experiments, and it tries to keep the balance between two extremes:
- Import the library and then write one line of code for training (an example is the OpenAI Baselines project). This approch is very inflexible since we can't change little things in it.
- Implement everything from scratch which is error-prone, boring, and inefficient.

<br>

At a high level, PTAN provides the following entities:
- __Agent:__ This class can convert a batch of observations to a batch of actions. Also, it can contain an optional state (if you want to track some information between consequent actions in one episode). The library provides several agents for the most common RL cases, but you always can write your own subclass of BaseAgent as well.


- __ActionSelector:__ For choosing the action from some output of the network. It works in tandem with Agent.


- __ExperienceSource (and variations):__ Can provide information about the trajectory from episodes. In its simplest form, it is one single (a, r, s') transition at a time, but its functionality goes beyond this.


- __ExperienceSourceBuffer (and friends):__ Replay buffers with various characteristics. They include a simple replay buffer and two versions of prioritized replay buffers.


- __Various utility classes:__ Like TargetNet, and wrappers for time series preprocessing (used for tracking training progress in TensorBoard).


- __Wrappers for Gym environments:__ Like wrappers for Atari games.


- PyTorch Ignite helpers to integrate PTAN into the Ignite framework.

That's basically it.

<br>

# 3. Action selectors

---

An action selector is an object that helps with going from network output to concrete action values. The most common cases include:

- __Argmax:__ Commonly used by Q-value methods when the network predicts Q-values for a set of actions and the desired action is the action with the largest Q(s, a).


- __Policy-based:__ The network outputs the probability distribution (in the form of logits or normalized distribution), and an action needs to be sampled from this distribution. The concrete classes provided by the library are:


- __ArgmaxActionSelector:__ Applies the argmax on the second axis of a passed tensor. (It assumes a matrix with batch dimension along the first axis.)


- __ProbabilityActionSelector:__ Samples from the probability distribution of a discrete set of actions.


- __EpsilonGreedyActionSelector:__ Has the parameter epsilon, which specifies the probability of a random action to be taken.

All the classes assume that NumPy arrays will be passed to them. The complete example from this section can be found in Chapter07/01_actions.py.

In [1]:
# Import the libraries
import ptan
import numpy as np

In [2]:
# Create Q-values
q_vals = np.array([[1, 2, 3], [1, -1, 0]])
print("q_vals")
print(q_vals)

q_vals
[[ 1  2  3]
 [ 1 -1  0]]


In [3]:
# Argmax action selector  (CHOOSE MAXIMUM)
selector = ptan.actions.ArgmaxActionSelector()        # Returns indices of actions with the largest values
print("argmax:", selector(q_vals))

argmax: [2 0]


In [29]:
# Epsilon greedy action selector with epsilon of 0 (TOTALLY GREEDY)
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon = 0.0)
print("Action Selector with Epsilon of 0.0: \t", selector(q_vals))

Action Selector with Epsilon of 0.0: 	 [2 0]


In [45]:
# Epsilon greedy action selector with epsilon of 1 (TOTALLY RANDOM)
selector.epsilon = 1.0
print("Action Selector with Epsilon of 1.0: \t", selector(q_vals))

Action Selector with Epsilon of 1.0: 	 [2 1]


In [61]:
# Epsilon greedy action selector with epsilon of 0.5 (HALF RANDOM)
selector.epsilon = 0.5
print("Action Selector with Epsilon of 0.5: \t", selector(q_vals))

Action Selector with Epsilon of 0.5: 	 [2 0]


In [77]:
# Epsilon greedy action selector with epsilon of 0.1 (10% RANDOM)
selector.epsilon = 0.1
print("Action Selector with Epsilon of 0.1: \t", selector(q_vals))

Action Selector with Epsilon of 0.1: 	 [2 0]


Working with ProbabilityActionSelector is the same, but the input needs to be a normalized probability distribution:

In [78]:
# Probability action selector (BASED ON GIVEN PROBABILITIES)
selector = ptan.actions.ProbabilityActionSelector()

# Report
print("Actions sampled from three prob distributions:")
for _ in range(10):
    acts = selector(np.array([[0.1, 0.8, 0.1],
                              [0.0, 0.0, 1.0],
                              [0.5, 0.5, 0.0]]))
    print(acts)

Actions sampled from three prob distributions:
[1 2 1]
[1 2 0]
[1 2 0]
[2 2 0]
[1 2 0]
[1 2 1]
[2 2 0]
[1 2 0]
[1 2 1]
[1 2 1]


In the preceding example, we sample from three distributions: in the first, the action with index 1 is chosen with probability 80%; in the second distribution, we always select action number 2; and in the third, actions 0 and 1 are equally likely.

<br>

# 4. The agent

---

The agent entity bridges observations and action. There are 3 variants that the agent must have:

1. So far, we've only used a simple DQN agent (that uses neural networks) to obtain action's value and behaving greedily on those values. We've used epsilon-greedy behavior to explore the environment. This could be more complicated. For example, instead of predicting the values of the actions, our agent could predict a probability distribution over the actions. Such agents are called policy agents.


2. It's necessary for the agent to keep a state between observations. For example, very often one observation (or even the last k observation) is not enough to make a decision about the action, and we want to keep some memory in the agent to capture the necessary information. There is a whole subdomain of RL that tries to address this complication with partially observable Markov decision process (POMDP) formalism, which is not covered in the book.


3. The third variant of the agent is very common in continuous control problems. In such cases, actions are not discrete anymore but some continuous value, and the agent needs to predict them from the observations.

To capture all those variants and make the code flexible, the agent in PTAN is implemented as an extensible hierarchy of classes with the ptan.agent.BaseAgent abstract class at the top. From a high level, the agent needs to accept the batch of observations (in the form of a NumPy array) and return the batch of actions. The batch is used to make the processing more efficient, as processing several observations in one pass in a graphics processing unit (GPU) is frequently much faster than processing them individually. The abstract base class doesn't define the types of input and output, which makes it very flexible and easy to extend. For example, in the continuous domain, our actions will no longer be indices of discrete actions, but float values.

PTAN provides two of the most common ways to convert observations into actions: DQNAgent and PolicyAgent. Let's check them out.

In [11]:
# Import the libraries
import ptan
import torch
import torch.nn as nn

<br>

##### DQN using PTAN

In [23]:
# DQN Network
class DQNNet(nn.Module):
    
    # Constructor function
    def __init__(self, actions: int):
        
        # Inherite parent's constructor
        super(DQNNet, self).__init__()
        
        # Initialize the action
        self.actions = actions
 
    # Forward function
    def forward(self, x):
        
        # Produce a diagonal tensor of shape (batch_size, actions)
        output = torch.eye(x.size()[0], self.actions)
        
        return output

In [32]:
# Initialize the DQN network
net = DQNNet(actions = 3)

# Feedforward
net_out = net(torch.zeros(2, 10))

print("dqn_net: \n", net_out)

dqn_net: 
 tensor([[1., 0., 0.],
        [0., 1., 0.]])


In [33]:
### DQN Agent (using Argmax)

# Argmax action selector
selector = ptan.actions.ArgmaxActionSelector()

# DQN Agent
agent = ptan.agent.DQNAgent(dqn_model = net, action_selector = selector)

# Output of the agent
ag_out = agent(torch.zeros(2, 5))

print("Argmax:", ag_out)

Argmax: (array([0, 1]), [None, None])


In [34]:
### DQN Agent (using Epsilon-Greedy)

# Epsilon-greedy action selector
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon = 1.0)

# DQN agent
agent = ptan.agent.DQNAgent(dqn_model = net, action_selector = selector)

# Output of the agent
ag_out = agent(torch.zeros(10, 5))[0]

# Report
print("eps=1.0:", ag_out)

eps=1.0: [0 1 0 2 1 0 1 2 1 1]


In [35]:
### DQN Agent (using Epsilon-Greedy)

# Set epsilon to 0.5
selector.epsilon = 0.5

# Output of the agent
ag_out = agent(torch.zeros(10, 5))[0]

# Report
print("eps=0.5:", ag_out)

eps=0.5: [0 1 2 2 0 0 0 1 2 2]


In [8]:
### DQN Agent (using Epsilon-Greedy)

# Set epsilon to 0.5
selector.epsilon = 0.1

# Output of the agent
ag_out = agent(torch.zeros(10, 5))[0]

# Report
print("eps=0.1:", ag_out)

eps=0.1: [0 1 2 0 0 0 0 0 0 0]


<br>

##### PolicyNet using PTAN

In [37]:
# Policy network
class PolicyNet(nn.Module):
    
    # Constructor function
    def __init__(self, actions: int):
        
        # Inherite parent's constructor
        super(PolicyNet, self).__init__()
        
        # Initialize the actions
        self.actions = actions

    # Forward function
    def forward(self, x):
        
        # Get the shape of output
        shape = (x.size()[0], self.actions)
        
        # Initialize the output with zeros
        output = torch.zeros(shape, dtype = torch.float32)
        
        # Update the output with first two actions having the same logit scores
        output[:, 0] = 1
        output[:, 1] = 1
        
        return output

In [38]:
# Initialize the policy network
net = PolicyNet(actions = 5)

# Feedforward
net_out = net(torch.zeros(6, 10))      # The first two column of this zero matrix will turn into one

print("policy_net: \n", net_out)

policy_net: 
 tensor([[1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.]])


In [39]:
# Probability action selector
selector = ptan.actions.ProbabilityActionSelector()

# Policy agent
agent = ptan.agent.PolicyAgent(model = net, action_selector = selector, apply_softmax = True)

# Output of the agent
ag_out = agent(torch.zeros(6, 5))[0]

print(ag_out)

[3 1 1 0 0 0]


<br>

# 5. DQNAgent

---

This class is applicable in Q-learning when the action space is not very large, which covers Atari games and lots of classical problems. This representation is not universal, and later in the book, you will see ways of dealing with that. DQNAgent takes a batch of observations on input (as a NumPy array), applies the network on them to get Q-values, and then uses the provided ActionSelector to convert Q-values to indices of actions.

Let's consider a small example. For simplicity, our network always produces the same output for the input batch.

In [40]:
# Import the libraries
import torch
import torch.nn as nn

In [49]:
# DQN Network
class DQNNet(nn.Module):
    
    # Constructor function
    def __init__(self, actions: int):
        
        # Inherite parent's constructor
        super(DQNNet, self).__init__()
        
        # Initialize the action
        self.actions = actions
 
    # Forward function
    def forward(self, x):
        
        # Produce a diagonal tensor of shape (batch_size, actions)
        output = torch.eye(x.size()[0], self.actions)
        
        return output

Once we have defined the above class, we can use it as a DQN model:

In [50]:
# Initialize the DQN network
net = DQNNet(actions=3)

# Feedforward
net(torch.zeros(2, 10))

tensor([[1., 0., 0.],
        [0., 1., 0.]])

We start with the simple argmax policy, so the agent will always return actions corresponding to 1s in the network output.

In [51]:
# Action selector (Argmax)
selector = ptan.actions.ArgmaxActionSelector()

# DQN Agent
agent = ptan.agent.DQNAgent(dqn_model = net, action_selector = selector)

# Pass a zero tensor to agent
agent(torch.zeros(2, 5))

(array([0, 1]), [None, None])

On the input, a batch of two observations, each having five values, was given, and on the output, the agent returned a tuple of two objects:

- An array with actions to be executed for every batch. In our case, this is action 0 for the first batch sample and action 1 for the second.
- A list with the agent's internal state. This is used for stateful agents and is a list of None in our case. As our agent is stateless so you can ignore it.

Now let's make the agent with an epsilon-greedy exploration strategy. For this, we just need to pass a different action selector:

In [52]:
# Action selector (Epsilon-Greedy)
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon = 1.0)

# DQN Agent
agent = ptan.agent.DQNAgent(dqn_model = net, action_selector = selector)

# Pass a tensor to agent
agent(torch.zeros(10, 5))[0]

array([1, 1, 0, 1, 2, 0, 0, 1, 1, 0])

As epsilon is 1.0, all the actions will be random, regardless of the network's output. But we can change the epsilon value on the fly, which is very handy during the training when we are supposed to anneal epsilon over time:

In [80]:
# Change epsilon
selector.epsilon = 0.5

# Pass a tensor to DQN agent
agent(torch.zeros(10, 5))[0]

array([2, 1, 2, 2, 1, 2, 0, 2, 0, 0])

In [121]:
# Change epsilon
selector.epsilon = 0.1

# Pass a tensor to DQN agent
agent(torch.zeros(10, 5))[0]

array([0, 1, 2, 0, 1, 0, 0, 0, 0, 0])

<br>

# 6. PolicyAgent

---

PolicyAgent expects the network to produce policy distribution over a discrete set of actions. Policy distribution could be either logits (unnormalized) or a normalized distribution. In practice, you should always use logits to improve the numeric stability of the training process.

Let's reimplement our previous example, but now the network will produce probability:

In [122]:
# Policy network
class PolicyNet(nn.Module):
    
    # Constructor function
    def __init__(self, actions: int):
        
        # Inherite parent's constructor
        super(PolicyNet, self).__init__()
        
        # Initialize the actions
        self.actions = actions

    # Forward function
    def forward(self, x):
        
        # Get the shape of output
        shape = (x.size()[0], self.actions)
        
        # Initialize the output with zeros
        output = torch.zeros(shape, dtype = torch.float32)
        
        # Update the output with first two actions having the same logit scores
        output[:, 0] = 1
        output[:, 1] = 1
        
        return output

The class above could be used to get the action logits for a batch of observations (which is ignored in our example):

In [123]:
# Initialize the PolicyNet
net = PolicyNet(actions=5)

# Pass a tensor into the PolicyNet
net(torch.zeros(6, 10))

tensor([[1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.]])

Now we can use PolicyAgent in combination with ProbabilityActionSelector. As the latter expects normalized probabilities, we need to ask PolicyAgent to apply softmax to the network's output.

In [20]:
# Action selector
selector = ptan.actions.ProbabilityActionSelector()

# Policy Agent
agent = ptan.agent.PolicyAgent(model = net, action_selector = selector, apply_softmax = True)

# Pass a tensor into the agent
agent(torch.zeros(6, 5))[0]

array([1, 4, 1, 1, 4, 3])

Please note that the softmax operation produces non-zero probabilities for zero logits, so our agent can still select actions >1.

In [125]:
# Add softmax at the end (for making probabulity distribution)
torch.nn.functional.softmax(input = net(torch.zeros(1, 10)), 
                            dim = 1)

tensor([[0.3222, 0.3222, 0.1185, 0.1185, 0.1185]])

<br>

# 7. Experience source

---

The agent abstraction described in the previous section allows us to implement environment communications in a generic way. These communications happen in the form of trajectories, produced by applying the agent's actions to the Gym environment.

At a high level, experience source classes take the agent instance and environment and provide you with step-by step data from the trajectories. The functionality of those classes includes:
- Support of multiple environments being communicated at the same time. This allows efficient GPU utilization as a batch of observations is being processed by the agent at once.
- A trajectory can be preprocessed and presented in a convenient form for further training. For example, there is an implementation of subtrajectory rollouts with accumulation of the reward. That preprocessing is convenient for DQN and n-step DQN, when we are not interested in individual intermediate steps in subtrajectories, so they can be dropped. This saves memory and reduces the amount of code we need to write.
- Support of vectorized environments from OpenAI Universe. We will cover this in Chapter 17, Continuous Action Space, for web automation and MiniWoB environments. 

So, the experience source classes act as a "magic black box" to hide the environment interaction and trajectory handling complexities from the library user. But the overall PTAN philosophy is to be flexible and extensible, so if you want, you can subclass one of the existing classes or implement your own version as needed.

There are three classes provided by the system:
- __ExperienceSource:__ using the agent and the set of environments, it produces n-step subtrajectories with all intermediate steps
- __ExperienceSourceFirstLast:__ this is the same as ExperienceSource, but instead of a full subtrajectory (with all steps), it keeps only the first and last steps, with proper reward accumulation in between. This can save a lot of memory in the case of n-step DQN or advantage actor-critic (A2C) rollouts
- __ExperienceSourceRollouts:__ this follows the asynchronous advantage actor-critic (A3C) rollouts scheme described in Mnih's paper about Atari games (referenced in Chapter 12, The Actor-Critic Method)

All the classes are written to be efficient both in terms of central processing unit (CPU) and memory, which is not very important for toy problems, but might become an issue when you want to solve Atari games and need to keep 10M samples in the replay buffer using commodity hardware.

In [146]:
# Import the libraries
import gym
import ptan
from typing import List, Optional, Tuple, Any
from pprint import pprint

In [127]:
# Environment
class ToyEnv(gym.Env):
    """
    Environment with observation 0-4 and actions 0-2. Observations are rotated sequentialy mod 5, reward is equal 
    to given action. Episodes are having fixed length of 10
    """

    # Constructor function
    def __init__(self):
        
        # Inherite parent's constructor
        super(ToyEnv, self).__init__()
        
        # Initialize the observation space
        self.observation_space = gym.spaces.Discrete(n=5)
        
        # Initialize the action space
        self.action_space = gym.spaces.Discrete(n=3)
        
        # Initialize the step index
        self.step_index = 0

        
    # Reset funtion
    def reset(self):
        
        # Reset the step index to zero
        self.step_index = 0
        
        return self.step_index

    
    # Step function
    def step(self, action):
        
        # Get is_done
        is_done = self.step_index == 10
        
        # If step index is 10
        if is_done:
            
            return self.step_index % self.observation_space.n, 0.0, is_done, {}
        
        # Increment the step index
        self.step_index += 1
        
        return self.step_index % self.observation_space.n, float(action), self.step_index == 10, {}

In [149]:
# Initialize the environment
env = ToyEnv()

# Reset the environment
s = env.reset()

print("env.reset() : %s" % s)

env.reset() : 0


In [150]:
# Take action 1
s = env.step(1)
print("env.step(1) : %s" % str(s))

env.step(1) : (1, 1.0, False, {})


In [151]:
# Take action 2
s = env.step(2)
print("env.step(2) : %s" % str(s))

env.step(2) : (2, 2.0, False, {})


In [136]:
# Loop 10 times
for _ in range(10):

    # Take action 0
    r = env.step(0)
    
    print(r)

(3, 0.0, False, {})
(4, 0.0, False, {})
(0, 0.0, False, {})
(1, 0.0, False, {})
(2, 0.0, False, {})
(3, 0.0, False, {})
(4, 0.0, False, {})
(0, 0.0, True, {})
(0, 0.0, True, {})
(0, 0.0, True, {})


In [128]:
# Agent
class DullAgent(ptan.agent.BaseAgent):
    """
    Agent always returns the fixed action
    """
    
    # Constructor
    def __init__(self, action: int):
        
        # Initialize the action
        self.action = action

    # Call function
    def __call__(self, observations: List[Any], state: Optional[List] = None) -> Tuple[List[int], Optional[List]]:
        return [self.action for _ in observations], state

In [137]:
# Initialize the agent
agent = DullAgent(action = 1)

print("agent:", agent([1, 2])[0])

agent: [1, 1]


In [140]:
# Experience source (2 step count)
exp_source = ptan.experience.ExperienceSource(env = env, agent = agent, steps_count = 2)

# Print untill 15th index
for idx, exp in enumerate(exp_source):
    if idx > 15:
        break
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))
(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))
(E

In [145]:
# Experience source (4 step count)
exp_source = ptan.experience.ExperienceSource(env = env, agent = agent, steps_count = 4)

pprint(next(iter(exp_source)))

(Experience(state=0, action=1, reward=1.0, done=False),
 Experience(state=1, action=1, reward=1.0, done=False),
 Experience(state=2, action=1, reward=1.0, done=False),
 Experience(state=3, action=1, reward=1.0, done=False))


In [147]:
# Experience source (multiple environment)
exp_source = ptan.experience.ExperienceSource(env = [ToyEnv(), ToyEnv()], agent = agent, steps_count = 2)

# Print untill 4th index
for idx, exp in enumerate(exp_source):
    if idx > 4:
        break
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))


In [148]:
# Experience source (first-last)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=1.0, steps_count=1)

# Print untill 4th index
for idx, exp in enumerate(exp_source):
    print(exp)
    if idx > 10:
        break

ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)


<br>

# 8. Toy environment

---

For demonstration, we will implement a very simple Gym environment with a small predictable observation state to show how experience source classes work. This environment has integer observation, which increases from 0 to 4, integer action, and a reward equal to the action given.

In [80]:
# Import the libraries
import gym
from typing import List, Tuple, Any, Optional

In [81]:
# Environment
class ToyEnv(gym.Env):
    
    # Constructor
    def __init__(self):
        
        # Inherite parent's constructor
        super(ToyEnv, self).__init__()
        
        # Observation space
        self.observation_space = gym.spaces.Discrete(n=5)
        
        # Action space
        self.action_space = gym.spaces.Discrete(n=3)
        
        # Step index
        self.step_index = 0
       
    
    # Reset function
    def reset(self):
        
        # Set step_index to zero
        self.step_index = 0
        
        return self.step_index
    
    
    # Step function
    def step(self, action):
        
        # Boolean if step_index is 10
        is_done = self.step_index == 10
        
        # If step index is equal to 10
        if is_done:
            
            return self.step_index % self.observation_space.n, 0.0, is_done, {}
        
        # Increment the step index
        self.step_index += 1
        
        return self.step_index % self.observation_space.n, float(action), self.step_index == 10, {}

In addition to this environment, we will use an agent that always generates fixed actions regardless of observations:

In [82]:
# Agent
class DullAgent(ptan.agent.BaseAgent):
    """
    Agent always returns the fixed action
    """
    
    # Constructor
    def __init__(self, action: int):
        
        # Initialize the actions
        self.action = action

    # Call function that generates fixed actions regardless of observations
    def __call__(self, observations: List[Any], state: Optional[List] = None) -> Tuple[List[int], Optional[List]]:
        
        return [self.action for _ in observations], state

<br>

# 9. The ExperienceSource class

---

The first class is __ptan.experience.ExperienceSource__, which generates chunks of agent trajectories of the given length. The implementation automatically handles the end of episode situation (when the step() method in the environment returns is_done=True) and resets the environment.

The constructor accepts several arguments:
- The Gym environment to be used. Alternatively, it could be the list of environments.
- The agent instance.
- steps_count=2: the length of subtrajectories to be generated.
- vectorized=False: if set to True, the environment needs to be an OpenAI Universe vectorized environment. We will discuss such environments in detail in Chapter 16, Web Navigation.

The class instance provides the standard Python iterator interface, so you can just iterate over this to get subtrajectories:

In [87]:
# Instantiate the environment
env = ToyEnv()

# Instantiate the agent
agent = DullAgent(action=1)

# Instantiate the experience source (2 steps count)
exp_source = ptan.experience.ExperienceSource(env = env, agent = agent, steps_count = 2)

# Get trajectories until 2nd index
for idx, exp in enumerate(exp_source):
    if idx > 2:
        break
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))


On every iteration, ExperienceSource returns a piece of the agent's trajectory
in environment communication. It might look simple, but there are several things happening under the hood of our example:
1. reset() was called in the environment to get the initial state
2. The agent was asked to select the action to execute from the state returned
3. The step() method was executed to get the reward and the next state
4. This next state was passed to the agent for the next action
5. Information about the transition from one state to the next state was returned
6. The process iterated (from step 3) until it iterated over the experience source


If the agent changes the way it generates actions (we can get this by updating the network weights, decreasing epsilon, or by some other means), it will immediately affect the experience trajectories that we get.

The ExperienceSource instance returns tuples of length equal to or less than the argument step_count passed on construction. In our case, we asked for two-step subtrajectories, so tuples will be of length 2 or 1 (at the end of episodes). Every object in a tuple is an instance of the ptan.experience.Experience class, which is a namedtuple with the following fields:
- state: the state we observed before taking the action
- action: the action we completed
- reward: the immediate reward we got from env
- done: whether the episode was done

If the episode reaches the end, the subtrajectory will be shorter and the underlying environment will be reset automatically, so we don't need to bother with this and can just keep iterating.

In [88]:
# Get trajectories until 15th index
for idx, exp in enumerate(exp_source):
    if idx > 15:
        break
    print(exp)

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False))
(Experience(state=4, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False))
(Experience(state=3, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=True))
(E

We can ask ExperienceSource for subtrajectories of any length.

In [89]:
# Experience source (4 step count)
exp_source = ptan.experience.ExperienceSource(env = env, agent = agent, steps_count = 4)

In [90]:
# Get the next trajectory
next(iter(exp_source))

(Experience(state=0, action=1, reward=1.0, done=False),
 Experience(state=1, action=1, reward=1.0, done=False),
 Experience(state=2, action=1, reward=1.0, done=False),
 Experience(state=3, action=1, reward=1.0, done=False))

We can pass it several instances of gym.Env. In that case, they will be used in round- robin fashion.

In [44]:
# Experience source (multiple environment)
exp_source = ptan.experience.ExperienceSource(env = [env, env], agent = agent, steps_count = 4)

In [45]:
# Get trajectories until 4th index
for idx, exp in enumerate(exp_source):
    if idx > 4:
        break
    print(exp)    

(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False))
(Experience(state=0, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False))
(Experience(state=1, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=False), Experience(state=0, action=1, reward=1.0, done=False), Experience(state=2, action=1, reward=1.0, done=False))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False), Experience(state=1, action=1, reward=1.0, done=False), Experience(state=3, action=1, reward=1.0, done=True))
(Experience(state=2, action=1, reward=1.0, done=False), Experience(state=4, action=1, reward=1.0, done=False), Experi

<br>

# 10. ExperienceSourceFirstLast

---

The class ExperienceSource provides us with full subtrajectories of the given length as the list of (s, a, r) objects. The next state, s', is returned in the next tuple, which is not always convenient. For example, in DQN training, we want to have tuples (s, a, r, s') at once to do one-step Bellman approximation during the training. In addition, some extension of DQN, like n-step DQN, might want to collapse longer sequences of observations into (first-state, action, total-reward-for-n-steps, state-after-step-n).

To support this in a generic way, a simple subclass of ExperienceSource is implemented: ExperienceSourceFirstLast. It accepts almost the same arguments in the constructor, but returns different data.

In [91]:
# Experience source first-last (1 steps count)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma = 1.0, steps_count = 1)

# Get trajectories until 10th index
for idx, exp in enumerate(exp_source):
    print(exp)
    if idx > 10:
        break

ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)


Now it returns a single object on every iteration, which is again a namedtuple with the following fields:
- state: the state we used to decide on the action to take
- action: the action we took at this step
- reward: the partial accumulated reward for steps_count (in our case, steps_count=1, so it is equal to the immediate reward)
- last_state: the state we got after executing the action. If our episode ends, we have None here

This data is much more convenient for DQN training, as we can apply Bellman approximation directly to it.

Let's check the result with a larger number of steps:

In [92]:
# Experience source first-last (2 steps count)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma = 1.0, steps_count = 2)

# Get trajectories until 10th index
for idx, exp in enumerate(exp_source):
    print(exp)
    if idx > 10:
        break

ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)
ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)
ExperienceFirstLast(state=2, action=1, reward=2.0, last_state=4)
ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=0)
ExperienceFirstLast(state=4, action=1, reward=2.0, last_state=1)
ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)
ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)
ExperienceFirstLast(state=2, action=1, reward=2.0, last_state=4)
ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=None)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)
ExperienceFirstLast(state=0, action=1, reward=2.0, last_state=2)
ExperienceFirstLast(state=1, action=1, reward=2.0, last_state=3)


So, now we are collapsing two steps on every iteration and calculating the immediate reward (that's why reward=2.0 for most of the samples). More interesting samples are at the end of the episode:

        ExperienceFirstLast(state=3, action=1, reward=2.0, last_state=None)
        ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=None)
        
As the episode ends, we have last_state=None in those samples, but additionally, we calculate the reward for the tail of the episode. Those tiny details are very easy to implement wrongly if you are doing all the trajectory handling yourself.

<br>

# 11. Experience replay buffers

---

In DQN, we rarely deal with immediate experience samples, as they are heavily correlated, which leads to instability in the training. Normally, we have large replay buffers, which are populated with experience pieces. Then the buffer is sampled (randomly or with priority weights) to get the training batch. The replay buffer normally has a maximum capacity, so old samples are pushed out when the replay buffer reaches the limit.

There are several implementation tricks here, which become extremely important when you need to deal with large problems:

- How to efficiently sample from a large buffer
- How to push old samples from the buffer
- In the case of a prioritized buffer, how priorities need to be maintained and handled in the most efficient way

All this becomes a quite non-trivial task if you want to solve Atari, keeping 10-100M samples, where every sample is an image from the game. A small mistake can lead to a 10-100x memory increase and major slowdowns of the training process.

PTAN provides several variants of replay buffers, which integrate simply with ExperienceSource and Agent machinery. Normally, what you need to do is ask the buffer to pull a new sample from the source and sample the training batch. The provided classes are:

- __ExperienceReplayBuffer:__ a simple replay buffer of predefined size with uniform sampling.
- __PrioReplayBufferNaive:__ a simple, but not very efficient, prioritized replay buffer implementation. The complexity of sampling is O(n), which might become an issue with large buffers. This version has the advantage over the optimized class, having much easier code.
- __PrioritizedReplayBuffer:__ uses segment trees for sampling, which makes the code cryptic, but with O(log(n)) sampling complexity.
The following shows how the replay buffer could be used:

In [95]:
# Import the libraries
import gym
import ptan
from typing import List, Optional, Tuple, Any

In [96]:
# Environment
class ToyEnv(gym.Env):
    """
    Environment with observation 0-4 and actions 0-2. Observations are rotated sequentialy mod 5, reward is equal 
    to given action. Episodes are having fixed length of 10
    """

    # Constructor
    def __init__(self):
        
        # Inherite parent's constructor
        super(ToyEnv, self).__init__()
        
        # Observation space
        self.observation_space = gym.spaces.Discrete(n=5)
        
        # Action space
        self.action_space = gym.spaces.Discrete(n=3)
        
        # Initialize the step index to zero
        self.step_index = 0
        

    # Reset function
    def reset(self):
        
        # Set step index back to 
        self.step_index = 0
        
        return self.step_index

    
    # Step function
    def step(self, action):
        
        # Get is_done
        is_done = self.step_index == 10
        
        # If terminal state
        if is_done:
            
            return self.step_index % self.observation_space.n, 0.0, is_done, {}
        
        # Increment the step index
        self.step_index += 1
        
        return self.step_index % self.observation_space.n, float(action), self.step_index == 10, {}

In [97]:
# Agent
class DullAgent(ptan.agent.BaseAgent):
    """
    Agent always returns the fixed action
    """
    
    # Constructor
    def __init__(self, action: int):
        
        # Initialize the action
        self.action = action

    # Call function
    def __call__(self, observations: List[Any], state: Optional[List] = None) -> Tuple[List[int], Optional[List]]:
        
        return [self.action for _ in observations], state

In [99]:
# Start the program
if __name__ == "__main__":
    
    # Instantiate the environment
    env = ToyEnv()
    
    # Instantiate the agent
    agent = DullAgent(action=1)
    
    # Instantiate the experience source
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma = 1.0, steps_count = 1)
    
    # Instantiate the experience buffer
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size = 100)

    # Loop over 6 times
    for step in range(6):
        
        # Populate the buffer
        buffer.populate(1)
        
        # If buffer is less than 5
        if len(buffer) < 5:
            
            # Start the loop again
            continue
            
        # Sample batches
        batch = buffer.sample(4)
        
        print("\nTrain time, %d batch samples:" % len(batch))
        
        # Iterate through batches
        for s in batch:
            
            print(s)


Train time, 4 batch samples:
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)

Train time, 4 batch samples:
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=4, action=1, reward=1.0, last_state=0)
ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)


All replay buffers provide the following interface:
- A Python iterator interface to walk over all the samples in the buffer
- The method populate(N) to get N samples from the experience source and put them into the buffer
- The method sample(N) to get the batch of N experience objects

So, the normal training loop for DQN looks like an infinite repetition of the following steps:
1. Call buffer.populate(1) to get a fresh sample from the environment
2. batch = buffer.sample(BATCH_SIZE) to get the batch from the buffer
3. Calculate the loss on the sampled batch
4. Backpropagate
5. Repeat until convergence (hopefully)

All the rest happens automatically: resetting the environment, handling subtrajectories, buffer size maintenance, and so on.

<br>

# 12. The TargetNet class

---

TargetNet is a small but useful class that allows us to synchronize two NNs of the same architecture. The purpose of this was described in the previous chapter: improving training stability. TargetNet supports two modes of such synchronization:

- sync(): weights from the source network are copied into the target network.
- alpha_sync(): the source network's weights are blended into the target network with some alpha weight (between 0 and 1).

The first mode is the standard way to perform a target network sync in discrete action space problems, like Atari and CartPole. We did this in Chapter 6, Deep Q-Networks. The latter mode is used in continuous control problems, which will be described in several chapters in part four of the book. In such problems, the transition between two network's parameters should be smooth, so alhpha blending is used, given by the formula ${w_i = w_i \alpha + s_i(1-\alpha)}$, where ${w_i}$ is the target network's i<sup>th</sup> parameter and ${s_i}$ is the source network's weight. The following is a small example of how _TargetNet_ should be used in code.

In [100]:
# Import the libraries
import ptan
import torch.nn as nn

In [101]:
# DQN Network
class DQNNet(nn.Module):
    
    # Constructor
    def __init__(self):
        
        # Inherite the parent's constructor
        super(DQNNet, self).__init__()
        
        # Initialize a linear layer
        self.ff = nn.Linear(5, 3)

    # Forward function
    def forward(self, x):
        
        return self.ff(x)

In [112]:
# Start the program
if __name__ == "__main__":
    
    # Instantiate the DQN network (source & target network)
    net = DQNNet()
    tgt_net = ptan.agent.TargetNet(net)
    print("* Source or Target Network: \n", net, "\n")
    
    # Print the weights (source & target network)
    print("* Source Network's Weight: \n", net.ff.weight, "\n")
    print("* Target Network's Weight: \n", tgt_net.target_model.ff.weight, "\n")
    
    # Add 1 to source network
    net.ff.weight.data += 1.0
    print("* Source Network's Weight (After Adding 1 to Source Network): \n", net.ff.weight, "\n")
    print("* Target Network's Weight (After Adding 1 to Source Network): \n", tgt_net.target_model.ff.weight, "\n")
    
    # Sync the weights
    tgt_net.sync()
    print("* Source Network's Weight (After Synching): \n", net.ff.weight, "\n")
    print("* Target Network's Weight (After Synching): \n", tgt_net.target_model.ff.weight)

* Source or Target Network: 
 DQNNet(
  (ff): Linear(in_features=5, out_features=3, bias=True)
) 

* Source Network's Weight: 
 Parameter containing:
tensor([[-0.0047, -0.2140, -0.2696, -0.4327, -0.2010],
        [ 0.0379, -0.1393,  0.0023, -0.3331, -0.3991],
        [ 0.2437, -0.3072, -0.3382, -0.0287, -0.2591]], requires_grad=True) 

* Target Network's Weight: 
 Parameter containing:
tensor([[-0.0047, -0.2140, -0.2696, -0.4327, -0.2010],
        [ 0.0379, -0.1393,  0.0023, -0.3331, -0.3991],
        [ 0.2437, -0.3072, -0.3382, -0.0287, -0.2591]], requires_grad=True) 

* Source Network's Weight (After Adding 1 to Source Network): 
 Parameter containing:
tensor([[0.9953, 0.7860, 0.7304, 0.5673, 0.7990],
        [1.0379, 0.8607, 1.0023, 0.6669, 0.6009],
        [1.2437, 0.6928, 0.6618, 0.9713, 0.7409]], requires_grad=True) 

* Target Network's Weight (After Adding 1 to Source Network): 
 Parameter containing:
tensor([[-0.0047, -0.2140, -0.2696, -0.4327, -0.2010],
        [ 0.0379, -0.13

<br>

# 13. Ignite helpers

---

PyTorch Ignite was briefly discussed in Chapter 3, Deep Learning with PyTorch and it will be used in the rest of the book to reduce the amount of training loop code. PTAN provides several small helpers to simplify integration with Ignite, which reside in the ptan.ignite package:
- __EndOfEpisodeHandler:__ attached to the ignite.Engine, it emits an EPISODE_COMPLETED event, and tracks the reward and number of steps in the event in the engine's metrics. It also can emit an event when the average reward for the last episodes reaches the predefined boundary, which is supposed to be used to stop the training on some goal reward.

- __EpisodeFPSHandler:__ tracks the number of interactions between the agent and environment that are performed and calculates performance metrics as frames per second. It also keeps the number of seconds passed since the start of the training.

- __PeriodicEvents:__ emits corresponding events every 10, 100, or 1,000 training iterations. It is useful for reducing the amount of data being written into TensorBoard.

A detailed illustration of how the preceding classes can be used will be given in the next chapter, when we will use them to reimplement DQN training from Chapter 6, Deep Q-Networks, and then check several DQN extensions and tweaks to improve basic DQN convergence.

<br>

# 14. The PTAN CartPole solver

---

Let's now take the PTAN classes (without Ignite so far) and try to combine everything together to solve our first environment: CartPole. The complete code is in Chapter07/06_cartpole.py. I will show only the important parts of the code related to the material that we have just covered.

In [115]:
# Import the libraries
import gym
import ptan
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [116]:
# Hyperparameters
HIDDEN_SIZE = 128
BATCH_SIZE = 16
TGT_NET_SYNC = 10
GAMMA = 0.9
REPLAY_SIZE = 1000
LR = 1e-3
EPS_DECAY=0.99

In [117]:
# Network
class Net(nn.Module):
    
    # Constructor
    def __init__(self, obs_size, hidden_size, n_actions):
        
        # Inherite parent's constructor
        super(Net, self).__init__()
        
        # Network
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    # Forward function
    def forward(self, x):
        
        # Feedforward the x into the network
        output = self.net(x.float())
        
        return output

In [118]:
# Make all operations to have no gradient
@torch.no_grad()

# Unpacking the batches
def unpack_batch(batch, net, gamma):
    
    # Initialize empty list for S, A, R, is_done, S'
    states = []
    actions = []
    rewards = []
    done_masks = []
    last_states = []
    
    # Loop over experiences in batch
    for exp in batch:
        
        # Append the S, A, R, is_done, S' to list
        states.append(exp.state)
        actions.append(exp.action)
        rewards.append(exp.reward)
        done_masks.append(exp.last_state is None)
        if exp.last_state is None:
            last_states.append(exp.state)
        else:
            last_states.append(exp.last_state)

    # Convert S, A, R, S' to torch tensor
    states_v = torch.tensor(states)
    actions_v = torch.tensor(actions)
    rewards_v = torch.tensor(rewards)
    last_states_v = torch.tensor(last_states)
    
    # Feedforward the last state into network
    last_state_q_v = net(last_states_v)
    
    # Get the maximum value
    best_last_q_v = torch.max(last_state_q_v, dim=1)[0]
    
    # Set index (which are is_done) to zero
    best_last_q_v[done_masks] = 0.0
    
    return states_v, actions_v, best_last_q_v * gamma + rewards_v

In [119]:
# Start the program
if __name__ == "__main__":
    
    # Create the CartPole environment
    env = gym.make("CartPole-v0")
    
    # Get the observation size
    obs_size = env.observation_space.shape[0]
    
    # Get the number of actions
    n_actions = env.action_space.n
    
    # Instantiate the source network
    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    
    # Instantiate the target network
    tgt_net = ptan.agent.TargetNet(net)
    
    # Action selector (argmax)
    selector = ptan.actions.ArgmaxActionSelector()
    
    # Action selector (epsilon greedy)
    selector = ptan.actions.EpsilonGreedyActionSelector(epsilon = 1, selector = selector)
    
    # DQN Agent
    agent = ptan.agent.DQNAgent(net, selector)
    
    # Experience source first-last
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma = GAMMA)
    
    # Experience buffer
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size = REPLAY_SIZE)
    optimizer = optim.Adam(net.parameters(), LR)

    # Initialize the step number
    step = 0
    
    # Initialize the episode number
    episode = 0
    
    # Initialize the 'solved' to false
    solved = False

    # Infinite loop
    while True:
        
        # Increment the step number
        step += 1
        
        # Populate the buffer
        buffer.populate(1)

        # Iterate over reward and steps of current experience source
        for reward, steps in exp_source.pop_rewards_steps():
            
            # Increment the episode number
            episode += 1
            
            # Report
            print("%d: episode %d done, reward=%.3f, epsilon=%.2f" % (step, episode, reward, selector.epsilon))
            
            # If reward is higher than 150 then set 'solved' to true
            solved = reward > 150
            
        # If solved
        if solved:
            
            # Report
            print("Congrats!")
            
            # Break the loop
            break

        # If length of experience buffer is less than 2*batch sizes
        if len(buffer) < 2*BATCH_SIZE:
            
            # Start the loop over
            continue

        # Sample from buffer
        batch = buffer.sample(BATCH_SIZE)
        
        # Unpack the batch
        states_v, actions_v, tgt_q_v = unpack_batch(batch, tgt_net.target_model, GAMMA)
        
        # Reset the gradients of optimizer to zero
        optimizer.zero_grad()
        
        # Feedforward
        q_v = net(states_v)
        
        # Reshape the output so that it's suitable for DQN training
        q_v = q_v.gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
        
        # Calculate the MSE loss
        loss_v = F.mse_loss(q_v, tgt_q_v)
        
        # Backpropagation
        loss_v.backward()
        
        # Optimize
        optimizer.step()
        
        # Decay the value of epsilon
        selector.epsilon *= EPS_DECAY

        # Every TGT_NET_SYNC time
        if step % TGT_NET_SYNC == 0:
            
            # Sync the source network to target network
            tgt_net.sync()

16: episode 1 done, reward=15.000, epsilon=1.00
67: episode 2 done, reward=51.000, epsilon=0.70
81: episode 3 done, reward=14.000, epsilon=0.61
95: episode 4 done, reward=14.000, epsilon=0.53
108: episode 5 done, reward=13.000, epsilon=0.47
120: episode 6 done, reward=12.000, epsilon=0.41
131: episode 7 done, reward=11.000, epsilon=0.37
143: episode 8 done, reward=12.000, epsilon=0.33
152: episode 9 done, reward=9.000, epsilon=0.30
162: episode 10 done, reward=10.000, epsilon=0.27
171: episode 11 done, reward=9.000, epsilon=0.25
181: episode 12 done, reward=10.000, epsilon=0.22
193: episode 13 done, reward=12.000, epsilon=0.20
202: episode 14 done, reward=9.000, epsilon=0.18
212: episode 15 done, reward=10.000, epsilon=0.16
226: episode 16 done, reward=14.000, epsilon=0.14
238: episode 17 done, reward=12.000, epsilon=0.13
253: episode 18 done, reward=15.000, epsilon=0.11
264: episode 19 done, reward=11.000, epsilon=0.10
282: episode 20 done, reward=18.000, epsilon=0.08
298: episode 21 

<br>

# 15. Other RL libraries

---

As we discussed earlier, there are several RL-specific libraries available. Overall, TensorFlow is more popular than PyTorch, as it is more widespread in the deep learning community. The following is my (very biased) list of libraries:
- __Keras-RL:__ started by Matthias Plappert in 2016, this includes basic deep RL methods. As suggested by the name, this library was implemented using Keras, which is a higher-level wrapper around TensorFlow (https://github.com/keras-rl/keras-rl).
- __Dopamine:__ a library from Google published in 2018. It is TensorFlow- specific, which is not surprising for a library from Google (https://github.com/google/dopamine).
- __Ray:__ a library for distributed execution of machine learning code. It includes RL utilities as part of the library (https://github.com/ray-project/ray).
- __TF-Agents:__ another library from Google published in 2018 (https://github.com/tensorflow/agents).
- __ReAgent:__ a library from Facebook Research. It uses PyTorch internally and uses a declarative style of configuration (when you are creating a JSON file to describe your problem), which limits extensibility. But, of course, as it is open source, you can always extend the functionality (https://github.com/ facebookresearch/ReAgent).
- __Catalyst.RL:__ a project started by Sergey Kolesnikov (one of this book's technical reviewers). It uses PyTorch as the backend (https://github.com/catalyst-team/catalyst).
- __SLM Lab:__ another PyTorch RL library (https://github.com/kengz/SLM-Lab).

<br>

# 16. Summary

---

In this chapter, we talked about higher-level RL libraries, their motivation, and their requirements. Then we took a deep look into the PTAN library, which will be used in the rest of the book to simplify example code.
In the next chapter, we will return to DQN methods by exploring extensions that researchers and practitioners have discovered since the classic DQN introduction to improve the stability and performance of the method.

# Good Job!