# Higher-Level RL Libraries

## PTAN

### Action Selecters

An action selecter is an object that helps with going from network output to concrete action values.

-> **Argmax** used by Q-value methods when the network predicts Q-values for a set of actions and trhe desired action is the action with the larges *Q(s,a)*. 

-> **Policy-based** where the network ouptuts the probablity distribution and an action needs to be sampled from this distribution. 

An action selecter is used by the Agent and rarely needs to be customized but you have this option. Concrete classes provided by the library are:

-> **ArgmaxActionSelecrtor** which applies argmax on the second axis of a passed tensor. (It assumes a matrix with batch dimension along the first axis)

-> **ProbabilityActionSelector** which samples from the probability distribution of a discrete set of actions

-> **EpsilonGreedyActionSelecter** has the parameter epsilon which specifies the probability of a random action to be taken




In [4]:
!pip3 install ptan

Collecting ptan
  Downloading https://files.pythonhosted.org/packages/d9/0b/c93ddb49b9f291062d1d3f63efd3d7e6614749214d15c8d8af2211d1b220/ptan-0.7.tar.gz
Building wheels for collected packages: ptan
  Building wheel for ptan (setup.py) ... [?25l[?25hdone
  Created wheel for ptan: filename=ptan-0.7-cp36-none-any.whl size=23502 sha256=6ca73266235078f169df26ce4b6122f58643170b2e57767e5a5e74029fa6802c
  Stored in directory: /root/.cache/pip/wheels/2c/58/0c/a42dad12a5cc0e130453042707b3e2205adfb901ae35cfad75
Successfully built ptan
Installing collected packages: ptan
Successfully installed ptan-0.7


In [5]:
import numpy as np 
import ptan

In [5]:
q_vals = np.array([[1,2,3], [1,-1,0]])
q_vals

array([[ 1,  2,  3],
       [ 1, -1,  0]])

In [7]:
selector = ptan.actions.ArgmaxActionSelector()
selector(q_vals)
# returns the indices of the outputs with the largest value |

array([2, 0])

In [10]:
selector_epsilon = ptan.actions.EpsilonGreedyActionSelector(epsilon=1)
selector_epsilon(q_vals)
# With epsilon set to 0 we get the arg max of the actions as there are no random actions taken
# With epsilon set to 1 we get completely random actions that are taken which is why the values are different

array([1, 2])

In [12]:
# The input for the probability distribution needs to be a normalized probability distribution
# [0.1,0.8, 0.1] in the probability distribution represents the probabilities of each actions
# in the first one the action at index 1 will have the highest probability 
# [1 2 0] -> outputs the action that is taken by the index
# the first probability distribution is used for this and the resulting ones are used for the other ones
# the second example the 2nd index action is the one with the highest probability
# the third example the 1st and the 2nd index have the same probability which is why it could be either action 0 or action 1 for this 
selector_prob = ptan.actions.ProbabilityActionSelector()
for _ in range(10):
  acts = selector_prob(np.array([
                                 [0.1,0.8, 0.1],
                                 [0.0, 0.0,1.0],
                                 [0.5, 0.5, 0.0]
  ]))
  print(acts)

[1 2 0]
[1 2 1]
[1 2 0]
[1 2 1]
[1 2 0]
[1 2 0]
[0 2 0]
[1 2 0]
[1 2 0]
[1 2 0]


## DQNAgent 

This class is appluicable in Q-learning when the action space is not very large which covers Atari games and lots of classical problems. 

A sample use case is a DQNAgent that takes in a batch of observations on input applies the network on them to get Q-values and then uses the provided ActionSelector to convert Q-values to indices of actions. 





In [2]:
import torch.nn as nn
import torch

class DQNNet(nn.Module):
       def __init__(self, actions):
           super(DQNNet, self).__init__()
           self.actions = actions

       def forward(self, x):
           return torch.eye(x.size()[0], self.actions)

net = DQNNet(actions=3)
net(torch.zeros(2,10))

tensor([[1., 0., 0.],
        [0., 1., 0.]])

In [19]:
# An input which is the batch of two observations and each having 5 values 
# The agent returned a tuple of two objects
# The first one is the actions that the agent is supposed to take for each batch
# The second one is the internal states of the agent and as the agent is stateless they return None
agent = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector)
agent(torch.zeros(2,5))

(array([0, 1]), [None, None])

In [22]:
# This returns the epsilon selector when the value of epsilon is 1
# It just shows the random actions the agent will take in the environment 
agent2 = ptan.agent.DQNAgent(dqn_model=net, action_selector=selector_epsilon)
agent2(torch.zeros(10,5))[0]

array([0, 0, 1, 2, 2, 0, 1, 1, 1, 2])

In [25]:
# The epsilon value can be changed on the fly during training 
selector_epsilon.epsilon = 0.5
agent2(torch.zeros(10,5))[0]

array([0, 2, 2, 0, 0, 0, 1, 2, 0, 2])

In [30]:
class PolicyNet(nn.Module):
  def __init__(self, actions):
    super(PolicyNet, self).__init__()
    self.actions = actions
  def forward(self,x):
    # Now we produce the tensor with first two actions having the same logit scores
    shape = (x.size()[0], self.actions)
    res = torch.zeros(shape, dtype=torch.float32)
    res[:, 0] = 1
    res[:, 1] = 1
    return res

In [31]:
net = PolicyNet(actions=5)
net(torch.zeros(6,10))

tensor([[1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 0., 0., 0.]])

In [32]:
# ProbabilityActionSelector expects the probabilities to be normalized
# So we use softmax on the networks outputs
agent_prob = ptan.agent.PolicyAgent(model=net, action_selector=selector_prob, apply_softmax=True)
agent(torch.zeros(6,5))[0]

array([0, 1, 2, 0, 0, 0])

### Experience source 

The experience source classes take the agent instance and environment and provide you with step-by-step data from the trajectories. 


In [15]:
from typing import List, Optional, Any, Tuple
import ptan
import gym

class ToyEnv(gym.Env):
  def __init__(self):
    super(ToyEnv, self).__init__()
    self.observation_space = gym.spaces.Discrete(n=5)
    self.action_space = gym.spaces.Discrete(n=3)
    self.step_index = 0
  
  def reset(self):
    self.step_index = 0
    return self.step_index
  
  def setup(self, action):
    is_done = self.step_index == 10
    if is_done:
      return self.step_index % self.observation_space.n, 0.0, is_done, {} 
    self.step_index += 1
    return self.step_index % self.observation_space.n, float(action), self.step_index == 10, {}

class DullAgent(ptan.agent.BaseAgent):
       """
       Agent always returns the fixed action
       """
       def __init__(self, action: int):
           self.action = action
       def __call__(self, observations: List[Any], state: Optional[List] = None) -> Tuple[List[int], Optional[List]]:
           return [self.action for _ in observations], state


In [None]:
# The ExperieceSource class
# Outputs:
# (Experience(state=0, action=1, reward=1.0, done=False),
#  Experience(state=1, action=1, reward=1.0, done=False))
#  (Experience(state=1, action=1, reward=1.0, done=False),
#  Experience(state=2, action=1, reward=1.0, done=False))
#  (Experience(state=2, action=1, reward=1.0, done=False),
#  Experience(state=3, action=1, reward=1.0, done=False))
# On every iteration ExperienceOusrce returns a piece of the agent's trajectory in environment communication 

import gym

env = ToyEnv()
agent = DullAgent(action=2)

exp_source = ptan.experience.ExperienceSource(env=env,agent=agent, steps_count=2)

for idx, exp in enumerate(exp_source):
  if idx > 2:
    break
  print(exp)

In [None]:
# The class ExperienceSource provides us with full subtrajectories of the given length as the list of (s, a, r) objects. 
# The next state, s', is returned in the next tuple, which is not always convenient. For example, in DQN training, we want to have tuples (s, a,r, s') at once to do one-step Bellman approximation during the training. 
# In addition, some extension of DQN, like n-step DQN, might want to collapse longer sequences of observations into (first-state, action, total-reward-for-n-steps, state-after-step-n).
# To support this in a generic way, a simple subclass of ExperienceSource is implemented: ExperienceSourceFirstLast. 
# It accepts almost the same arguments in the constructor, but returns different data.

# Outputs:
# ExperienceFirstLast(state=0, action=1, reward=1.0, last_state=1)
# ExperienceFirstLast(state=1, action=1, reward=1.0, last_state=2)
# ExperienceFirstLast(state=2, action=1, reward=1.0, last_state=3)
# ExperienceFirstLast(state=3, action=1, reward=1.0, last_state=4)

exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent,gamma=1.0, steps_count=1)

for idx, exp in enumerate(exp_source):
  if idx > 2:
    break
  print(exp)

## Experience replay buffers

in DQN we rarely deal with the immediate samples, as they are heavily correlated, which leads to instabilitiy in the training. Normally we have large replay buffers which are populated with experiece pieces. Then the buffer is samples to get the training batch. The replay buffer normally has a max capacity so old samples are pushed out when the replay buffer reaches the limit.

1. ExperienceReplayBuffer: a simple replay buffer of predefined size with uniform sampling.

2. PrioReplayBufferNaive: a simple, but not very efficient, prioritized replay buffer implementation. The complexity of sampling is O(n), which might become an issue with large buffers. This version has the advantage over the optimized class, having much easier code.

3. PrioritizedReplayBuffer: uses segment trees for sampling, which makes the code cryptic, but with O(log(n)) sampling complexity.




## The TargetNet Class

TargetNet is a small but a useful class that allows us to synchronize two NNs of the same architecture.

TargetNet supports two modes of such synchronization:

1. sync(): weights from the source network are copied into the target network.
   
2. alpha_sync(): the source network's weights are blended into the target
network with some alpha weight (between 0 and 1).

## Iginite Helpers

PTAN provides several small helpers to simplify integration with Ignite, which reside in the ptan.ignite package:

1. EndOfEpisodeHandler: attached to the ignite.Engine, it emits an EPISODE_COMPLETED event, and tracks the reward and number of steps in the event in the engine's metrics. It also can emit an event when the average reward for the last episodes reaches the predefined boundary, which is supposed to be used to stop the training on some goal reward.

2. EpisodeFPSHandler: tracks the number of interactions between the agent and environment that are performed and calculates performance metrics as frames per second. It also keeps the number of seconds passed since the start of the training.
 
3. PeriodicEvents: emits corresponding events every 10, 100, or 1,000 training iterations. It is useful for reducing the amount of data being written into TensorBoard.

In [None]:
# The PTAN CartPole Solver

# We create a simple feed-forward NN and target the NN epsilon-greedy action selector and DQNAgent
# Then the experience source and replay buffer are created 
net = Net(obs_size, HIDDEN_SIZE, n_actions)
tgt_net = ptan.agent.TargetNet(net)
selector = ptan.actions.ArgmaxActionSelector()
selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=1, selector=selector)
agent = ptan.agent.DQNAgent(net, selector)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=GAMMA)
buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=REPLAY_SIZE)

while True:
step += 1
buffer.populate(1)
# pop_rewards_steps() retuns the list of tuples with information about episodes completed since the last call to the method
  for reward, steps in exp_source.pop_rewards_steps():
      episode += 1
      print("%d: episode %d done, reward=%.3f, epsilon=%.2f" % (
          step, episode, reward, selector.epsilon))
      solved = reward > 150
  if solved:
      print("Congrats!")
      break
  if len(buffer) < 2*BATCH_SIZE:
    continue
  batch = buffer.sample(BATCH_SIZE)
  states_v, actions_v, tgt_q_v = unpack_batch(
           batch, tgt_net.target_model, GAMMA)
  optimizer.zero_grad()
  q_v = net(states_v)
  q_v = q_v.gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
  loss_v = F.mse_loss(q_v, tgt_q_v)
  loss_v.backward()
  optimizer.step()
  selector.epsilon *= EPS_DECAY
  if step % TGT_NET_SYNC == 0:
      tgt_net.sync()

      
