# Link to this colab

https://colab.research.google.com/drive/1wCK9AeiYLQebBUHmxH6EbxoMmclsuklX#scrollTo=xazcwu5soxjb



# Report

For this project, we use Comet[[1]](https://www.comet.ml/george-gca/rl-project-02/view/djbemH9jy8r2r3XeuWygmQrmK) to record our experiments, therefore we did our analysis also in comet. In this colab, you can find the full code, **but please check the full report in the link below** 
- https://www.comet.ml/george-gca/rl-project-02/reports/project-02-report

# Video


**please check the video from the link below** 
- https://youtu.be/PEhwNhPwdaA

#Group 9 Members

* Alceu Emanuel Bissoto - 191077
* Levy Gurgel Chaves - 264958 
* George Corrêa de Araújo - 191075
* Stephane de Freitas Schwarz - 211518
* Jing Yang - 262891

Our team decided to divide the solutions for each team member, being divided as:

- REINFORCE - Stephane
- A2C - George
- DQN - Alceu and Levy
- Comparison - Jing

The remaining parts of the project, including the analysis, were created and revised by every team member, with equal contribution.

# Problem description

---
In this assignment, we had implemented three reinforcement learning control methods to solve the CartPole environment from the OpenAI Gym. The inverted pendulum problem consists of finding an appropriate action policy that balances a pole attached by an un-actuated joint to a cart while it moves along a frictionless horizontal track. The pendulum starts in the upright position, and the goal is to preserve it on top by applying bi-directional lateral force on the cart, in other words, increasing and reducing the cart's velocity, keeping the cart between limiting track positions. The cart speed relies on the angle the pole is pointing since the center of gravity of the pendulum increases the energy needed to move the cart.

The Cart Pole has four observation attributes at any time, randomly initialized between $\pm$ 0.05: the linear cart position $x$, cart velocity $\dot{x}$, pole angle $\theta$, and pole velocity $\dot{\theta}$. The table below shows the observation attributes limits.

| Attribute | Min. | Max. |
| :-- | :-- | :-- |
| Cart Position | - 4.8 | 4.8 
| Cart Velocity | - Inf | Inf
| Pole Angle | -24 deg | 24 deg
| Pole Angular Velocity | - Inf | Inf

\
The model simulation derives from Newtonian physics or Lagrangian methods. The following equation describes the system dynamics.

\
\begin{equation}
\ddot{x} = \frac{(I+ml^{2})(F+ml\dot{\theta}^{2}sin\theta) - gm^{2}l^{2}sin\theta cos\theta}{I(M+m)+ml^{2}(M+msin^{2}\theta)}
\end{equation}

\
\begin{equation}
\ddot{\theta} = -\frac{ml[F cos\theta + ml\dot{\theta}^{2} sin\theta.cos\theta - (M +m) g sin\theta]}{I(M+m)+ml^{2}(M+msin^{2}\theta)}
\end{equation}
\
Where $I$ is the moment of inertia, $l$ is the length of the pendulum, $m$ is the mass of the pendulum, $M$ is the cart mass, $g$ is the acceleration due to gravity, and $F$ is the force applied on the cart [[2]](http://ethesis.nitrkl.ac.in/6302/1/E-64.pdf). Knowing the model dynamics, we can conclude that Cart Pole has a deterministic environment, which means that by applying a force $F$ from a state $s$, the model can determine the next state $s'$. That force $F$ applied horizontally to the cart represents the action space and can assume one of two possible values $-f$ or $+f$ respectively left and right.

Cart pole task is episodic. An episode ends when the cart position is more than 2.4 units (the center of the cart reaches the edge of the display), or the pole angle is more than $\pm$ 12 degrees, or the amount of steps is higher than 500. By default, for every timestep taken, including the terminal one, the model returns a reward 1. The problem is solved when the reward average is higher or equal to 475 over 100 consecutive trials [[3]](https://www.google.com/url?q=https://github.com/openai/gym/blob/8e5a7ca3e6b4c88100a9550910dfb1a6ed8c5277/gym/envs/__init__.py%23L50&sa=D&ust=1608765798928000&usg=AFQjCNEjJcLV6B2yYti1uY8csySiGpCuGw).


---
# Install and import everything

In [None]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!apt-get update > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!pip install gym[atari] pyvirtualdisplay > /dev/null 2>&1
!pip install matplotlib memory_profiler pandas seaborn > /dev/null 2>&1
!pip install comet-ml > /dev/null 2>&1

In [None]:
from comet_ml import Experiment

import base64
import glob
import io
import locale
import logging
import math
import os
import sys
import time
from collections import namedtuple
from datetime import datetime
from pathlib import Path
from typing import List

import gym
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from gym import wrappers
from gym.wrappers import Monitor
from IPython import display as ipythondisplay
from IPython.display import HTML
from pyvirtualdisplay import Display
from sklearn.kernel_approximation import RBFSampler
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import StandardScaler
from pathlib import Path

# from tqdm.notebook import trange
# from tqdm import trange

locale.setlocale(locale.LC_ALL, '')
sns.set_style('darkgrid')
# sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 1.5})
sns.set_context("notebook", rc={"lines.linewidth": 1.5})

%load_ext memory_profiler


import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.nn.utils as utils
import torchvision.transforms as T
from torch.autograd import Variable

from IPython import display as ipythondisplay

os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

# Create auxiliary functions

In [None]:
def print_environment_info(env):
  # https://stackoverflow.com/a/52774794/14459430
  name = env.unwrapped.spec.id
  spec = gym.spec(name)

  print(f'Action Space: {env.action_space}')
  print(f'Observation Space: {env.observation_space}')
  print(f'Max Episode Steps: {spec.max_episode_steps}')
  print(f'Nondeterministic: {spec.nondeterministic}')
  print(f'Reward Range: {env.reward_range}')
  print(f'Reward Threshold: {spec.reward_threshold}\n')


"""
Utility functions to enable video recording of gym environment 
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""

def show_video(episode: int=-1) -> None:
  mp4list = sorted(glob.glob('video/*.mp4'))
  if len(mp4list) > 0:
    mp4 = mp4list[episode]
    print(f'Displaying {mp4}')
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

<a name="reward-functions"></a>
## **Reward Functions**
---

In this project, we have adopted three reward functions to penalize to model, aiming to boost their performance. The simplest one consists of punishing the learning algorithm when it was set done but did not reach all the steps. In this sense, the system gets a positive reward when it runs over all 500 iterations.

Similarly, the finish bonus approach considers the step to reward. To that, it verifies if the current iteration is higher than a threshold. If true, it returns a positive reward based on the floor division between the step and the threshold added by 1. Whatever the current iteration, if the steps are lower than the checkpoint (closer to the limit steps), the reward is 1.

In contrast with those methods, the last one punishes the algorithm if the pole angle is $\pm$3 degrees from the center. The general idea of the Cart Pole task is to keep the pole upright. The pole angle is one of the observation attributes that can make the task end. Given that, we decide to build a safe boundary, penalizing any angle value outside that.

In [None]:
# rewards functions
def finish_bonus(steps, checkpoint_steps=25):
  if steps > checkpoint_steps:
    # get extra reward every checkpoint_steps
    return steps // checkpoint_steps + 1
  else:
    # get penalty for not getting to first checkpoint_steps
    # return steps - checkpoint_steps
    return 1

def finish_penalty(done_signal, steps, penalty = 500):
  if done_signal and steps < 500:
    return -penalty
  else:
    return 1

def pole_angle_based_penalty(observation, step):        
  pole_angle = observation[2]
  angle = 3 * 2 * math.pi / 360
  # If the pole is 3 degrees from de center, penalty.
  if (pole_angle < -angle) or (pole_angle > angle):    
    return -200
  return 100

---
# On-policy Method

### REINFORCE algorithm

REINFORCE is a policy gradient method (Monte-Carlo policy gradient), it relies on an estimated return by Monte-Carlo methods using episode samples to update the policy parameter θ [[4]](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html). 
The policy gradient update rule is as follows:
\begin{aligned}
\nabla_{\theta} J(\theta) &=\mathbb{E}_{\pi}\left[Q^{\pi}(s, a) \nabla_{\theta} \ln \pi_{\theta}(a \mid s)\right] \\
&=\mathbb{E}_{\pi}\left[G_{t} \nabla_{\theta} \ln \pi_{\theta}\left(A_{t} \mid S_{t}\right)\right]
\end{aligned}

It relies on a full trajectory and that’s why it is a Monte-Carlo method.
The general process of REINFORCE algorithm:

1.   Initialize the policy parameter $\theta$ at random;
2.   Generate one trajectory on policy $\pi_\theta$: $S_1,A_1,R_2,S_2,R_2,…,S_T$;
3. For $t=1, 2, … , T$:

    - Estimate the the return $G_t$;
    - Update policy parameters: $\theta \leftarrow \theta+\alpha \gamma^{t} G_{t} \nabla_{\theta} \ln \pi_{\theta}\left(A_{t} \mid S_{t}\right)$



In [None]:
def reward_function(method=None, smooth=True):

    """
      Possible options:
        0 : Just to save one if return 1
        1 : step based penalty
        2 : pole angle based penalty
    """
    def none_penalty(observation, step, smooth_reward=True):
      
      return 1

    def step_penalty(observation, step, check_point=25, smooth_reward=True):

      if step > check_point:
        
        return step // check_point
      
      return step - check_point

    def pole_angle_based_penalty(observation, step, smooth_reward=True):
        
      pole_angle = observation[2]

      angle = 3 * 2 * math.pi / 360

      # If the pole is 3 degrees from de center, penalty.
      if (pole_angle < -angle) or (pole_angle > angle):
        
        if smooth:
          return -2
        return -200
    
      return 2
    
    if method == None:
      return None

    if method == 1:
      return step_penalty
    if method == 2:
      return pole_angle_based_penalty
    
    return none_penalty


In [None]:
class Policy(nn.Module):
    def __init__(self, hidden_size, num_inputs, action_space):
        super(Policy, self).__init__()
        self.action_space = action_space
        num_outputs = action_space.n

        self.linear1 = nn.Linear(num_inputs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, num_outputs)

    def forward(self, inputs):
        x = inputs
        x = F.relu(self.linear1(x))
        action_scores = self.linear2(x)
        return F.softmax(action_scores)


class REINFORCE:
    def __init__(self, hidden_size, num_inputs, action_space, lr=1e-3):
        self.action_space = action_space
        self.model = Policy(hidden_size, num_inputs, action_space)
        self.model = self.model.cuda()
        self.optimizer = optim.Adam(self.model.parameters(), lr=lr)
        self.model.train()

    def select_action(self, state):
        probs = self.model(Variable(state).cuda())       
        action = probs.multinomial(1).data
        prob = probs[:, action[0,0]].view(1, -1)
        log_prob = prob.log()
        entropy = - (probs*probs.log()).sum()

        return action[0], log_prob, entropy

    def update_parameters(self, rewards, log_probs, entropies, gamma):
        R = torch.zeros(1, 1)
        loss = 0
        for i in reversed(range(len(rewards))):
            R = gamma * R + rewards[i]
            loss = loss - (log_probs[i]*(Variable(R).expand_as(log_probs[i])).cuda()).sum() - (0.0001*entropies[i].cuda()).sum()
        loss = loss / len(rewards)
		
        self.optimizer.zero_grad()
        loss.backward()
        utils.clip_grad_norm(self.model.parameters(), 40)
        self.optimizer.step()
        return loss

In [None]:
def train_REINFORCE(env_name, exp_name, tag, reward_mode=None, num_episodes=1000, 
                    num_steps=500, gamma=0.99, seed=123,
                    hidden_size=128, ckpt_freq=50, lr=1e-3, idx=[0,1,2,3], smooth_reward=True, save_video=False):
  
  def saving_video():

    env = wrap_env(gym.make('CartPole-v1'))
    display = Display(visible=0, size=(1400, 900))
    display.start()

    obs = env.reset()
    for _ in range(1000):

        action, _, _ = agent.select_action(torch.Tensor([obs]))
        action = action.cpu()

        obs, reward, done, info = env.step(action.numpy()[0])
        
        env.render()

        if done:
            break

    env.close()

    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        print('LIST MP4: ', mp4list)
        mp4 = mp4list[-1]
        experiment.log_asset(mp4, file_name='BEST_REINFORCE.mp4')

    experiment.end()

# ==============================================================================
# ==============================================================================

    
  def train_model():
    
    agent.model.train()

    state = torch.Tensor([env.reset()[idx]])

    entropies = []
    log_probs = []
    rewards = []

    for t in range(num_steps):

      action, log_prob, entropy = agent.select_action(state)
      action = action.cpu()

      next_state, reward, done, _ = env.step(action.numpy()[0])
        
      if reward_mode != None:
        
        if reward_mode != 0:      
            
            reward = reward_fun(next_state[idx], t, smooth_reward=smooth_reward)
        else:
            
            if done and t < 499:
                if smooth_reward:
                  reward = -2
                else:
                  reward = -200

      entropies.append(entropy)
      log_probs.append(log_prob)
      rewards.append(reward)
      state = torch.Tensor([next_state[idx]])
      train_REINFORCE.train_sum_rewards += reward

      if done:
        break

    ep_loss = agent.update_parameters(rewards, log_probs, entropies, gamma)
    train_REINFORCE.train_sum_steps += t
    
    experiment.log_metric('train/steps', t, epoch=e+1)
    experiment.log_metric('train/reward', sum(rewards), epoch=e+1)
    experiment.log_metric('train/avg_steps', train_REINFORCE.train_sum_steps/(e+1), epoch=e+1)
    experiment.log_metric('train/avg_reward', train_REINFORCE.train_sum_steps/(e+1), epoch=e+1)
    experiment.log_metric('train/loss', ep_loss)
    experiment.log_metric('train/entropy', sum(entropies), epoch=e+1)
                           

    if e%ckpt_freq == 0:
      print("TRAIN: Episode: {}, reward: {}, steps: {}".format(e, sum(rewards), t))
  
# ==============================================================================
# ==============================================================================

  def test_model():
    
    agent.model.eval()    

    ep_entropies = []
    ep_rewards = []
    ep_step_size = []
    test_ep_size = 10
    
    test_sum_steps = 0
    test_sum_rewards = 0.
    
    for test_e in range(test_ep_size):

      entropies = []
      rewards = []

      state = torch.Tensor([env.reset()[idx]])

      for t in range(num_steps):

        action, log_prob, entropy = agent.select_action(state)
        action = action.cpu()

        next_state, reward, done, _ = env.step(action.numpy()[0])
          
        if reward_mode != None:
          
          if reward_mode != 0:      
              
            reward = reward_fun(next_state[idx], t, smooth_reward=smooth_reward)
          else:
              
            if done and t < 499:            
              reward = -2     
                

        entropies.append(entropy)
        rewards.append(reward)
        state = torch.Tensor([next_state[idx]])
        test_sum_rewards += reward

        if done:
          break

      ep_step_size.append(t)
      ep_entropies.append(sum(entropies))
      ep_rewards.append(sum(rewards))
      test_sum_steps += t

    experiment.log_metric('eval/steps', t, epoch=e+1)
    experiment.log_metric('eval/rewards', sum(ep_rewards)/test_ep_size, epoch=e+1)
    experiment.log_metric('eval/entropy', sum(ep_entropies)/test_ep_size, epoch=e+1)
    experiment.log_metric('eval/avg_steps', test_sum_steps/(test_ep_size+1), epoch=e+1)
    experiment.log_metric('eval/avg_reward', test_sum_steps/(test_ep_size+1), epoch=e+1)
    
    print("EVAL: Episode: {}, reward: {}".format(e, sum(rewards)))
  
# ==============================================================================
# ==============================================================================
    
    
  experiment = Experiment(project_name='rl-project-02', 
                          api_key='ZiJ5PvGmmaSWzp3XvePfCOYxk',
                          auto_metric_logging=False,
                          workspace='george-gca')
  
  hyper_params = {
  "RF-env_name":env_name,
  "RF-exp_name":exp_name,
  "RF-tag":tag,
  "RF-reward_mode":reward_mode,
  "RF-num_episodes":num_episodes,
  "RF-num_steps":num_steps,
  "RF-gamma":gamma,
  "RF-seed":seed,
  "RF-hidden_size":hidden_size,
  "RF-ckpt_freq":ckpt_freq,
  "RF-lr":lr,
  "RF-idx":idx,
  "RF-smooth_reward":smooth_reward
  }
  
  experiment.log_parameters(hyper_params)
  experiment_name = exp_name
  experiment.set_name(experiment_name)
  experiment.add_tag(tag)

  env = gym.make(env_name)

  # Set seeds
  env.seed(seed)
  torch.manual_seed(seed)
  np.random.seed(seed)

  # Set reward mode
  reward_fun = reward_function(method=reward_mode)
  
  # Build a REINFORCE Agent
  agent = REINFORCE(hidden_size, len(idx), env.action_space, lr=lr)
    
  train_REINFORCE.train_sum_steps = 0
  train_REINFORCE.train_sum_rewards = 0.
  
  start_time = time.time()

  # Start training    
  for e in range(num_episodes):
    
    train_model()
    
    if e+1%50 == 0:
        
        test_model()
    
    
  experiment.log_metric('exp_time', time.time() - start_time)
  total_params = sum(p.numel() for p in agent.model.parameters())
  experiment.log_other('total params', total_params)
  experiment.log_other('trainable params', sum(p.numel() for p in agent.model.parameters() if p.requires_grad))
  env.close()
  if save_video == False:
    experiment.end()
    return
  else:
    saving_video()
    
  return agent, experiment

In [None]:
gammas = [0.1, 0.5, 0.75, 0.9, 0.99, 0.999]
lrs = [1e-2, 1e-3, 1e-4, 1e-5]
reward_modes = [0, 1, 2, None]
idxs = [[0], [1], [2], [3], [0,1], [3,0], [3,1], [3,2], [0,1,2,3]]
smooth_reward = [True, False]

param_to_test = [gammas, lrs, reward_modes, idxs, smooth_reward]
parameters_list = ['gamma', 'lr', 'reward_mode', 'idx', 'smooth_reward']

parameters_search = {}
n_episode = 1000
for parameters, parameter_name in zip(param_to_test, parameters_list):

  for parameter in parameters:

    # Just change the value passed
    p = {parameter_name:parameter}
    _ = train_REINFORCE(env_name='CartPole-v1',
                        exp_name='{}-{}'.format(parameter_name,parameter),
                        tag=parameter_name,
                        **p)


In [None]:
p = {'env_name': 'CartPole-v1',
'exp_name': 'VIDEO_BEST_REINFORCE',
'tag': 'VIDEO_BEST_REINFORCE',
'reward_mode':0,
'num_episodes':1000,
'num_steps':500,
'gamma':0.999,
'seed':123,
'hidden_size':128,
'ckpt_freq':50,
'lr':1e-3,
'idx':[0,1,2,3],
'smooth_reward':True,
'save_video':True}

_ = train_REINFORCE(**p)


COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/2bbb09e63e5c470f9e40cdb7f9998a81

  


TRAIN: Episode: 0, reward: 10.0, steps: 12
EVAL: Episode: 0, reward: 17.0
TRAIN: Episode: 50, reward: 8.0, steps: 10
EVAL: Episode: 50, reward: 16.0
TRAIN: Episode: 100, reward: 76.0, steps: 78
EVAL: Episode: 100, reward: 14.0
TRAIN: Episode: 150, reward: 81.0, steps: 83
EVAL: Episode: 150, reward: 76.0
TRAIN: Episode: 200, reward: 96.0, steps: 98
EVAL: Episode: 200, reward: 108.0
TRAIN: Episode: 250, reward: 155.0, steps: 157
EVAL: Episode: 250, reward: 180.0
TRAIN: Episode: 300, reward: 158.0, steps: 160
EVAL: Episode: 300, reward: 123.0
TRAIN: Episode: 350, reward: 157.0, steps: 159
EVAL: Episode: 350, reward: 135.0
TRAIN: Episode: 400, reward: 110.0, steps: 112
EVAL: Episode: 400, reward: 53.0
TRAIN: Episode: 450, reward: 142.0, steps: 144
EVAL: Episode: 450, reward: 500.0
TRAIN: Episode: 500, reward: 249.0, steps: 251
EVAL: Episode: 500, reward: 457.0
TRAIN: Episode: 550, reward: 263.0, steps: 265
EVAL: Episode: 550, reward: 186.0
TRAIN: Episode: 600, reward: 293.0, steps: 295
EVA

  
COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/2bbb09e63e5c470f9e40cdb7f9998a81
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     eval/avg_reward [1000]  : (11.545454545454545, 453.6363636363636)
COMET INFO:     eval/avg_steps [1000]   : (11.545454545454545, 453.6363636363636)
COMET INFO:     eval/entropy [1000]     : (8.438605308532715, 282.6722412109375)
COMET INFO:     eval/rewards [1000]     : (10.7, 500.0)
COMET INFO:     eval/steps [1000]       : (8, 499)
COMET INFO:     exp_time                : 2134.7172038555145
COMET INFO:     train/avg_reward [1000] : (12.0, 245.91)
COMET INFO:     train/avg_steps [1000]  : (12.0, 245.91)
COMET INFO:     train/entropy [1000]    : (5.079311847686768, 295.7603759765625)
COMET INFO:     train/loss [1000]       : (1.5

LIST MP4:  ['video/openaigym.video.0.61.video000000.mp4']


COMET INFO: Uploading stats to Comet before program termination (may take several seconds)


In [None]:
s, a = train_REINFORCE(env_name='CartPole-v1',
                        exp_name='{}-{}'.format('best_parameters', '2000ep'),
                        tag='best')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/stephanefschwarz/reinforce-test/3fb49bb4118d481c8c1cfe5e5c6ade5a

  
  


Episode: 0, reward: 13.0 || SUM: 13.0


  
  
  
  
  
  


Episode: 500, reward: 114.0 || SUM: 114.0


  
  
  


Episode: 1000, reward: 175.0 || SUM: 175.0
Episode: 1500, reward: 262.0 || SUM: 262.0


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/stephanefschwarz/reinforce-test/3fb49bb4118d481c8c1cfe5e5c6ade5a
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     ep_avg_rewards [2000] : (1.002004008016032, 1.125)
COMET INFO:     ep_entropy [2000]     : (5.642023086547852, 297.89984130859375)
COMET INFO:     ep_loss [2000]        : (2.678617238998413, 47.66950225830078)
COMET INFO:     ep_rewards [2000]     : (9.0, 500.0)
COMET INFO:     ep_size [2000]        : (8, 499)
COMET INFO:     exp_time              : 567.8271110057831
COMET INFO:   Others:
COMET INFO:     Name : best_parameters-2000ep
COMET INFO:   Parameters:
COMET INFO:     gamma       : 0.99
COMET INFO:     idx         : [0, 1, 2, 3]
COMET INFO:     lr          : 0.001
COMET INFO:     reward_mode : 1
COMET INFO:   Uploads:
C

In [None]:
env = wrap_env(gym.make('CartPole-v1'))
display = Display(visible=0, size=(1400, 900))
display.start()

obs = env.reset()
for _ in range(1000):

    action, _, _ = a.select_action(torch.Tensor([obs]))
    action = action.cpu()

    obs, reward, done, info = env.step(action.numpy()[0])
    
    env.render()
    if done:
        break

env.close()
show_video()

  


Displaying video/openaigym.video.3.1153.video000000.mp4


**Discussion on REINFORCE**  

- Advantages [[5]](https://pylessons.com/Beyond-DQN/)
    - Better convergence properties (guaranteed to converge on a local maximum) 
    - More effective in high dimensional action spaces, or when using continuous actions
    - Can learn a stochastic policy (i.e. handles the exploration/exploitation trade off without hard coding it)

- Disadvantages

    - Noisy gradient and high variability in log probabilities (log of the policy distribution).
    - Instability and slow convergence.

### A2C algorithm

The Actor-Critic is largely similar to the REINFORCE Actor, but it aims to solve some of its main problems: high variance and instability. It achieves this by subtracting the cumulative reward by a baseline, which generates smaller gradients, thus smaller and more stable updates. We used the state-value function [[6]](https://jfpettit.github.io/blog/2020/08/19/diving-deep-with-reinforce-and-a2c). The function is trained to estimate future return $G_t$ from the current state. At a high level, the resulting algorithm involves a loop that alternates between:

- Actor: a policy gradient algorithm that decides on an action to take;
- Critic: value-based algorithm that critiques the action that the actor selected, providing feedback on how to adjust.

The critic allows us to calculate the advantage, so we can use the action-value in the policy loss instead of having to purely use the observed returns like in REINFORCE. This reduces the variance of our policy, and leads to more stable training. Intuitively, the value function captures how good it is to be at a given state, while the advantage means how much better it is to take a specific action compared to the general action taken at that specific state.

The algorithm implemented here, called Advantage Actor Critic (A2C), uses the following update rule for policy gradient:

\begin{array}{c}
{A}^{\pi_{\theta}}({s}, {a})=r({s}, {a})+\gamma \hat{V}^{\pi_{\theta}}\left({s}^{\prime}\right)-\hat{V}^{\pi_{\theta}}({s}) \\
\nabla_{\theta} J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) A^{\pi_{\theta}}(s, a)\right] 
\end{array}

The general process of A2C algorithm[[7]](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_5_actor_critic_pdf):
1. Initialize the actor $\pi_{\theta}$ and the critic $V^{\pi_{\theta}}$ with random weights
2. take action ${a} \sim \pi_{\theta}({a} \mid {s}),$ get $\left({s}, {a}, {s}^{\prime}, r\right)$
3. update $\hat{V}^{\pi_{\theta}}$ using target $r+\gamma \hat{V}^{\pi_{\theta}}\left({s}^{\prime}\right)$
4. evaluate ${A}^{\pi}({s},{a})=r({s}, {a})+\gamma \hat{V}^{\pi_{\theta}}\left({s}^{\prime}\right)-\hat{V}^{\pi_{\theta}}({s})$
5. $\nabla_{\theta} J(\theta) \approx \nabla_{\theta} \log \pi_{\theta}({a} \mid {s}) {A}^{\pi}({s}, {a})$
6. $\theta \leftarrow \theta+\alpha \nabla_{\theta} J(\theta)$

In [None]:
!git clone https://github.com/dongminlee94/deep_rl.git > /dev/null 2>&1

In [None]:
%cd /content/deep_rl/

/content/deep_rl


#### Default Arguments and Auxiliary Functions

In [None]:
Arguments = namedtuple('Arguments', ['algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index',
                                     'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed',
                                     'step_bonus', 'threshold_reward'])

In [None]:
def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env


def show_video(video_path='video', prefix='', index=-1):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  mp4list = sorted(Path(video_path).glob(f'{prefix}*.mp4'))
  mp4 = mp4list[index]
  # print(f'Found {mp4list}')
  # print(f'Using {mp4}')
  video_b64 = base64.b64encode(mp4.read_bytes())
  html.append('''<video alt="{}" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{}" type="video/mp4" />
            </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [None]:
episodes = 1_000
# episodes = 100
eval_num_episodes = 100
gym_env = 'CartPole-v1'
gpu_index = 0
max_steps = 10_000
seed = 0

In [None]:
# Initialize environment
env = wrap_env(gym.make(gym_env))
obs_dim = env.observation_space.shape[0]
act_num = env.action_space.n

# Set a random seed
env.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device('cuda', index=gpu_index) if torch.cuda.is_available() else torch.device('cpu')

In [None]:
def run_agent(agent, args, episode_bonus_fn=None, tags=[], experiment_name=''):
  # Create an experiment logger to comet ml
  experiment = Experiment(project_name='RL Project 02', api_key='vgqoTyQHCNqFm15LosJnDBwEo',
                          auto_metric_logging=False)
  
  experiment.log_parameter('gamma', agent.gamma)
  experiment.log_parameter('policy_lr', agent.policy_lr)
  experiment.log_parameter('vf_lr', agent.vf_lr)
  experiment.log_parameter('obs_dim', agent.obs_dim)
  if agent.observations_to_use is not None:
    experiment.log_parameter('observations_to_use', agent.observations_to_use)
  else:
    experiment.log_parameter('observations_to_use', [0, 1, 2, 3])
  
  trainable_params = sum(p.numel() for p in agent.policy.parameters() if p.requires_grad)
  trainable_params += sum(p.numel() for p in agent.vf.parameters() if p.requires_grad)

  total_params = sum(p.numel() for p in agent.policy.parameters())
  total_params += sum(p.numel() for p in agent.vf.parameters())

  experiment.log_other('trainable params', trainable_params)
  experiment.log_other('total params', total_params)
  
  experiment.set_name(experiment_name)
  if len(tags) > 0:
    experiment.add_tags(tags)
  experiment.add_tag(args.algo)
  experiment.log_parameters(args)

  start_time = time.time()

  train_sum_steps = 0
  train_sum_rewards = 0.
  max_eval_avg_steps = 0.

  ckpt_path = Path('./save_model')
  ckpt_path.mkdir(parents=True, exist_ok=True)
  ckpt_path = ckpt_path / f'{args.env}_{experiment_name}_best.pt'

  if episode_bonus_fn is not None:
    train_sum_rewards_with_bonus = 0.

  # Main loop
  for i in range(args.iterations):
    # Perform the training phase, during which the agent learns
    if args.phase == 'train':
      agent.eval_mode = False
  
      # Run one episode
      train_step_length, train_episode_reward = agent.run(args.max_steps)
      
      train_sum_steps += train_step_length
      train_sum_rewards += train_episode_reward
      
      train_average_reward = train_sum_rewards / (i+1)
      train_average_steps = train_sum_steps / (i+1)

      # Log experiment result for training episodes
      metrics = {'train/steps': train_step_length, 'train/reward': train_episode_reward,
                 'train/avg_steps': train_average_steps, 'train/avg_reward': train_average_reward}

      if episode_bonus_fn is not None:
        finish_bonus = episode_bonus_fn(train_step_length)
        train_sum_rewards_with_bonus += train_episode_reward + finish_bonus
        metrics.update({'train/reward_with_bonus': train_episode_reward + finish_bonus,
                        'train/reward_bonus': finish_bonus,
                        'train/avg_reward_with_bonus': train_sum_rewards_with_bonus / (i+1)})

      metrics.update(agent.logger)
      experiment.log_metrics(metrics, epoch=i+1)

    # Perform the evaluation phase -- no learning
    if (i + 1) % args.eval_per_train == 0:
      eval_sum_rewards = 0.
      eval_sum_steps = 0
      agent.eval_mode = True
      
      for j in range(args.eval_num_episodes):
        # Run one episode
        eval_step_length, eval_episode_reward = agent.run(args.max_steps)
        eval_sum_rewards += eval_episode_reward
        eval_sum_steps += eval_step_length

      eval_average_reward = eval_sum_rewards / args.eval_num_episodes
      eval_average_steps = eval_sum_steps / args.eval_num_episodes

      if eval_average_steps >= max_eval_avg_steps:
        max_eval_avg_steps = eval_average_steps

        # Save the trained model
        torch.save({
          'policy_state_dict': agent.policy.state_dict(),
          'vf_state_dict': agent.vf.state_dict(),
          'policy_optimizer_state_dict': agent.policy_optimizer.state_dict(),
          'vf_optimizer_state_dict': agent.vf_optimizer.state_dict(),
          }, ckpt_path)
          
        experiment.log_model(ckpt_path.name, f'{ckpt_path}', overwrite=True)

      # Log experiment result for evaluation episodes
      metrics = {'eval/avg_steps': eval_average_steps, 'eval/avg_reward': eval_average_reward}
      experiment.log_metrics(metrics, epoch=i+1)
      
      if args.phase == 'train':
        print('---------------------------------------')
        print('Iterations:', i + 1)
        print('Steps:', train_sum_steps)
        print('Episodes:', i)
        print('EpisodeReturn:', round(train_episode_reward, 2))
        print('AverageReturn:', round(train_average_reward, 2))
        print('EvalEpisodes:', args.eval_num_episodes)
        print('EvalEpisodeReturn:', round(eval_episode_reward, 2))
        print('EvalAverageReturn:', round(eval_average_reward, 2))
        print('OtherLogs:', agent.logger)
        print('Time:', int(time.time() - start_time))
        print('---------------------------------------')

      elif args.phase == 'test':
        print('---------------------------------------')
        print('EvalEpisodes:', args.eval_num_episodes)
        print('EvalEpisodeReturn:', round(eval_episode_reward, 2))
        print('EvalAverageReturn:', round(eval_average_reward, 2))
        print('Time:', int(time.time() - start_time))
        print('---------------------------------------')
  
  # generate video from the best set of parameters
  checkpoint = torch.load(ckpt_path)
  agent.policy.load_state_dict(checkpoint['policy_state_dict'])
  agent.vf.load_state_dict(checkpoint['vf_state_dict'])
  agent.policy_optimizer.load_state_dict(checkpoint['policy_optimizer_state_dict'])
  agent.vf_optimizer.load_state_dict(checkpoint['vf_optimizer_state_dict'])
  
  agent.env = wrap_env(gym.make(args.env))
  agent.render = True
  agent.eval_mode = True

  # Run one episode
  step_length, episode_reward = agent.run(args.max_steps)
  print(f'Test ended after {step_length} steps with reward {episode_reward}')
  agent.env.close()
  
  mp4list = sorted(Path('./video').glob('*.mp4'))
  if len(mp4list) > 0:
    print(f'Found files: {[str(mp4) for mp4 in mp4list]}')
    print(f'Uploading file: {mp4list[-1]}')
    experiment.log_asset(mp4list[-1], file_name=f'{args.env}_{experiment_name}_best.mp4')

  experiment.end()
  show_video()

#### Implementation

In [None]:
import numpy as np
import torch
import torch.optim as optim
import torch.nn.functional as F

from agents.common.networks import *


class A2CAgent(object):
   """An implementation of the Advantage Actor-Critic (A2C) agent."""

   def __init__(self,
                env,
                args,
                device,
                obs_dim,
                act_num,
                steps=0,
                gamma=0.99,
                policy_lr=1e-4,
                vf_lr=1e-3,
                eval_mode=False,
                policy_losses=list(),
                vf_losses=list(),
                logger=dict(),
                additional_reward_fn=None
   ):
      self.additional_reward = additional_reward_fn
      self.env = env
      self.args = args
      self.device = device
      if type(obs_dim) is list:
        self.observations_to_use = obs_dim
        self.obs_dim = len(obs_dim)
      else:
        self.observations_to_use = None
        self.obs_dim = obs_dim
      self.act_num = act_num
      self.steps = steps 
      self.gamma = gamma
      self.policy_lr = policy_lr
      self.vf_lr = vf_lr
      self.eval_mode = eval_mode
      self.render = args.render
      self.policy_losses = policy_losses
      self.vf_losses = vf_losses
      self.logger = logger
      
      # Policy network
      self.policy = CategoricalPolicy(self.obs_dim, self.act_num, activation=torch.tanh).to(self.device)
      # Value network
      self.vf = MLP(self.obs_dim, 1, activation=torch.tanh).to(self.device)
      
      # Create optimizers
      self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=self.policy_lr)
      self.vf_optimizer = optim.Adam(self.vf.parameters(), lr=self.vf_lr)
      
   def select_action(self, obs):
      """Select an action from the set of available actions."""
      action, _, log_pi  = self.policy(obs)
      
      # Prediction V(s)
      v = self.vf(obs)

      # Add logπ(a|s), V(s) to transition list
      self.transition.extend([log_pi, v])
      return action.detach().cpu().numpy()

   def train_model(self):
      log_pi, v, reward, next_obs, done = self.transition

      # Prediction V(s')
      next_v = self.vf(torch.Tensor(next_obs).to(self.device))
      
      # Target for Q regression
      q = reward + self.gamma*(1-done)*next_v
      q.to(self.device)

      # Advantage = Q - V
      advant = q - v

      # A2C losses
      policy_loss = -log_pi*advant.detach()
      vf_loss = F.mse_loss(v, q.detach())

      # Update value network parameter
      self.vf_optimizer.zero_grad()
      vf_loss.backward()
      self.vf_optimizer.step()
      
      # Update policy network parameter
      self.policy_optimizer.zero_grad()
      policy_loss.backward()
      self.policy_optimizer.step()

      # Save losses
      self.policy_losses.append(policy_loss.item())
      self.vf_losses.append(vf_loss.item())

   def run(self, max_step):
      step_number = 0
      total_reward = 0.
      if self.additional_reward is not None and not self.eval_mode:
        reward_without_additional = 0.

      obs = self.env.reset()
      if self.observations_to_use is not None:
        obs = [ obs[i] for i in range(len(obs)) if i in self.observations_to_use ]
      done = False

      # Keep interacting until agent reaches a terminal state.
      while not (done or step_number == max_step):
         if self.render and (self.args.render_in_train or self.eval_mode):
            self.env.render()

         if self.eval_mode:
           with torch.no_grad():
             # use only the probabilities
             _, pi, _ = self.policy(torch.Tensor(obs).to(self.device))
           action = pi.argmax().detach().cpu().numpy()
           next_obs, reward, done, _ = self.env.step(action)
           if self.observations_to_use is not None:
             next_obs = [ next_obs[i] for i in range(len(next_obs)) if i in self.observations_to_use ]
         else:
           self.steps += 1

           # Create a transition list
           self.transition = []

           # Collect experience (s, a, r, s') using some policy
           action = self.select_action(torch.Tensor(obs).to(self.device))
           next_obs, reward, done, _ = self.env.step(action)

           if self.observations_to_use is not None:
             next_obs = [ next_obs[i] for i in range(len(next_obs)) if i in self.observations_to_use ]

           if self.additional_reward is not None:
             reward_without_additional += reward
             reward += self.additional_reward(step_number, next_obs)

           # Add (r, s') to transition list
           self.transition.extend([reward, next_obs, done])
           
           self.train_model()

         total_reward += reward
         step_number += 1
         obs = next_obs
      
      # Save total average losses
      if self.additional_reward is not None and not self.eval_mode:
        self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
        self.logger['LossV'] = round(np.mean(self.vf_losses), 5)
        self.logger['custom_reward'] = total_reward
        return step_number, reward_without_additional
      else:
        self.logger['LossPi'] = round(np.mean(self.policy_losses), 5)
        self.logger['LossV'] = round(np.mean(self.vf_losses), 5)
        return step_number, total_reward

#### Running With Default Parameters

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

In [None]:
# Create an agent
observations_to_use = [0, 1, 2, 3]
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['gamma', 'policy_lr', 'vf_lr', 'episode_bonus', 'step_bonus', 'observation_slice'],
          experiment_name=f'{args.algo}_default')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/6c11d15962c9440694e0df4ff22ecbb7

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1258
Episodes: 49
EpisodeReturn: 110.0
AverageReturn: 25.16
EvalEpisodes: 100
EvalEpisodeReturn: 75.0
EvalAverageReturn: 92.65
OtherLogs: {'LossPi': 0.17537, 'LossV': 5.81515}
Time: 12
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 4617
Episodes: 99
EpisodeReturn: 113.0
AverageReturn: 46.17
EvalEpisodes: 100
EvalEpisodeReturn: 336.0
EvalAverageReturn: 267.95
OtherLogs: {'LossPi': -0.02935, 'LossV': 9.98169}
Time: 35
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 14523
Episodes: 149
EpisodeReturn: 75.0
AverageReturn: 96.82
EvalEpisodes: 100
EvalEpisodeReturn: 80.0
EvalAverageReturn: 195.1
OtherLogs: {'LossPi': -0.0218, 'LossV': 7.6562}
Time: 74
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 31944
Episodes: 199
EpisodeReturn: 500.0
AverageReturn: 159.72
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/6c11d15962c9440694e0df4ff22ecbb7
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08002, 0.62732)
COMET INFO:     LossV [1000]            : (1.00028, 12.7709)
COMET INFO:     eval/avg_reward [20]    : (92.65, 500.0)
COMET INFO:     eval/avg_steps [20]     : (92.65, 500.0)
COMET INFO:     train/avg_reward [1000] : (17.6, 357.45208333333335)
COMET INFO:     train/avg_steps [1000]  : (17.6, 357.45208333333335)
COMET INFO:     train/reward [1000]     : (8.0, 500.0)
COMET INFO:     train/steps [1000]      : (8, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_default
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
C

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.1.63.video000000.mp4']
Uploading file: video/openaigym.video.1.63.video000000.mp4


COMET INFO: Uploading stats to Comet before program termination (may take several seconds)
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


#### Testing $\gamma$

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, gamma=0.1)
run_agent(agent, args, tags=['gamma'], experiment_name=f'{args.algo}_gamma_{agent.gamma}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/687d8c41b8714baa8af1a49439b9f09a

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1070
Episodes: 49
EpisodeReturn: 12.0
AverageReturn: 21.4
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.52
OtherLogs: {'LossPi': -0.03075, 'LossV': 12.6975}
Time: 9
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2144
Episodes: 99
EpisodeReturn: 18.0
AverageReturn: 21.44
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.3
OtherLogs: {'LossPi': -0.03066, 'LossV': 12.65834}
Time: 17
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3394
Episodes: 149
EpisodeReturn: 36.0
AverageReturn: 22.63
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 11.14
OtherLogs: {'LossPi': -0.03054, 'LossV': 12.61306}
Time: 27
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 4487
Episodes: 199
EpisodeReturn: 13.0
AverageReturn: 22.43
EvalEpisodes: 100


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/687d8c41b8714baa8af1a49439b9f09a
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.03085, -0.02894)
COMET INFO:     LossV [1000]            : (11.94901, 12.7359)
COMET INFO:     eval/avg_reward [20]    : (9.3, 90.03)
COMET INFO:     eval/avg_steps [20]     : (9.3, 90.03)
COMET INFO:     train/avg_reward [1000] : (17.375, 26.384223918575064)
COMET INFO:     train/avg_steps [1000]  : (17.375, 26.384223918575064)
COMET INFO:     train/reward [1000]     : (8.0, 131.0)
COMET INFO:     train/steps [1000]      : (8, 131)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_gamma_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameter

Test ended after 111 steps with reward 111.0
Found files: ['video/openaigym.video.2.63.video000000.mp4']
Uploading file: video/openaigym.video.2.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, gamma=0.5)
run_agent(agent, args, tags=['gamma'], experiment_name=f'{args.algo}_gamma_{agent.gamma}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/f519c9a9f8de4d01bc7b732912e7a4d4

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1544
Episodes: 49
EpisodeReturn: 12.0
AverageReturn: 30.88
EvalEpisodes: 100
EvalEpisodeReturn: 46.0
EvalAverageReturn: 59.25
OtherLogs: {'LossPi': -0.02881, 'LossV': 11.89935}
Time: 12
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 4143
Episodes: 99
EpisodeReturn: 53.0
AverageReturn: 41.43
EvalEpisodes: 100
EvalEpisodeReturn: 87.0
EvalAverageReturn: 110.06
OtherLogs: {'LossPi': -0.02865, 'LossV': 11.81655}
Time: 30
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 8249
Episodes: 149
EpisodeReturn: 79.0
AverageReturn: 54.99
EvalEpisodes: 100
EvalEpisodeReturn: 153.0
EvalAverageReturn: 135.52
OtherLogs: {'LossPi': -0.02836, 'LossV': 11.68801}
Time: 53
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 13902
Episodes: 199
EpisodeReturn: 131.0
AverageReturn: 69.51
EvalE

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/f519c9a9f8de4d01bc7b732912e7a4d4
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02891, -0.02151)
COMET INFO:     LossV [1000]            : (8.85155, 11.94829)
COMET INFO:     eval/avg_reward [20]    : (24.44, 500.0)
COMET INFO:     eval/avg_steps [20]     : (24.44, 500.0)
COMET INFO:     train/avg_reward [1000] : (18.5, 129.181)
COMET INFO:     train/avg_steps [1000]  : (18.5, 129.181)
COMET INFO:     train/reward [1000]     : (12.0, 500.0)
COMET INFO:     train/steps [1000]      : (12, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_gamma_0.5
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.3.63.video000000.mp4']
Uploading file: video/openaigym.video.3.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, gamma=0.75)
run_agent(agent, args, tags=['gamma'], experiment_name=f'{args.algo}_gamma_{agent.gamma}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/8c62e66ee3674799b0caeee44511d929

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1317
Episodes: 49
EpisodeReturn: 23.0
AverageReturn: 26.34
EvalEpisodes: 100
EvalEpisodeReturn: 47.0
EvalAverageReturn: 49.82
OtherLogs: {'LossPi': -0.02143, 'LossV': 8.82888}
Time: 13
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 3749
Episodes: 99
EpisodeReturn: 39.0
AverageReturn: 37.49
EvalEpisodes: 100
EvalEpisodeReturn: 41.0
EvalAverageReturn: 57.31
OtherLogs: {'LossPi': -0.02142, 'LossV': 8.7866}
Time: 30
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 7406
Episodes: 149
EpisodeReturn: 63.0
AverageReturn: 49.37
EvalEpisodes: 100
EvalEpisodeReturn: 144.0
EvalAverageReturn: 140.74
OtherLogs: {'LossPi': -0.02136, 'LossV': 8.72353}
Time: 54
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 16286
Episodes: 199
EpisodeReturn: 374.0
AverageReturn: 81.43
EvalEpisod

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/8c62e66ee3674799b0caeee44511d929
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.0215, -0.01633)
COMET INFO:     LossV [1000]            : (6.5339, 8.85134)
COMET INFO:     eval/avg_reward [20]    : (36.73, 500.0)
COMET INFO:     eval/avg_steps [20]     : (36.73, 500.0)
COMET INFO:     train/avg_reward [1000] : (13.0, 190.05291005291005)
COMET INFO:     train/avg_steps [1000]  : (13.0, 190.05291005291005)
COMET INFO:     train/reward [1000]     : (10.0, 500.0)
COMET INFO:     train/steps [1000]      : (10, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_gamma_0.75
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameter

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.4.63.video000000.mp4']
Uploading file: video/openaigym.video.4.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, gamma=0.9)
run_agent(agent, args, tags=['gamma'], experiment_name=f'{args.algo}_gamma_{agent.gamma}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/23b1fa6aedf14103b0c9ec62d8fe1e51

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1515
Episodes: 49
EpisodeReturn: 18.0
AverageReturn: 30.3
EvalEpisodes: 100
EvalEpisodeReturn: 39.0
EvalAverageReturn: 44.59
OtherLogs: {'LossPi': -0.01632, 'LossV': 6.52232}
Time: 16
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 4973
Episodes: 99
EpisodeReturn: 111.0
AverageReturn: 49.73
EvalEpisodes: 100
EvalEpisodeReturn: 193.0
EvalAverageReturn: 232.64
OtherLogs: {'LossPi': -0.01651, 'LossV': 6.49298}
Time: 45
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 13703
Episodes: 149
EpisodeReturn: 184.0
AverageReturn: 91.35
EvalEpisodes: 100
EvalEpisodeReturn: 236.0
EvalAverageReturn: 226.9
OtherLogs: {'LossPi': -0.01668, 'LossV': 6.41459}
Time: 90
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 23571
Episodes: 199
EpisodeReturn: 36.0
AverageReturn: 117.86
EvalEp

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/23b1fa6aedf14103b0c9ec62d8fe1e51
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01669, -0.01368)
COMET INFO:     LossV [1000]            : (5.12886, 6.53378)
COMET INFO:     eval/avg_reward [20]    : (44.59, 500.0)
COMET INFO:     eval/avg_steps [20]     : (44.59, 500.0)
COMET INFO:     train/avg_reward [1000] : (15.0, 200.74400871459696)
COMET INFO:     train/avg_steps [1000]  : (15.0, 200.74400871459696)
COMET INFO:     train/reward [1000]     : (10.0, 500.0)
COMET INFO:     train/steps [1000]      : (10, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_gamma_0.9
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Paramete

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.5.63.video000000.mp4']
Uploading file: video/openaigym.video.5.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, gamma=0.999)
run_agent(agent, args, tags=['gamma'], experiment_name=f'{args.algo}_gamma_{agent.gamma}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/0d01e8f0b4784005a461cecdd4c8b7be

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1244
Episodes: 49
EpisodeReturn: 31.0
AverageReturn: 24.88
EvalEpisodes: 100
EvalEpisodeReturn: 87.0
EvalAverageReturn: 66.72
OtherLogs: {'LossPi': -0.01344, 'LossV': 5.13454}
Time: 18
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 3984
Episodes: 99
EpisodeReturn: 30.0
AverageReturn: 39.84
EvalEpisodes: 100
EvalEpisodeReturn: 69.0
EvalAverageReturn: 73.44
OtherLogs: {'LossPi': -0.01327, 'LossV': 5.15541}
Time: 42
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 8332
Episodes: 149
EpisodeReturn: 107.0
AverageReturn: 55.55
EvalEpisodes: 100
EvalEpisodeReturn: 222.0
EvalAverageReturn: 96.47
OtherLogs: {'LossPi': -0.01302, 'LossV': 5.20115}
Time: 71
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 16293
Episodes: 199
EpisodeReturn: 274.0
AverageReturn: 81.47
EvalEpiso

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/0d01e8f0b4784005a461cecdd4c8b7be
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02213, -0.00538)
COMET INFO:     LossV [1000]            : (5.12837, 47.68618)
COMET INFO:     eval/avg_reward [20]    : (12.57, 500.0)
COMET INFO:     eval/avg_steps [20]     : (12.57, 500.0)
COMET INFO:     train/avg_reward [1000] : (18.0, 259.6070381231672)
COMET INFO:     train/avg_steps [1000]  : (18.0, 259.6070381231672)
COMET INFO:     train/reward [1000]     : (10.0, 500.0)
COMET INFO:     train/steps [1000]      : (10, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_gamma_0.999
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Paramet

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.6.63.video000000.mp4']
Uploading file: video/openaigym.video.6.63.video000000.mp4


COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


#### Testing Policy Learning Rates

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-5)
run_agent(agent, args, tags=['policy_lr'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/25e8bc99ee7740e78792a88c9dc54f0d

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1049
Episodes: 49
EpisodeReturn: 46.0
AverageReturn: 20.98
EvalEpisodes: 100
EvalEpisodeReturn: 43.0
EvalAverageReturn: 42.03
OtherLogs: {'LossPi': -0.02111, 'LossV': 47.64773}
Time: 20
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2450
Episodes: 99
EpisodeReturn: 38.0
AverageReturn: 24.5
EvalEpisodes: 100
EvalEpisodeReturn: 23.0
EvalAverageReturn: 21.98
OtherLogs: {'LossPi': -0.02112, 'LossV': 47.60261}
Time: 41
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3788
Episodes: 149
EpisodeReturn: 26.0
AverageReturn: 25.25
EvalEpisodes: 100
EvalEpisodeReturn: 18.0
EvalAverageReturn: 20.89
OtherLogs: {'LossPi': -0.02129, 'LossV': 47.56127}
Time: 62
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 5339
Episodes: 199
EpisodeReturn: 19.0
AverageReturn: 26.7
EvalEpisodes

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/25e8bc99ee7740e78792a88c9dc54f0d
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.03246, -0.02103)
COMET INFO:     LossV [1000]            : (42.31322, 47.68555)
COMET INFO:     eval/avg_reward [20]    : (20.89, 500.0)
COMET INFO:     eval/avg_steps [20]     : (20.89, 500.0)
COMET INFO:     train/avg_reward [1000] : (12.5, 201.063)
COMET INFO:     train/avg_steps [1000]  : (12.5, 201.063)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_1e-05
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _f

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.7.63.video000000.mp4']
Uploading file: video/openaigym.video.7.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-3)
run_agent(agent, args, tags=['policy_lr'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/05827dea055a4bed8341227246730f08

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1474
Episodes: 49
EpisodeReturn: 32.0
AverageReturn: 29.48
EvalEpisodes: 100
EvalEpisodeReturn: 37.0
EvalAverageReturn: 52.26
OtherLogs: {'LossPi': -0.03145, 'LossV': 42.27285}
Time: 25
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 6738
Episodes: 99
EpisodeReturn: 8.0
AverageReturn: 67.38
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.45
OtherLogs: {'LossPi': -0.03106, 'LossV': 42.18538}
Time: 59
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 7207
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 48.05
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.42
OtherLogs: {'LossPi': -0.03105, 'LossV': 42.17777}
Time: 79
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 7668
Episodes: 199
EpisodeReturn: 8.0
AverageReturn: 38.34
EvalEpisodes: 

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/05827dea055a4bed8341227246730f08
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.03172, -0.03078)
COMET INFO:     LossV [1000]            : (41.94391, 42.31275)
COMET INFO:     eval/avg_reward [20]    : (9.22, 52.26)
COMET INFO:     eval/avg_steps [20]     : (9.22, 52.26)
COMET INFO:     train/avg_reward [1000] : (11.11111111111111, 84.35526315789474)
COMET INFO:     train/avg_steps [1000]  : (11.11111111111111, 84.35526315789474)
COMET INFO:     train/reward [1000]     : (8.0, 500.0)
COMET INFO:     train/steps [1000]      : (8, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155


Test ended after 42 steps with reward 42.0
Found files: ['video/openaigym.video.8.63.video000000.mp4']
Uploading file: video/openaigym.video.8.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-2)
run_agent(agent, args, tags=['policy_lr'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/76b60358d0b34003a046531ad301b137

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 481
Episodes: 49
EpisodeReturn: 9.0
AverageReturn: 9.62
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.32
OtherLogs: {'LossPi': -0.03077, 'LossV': 41.92966}
Time: 20
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 941
Episodes: 99
EpisodeReturn: 11.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.39
OtherLogs: {'LossPi': -0.03076, 'LossV': 41.91534}
Time: 41
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1400
Episodes: 149
EpisodeReturn: 9.0
AverageReturn: 9.33
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.39
OtherLogs: {'LossPi': -0.03075, 'LossV': 41.90105}
Time: 61
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1869
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 9.35
EvalEpisodes: 100
EvalE

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/76b60358d0b34003a046531ad301b137
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.03078, -0.03057)
COMET INFO:     LossV [1000]            : (41.65508, 41.94348)
COMET INFO:     eval/avg_reward [20]    : (9.17, 9.55)
COMET INFO:     eval/avg_steps [20]     : (9.17, 9.55)
COMET INFO:     train/avg_reward [1000] : (9.314285714285715, 14.0)
COMET INFO:     train/avg_steps [1000]  : (9.314285714285715, 14.0)
COMET INFO:     train/reward [1000]     : (8.0, 14.0)
COMET INFO:     train/steps [1000]      : (8, 14)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.01
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.9.63.video000000.mp4']
Uploading file: video/openaigym.video.9.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-1)
run_agent(agent, args, tags=['policy_lr'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/c5814f5b1ca24b738033f385187d2e74

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 470
Episodes: 49
EpisodeReturn: 9.0
AverageReturn: 9.4
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.4
OtherLogs: {'LossPi': -0.03055, 'LossV': 41.64118}
Time: 20
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 937
Episodes: 99
EpisodeReturn: 9.0
AverageReturn: 9.37
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.46
OtherLogs: {'LossPi': -0.03054, 'LossV': 41.62684}
Time: 41
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1415
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.43
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.22
OtherLogs: {'LossPi': -0.03053, 'LossV': 41.61215}
Time: 61
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1887
Episodes: 199
EpisodeReturn: 9.0
AverageReturn: 9.44
EvalEpisodes: 100
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/c5814f5b1ca24b738033f385187d2e74
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.03056, -0.03035)
COMET INFO:     LossV [1000]            : (41.3689, 41.65478)
COMET INFO:     eval/avg_reward [20]    : (9.22, 9.49)
COMET INFO:     eval/avg_steps [20]     : (9.22, 9.49)
COMET INFO:     train/avg_reward [1000] : (9.318181818181818, 10.0)
COMET INFO:     train/avg_steps [1000]  : (9.318181818181818, 10.0)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET I

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.10.63.video000000.mp4']
Uploading file: video/openaigym.video.10.63.video000000.mp4


COMET INFO: Still uploading


#### Testing Value Function Learning Rates

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, vf_lr=1e-5)
run_agent(agent, args, tags=['vf_lr'], experiment_name=f'{args.algo}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/23bacd83a9d04e3c8a6afbfd440c573d

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1131
Episodes: 49
EpisodeReturn: 18.0
AverageReturn: 22.62
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.52
OtherLogs: {'LossPi': -0.02978, 'LossV': 41.33534}
Time: 23
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2008
Episodes: 99
EpisodeReturn: 9.0
AverageReturn: 20.08
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.45
OtherLogs: {'LossPi': -0.02939, 'LossV': 41.30936}
Time: 45
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2641
Episodes: 149
EpisodeReturn: 11.0
AverageReturn: 17.61
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.47
OtherLogs: {'LossPi': -0.02919, 'LossV': 41.2907}
Time: 66
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 3206
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 16.03
EvalEpisodes: 100


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/23bacd83a9d04e3c8a6afbfd440c573d
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.03032, -0.02866)
COMET INFO:     LossV [1000]            : (41.05384, 41.36727)
COMET INFO:     eval/avg_reward [20]    : (9.2, 9.52)
COMET INFO:     eval/avg_steps [20]     : (9.2, 9.52)
COMET INFO:     train/avg_reward [1000] : (11.018, 55.0)
COMET INFO:     train/avg_steps [1000]  : (11.018, 55.0)
COMET INFO:     train/reward [1000]     : (8.0, 62.0)
COMET INFO:     train/steps [1000]      : (8, 62)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_vflr_1e-05
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fields    

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.11.63.video000000.mp4']
Uploading file: video/openaigym.video.11.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, vf_lr=1e-4)
run_agent(agent, args, tags=['vf_lr'], experiment_name=f'{args.algo}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/3ede4a3e910448e4b28a7de704990331

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1096
Episodes: 49
EpisodeReturn: 19.0
AverageReturn: 21.92
EvalEpisodes: 100
EvalEpisodeReturn: 11.0
EvalAverageReturn: 11.37
OtherLogs: {'LossPi': -0.02831, 'LossV': 41.0229}
Time: 23
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1792
Episodes: 99
EpisodeReturn: 17.0
AverageReturn: 17.92
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.38
OtherLogs: {'LossPi': -0.02823, 'LossV': 41.00489}
Time: 44
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2510
Episodes: 149
EpisodeReturn: 12.0
AverageReturn: 16.73
EvalEpisodes: 100
EvalEpisodeReturn: 14.0
EvalAverageReturn: 12.39
OtherLogs: {'LossPi': -0.02818, 'LossV': 40.98657}
Time: 66
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 3194
Episodes: 199
EpisodeReturn: 18.0
AverageReturn: 15.97
EvalEpisodes

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/3ede4a3e910448e4b28a7de704990331
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02866, -0.02457)
COMET INFO:     LossV [1000]            : (40.36938, 41.05297)
COMET INFO:     eval/avg_reward [20]    : (9.38, 81.44)
COMET INFO:     eval/avg_steps [20]     : (9.38, 81.44)
COMET INFO:     train/avg_reward [1000] : (15.959798994974875, 37.0)
COMET INFO:     train/avg_steps [1000]  : (15.959798994974875, 37.0)
COMET INFO:     train/reward [1000]     : (9.0, 155.0)
COMET INFO:     train/steps [1000]      : (9, 155)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_vflr_0.0001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Paramete

Test ended after 107 steps with reward 107.0
Found files: ['video/openaigym.video.12.63.video000000.mp4']
Uploading file: video/openaigym.video.12.63.video000000.mp4


COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (18 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [7]   : 7
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, vf_lr=1e-2)
run_agent(agent, args, tags=['vf_lr'], experiment_name=f'{args.algo}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/0b6375cb0628426ead7ecd4afd519ffc

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 967
Episodes: 49
EpisodeReturn: 17.0
AverageReturn: 19.34
EvalEpisodes: 100
EvalEpisodeReturn: 21.0
EvalAverageReturn: 25.13
OtherLogs: {'LossPi': -0.02464, 'LossV': 40.34724}
Time: 23
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 4429
Episodes: 99
EpisodeReturn: 42.0
AverageReturn: 44.29
EvalEpisodes: 100
EvalEpisodeReturn: 133.0
EvalAverageReturn: 109.65
OtherLogs: {'LossPi': -0.02482, 'LossV': 40.28473}
Time: 57
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 9527
Episodes: 149
EpisodeReturn: 52.0
AverageReturn: 63.51
EvalEpisodes: 100
EvalEpisodeReturn: 50.0
EvalAverageReturn: 64.85
OtherLogs: {'LossPi': -0.02455, 'LossV': 40.19522}
Time: 94
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 15212
Episodes: 199
EpisodeReturn: 36.0
AverageReturn: 76.06
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/0b6375cb0628426ead7ecd4afd519ffc
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02484, -0.02254)
COMET INFO:     LossV [1000]            : (39.44371, 40.36914)
COMET INFO:     eval/avg_reward [20]    : (25.13, 314.36)
COMET INFO:     eval/avg_steps [20]     : (25.13, 314.36)
COMET INFO:     train/avg_reward [1000] : (10.0, 83.45327102803738)
COMET INFO:     train/avg_steps [1000]  : (10.0, 83.45327102803738)
COMET INFO:     train/reward [1000]     : (8.0, 500.0)
COMET INFO:     train/steps [1000]      : (8, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_vflr_0.01
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Paramete

Test ended after 331 steps with reward 331.0
Found files: ['video/openaigym.video.13.63.video000000.mp4']
Uploading file: video/openaigym.video.13.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, vf_lr=1e-1)
run_agent(agent, args, tags=['vf_lr'], experiment_name=f'{args.algo}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/661b169c68d646e4907701527ab977ea

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1119
Episodes: 49
EpisodeReturn: 24.0
AverageReturn: 22.38
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.29
OtherLogs: {'LossPi': -0.02255, 'LossV': 39.43278}
Time: 24
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2090
Episodes: 99
EpisodeReturn: 11.0
AverageReturn: 20.9
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.23
OtherLogs: {'LossPi': -0.02254, 'LossV': 39.41768}
Time: 47
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3119
Episodes: 149
EpisodeReturn: 25.0
AverageReturn: 20.79
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.44
OtherLogs: {'LossPi': -0.02253, 'LossV': 39.40316}
Time: 71
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 4186
Episodes: 199
EpisodeReturn: 13.0
AverageReturn: 20.93
EvalEpisodes: 10

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/661b169c68d646e4907701527ab977ea
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02258, -0.022)
COMET INFO:     LossV [1000]            : (39.15991, 39.44708)
COMET INFO:     eval/avg_reward [20]    : (9.23, 80.6)
COMET INFO:     eval/avg_steps [20]     : (9.23, 80.6)
COMET INFO:     train/avg_reward [1000] : (14.0, 32.0)
COMET INFO:     train/avg_steps [1000]  : (14.0, 32.0)
COMET INFO:     train/reward [1000]     : (8.0, 153.0)
COMET INFO:     train/steps [1000]      : (8, 153)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_vflr_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fields        

Test ended after 60 steps with reward 60.0
Found files: ['video/openaigym.video.14.63.video000000.mp4']
Uploading file: video/openaigym.video.14.63.video000000.mp4


COMET INFO: Still uploading


#### Testing the Relation Between Policy LR and Value Function LR

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-5, vf_lr=1e-5)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/b7171149023e486889b8a7ef732c6de9

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1056
Episodes: 49
EpisodeReturn: 9.0
AverageReturn: 21.12
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.26
OtherLogs: {'LossPi': 0.67846, 'LossV': 0.9978}
Time: 8
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2128
Episodes: 99
EpisodeReturn: 29.0
AverageReturn: 21.28
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.41
OtherLogs: {'LossPi': 0.65979, 'LossV': 0.977}
Time: 12
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3091
Episodes: 149
EpisodeReturn: 23.0
AverageReturn: 20.61
EvalEpisodes: 100
EvalEpisodeReturn: 11.0
EvalAverageReturn: 9.24
OtherLogs: {'LossPi': 0.63523, 'LossV': 0.98718}
Time: 17
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 4119
Episodes: 199
EpisodeReturn: 22.0
AverageReturn: 20.59
EvalEpisodes: 100
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/b7171149023e486889b8a7ef732c6de9
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (0.39267, 0.71208)
COMET INFO:     LossV [1000]            : (0.9728, 2.34529)
COMET INFO:     eval/avg_reward [20]    : (9.13, 9.75)
COMET INFO:     eval/avg_steps [20]     : (9.13, 9.75)
COMET INFO:     train/avg_reward [1000] : (17.6, 36.0)
COMET INFO:     train/avg_steps [1000]  : (17.6, 36.0)
COMET INFO:     train/reward [1000]     : (8.0, 68.0)
COMET INFO:     train/steps [1000]      : (8, 68)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_1e-05_vflr_1e-05
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fields 

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.1.61.video000000.mp4']
Uploading file: video/openaigym.video.1.61.video000000.mp4


COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [5]   : 5
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Uploading stats to Comet before program termination (may take several seconds)


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-5, vf_lr=1e-4)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/7f93f0e5d8c04790915ad2d93e9cda31

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1203
Episodes: 49
EpisodeReturn: 19.0
AverageReturn: 24.06
EvalEpisodes: 100
EvalEpisodeReturn: 18.0
EvalAverageReturn: 19.85
OtherLogs: {'LossPi': 0.39968, 'LossV': 2.30931}
Time: 5
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2542
Episodes: 99
EpisodeReturn: 19.0
AverageReturn: 25.42
EvalEpisodes: 100
EvalEpisodeReturn: 27.0
EvalAverageReturn: 25.36
OtherLogs: {'LossPi': 0.39482, 'LossV': 2.44054}
Time: 11
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3662
Episodes: 149
EpisodeReturn: 18.0
AverageReturn: 24.41
EvalEpisodes: 100
EvalEpisodeReturn: 29.0
EvalAverageReturn: 21.56
OtherLogs: {'LossPi': 0.38425, 'LossV': 2.6602}
Time: 16
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 4974
Episodes: 199
EpisodeReturn: 12.0
AverageReturn: 24.87
EvalEpisodes: 100


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/7f93f0e5d8c04790915ad2d93e9cda31
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (0.06941, 0.40032)
COMET INFO:     LossV [1000]            : (2.29929, 6.98914)
COMET INFO:     eval/avg_reward [20]    : (19.85, 470.79)
COMET INFO:     eval/avg_steps [20]     : (19.85, 470.79)
COMET INFO:     train/avg_reward [1000] : (15.0, 72.259)
COMET INFO:     train/avg_steps [1000]  : (15.0, 72.259)
COMET INFO:     train/reward [1000]     : (8.0, 500.0)
COMET INFO:     train/steps [1000]      : (8, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_1e-05_vflr_0.0001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.2.61.video000000.mp4']
Uploading file: video/openaigym.video.2.61.video000000.mp4


COMET INFO:     gpu_index           : 0
COMET INFO:     iterations          : 1000
COMET INFO:     load                : None
COMET INFO:     max_steps           : 10000
COMET INFO:     obs_dim             : 4
COMET INFO:     observations_to_use : [0, 1, 2, 3]
COMET INFO:     phase               : train
COMET INFO:     policy_lr           : 1e-05
COMET INFO:     render              : False
COMET INFO:     render_in_train     : False
COMET INFO:     seed                : 0
COMET INFO:     step_bonus          : None
COMET INFO:     threshold_reward    : 500
COMET INFO:     vf_lr               : 0.0001
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (14 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [10]  : 10
COMET INFO:     notebook            : 1
COMET INFO:     os packages      

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-5, vf_lr=1e-3)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/7189652af811444ab389c3a50945d07f

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1005
Episodes: 49
EpisodeReturn: 27.0
AverageReturn: 20.1
EvalEpisodes: 100
EvalEpisodeReturn: 11.0
EvalAverageReturn: 10.93
OtherLogs: {'LossPi': 0.07482, 'LossV': 6.89651}
Time: 5
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2597
Episodes: 99
EpisodeReturn: 16.0
AverageReturn: 25.97
EvalEpisodes: 100
EvalEpisodeReturn: 28.0
EvalAverageReturn: 27.13
OtherLogs: {'LossPi': 0.07392, 'LossV': 6.95293}
Time: 13
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3845
Episodes: 149
EpisodeReturn: 32.0
AverageReturn: 25.63
EvalEpisodes: 100
EvalEpisodeReturn: 40.0
EvalAverageReturn: 35.74
OtherLogs: {'LossPi': 0.0694, 'LossV': 7.05419}
Time: 20
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 5645
Episodes: 199
EpisodeReturn: 27.0
AverageReturn: 28.23
EvalEpisodes: 100
E

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/7189652af811444ab389c3a50945d07f
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.04391, 0.07502)
COMET INFO:     LossV [1000]            : (6.89225, 11.66455)
COMET INFO:     eval/avg_reward [20]    : (10.93, 500.0)
COMET INFO:     eval/avg_steps [20]     : (10.93, 500.0)
COMET INFO:     train/avg_reward [1000] : (9.0, 270.551)
COMET INFO:     train/avg_steps [1000]  : (9.0, 270.551)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_1e-05_vflr_0.001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.3.61.video000000.mp4']
Uploading file: video/openaigym.video.3.61.video000000.mp4


COMET INFO:     vf_lr               : 0.001
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (14 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [15]  : 15
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-5, vf_lr=1e-2)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/c4540ac2e33c4fda9d5cc3b316e8179a

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1224
Episodes: 49
EpisodeReturn: 19.0
AverageReturn: 24.48
EvalEpisodes: 100
EvalEpisodeReturn: 31.0
EvalAverageReturn: 32.78
OtherLogs: {'LossPi': -0.02582, 'LossV': 11.67032}
Time: 11
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2307
Episodes: 99
EpisodeReturn: 28.0
AverageReturn: 23.07
EvalEpisodes: 100
EvalEpisodeReturn: 32.0
EvalAverageReturn: 26.36
OtherLogs: {'LossPi': -0.02587, 'LossV': 11.6628}
Time: 22
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3714
Episodes: 149
EpisodeReturn: 35.0
AverageReturn: 24.76
EvalEpisodes: 100
EvalEpisodeReturn: 59.0
EvalAverageReturn: 54.14
OtherLogs: {'LossPi': -0.02608, 'LossV': 11.64772}
Time: 34
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 5375
Episodes: 199
EpisodeReturn: 41.0
AverageReturn: 26.88
EvalEpisode

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/c4540ac2e33c4fda9d5cc3b316e8179a
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.03277, -0.0236)
COMET INFO:     LossV [1000]            : (11.60884, 13.42117)
COMET INFO:     eval/avg_reward [20]    : (26.36, 332.44)
COMET INFO:     eval/avg_steps [20]     : (26.36, 332.44)
COMET INFO:     train/avg_reward [1000] : (17.0, 86.55496264674493)
COMET INFO:     train/avg_steps [1000]  : (17.0, 86.55496264674493)
COMET INFO:     train/reward [1000]     : (9.0, 488.0)
COMET INFO:     train/steps [1000]      : (9, 488)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_1e-05_vflr_0.01
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:  

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.4.61.video000000.mp4']
Uploading file: video/openaigym.video.4.61.video000000.mp4


COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (15 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [6]   : 6
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-5, vf_lr=1e-1)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/571121c175c643a5b8415f5497b8aade

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1138
Episodes: 49
EpisodeReturn: 16.0
AverageReturn: 22.76
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.36
OtherLogs: {'LossPi': -0.02327, 'LossV': 13.44034}
Time: 11
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2185
Episodes: 99
EpisodeReturn: 31.0
AverageReturn: 21.85
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 10.13
OtherLogs: {'LossPi': -0.02321, 'LossV': 13.46718}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3425
Episodes: 149
EpisodeReturn: 15.0
AverageReturn: 22.83
EvalEpisodes: 100
EvalEpisodeReturn: 12.0
EvalAverageReturn: 12.25
OtherLogs: {'LossPi': -0.0232, 'LossV': 13.49253}
Time: 33
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 4793
Episodes: 199
EpisodeReturn: 30.0
AverageReturn: 23.96
EvalEpisodes:

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/571121c175c643a5b8415f5497b8aade
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02369, -0.02245)
COMET INFO:     LossV [1000]            : (13.42364, 13.93437)
COMET INFO:     eval/avg_reward [20]    : (9.36, 49.65)
COMET INFO:     eval/avg_steps [20]     : (9.36, 49.65)
COMET INFO:     train/avg_reward [1000] : (9.0, 25.84583761562179)
COMET INFO:     train/avg_steps [1000]  : (9.0, 25.84583761562179)
COMET INFO:     train/reward [1000]     : (9.0, 124.0)
COMET INFO:     train/steps [1000]      : (9, 124)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_1e-05_vflr_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Param

Test ended after 39 steps with reward 39.0
Found files: ['video/openaigym.video.5.61.video000000.mp4']
Uploading file: video/openaigym.video.5.61.video000000.mp4


COMET INFO:     render              : False
COMET INFO:     render_in_train     : False
COMET INFO:     seed                : 0
COMET INFO:     step_bonus          : None
COMET INFO:     threshold_reward    : 500
COMET INFO:     vf_lr               : 0.1
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (15 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [11]  : 11
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-4, vf_lr=1e-5)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/e41ed570c5444c62bf6fe43a07c60b7d

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1076
Episodes: 49
EpisodeReturn: 11.0
AverageReturn: 21.52
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 10.0
OtherLogs: {'LossPi': -0.02091, 'LossV': 13.9049}
Time: 11
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1793
Episodes: 99
EpisodeReturn: 25.0
AverageReturn: 17.93
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.29
OtherLogs: {'LossPi': -0.02009, 'LossV': 13.88531}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2468
Episodes: 149
EpisodeReturn: 13.0
AverageReturn: 16.45
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.26
OtherLogs: {'LossPi': -0.0194, 'LossV': 13.86707}
Time: 31
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 3115
Episodes: 199
EpisodeReturn: 12.0
AverageReturn: 15.57
EvalEpisodes: 100

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/e41ed570c5444c62bf6fe43a07c60b7d
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02242, -0.0176)
COMET INFO:     LossV [1000]            : (13.66387, 13.93388)
COMET INFO:     eval/avg_reward [20]    : (9.19, 10.0)
COMET INFO:     eval/avg_steps [20]     : (9.19, 10.0)
COMET INFO:     train/avg_reward [1000] : (10.904714142427283, 24.95)
COMET INFO:     train/avg_steps [1000]  : (10.904714142427283, 24.95)
COMET INFO:     train/reward [1000]     : (8.0, 62.0)
COMET INFO:     train/steps [1000]      : (8, 62)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.0001_vflr_1e-05
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   P

Test ended after 11 steps with reward 11.0
Found files: ['video/openaigym.video.6.61.video000000.mp4']
Uploading file: video/openaigym.video.6.61.video000000.mp4


COMET INFO:     code                : 1 (15 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element       : 1
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-4, vf_lr=1e-4)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/8419ece95ccd4cb98e99974856b9e020

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 925
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 18.5
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.31
OtherLogs: {'LossPi': -0.01665, 'LossV': 13.641}
Time: 11
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1711
Episodes: 99
EpisodeReturn: 15.0
AverageReturn: 17.11
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.52
OtherLogs: {'LossPi': -0.01623, 'LossV': 13.62416}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2666
Episodes: 149
EpisodeReturn: 28.0
AverageReturn: 17.77
EvalEpisodes: 100
EvalEpisodeReturn: 13.0
EvalAverageReturn: 12.57
OtherLogs: {'LossPi': -0.01566, 'LossV': 13.60524}
Time: 32
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 3785
Episodes: 199
EpisodeReturn: 20.0
AverageReturn: 18.93
EvalEpisodes: 10

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/8419ece95ccd4cb98e99974856b9e020
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01757, -0.00112)
COMET INFO:     LossV [1000]            : (13.10476, 13.67027)
COMET INFO:     eval/avg_reward [20]    : (9.31, 500.0)
COMET INFO:     eval/avg_steps [20]     : (9.31, 500.0)
COMET INFO:     train/avg_reward [1000] : (16.817391304347826, 111.846)
COMET INFO:     train/avg_steps [1000]  : (16.817391304347826, 111.846)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.0001_vflr_0.0001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.7.61.video000000.mp4']
Uploading file: video/openaigym.video.7.61.video000000.mp4


COMET INFO:     observations_to_use : [0, 1, 2, 3]
COMET INFO:     phase               : train
COMET INFO:     policy_lr           : 0.0001
COMET INFO:     render              : False
COMET INFO:     render_in_train     : False
COMET INFO:     seed                : 0
COMET INFO:     step_bonus          : None
COMET INFO:     threshold_reward    : 500
COMET INFO:     vf_lr               : 0.0001
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (15 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [10]  : 10
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-4, vf_lr=1e-3)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/f87ce3a1ba684510884d7640461ed815

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1314
Episodes: 49
EpisodeReturn: 45.0
AverageReturn: 26.28
EvalEpisodes: 100
EvalEpisodeReturn: 13.0
EvalAverageReturn: 16.35
OtherLogs: {'LossPi': -0.00495, 'LossV': 13.11366}
Time: 14
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2975
Episodes: 99
EpisodeReturn: 40.0
AverageReturn: 29.75
EvalEpisodes: 100
EvalEpisodeReturn: 50.0
EvalAverageReturn: 60.95
OtherLogs: {'LossPi': -0.00499, 'LossV': 13.10883}
Time: 31
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 8144
Episodes: 149
EpisodeReturn: 75.0
AverageReturn: 54.29
EvalEpisodes: 100
EvalEpisodeReturn: 74.0
EvalAverageReturn: 88.17
OtherLogs: {'LossPi': -0.00548, 'LossV': 13.07318}
Time: 60
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 15147
Episodes: 199
EpisodeReturn: 252.0
AverageReturn: 75.73
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/f87ce3a1ba684510884d7640461ed815
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01604, -0.00486)
COMET INFO:     LossV [1000]            : (12.93338, 13.42296)
COMET INFO:     eval/avg_reward [20]    : (16.35, 500.0)
COMET INFO:     eval/avg_steps [20]     : (16.35, 500.0)
COMET INFO:     train/avg_reward [1000] : (25.2, 337.974)
COMET INFO:     train/avg_steps [1000]  : (25.2, 337.974)
COMET INFO:     train/reward [1000]     : (11.0, 500.0)
COMET INFO:     train/steps [1000]      : (11, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.0001_vflr_0.001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COME

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.8.61.video000000.mp4']
Uploading file: video/openaigym.video.8.61.video000000.mp4


COMET INFO:     phase               : train
COMET INFO:     policy_lr           : 0.0001
COMET INFO:     render              : False
COMET INFO:     render_in_train     : False
COMET INFO:     seed                : 0
COMET INFO:     step_bonus          : None
COMET INFO:     threshold_reward    : 500
COMET INFO:     vf_lr               : 0.001
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (15 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [15]  : 15
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-4, vf_lr=1e-2)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/6e26c3101fda49cfb2a87a276cdc9319

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 951
Episodes: 49
EpisodeReturn: 25.0
AverageReturn: 19.02
EvalEpisodes: 100
EvalEpisodeReturn: 13.0
EvalAverageReturn: 12.22
OtherLogs: {'LossPi': -0.01569, 'LossV': 13.34839}
Time: 17
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2535
Episodes: 99
EpisodeReturn: 23.0
AverageReturn: 25.35
EvalEpisodes: 100
EvalEpisodeReturn: 36.0
EvalAverageReturn: 54.85
OtherLogs: {'LossPi': -0.0157, 'LossV': 13.35136}
Time: 38
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 5308
Episodes: 149
EpisodeReturn: 52.0
AverageReturn: 35.39
EvalEpisodes: 100
EvalEpisodeReturn: 84.0
EvalAverageReturn: 79.29
OtherLogs: {'LossPi': -0.016, 'LossV': 13.34452}
Time: 65
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 9732
Episodes: 199
EpisodeReturn: 70.0
AverageReturn: 48.66
EvalEpisodes: 

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/6e26c3101fda49cfb2a87a276cdc9319
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01665, -0.01526)
COMET INFO:     LossV [1000]            : (13.27847, 13.38028)
COMET INFO:     eval/avg_reward [20]    : (12.22, 109.86)
COMET INFO:     eval/avg_steps [20]     : (12.22, 109.86)
COMET INFO:     train/avg_reward [1000] : (10.0, 71.57680250783699)
COMET INFO:     train/avg_steps [1000]  : (10.0, 71.57680250783699)
COMET INFO:     train/reward [1000]     : (9.0, 350.0)
COMET INFO:     train/steps [1000]      : (9, 350)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.0001_vflr_0.01
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:

Test ended after 157 steps with reward 157.0
Found files: ['video/openaigym.video.9.61.video000000.mp4']
Uploading file: video/openaigym.video.9.61.video000000.mp4


COMET INFO:     seed                : 0
COMET INFO:     step_bonus          : None
COMET INFO:     threshold_reward    : 500
COMET INFO:     vf_lr               : 0.01
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (16 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [5]   : 5
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-4, vf_lr=1e-1)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/9a2cea7d52dc4e9aabf24248d39cf801

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1066
Episodes: 49
EpisodeReturn: 12.0
AverageReturn: 21.32
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.29
OtherLogs: {'LossPi': -0.01543, 'LossV': 13.40379}
Time: 18
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1877
Episodes: 99
EpisodeReturn: 13.0
AverageReturn: 18.77
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.31
OtherLogs: {'LossPi': -0.01536, 'LossV': 13.40425}
Time: 36
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2672
Episodes: 149
EpisodeReturn: 16.0
AverageReturn: 17.81
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.39
OtherLogs: {'LossPi': -0.01533, 'LossV': 13.40565}
Time: 54
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 3781
Episodes: 199
EpisodeReturn: 22.0
AverageReturn: 18.91
EvalEpisodes: 

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/9a2cea7d52dc4e9aabf24248d39cf801
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01545, -0.01513)
COMET INFO:     LossV [1000]            : (13.39188, 13.42688)
COMET INFO:     eval/avg_reward [20]    : (9.17, 9.47)
COMET INFO:     eval/avg_steps [20]     : (9.17, 9.47)
COMET INFO:     train/avg_reward [1000] : (14.852, 33.0)
COMET INFO:     train/avg_steps [1000]  : (14.852, 33.0)
COMET INFO:     train/reward [1000]     : (8.0, 88.0)
COMET INFO:     train/steps [1000]      : (8, 88)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.0001_vflr_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.10.61.video000000.mp4']
Uploading file: video/openaigym.video.10.61.video000000.mp4


COMET INFO:     model-element [6]   : 6
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-3, vf_lr=1e-5)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/0836f42938cc4feeac4c20318342a684

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 746
Episodes: 49
EpisodeReturn: 11.0
AverageReturn: 14.92
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.29
OtherLogs: {'LossPi': -0.0148, 'LossV': 13.40364}
Time: 18
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1251
Episodes: 99
EpisodeReturn: 9.0
AverageReturn: 12.51
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.45
OtherLogs: {'LossPi': -0.01468, 'LossV': 13.39736}
Time: 35
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1751
Episodes: 149
EpisodeReturn: 11.0
AverageReturn: 11.67
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 11.46
OtherLogs: {'LossPi': -0.01459, 'LossV': 13.39121}
Time: 52
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 2302
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 11.51
EvalEpisodes: 100

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/0836f42938cc4feeac4c20318342a684
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01517, -0.01439)
COMET INFO:     LossV [1000]            : (13.30492, 13.41262)
COMET INFO:     eval/avg_reward [20]    : (9.2, 11.46)
COMET INFO:     eval/avg_steps [20]     : (9.2, 11.46)
COMET INFO:     train/avg_reward [1000] : (9.833, 29.666666666666668)
COMET INFO:     train/avg_steps [1000]  : (9.833, 29.666666666666668)
COMET INFO:     train/reward [1000]     : (8.0, 36.0)
COMET INFO:     train/steps [1000]      : (8, 36)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.001_vflr_1e-05
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   P

Test ended after 11 steps with reward 11.0
Found files: ['video/openaigym.video.11.61.video000000.mp4']
Uploading file: video/openaigym.video.11.61.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-3, vf_lr=1e-4)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/f2bf0ff154684df8878d19875ee3ee91

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 515
Episodes: 49
EpisodeReturn: 11.0
AverageReturn: 10.3
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.26
OtherLogs: {'LossPi': -0.01433, 'LossV': 13.29901}
Time: 17
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 985
Episodes: 99
EpisodeReturn: 10.0
AverageReturn: 9.85
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.31
OtherLogs: {'LossPi': -0.01432, 'LossV': 13.29446}
Time: 34
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1462
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.75
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.28
OtherLogs: {'LossPi': -0.01432, 'LossV': 13.29012}
Time: 51
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1929
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 9.64
EvalEpisodes: 100
Ev

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/f2bf0ff154684df8878d19875ee3ee91
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01439, -0.0142)
COMET INFO:     LossV [1000]            : (13.18932, 13.30476)
COMET INFO:     eval/avg_reward [20]    : (9.24, 9.41)
COMET INFO:     eval/avg_steps [20]     : (9.24, 9.41)
COMET INFO:     train/avg_reward [1000] : (9.424335378323109, 18.25)
COMET INFO:     train/avg_steps [1000]  : (9.424335378323109, 18.25)
COMET INFO:     train/reward [1000]     : (8.0, 26.0)
COMET INFO:     train/steps [1000]      : (8, 26)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.001_vflr_0.0001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Par

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.12.61.video000000.mp4']
Uploading file: video/openaigym.video.12.61.video000000.mp4


COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-3, vf_lr=1e-3)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/5a64d210ab654d468e15bdc9b9d3540b

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 760
Episodes: 49
EpisodeReturn: 13.0
AverageReturn: 15.2
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 10.91
OtherLogs: {'LossPi': -0.01411, 'LossV': 13.18405}
Time: 18
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1990
Episodes: 99
EpisodeReturn: 44.0
AverageReturn: 19.9
EvalEpisodes: 100
EvalEpisodeReturn: 82.0
EvalAverageReturn: 55.92
OtherLogs: {'LossPi': -0.01403, 'LossV': 13.17445}
Time: 40
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 6278
Episodes: 149
EpisodeReturn: 91.0
AverageReturn: 41.85
EvalEpisodes: 100
EvalEpisodeReturn: 500.0
EvalAverageReturn: 297.84
OtherLogs: {'LossPi': -0.01392, 'LossV': 13.15894}
Time: 81
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 12339
Episodes: 199
EpisodeReturn: 127.0
AverageReturn: 61.7
EvalEpisod

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/5a64d210ab654d468e15bdc9b9d3540b
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01686, -0.01388)
COMET INFO:     LossV [1000]            : (12.62613, 13.1891)
COMET INFO:     eval/avg_reward [20]    : (10.91, 500.0)
COMET INFO:     eval/avg_steps [20]     : (10.91, 500.0)
COMET INFO:     train/avg_reward [1000] : (13.071428571428571, 253.25294117647059)
COMET INFO:     train/avg_steps [1000]  : (13.071428571428571, 253.25294117647059)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.001_vflr_0.001
COMET INFO:     total params     : 9155
COMET INFO:     trainabl

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.13.61.video000000.mp4']
Uploading file: video/openaigym.video.13.61.video000000.mp4


COMET INFO:     threshold_reward    : 500
COMET INFO:     vf_lr               : 0.001
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (17 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [8]   : 8
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-3, vf_lr=1e-2)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/a20a2170e22a4f658fe2d17d2ef5f036

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 957
Episodes: 49
EpisodeReturn: 11.0
AverageReturn: 19.14
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.29
OtherLogs: {'LossPi': -0.01662, 'LossV': 12.64084}
Time: 22
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1891
Episodes: 99
EpisodeReturn: 9.0
AverageReturn: 18.91
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.23
OtherLogs: {'LossPi': -0.01662, 'LossV': 12.64148}
Time: 44
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2550
Episodes: 149
EpisodeReturn: 16.0
AverageReturn: 17.0
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.44
OtherLogs: {'LossPi': -0.01659, 'LossV': 12.63976}
Time: 65
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 3107
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 15.54
EvalEpisodes: 100


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/a20a2170e22a4f658fe2d17d2ef5f036
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.0167, -0.01632)
COMET INFO:     LossV [1000]            : (12.62403, 12.64181)
COMET INFO:     eval/avg_reward [20]    : (9.23, 88.69)
COMET INFO:     eval/avg_steps [20]     : (9.23, 88.69)
COMET INFO:     train/avg_reward [1000] : (13.0, 25.483164983164983)
COMET INFO:     train/avg_steps [1000]  : (13.0, 25.483164983164983)
COMET INFO:     train/reward [1000]     : (8.0, 218.0)
COMET INFO:     train/steps [1000]      : (8, 218)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.001_vflr_0.01
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   P

Test ended after 65 steps with reward 65.0
Found files: ['video/openaigym.video.14.61.video000000.mp4']
Uploading file: video/openaigym.video.14.61.video000000.mp4


COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [5]   : 5
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-3, vf_lr=1e-1)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/35ded586d88a40d58a820c2e00379b50

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 714
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 14.28
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.36
OtherLogs: {'LossPi': -0.01653, 'LossV': 12.64597}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1187
Episodes: 99
EpisodeReturn: 10.0
AverageReturn: 11.87
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.44
OtherLogs: {'LossPi': -0.01655, 'LossV': 12.64392}
Time: 42
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1658
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 11.05
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.48
OtherLogs: {'LossPi': -0.01655, 'LossV': 12.64191}
Time: 63
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 2124
Episodes: 199
EpisodeReturn: 9.0
AverageReturn: 10.62
EvalEpisodes: 10

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/35ded586d88a40d58a820c2e00379b50
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01657, -0.01612)
COMET INFO:     LossV [1000]            : (12.62868, 12.6965)
COMET INFO:     eval/avg_reward [20]    : (9.19, 38.95)
COMET INFO:     eval/avg_steps [20]     : (9.19, 38.95)
COMET INFO:     train/avg_reward [1000] : (10.11484593837535, 17.666666666666668)
COMET INFO:     train/avg_steps [1000]  : (10.11484593837535, 17.666666666666668)
COMET INFO:     train/reward [1000]     : (8.0, 99.0)
COMET INFO:     train/steps [1000]      : (8, 99)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.001_vflr_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params

Test ended after 47 steps with reward 47.0
Found files: ['video/openaigym.video.15.61.video000000.mp4']
Uploading file: video/openaigym.video.15.61.video000000.mp4


COMET INFO:     render              : False
COMET INFO:     render_in_train     : False
COMET INFO:     seed                : 0
COMET INFO:     step_bonus          : None
COMET INFO:     threshold_reward    : 500
COMET INFO:     vf_lr               : 0.1
COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (17 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [7]   : 7
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-2, vf_lr=1e-5)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/2b500571ac0e4b59a7c83763cccc765b

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 467
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 9.34
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.59
OtherLogs: {'LossPi': -0.01614, 'LossV': 12.68906}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 936
Episodes: 99
EpisodeReturn: 8.0
AverageReturn: 9.36
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.31
OtherLogs: {'LossPi': -0.01613, 'LossV': 12.68479}
Time: 42
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1402
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.35
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.35
OtherLogs: {'LossPi': -0.01613, 'LossV': 12.68057}
Time: 64
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1873
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 9.37
EvalEpisodes: 100
Eval

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/2b500571ac0e4b59a7c83763cccc765b
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01615, -0.01603)
COMET INFO:     LossV [1000]            : (12.6169, 12.69323)
COMET INFO:     eval/avg_reward [20]    : (9.25, 9.59)
COMET INFO:     eval/avg_steps [20]     : (9.25, 9.59)
COMET INFO:     train/avg_reward [1000] : (8.666666666666666, 9.45945945945946)
COMET INFO:     train/avg_steps [1000]  : (8.666666666666666, 9.45945945945946)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.01_vflr_1e-05
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 91

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.16.61.video000000.mp4']
Uploading file: video/openaigym.video.16.61.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-2, vf_lr=1e-4)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/aa5d480b98b44e8d848541d452985bb5

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 466
Episodes: 49
EpisodeReturn: 9.0
AverageReturn: 9.32
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.5
OtherLogs: {'LossPi': -0.01602, 'LossV': 12.6129}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 941
Episodes: 99
EpisodeReturn: 11.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.22
OtherLogs: {'LossPi': -0.01602, 'LossV': 12.60939}
Time: 42
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1413
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.42
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.28
OtherLogs: {'LossPi': -0.01601, 'LossV': 12.6061}
Time: 63
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1878
Episodes: 199
EpisodeReturn: 9.0
AverageReturn: 9.39
EvalEpisodes: 100
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/aa5d480b98b44e8d848541d452985bb5
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01603, -0.01591)
COMET INFO:     LossV [1000]            : (12.53106, 12.61682)
COMET INFO:     eval/avg_reward [20]    : (9.22, 9.53)
COMET INFO:     eval/avg_steps [20]     : (9.22, 9.53)
COMET INFO:     train/avg_reward [1000] : (9.0, 9.5)
COMET INFO:     train/avg_steps [1000]  : (9.0, 9.5)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.01_vflr_0.0001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fields

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.17.61.video000000.mp4']
Uploading file: video/openaigym.video.17.61.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-2, vf_lr=1e-3)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/4f8d46777eac4853a6d61bebf9e975d9

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 471
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 9.42
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.37
OtherLogs: {'LossPi': -0.0159, 'LossV': 12.52735}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 941
Episodes: 99
EpisodeReturn: 10.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.31
OtherLogs: {'LossPi': -0.0159, 'LossV': 12.52291}
Time: 42
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1411
Episodes: 149
EpisodeReturn: 9.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.3
OtherLogs: {'LossPi': -0.01589, 'LossV': 12.51843}
Time: 64
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1879
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 9.39
EvalEpisodes: 100
EvalEpi

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/4f8d46777eac4853a6d61bebf9e975d9
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01591, -0.01578)
COMET INFO:     LossV [1000]            : (12.44456, 12.53097)
COMET INFO:     eval/avg_reward [20]    : (9.24, 9.52)
COMET INFO:     eval/avg_steps [20]     : (9.24, 9.52)
COMET INFO:     train/avg_reward [1000] : (9.0, 9.633333333333333)
COMET INFO:     train/avg_steps [1000]  : (9.0, 9.633333333333333)
COMET INFO:     train/reward [1000]     : (8.0, 22.0)
COMET INFO:     train/steps [1000]      : (8, 22)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.01_vflr_0.001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Paramete

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.18.61.video000000.mp4']
Uploading file: video/openaigym.video.18.61.video000000.mp4


COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-2, vf_lr=1e-2)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/0097df50caf44bf994dec0cfc7af709c

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 472
Episodes: 49
EpisodeReturn: 9.0
AverageReturn: 9.44
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.27
OtherLogs: {'LossPi': -0.01577, 'LossV': 12.44203}
Time: 21
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 945
Episodes: 99
EpisodeReturn: 10.0
AverageReturn: 9.45
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.47
OtherLogs: {'LossPi': -0.01576, 'LossV': 12.43889}
Time: 43
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1409
Episodes: 149
EpisodeReturn: 9.0
AverageReturn: 9.39
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.48
OtherLogs: {'LossPi': -0.01576, 'LossV': 12.43616}
Time: 64
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1880
Episodes: 199
EpisodeReturn: 9.0
AverageReturn: 9.4
EvalEpisodes: 100
EvalE

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/0097df50caf44bf994dec0cfc7af709c
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01578, -0.01566)
COMET INFO:     LossV [1000]            : (12.37189, 12.44451)
COMET INFO:     eval/avg_reward [20]    : (9.27, 9.48)
COMET INFO:     eval/avg_steps [20]     : (9.27, 9.48)
COMET INFO:     train/avg_reward [1000] : (9.0, 9.75)
COMET INFO:     train/avg_steps [1000]  : (9.0, 9.75)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.01_vflr_0.01
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fields

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.19.61.video000000.mp4']
Uploading file: video/openaigym.video.19.61.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-2, vf_lr=1e-1)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/c296e04ca5a4468eaf6920a25817e42d

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 463
Episodes: 49
EpisodeReturn: 9.0
AverageReturn: 9.26
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.42
OtherLogs: {'LossPi': -0.01566, 'LossV': 12.37199}
Time: 22
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 930
Episodes: 99
EpisodeReturn: 9.0
AverageReturn: 9.3
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.39
OtherLogs: {'LossPi': -0.01565, 'LossV': 12.3702}
Time: 43
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1401
Episodes: 149
EpisodeReturn: 11.0
AverageReturn: 9.34
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.29
OtherLogs: {'LossPi': -0.01565, 'LossV': 12.3684}
Time: 65
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1869
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 9.35
EvalEpisodes: 100
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/c296e04ca5a4468eaf6920a25817e42d
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01566, -0.01555)
COMET INFO:     LossV [1000]            : (12.33818, 12.37381)
COMET INFO:     eval/avg_reward [20]    : (9.12, 9.51)
COMET INFO:     eval/avg_steps [20]     : (9.12, 9.51)
COMET INFO:     train/avg_reward [1000] : (9.0, 9.5)
COMET INFO:     train/avg_steps [1000]  : (9.0, 9.5)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.01_vflr_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fields   

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.20.61.video000000.mp4']
Uploading file: video/openaigym.video.20.61.video000000.mp4


COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-1, vf_lr=1e-5)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/cbf9f7e898f64bcfa2be5294c0c15279

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 467
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 9.34
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.38
OtherLogs: {'LossPi': -0.01555, 'LossV': 12.33419}
Time: 22
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 932
Episodes: 99
EpisodeReturn: 10.0
AverageReturn: 9.32
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.3
OtherLogs: {'LossPi': -0.01554, 'LossV': 12.33024}
Time: 43
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1403
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.35
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.33
OtherLogs: {'LossPi': -0.01554, 'LossV': 12.32627}
Time: 65
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1870
Episodes: 199
EpisodeReturn: 10.0
AverageReturn: 9.35
EvalEpisodes: 100
Eval

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/cbf9f7e898f64bcfa2be5294c0c15279
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01555, -0.01544)
COMET INFO:     LossV [1000]            : (12.26842, 12.3381)
COMET INFO:     eval/avg_reward [20]    : (9.18, 9.52)
COMET INFO:     eval/avg_steps [20]     : (9.18, 9.52)
COMET INFO:     train/avg_reward [1000] : (8.8, 9.397058823529411)
COMET INFO:     train/avg_steps [1000]  : (8.8, 9.397058823529411)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.1_vflr_1e-05
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.21.61.video000000.mp4']
Uploading file: video/openaigym.video.21.61.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-1, vf_lr=1e-4)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/ec0733caf0c74cfc85ea254041868ec4

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 470
Episodes: 49
EpisodeReturn: 8.0
AverageReturn: 9.4
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.44
OtherLogs: {'LossPi': -0.01544, 'LossV': 12.2647}
Time: 22
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 935
Episodes: 99
EpisodeReturn: 8.0
AverageReturn: 9.35
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.2
OtherLogs: {'LossPi': -0.01543, 'LossV': 12.26164}
Time: 44
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1411
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.21
OtherLogs: {'LossPi': -0.01543, 'LossV': 12.25882}
Time: 66
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1883
Episodes: 199
EpisodeReturn: 9.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/ec0733caf0c74cfc85ea254041868ec4
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01544, -0.01533)
COMET INFO:     LossV [1000]            : (12.19392, 12.26835)
COMET INFO:     eval/avg_reward [20]    : (9.2, 9.44)
COMET INFO:     eval/avg_steps [20]     : (9.2, 9.44)
COMET INFO:     train/avg_reward [1000] : (9.0, 9.5)
COMET INFO:     train/avg_steps [1000]  : (9.0, 9.5)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.1_vflr_0.0001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fields   

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.22.61.video000000.mp4']
Uploading file: video/openaigym.video.22.61.video000000.mp4


COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-1, vf_lr=1e-3)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/49e7344053dc43beabeda6b3c66569c6

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 472
Episodes: 49
EpisodeReturn: 9.0
AverageReturn: 9.44
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.33
OtherLogs: {'LossPi': -0.01533, 'LossV': 12.19014}
Time: 24
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 941
Episodes: 99
EpisodeReturn: 10.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.4
OtherLogs: {'LossPi': -0.01532, 'LossV': 12.18595}
Time: 48
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1418
Episodes: 149
EpisodeReturn: 8.0
AverageReturn: 9.45
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.33
OtherLogs: {'LossPi': -0.01532, 'LossV': 12.18168}
Time: 72
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1890
Episodes: 199
EpisodeReturn: 8.0
AverageReturn: 9.45
EvalEpisodes: 100
EvalEpi

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/49e7344053dc43beabeda6b3c66569c6
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01533, -0.01523)
COMET INFO:     LossV [1000]            : (12.11081, 12.19384)
COMET INFO:     eval/avg_reward [20]    : (9.14, 9.45)
COMET INFO:     eval/avg_steps [20]     : (9.14, 9.45)
COMET INFO:     train/avg_reward [1000] : (9.0, 9.6875)
COMET INFO:     train/avg_steps [1000]  : (9.0, 9.6875)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.1_vflr_0.001
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters:
COMET INFO:     _fi

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.23.61.video000000.mp4']
Uploading file: video/openaigym.video.23.61.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-1, vf_lr=1e-2)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/4569c1604df045b3ab2cc7b6f57ea4c2

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 474
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 9.48
EvalEpisodes: 100
EvalEpisodeReturn: 11.0
EvalAverageReturn: 9.44
OtherLogs: {'LossPi': -0.01522, 'LossV': 12.10781}
Time: 24
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 944
Episodes: 99
EpisodeReturn: 10.0
AverageReturn: 9.44
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.33
OtherLogs: {'LossPi': -0.01522, 'LossV': 12.10423}
Time: 47
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1411
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.41
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.33
OtherLogs: {'LossPi': -0.01521, 'LossV': 12.10057}
Time: 70
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1880
Episodes: 199
EpisodeReturn: 9.0
AverageReturn: 9.4
EvalEpisodes: 100
Eva

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/4569c1604df045b3ab2cc7b6f57ea4c2
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01523, -0.01512)
COMET INFO:     LossV [1000]            : (12.05732, 12.11077)
COMET INFO:     eval/avg_reward [20]    : (9.26, 9.45)
COMET INFO:     eval/avg_steps [20]     : (9.26, 9.45)
COMET INFO:     train/avg_reward [1000] : (8.0, 9.538461538461538)
COMET INFO:     train/avg_steps [1000]  : (8.0, 9.538461538461538)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.1_vflr_0.01
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameters

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.24.61.video000000.mp4']
Uploading file: video/openaigym.video.24.61.video000000.mp4


COMET INFO: Still uploading


In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, policy_lr=1e-1, vf_lr=1e-1)
run_agent(agent, args, tags=['lr_relation'], experiment_name=f'{args.algo}_plr_{agent.policy_lr}_vflr_{agent.vf_lr}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/ee6ad172ab334b029344f6b55f1f81a1

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 471
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 9.42
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.4
OtherLogs: {'LossPi': -0.01512, 'LossV': 12.05804}
Time: 22
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 935
Episodes: 99
EpisodeReturn: 8.0
AverageReturn: 9.35
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.45
OtherLogs: {'LossPi': -0.01511, 'LossV': 12.0564}
Time: 45
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 1403
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 9.35
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.38
OtherLogs: {'LossPi': -0.01511, 'LossV': 12.05478}
Time: 67
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 1855
Episodes: 199
EpisodeReturn: 9.0
AverageReturn: 9.28
EvalEpisodes: 100
EvalEpi

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/ee6ad172ab334b029344f6b55f1f81a1
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.01512, -0.01502)
COMET INFO:     LossV [1000]            : (12.02744, 12.05969)
COMET INFO:     eval/avg_reward [20]    : (9.24, 9.47)
COMET INFO:     eval/avg_steps [20]     : (9.24, 9.47)
COMET INFO:     train/avg_reward [1000] : (9.255707762557078, 10.0)
COMET INFO:     train/avg_steps [1000]  : (9.255707762557078, 10.0)
COMET INFO:     train/reward [1000]     : (8.0, 11.0)
COMET INFO:     train/steps [1000]      : (8, 11)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_plr_0.1_vflr_0.1
COMET INFO:     total params     : 9155
COMET INFO:     trainable params : 9155
COMET INFO:   Parameter

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.25.61.video000000.mp4']
Uploading file: video/openaigym.video.25.61.video000000.mp4


COMET INFO: Still uploading


#### Testing Episode Bonuses

##### Finish Bonus

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': 'ckpt_finish_bonus',
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

def finish_bonus(steps, checkpoint_steps=25):
  if steps > checkpoint_steps:
    # get extra reward every checkpoint_steps
    return steps // checkpoint_steps
  else:
    # get penalty for not getting to first checkpoint_steps
    return steps - checkpoint_steps

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num)
run_agent(agent, args, episode_bonus_fn=finish_bonus, tags=['episode_bonus'],
          experiment_name=f'{args.algo}_{args.episode_bonus}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/e1d308953d554c459ac1d4d4b4d6e9cf

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1480
Episodes: 49
EpisodeReturn: 20.0
AverageReturn: 29.6
EvalEpisodes: 100
EvalEpisodeReturn: 22.0
EvalAverageReturn: 16.87
OtherLogs: {'LossPi': -0.0218, 'LossV': 39.12975}
Time: 26
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2612
Episodes: 99
EpisodeReturn: 15.0
AverageReturn: 26.12
EvalEpisodes: 100
EvalEpisodeReturn: 12.0
EvalAverageReturn: 12.23
OtherLogs: {'LossPi': -0.02181, 'LossV': 39.11319}
Time: 50
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3692
Episodes: 149
EpisodeReturn: 17.0
AverageReturn: 24.61
EvalEpisodes: 100
EvalEpisodeReturn: 19.0
EvalAverageReturn: 25.68
OtherLogs: {'LossPi': -0.02185, 'LossV': 39.09461}
Time: 75
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 5220
Episodes: 199
EpisodeReturn: 61.0
AverageReturn: 26.1
EvalEpisodes:

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/e1d308953d554c459ac1d4d4b4d6e9cf
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]                      : (-0.02428, -0.02143)
COMET INFO:     LossV [1000]                       : (35.44117, 39.15897)
COMET INFO:     eval/avg_reward [20]               : (12.23, 500.0)
COMET INFO:     eval/avg_steps [20]                : (12.23, 500.0)
COMET INFO:     train/avg_reward [1000]            : (19.0, 247.286)
COMET INFO:     train/avg_reward_with_bonus [1000] : (11.166666666666666, 255.381)
COMET INFO:     train/avg_steps [1000]             : (19.0, 247.286)
COMET INFO:     train/reward [1000]                : (9.0, 500.0)
COMET INFO:     train/reward_bonus [1000]          : (-16, 20)
COMET INFO:     train/r

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.15.63.video000000.mp4']
Uploading file: video/openaigym.video.15.63.video000000.mp4


COMET INFO: Still uploading


##### High Penalty

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': 'high_penalty',
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

def high_penalty(steps, max_steps=500):
  if steps < max_steps - 1:
    return -max_steps
  else:
    return 0

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num)
run_agent(agent, args, episode_bonus_fn=high_penalty, tags=['episode_bonus'],
          experiment_name=f'{args.algo}_{args.episode_bonus}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/091335aafef646e98b167dfd3b100684

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1039
Episodes: 49
EpisodeReturn: 11.0
AverageReturn: 20.78
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.23
OtherLogs: {'LossPi': -0.02412, 'LossV': 35.4249}
Time: 27
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2134
Episodes: 99
EpisodeReturn: 39.0
AverageReturn: 21.34
EvalEpisodes: 100
EvalEpisodeReturn: 72.0
EvalAverageReturn: 66.5
OtherLogs: {'LossPi': -0.02408, 'LossV': 35.40787}
Time: 58
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 5611
Episodes: 149
EpisodeReturn: 170.0
AverageReturn: 37.41
EvalEpisodes: 100
EvalEpisodeReturn: 500.0
EvalAverageReturn: 500.0
OtherLogs: {'LossPi': -0.02427, 'LossV': 35.35478}
Time: 113
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 16217
Episodes: 199
EpisodeReturn: 500.0
AverageReturn: 81.08
EvalEpis

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/091335aafef646e98b167dfd3b100684
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]                      : (-0.02649, -0.02403)
COMET INFO:     LossV [1000]                       : (32.1401, 35.44078)
COMET INFO:     eval/avg_reward [20]               : (9.23, 500.0)
COMET INFO:     eval/avg_steps [20]                : (9.23, 500.0)
COMET INFO:     train/avg_reward [1000]            : (14.666666666666666, 315.388)
COMET INFO:     train/avg_reward_with_bonus [1000] : (-485.3333333333333, 41.888)
COMET INFO:     train/avg_steps [1000]             : (14.666666666666666, 315.388)
COMET INFO:     train/reward [1000]                : (8.0, 500.0)
COMET INFO:     train/reward_bonus [1000]          : (-500, 0)

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.16.63.video000000.mp4']
Uploading file: video/openaigym.video.16.63.video000000.mp4


COMET INFO: Still uploading


#### Testing Step Bonuses

##### Checkpoint Step Bonus

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': 'ckpt_step_bonus',
  'threshold_reward': 500,
}

args = Arguments(**args)

def step_bonus(steps, observation, checkpoint_steps=25):
  if steps > checkpoint_steps:
    # get extra reward every checkpoint_steps
    return steps // checkpoint_steps
  else:
    return 0

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, additional_reward_fn=step_bonus)
run_agent(agent, args, tags=['step_bonus'], experiment_name=f'{args.algo}_{args.step_bonus}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/0b16b5ad412f43008c8636202af75189

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1146
Episodes: 49
EpisodeReturn: 90.0
AverageReturn: 22.92
EvalEpisodes: 100
EvalEpisodeReturn: 42.0
EvalAverageReturn: 36.57
OtherLogs: {'LossPi': -0.02571, 'LossV': 32.12687, 'custom_reward': 209.0}
Time: 33
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2943
Episodes: 99
EpisodeReturn: 38.0
AverageReturn: 29.43
EvalEpisodes: 100
EvalEpisodeReturn: 80.0
EvalAverageReturn: 102.64
OtherLogs: {'LossPi': -0.02548, 'LossV': 32.11475, 'custom_reward': 50.0}
Time: 71
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 7203
Episodes: 149
EpisodeReturn: 152.0
AverageReturn: 48.02
EvalEpisodes: 100
EvalEpisodeReturn: 144.0
EvalAverageReturn: 162.81
OtherLogs: {'LossPi': -0.02473, 'LossV': 32.14188, 'custom_reward': 538.0}
Time: 118
---------------------------------------
---------------------------------------
Iterations: 200
S

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/0b16b5ad412f43008c8636202af75189
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.02584, 0.05191)
COMET INFO:     LossV [1000]            : (32.11249, 166.36647)
COMET INFO:     custom_reward [1000]    : (9.0, 5249.0)
COMET INFO:     eval/avg_reward [20]    : (36.57, 500.0)
COMET INFO:     eval/avg_steps [20]     : (36.57, 500.0)
COMET INFO:     train/avg_reward [1000] : (21.551020408163264, 299.583)
COMET INFO:     train/avg_steps [1000]  : (21.551020408163264, 299.583)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_ckpt_step_bonus
COMET INFO:     total params     :

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.17.63.video000000.mp4']
Uploading file: video/openaigym.video.17.63.video000000.mp4


COMET INFO: Still uploading


##### Pole Angle Based Reward

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': 'pole_angle_bonus',
  'threshold_reward': 500,
}

args = Arguments(**args)

def pole_angle_based_reward(steps, observation, max_angle=3):
  pole_angle = observation[2]
  angle = max_angle * 2 * math.pi / 360
  if pole_angle < -angle or pole_angle > angle:
    # penalize if angle is higher than max_angle
    return -200
  else:
    # give a bonus otherwise
    return 100

In [None]:
# Create an agent
agent = A2CAgent(env, args, device, obs_dim, act_num, additional_reward_fn=pole_angle_based_reward)
run_agent(agent, args, tags=['step_bonus'], experiment_name=f'{args.algo}_{args.step_bonus}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/49466c3a88ab4816ae5cdab7586f923a

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 902
Episodes: 49
EpisodeReturn: 11.0
AverageReturn: 18.04
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.37
OtherLogs: {'LossPi': 0.03562, 'LossV': 176.24018, 'custom_reward': -389.0}
Time: 35
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1689
Episodes: 99
EpisodeReturn: 14.0
AverageReturn: 16.89
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.31
OtherLogs: {'LossPi': 0.02522, 'LossV': 184.46916, 'custom_reward': -86.0}
Time: 70
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2523
Episodes: 149
EpisodeReturn: 15.0
AverageReturn: 16.82
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.3
OtherLogs: {'LossPi': 0.01571, 'LossV': 192.80584, 'custom_reward': 15.0}
Time: 105
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 3303

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/49466c3a88ab4816ae5cdab7586f923a
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.09089, 0.05135)
COMET INFO:     LossV [1000]            : (166.53065, 304.00785)
COMET INFO:     custom_reward [1000]    : (-5765.0, 3963.0)
COMET INFO:     eval/avg_reward [20]    : (9.19, 9.48)
COMET INFO:     eval/avg_steps [20]     : (9.19, 9.48)
COMET INFO:     train/avg_reward [1000] : (13.688, 22.5)
COMET INFO:     train/avg_steps [1000]  : (13.688, 22.5)
COMET INFO:     train/reward [1000]     : (8.0, 63.0)
COMET INFO:     train/steps [1000]      : (8, 63)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_pole_angle_bonus
COMET INFO:     total params     : 9155
COMET INFO:     trainabl

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.18.63.video000000.mp4']
Uploading file: video/openaigym.video.18.63.video000000.mp4


COMET INFO: Still uploading


#### Testing Observation Slices

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

In [None]:
observations_to_use = [1, 2, 3]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/e72a800f10984849972f134e6adc5977

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1664
Episodes: 49
EpisodeReturn: 26.0
AverageReturn: 33.28
EvalEpisodes: 100
EvalEpisodeReturn: 56.0
EvalAverageReturn: 42.68
OtherLogs: {'LossPi': -0.09071, 'LossV': 303.80037, 'custom_reward': -891.0}
Time: 39
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 5726
Episodes: 99
EpisodeReturn: 228.0
AverageReturn: 57.26
EvalEpisodes: 100
EvalEpisodeReturn: 259.0
EvalAverageReturn: 242.92
OtherLogs: {'LossPi': -0.09072, 'LossV': 303.30102, 'custom_reward': -891.0}
Time: 93
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 16730
Episodes: 149
EpisodeReturn: 123.0
AverageReturn: 111.53
EvalEpisodes: 100
EvalEpisodeReturn: 188.0
EvalAverageReturn: 161.9
OtherLogs: {'LossPi': -0.09041, 'LossV': 301.97558, 'custom_reward': -891.0}
Time: 164
---------------------------------------
---------------------------------------
Iterati

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/e72a800f10984849972f134e6adc5977
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.09089, -0.08483)
COMET INFO:     LossV [1000]            : (278.23471, 304.00657)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (23.41, 500.0)
COMET INFO:     eval/avg_steps [20]     : (23.41, 500.0)
COMET INFO:     train/avg_reward [1000] : (10.0, 256.2232558139535)
COMET INFO:     train/avg_steps [1000]  : (10.0, 256.2232558139535)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_123
COMET INFO:     total params     : 9027
COMET INFO:     tra

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.19.63.video000000.mp4']
Uploading file: video/openaigym.video.19.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
observations_to_use = [0, 2, 3]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/572da676101e4e2ca685f8acd6faea0b

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1702
Episodes: 49
EpisodeReturn: 55.0
AverageReturn: 34.04
EvalEpisodes: 100
EvalEpisodeReturn: 217.0
EvalAverageReturn: 200.52
OtherLogs: {'LossPi': -0.08492, 'LossV': 278.05803, 'custom_reward': -891.0}
Time: 49
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 6748
Episodes: 99
EpisodeReturn: 116.0
AverageReturn: 67.48
EvalEpisodes: 100
EvalEpisodeReturn: 95.0
EvalAverageReturn: 112.02
OtherLogs: {'LossPi': -0.08498, 'LossV': 277.54535, 'custom_reward': -891.0}
Time: 103
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 14871
Episodes: 149
EpisodeReturn: 160.0
AverageReturn: 99.14
EvalEpisodes: 100
EvalEpisodeReturn: 252.0
EvalAverageReturn: 226.95
OtherLogs: {'LossPi': -0.0851, 'LossV': 276.70469, 'custom_reward': -891.0}
Time: 171
---------------------------------------
---------------------------------------
Iterat

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/572da676101e4e2ca685f8acd6faea0b
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.0864, -0.08239)
COMET INFO:     LossV [1000]            : (255.28455, 278.23215)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (13.31, 500.0)
COMET INFO:     eval/avg_steps [20]     : (13.31, 500.0)
COMET INFO:     train/avg_reward [1000] : (20.0, 301.62420382165607)
COMET INFO:     train/avg_steps [1000]  : (20.0, 301.62420382165607)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_023
COMET INFO:     total params     : 9027
COMET INFO:     tr

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.20.63.video000000.mp4']
Uploading file: video/openaigym.video.20.63.video000000.mp4


COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (22 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [10]  : 10
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
observations_to_use = [0, 1, 3]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/7ad7777e21f7425fa138081566849a75

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1418
Episodes: 49
EpisodeReturn: 56.0
AverageReturn: 28.36
EvalEpisodes: 100
EvalEpisodeReturn: 26.0
EvalAverageReturn: 30.81
OtherLogs: {'LossPi': -0.08228, 'LossV': 255.16132, 'custom_reward': -891.0}
Time: 45
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 4269
Episodes: 99
EpisodeReturn: 85.0
AverageReturn: 42.69
EvalEpisodes: 100
EvalEpisodeReturn: 31.0
EvalAverageReturn: 37.19
OtherLogs: {'LossPi': -0.08221, 'LossV': 254.91881, 'custom_reward': -891.0}
Time: 94
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 10784
Episodes: 149
EpisodeReturn: 62.0
AverageReturn: 71.89
EvalEpisodes: 100
EvalEpisodeReturn: 92.0
EvalAverageReturn: 106.41
OtherLogs: {'LossPi': -0.08213, 'LossV': 254.352, 'custom_reward': -891.0}
Time: 157
---------------------------------------
---------------------------------------
Iterations: 20

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/7ad7777e21f7425fa138081566849a75
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08239, -0.07937)
COMET INFO:     LossV [1000]            : (242.00405, 255.28036)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (30.81, 458.09)
COMET INFO:     eval/avg_steps [20]     : (30.81, 458.09)
COMET INFO:     train/avg_reward [1000] : (24.96, 171.03054448871183)
COMET INFO:     train/avg_steps [1000]  : (24.96, 171.03054448871183)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_013
COMET INFO:     total params     : 9027
COMET INFO:  

Test ended after 424 steps with reward 424.0
Found files: ['video/openaigym.video.21.63.video000000.mp4']
Uploading file: video/openaigym.video.21.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
observations_to_use = [0, 1, 2]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/ca4096becd6f4ab5b21191443f4aab42

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1281
Episodes: 49
EpisodeReturn: 11.0
AverageReturn: 25.62
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 10.27
OtherLogs: {'LossPi': -0.07927, 'LossV': 241.90489, 'custom_reward': -891.0}
Time: 45
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2300
Episodes: 99
EpisodeReturn: 17.0
AverageReturn: 23.0
EvalEpisodes: 100
EvalEpisodeReturn: 13.0
EvalAverageReturn: 13.15
OtherLogs: {'LossPi': -0.07927, 'LossV': 241.82821, 'custom_reward': -891.0}
Time: 90
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3534
Episodes: 149
EpisodeReturn: 25.0
AverageReturn: 23.56
EvalEpisodes: 100
EvalEpisodeReturn: 12.0
EvalAverageReturn: 12.07
OtherLogs: {'LossPi': -0.07921, 'LossV': 241.73455, 'custom_reward': -891.0}
Time: 135
---------------------------------------
---------------------------------------
Iterations: 200

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/ca4096becd6f4ab5b21191443f4aab42
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08192, -0.07907)
COMET INFO:     LossV [1000]            : (235.07132, 242.00181)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (9.37, 497.4)
COMET INFO:     eval/avg_steps [20]     : (9.37, 497.4)
COMET INFO:     train/avg_reward [1000] : (18.661057692307693, 91.888)
COMET INFO:     train/avg_steps [1000]  : (18.661057692307693, 91.888)
COMET INFO:     train/reward [1000]     : (8.0, 500.0)
COMET INFO:     train/steps [1000]      : (8, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_012
COMET INFO:     total params     : 9027
COMET INFO:    

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.22.63.video000000.mp4']
Uploading file: video/openaigym.video.22.63.video000000.mp4


COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [10]  : 10
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
observations_to_use = [2, 3]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/d619d0aa6fad4314b154ea0b057abf10

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1419
Episodes: 49
EpisodeReturn: 70.0
AverageReturn: 28.38
EvalEpisodes: 100
EvalEpisodeReturn: 33.0
EvalAverageReturn: 32.98
OtherLogs: {'LossPi': -0.08182, 'LossV': 234.96759, 'custom_reward': -891.0}
Time: 48
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 5949
Episodes: 99
EpisodeReturn: 80.0
AverageReturn: 59.49
EvalEpisodes: 100
EvalEpisodeReturn: 88.0
EvalAverageReturn: 86.46
OtherLogs: {'LossPi': -0.08163, 'LossV': 234.64355, 'custom_reward': -891.0}
Time: 106
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 12173
Episodes: 149
EpisodeReturn: 146.0
AverageReturn: 81.15
EvalEpisodes: 100
EvalEpisodeReturn: 260.0
EvalAverageReturn: 177.61
OtherLogs: {'LossPi': -0.08163, 'LossV': 234.21568, 'custom_reward': -891.0}
Time: 174
---------------------------------------
---------------------------------------
Iteration

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/d619d0aa6fad4314b154ea0b057abf10
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08191, -0.07892)
COMET INFO:     LossV [1000]            : (223.17179, 235.06816)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (32.98, 500.0)
COMET INFO:     eval/avg_steps [20]     : (32.98, 500.0)
COMET INFO:     train/avg_reward [1000] : (23.666666666666668, 185.19158361018827)
COMET INFO:     train/avg_steps [1000]  : (23.666666666666668, 185.19158361018827)
COMET INFO:     train/reward [1000]     : (10.0, 500.0)
COMET INFO:     train/steps [1000]      : (10, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_23
COMET INFO:     total params

Test ended after 500 steps with reward 500.0
Found files: ['video/openaigym.video.23.63.video000000.mp4']
Uploading file: video/openaigym.video.23.63.video000000.mp4


COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [7]   : 7
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
observations_to_use = [1, 3]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/6e604b6bbb544786b5f57cf5f368c0de

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1463
Episodes: 49
EpisodeReturn: 24.0
AverageReturn: 29.26
EvalEpisodes: 100
EvalEpisodeReturn: 49.0
EvalAverageReturn: 40.87
OtherLogs: {'LossPi': -0.07882, 'LossV': 223.07652, 'custom_reward': -891.0}
Time: 50
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 3874
Episodes: 99
EpisodeReturn: 92.0
AverageReturn: 38.74
EvalEpisodes: 100
EvalEpisodeReturn: 69.0
EvalAverageReturn: 85.77
OtherLogs: {'LossPi': -0.0788, 'LossV': 222.91916, 'custom_reward': -891.0}
Time: 106
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 8256
Episodes: 149
EpisodeReturn: 74.0
AverageReturn: 55.04
EvalEpisodes: 100
EvalEpisodeReturn: 132.0
EvalAverageReturn: 139.58
OtherLogs: {'LossPi': -0.07863, 'LossV': 222.63581, 'custom_reward': -891.0}
Time: 168
---------------------------------------
---------------------------------------
Iterations: 

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/6e604b6bbb544786b5f57cf5f368c0de
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.07893, -0.07624)
COMET INFO:     LossV [1000]            : (213.9958, 223.17017)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (40.87, 397.3)
COMET INFO:     eval/avg_steps [20]     : (40.87, 397.3)
COMET INFO:     train/avg_reward [1000] : (17.23076923076923, 153.354)
COMET INFO:     train/avg_steps [1000]  : (17.23076923076923, 153.354)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_13
COMET INFO:     total params     : 8899
COMET INFO:    

Test ended after 445 steps with reward 445.0
Found files: ['video/openaigym.video.24.63.video000000.mp4']
Uploading file: video/openaigym.video.24.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
observations_to_use = [1, 2]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/6da800a3682849dbbc1bd4f6559155a7

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1207
Episodes: 49
EpisodeReturn: 14.0
AverageReturn: 24.14
EvalEpisodes: 100
EvalEpisodeReturn: 14.0
EvalAverageReturn: 15.11
OtherLogs: {'LossPi': -0.07616, 'LossV': 213.92368, 'custom_reward': -891.0}
Time: 51
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2295
Episodes: 99
EpisodeReturn: 15.0
AverageReturn: 22.95
EvalEpisodes: 100
EvalEpisodeReturn: 12.0
EvalAverageReturn: 12.15
OtherLogs: {'LossPi': -0.07614, 'LossV': 213.86052, 'custom_reward': -891.0}
Time: 101
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3720
Episodes: 149
EpisodeReturn: 19.0
AverageReturn: 24.8
EvalEpisodes: 100
EvalEpisodeReturn: 29.0
EvalAverageReturn: 21.35
OtherLogs: {'LossPi': -0.07609, 'LossV': 213.77696, 'custom_reward': -891.0}
Time: 153
---------------------------------------
---------------------------------------
Iterations: 20

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/6da800a3682849dbbc1bd4f6559155a7
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08098, -0.07608)
COMET INFO:     LossV [1000]            : (205.58266, 213.99388)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (12.15, 350.47)
COMET INFO:     eval/avg_steps [20]     : (12.15, 350.47)
COMET INFO:     train/avg_reward [1000] : (20.75, 152.451)
COMET INFO:     train/avg_steps [1000]  : (20.75, 152.451)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_12
COMET INFO:     total params     : 8899
COMET INFO:     trainable params : 8

Test ended after 329 steps with reward 329.0
Found files: ['video/openaigym.video.25.63.video000000.mp4']
Uploading file: video/openaigym.video.25.63.video000000.mp4


COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
observations_to_use = [0, 3]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/f5d5f41dc2994955ad51616520ead8f7

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1333
Episodes: 49
EpisodeReturn: 16.0
AverageReturn: 26.66
EvalEpisodes: 100
EvalEpisodeReturn: 52.0
EvalAverageReturn: 77.16
OtherLogs: {'LossPi': -0.08088, 'LossV': 205.50923, 'custom_reward': -891.0}
Time: 56
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 4565
Episodes: 99
EpisodeReturn: 95.0
AverageReturn: 45.65
EvalEpisodes: 100
EvalEpisodeReturn: 142.0
EvalAverageReturn: 130.69
OtherLogs: {'LossPi': -0.08073, 'LossV': 205.33599, 'custom_reward': -891.0}
Time: 119
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 11017
Episodes: 149
EpisodeReturn: 128.0
AverageReturn: 73.45
EvalEpisodes: 100
EvalEpisodeReturn: 130.0
EvalAverageReturn: 143.29
OtherLogs: {'LossPi': -0.08072, 'LossV': 204.98935, 'custom_reward': -891.0}
Time: 192
---------------------------------------
---------------------------------------
Iterati

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/f5d5f41dc2994955ad51616520ead8f7
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08094, -0.0806)
COMET INFO:     LossV [1000]            : (202.82244, 205.581)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (26.94, 143.29)
COMET INFO:     eval/avg_steps [20]     : (26.94, 143.29)
COMET INFO:     train/avg_reward [1000] : (18.375, 90.06930693069307)
COMET INFO:     train/avg_steps [1000]  : (18.375, 90.06930693069307)
COMET INFO:     train/reward [1000]     : (10.0, 354.0)
COMET INFO:     train/steps [1000]      : (10, 354)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_03
COMET INFO:     total params     : 8899
COMET INFO:    

Test ended after 145 steps with reward 145.0
Found files: ['video/openaigym.video.26.63.video000000.mp4']
Uploading file: video/openaigym.video.26.63.video000000.mp4


COMET INFO:   Uploads [count]:
COMET INFO:     asset               : 1
COMET INFO:     code                : 1 (24 KB)
COMET INFO:     environment details : 1
COMET INFO:     filename            : 1
COMET INFO:     git metadata        : 1
COMET INFO:     installed packages  : 1
COMET INFO:     model-element [3]   : 3
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
observations_to_use = [0, 2]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/47d4468767ad4a3794763b5a396dd74f

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 963
Episodes: 49
EpisodeReturn: 10.0
AverageReturn: 19.26
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.41
OtherLogs: {'LossPi': -0.08054, 'LossV': 202.77107, 'custom_reward': -891.0}
Time: 53
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2225
Episodes: 99
EpisodeReturn: 28.0
AverageReturn: 22.25
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.18
OtherLogs: {'LossPi': -0.08046, 'LossV': 202.70306, 'custom_reward': -891.0}
Time: 106
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3333
Episodes: 149
EpisodeReturn: 21.0
AverageReturn: 22.22
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.44
OtherLogs: {'LossPi': -0.0804, 'LossV': 202.64445, 'custom_reward': -891.0}
Time: 159
---------------------------------------
---------------------------------------
Iterations: 200
St

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/47d4468767ad4a3794763b5a396dd74f
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08059, -0.08006)
COMET INFO:     LossV [1000]            : (201.94585, 202.82006)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (9.18, 9.44)
COMET INFO:     eval/avg_steps [20]     : (9.18, 9.44)
COMET INFO:     train/avg_reward [1000] : (15.759509202453987, 43.0)
COMET INFO:     train/avg_steps [1000]  : (15.759509202453987, 43.0)
COMET INFO:     train/reward [1000]     : (8.0, 97.0)
COMET INFO:     train/steps [1000]      : (8, 97)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_02
COMET INFO:     total params     : 8899
COMET INFO:     trainabl

Test ended after 10 steps with reward 10.0
Found files: ['video/openaigym.video.27.63.video000000.mp4']
Uploading file: video/openaigym.video.27.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
observations_to_use = [0, 1]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/eb39f8f6171b417ab849669e425de736

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1300
Episodes: 49
EpisodeReturn: 30.0
AverageReturn: 26.0
EvalEpisodes: 100
EvalEpisodeReturn: 16.0
EvalAverageReturn: 15.03
OtherLogs: {'LossPi': -0.07996, 'LossV': 201.87712, 'custom_reward': -891.0}
Time: 54
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2447
Episodes: 99
EpisodeReturn: 35.0
AverageReturn: 24.47
EvalEpisodes: 100
EvalEpisodeReturn: 28.0
EvalAverageReturn: 35.74
OtherLogs: {'LossPi': -0.07995, 'LossV': 201.81817, 'custom_reward': -891.0}
Time: 108
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 4027
Episodes: 149
EpisodeReturn: 23.0
AverageReturn: 26.85
EvalEpisodes: 100
EvalEpisodeReturn: 30.0
EvalAverageReturn: 36.88
OtherLogs: {'LossPi': -0.07994, 'LossV': 201.73681, 'custom_reward': -891.0}
Time: 164
---------------------------------------
---------------------------------------
Iterations: 20

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/eb39f8f6171b417ab849669e425de736
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.08006, -0.0798)
COMET INFO:     LossV [1000]            : (200.09243, 201.94481)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (15.03, 46.33)
COMET INFO:     eval/avg_steps [20]     : (15.03, 46.33)
COMET INFO:     train/avg_reward [1000] : (19.0, 37.404233870967744)
COMET INFO:     train/avg_steps [1000]  : (19.0, 37.404233870967744)
COMET INFO:     train/reward [1000]     : (9.0, 112.0)
COMET INFO:     train/steps [1000]      : (9, 112)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_01
COMET INFO:     total params     : 8899
COMET INFO:     tra

Test ended after 57 steps with reward 57.0
Found files: ['video/openaigym.video.28.63.video000000.mp4']
Uploading file: video/openaigym.video.28.63.video000000.mp4


COMET INFO:     model-element [7]   : 7
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
observations_to_use = [0]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/bbbc8aa9fa14450fbb037b13dbd09b12

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 927
Episodes: 49
EpisodeReturn: 17.0
AverageReturn: 18.54
EvalEpisodes: 100
EvalEpisodeReturn: 9.0
EvalAverageReturn: 9.38
OtherLogs: {'LossPi': -0.07983, 'LossV': 200.04445, 'custom_reward': -891.0}
Time: 53
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 1580
Episodes: 99
EpisodeReturn: 12.0
AverageReturn: 15.8
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.44
OtherLogs: {'LossPi': -0.07983, 'LossV': 200.01025, 'custom_reward': -891.0}
Time: 106
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 2176
Episodes: 149
EpisodeReturn: 8.0
AverageReturn: 14.51
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.4
OtherLogs: {'LossPi': -0.0798, 'LossV': 199.97839, 'custom_reward': -891.0}
Time: 158
---------------------------------------
---------------------------------------
Iterations: 200
Steps: 2

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/bbbc8aa9fa14450fbb037b13dbd09b12
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.07987, -0.07953)
COMET INFO:     LossV [1000]            : (199.40614, 200.09109)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (9.24, 9.44)
COMET INFO:     eval/avg_steps [20]     : (9.24, 9.44)
COMET INFO:     train/avg_reward [1000] : (12.608695652173912, 25.714285714285715)
COMET INFO:     train/avg_steps [1000]  : (12.608695652173912, 25.714285714285715)
COMET INFO:     train/reward [1000]     : (8.0, 67.0)
COMET INFO:     train/steps [1000]      : (8, 67)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_0
COMET INFO:     total params     : 87

Test ended after 9 steps with reward 9.0
Found files: ['video/openaigym.video.29.63.video000000.mp4']
Uploading file: video/openaigym.video.29.63.video000000.mp4


COMET INFO: Still uploading


In [None]:
observations_to_use = [1]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/989b33e7945d4bb0b579939b87fe433a

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1206
Episodes: 49
EpisodeReturn: 21.0
AverageReturn: 24.12
EvalEpisodes: 100
EvalEpisodeReturn: 12.0
EvalAverageReturn: 15.25
OtherLogs: {'LossPi': -0.07945, 'LossV': 199.34406, 'custom_reward': -891.0}
Time: 54
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2852
Episodes: 99
EpisodeReturn: 64.0
AverageReturn: 28.52
EvalEpisodes: 100
EvalEpisodeReturn: 30.0
EvalAverageReturn: 35.91
OtherLogs: {'LossPi': -0.07938, 'LossV': 199.26126, 'custom_reward': -891.0}
Time: 111
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 4272
Episodes: 149
EpisodeReturn: 25.0
AverageReturn: 28.48
EvalEpisodes: 100
EvalEpisodeReturn: 59.0
EvalAverageReturn: 37.12
OtherLogs: {'LossPi': -0.07938, 'LossV': 199.19159, 'custom_reward': -891.0}
Time: 167
---------------------------------------
---------------------------------------
Iterations: 2

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/989b33e7945d4bb0b579939b87fe433a
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.07953, -0.079)
COMET INFO:     LossV [1000]            : (197.71765, 199.40513)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (15.25, 43.87)
COMET INFO:     eval/avg_steps [20]     : (15.25, 43.87)
COMET INFO:     train/avg_reward [1000] : (19.0, 35.20842332613391)
COMET INFO:     train/avg_steps [1000]  : (19.0, 35.20842332613391)
COMET INFO:     train/reward [1000]     : (9.0, 158.0)
COMET INFO:     train/steps [1000]      : (9, 158)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_1
COMET INFO:     total params     : 8771
COMET INFO:     trainab

Test ended after 29 steps with reward 29.0
Found files: ['video/openaigym.video.30.63.video000000.mp4']
Uploading file: video/openaigym.video.30.63.video000000.mp4


COMET INFO: Still uploading
COMET INFO: Waiting for completion of the file uploads (may take several seconds)
COMET INFO: Still uploading


In [None]:
observations_to_use = [2]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/3eaea6741e424a13b41a7289eed2cf3a

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 952
Episodes: 49
EpisodeReturn: 14.0
AverageReturn: 19.04
EvalEpisodes: 100
EvalEpisodeReturn: 10.0
EvalAverageReturn: 9.21
OtherLogs: {'LossPi': -0.07894, 'LossV': 197.6695, 'custom_reward': -891.0}
Time: 54
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 2102
Episodes: 99
EpisodeReturn: 15.0
AverageReturn: 21.02
EvalEpisodes: 100
EvalEpisodeReturn: 8.0
EvalAverageReturn: 9.27
OtherLogs: {'LossPi': -0.07889, 'LossV': 197.61242, 'custom_reward': -891.0}
Time: 109
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 3321
Episodes: 149
EpisodeReturn: 10.0
AverageReturn: 22.14
EvalEpisodes: 100
EvalEpisodeReturn: 51.0
EvalAverageReturn: 40.83
OtherLogs: {'LossPi': -0.07884, 'LossV': 197.55224, 'custom_reward': -891.0}
Time: 165
---------------------------------------
---------------------------------------
Iterations: 200
St

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/3eaea6741e424a13b41a7289eed2cf3a
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.079, -0.07848)
COMET INFO:     LossV [1000]            : (196.81028, 197.71697)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (9.2, 40.83)
COMET INFO:     eval/avg_steps [20]     : (9.2, 40.83)
COMET INFO:     train/avg_reward [1000] : (13.0, 22.257861635220127)
COMET INFO:     train/avg_steps [1000]  : (13.0, 22.257861635220127)
COMET INFO:     train/reward [1000]     : (8.0, 100.0)
COMET INFO:     train/steps [1000]      : (8, 100)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_2
COMET INFO:     total params     : 8771
COMET INFO:     trainable

Test ended after 40 steps with reward 40.0
Found files: ['video/openaigym.video.31.63.video000000.mp4']
Uploading file: video/openaigym.video.31.63.video000000.mp4


COMET INFO:     model-element [3]   : 3
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


In [None]:
observations_to_use = [3]

# Create an agent
agent = A2CAgent(env, args, device, observations_to_use, act_num)
run_agent(agent, args, tags=['observation_slice'],
          experiment_name=f'{args.algo}_{"".join([str(i) for i in observations_to_use])}')

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/george-gca/rl-project-02/abd7930d07fc4c1e8c06b52c5f4925e6

from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Arguments(tuple):
    'Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward)'

    __slots__ = ()

    _fields = ('algo', 'env', 'episode_bonus', 'eval_num_episodes', 'eval_per_train', 'gpu_index', 'iterations', 'load', 'max_steps', 'phase', 'render', 'render_in_train', 'seed', 'step_bonus', 'threshold_reward')

    def __new__(_cls, algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render, render_in_train, seed, step_bonus, threshold_reward):
        'Create new instance of Arguments(algo, env, episode_bonus, eval_num_episodes, eval_per_train, gpu_index, iterations, load, max_steps, phase, render

---------------------------------------
Iterations: 50
Steps: 1943
Episodes: 49
EpisodeReturn: 108.0
AverageReturn: 38.86
EvalEpisodes: 100
EvalEpisodeReturn: 165.0
EvalAverageReturn: 126.28
OtherLogs: {'LossPi': -0.07836, 'LossV': 196.71343, 'custom_reward': -891.0}
Time: 61
---------------------------------------
---------------------------------------
Iterations: 100
Steps: 5678
Episodes: 99
EpisodeReturn: 74.0
AverageReturn: 56.78
EvalEpisodes: 100
EvalEpisodeReturn: 77.0
EvalAverageReturn: 93.76
OtherLogs: {'LossPi': -0.07826, 'LossV': 196.53376, 'custom_reward': -891.0}
Time: 127
---------------------------------------
---------------------------------------
Iterations: 150
Steps: 11116
Episodes: 149
EpisodeReturn: 102.0
AverageReturn: 74.11
EvalEpisodes: 100
EvalEpisodeReturn: 138.0
EvalAverageReturn: 180.44
OtherLogs: {'LossPi': -0.07816, 'LossV': 196.27846, 'custom_reward': -891.0}
Time: 201
---------------------------------------
---------------------------------------
Iterat

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/george-gca/rl-project-02/abd7930d07fc4c1e8c06b52c5f4925e6
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     LossPi [1000]           : (-0.07847, -0.07633)
COMET INFO:     LossV [1000]            : (190.55272, 196.8096)
COMET INFO:     custom_reward [1000]    : -891.0
COMET INFO:     eval/avg_reward [20]    : (71.1, 201.66)
COMET INFO:     eval/avg_steps [20]     : (71.1, 201.66)
COMET INFO:     train/avg_reward [1000] : (13.0, 141.1180625630676)
COMET INFO:     train/avg_steps [1000]  : (13.0, 141.1180625630676)
COMET INFO:     train/reward [1000]     : (9.0, 500.0)
COMET INFO:     train/steps [1000]      : (9, 500)
COMET INFO:   Others:
COMET INFO:     Name             : a2c_3
COMET INFO:     total params     : 8771
COMET INFO:     traina

Test ended after 157 steps with reward 157.0
Found files: ['video/openaigym.video.32.63.video000000.mp4']
Uploading file: video/openaigym.video.32.63.video000000.mp4


COMET INFO:     installed packages  : 1
COMET INFO:     model-element [4]   : 4
COMET INFO:     notebook            : 1
COMET INFO:     os packages         : 1
COMET INFO: ---------------------------
COMET INFO: Still uploading


#### Generating Video From Best Result

In [None]:
args = {
  'algo': 'a2c',
  'env': gym_env,
  'episode_bonus': None,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': True,
  'render_in_train': False,
  'seed': seed,
  'step_bonus': None,
  'threshold_reward': 500,
}

args = Arguments(**args)

In [None]:
# Create an agent
observations_to_use = [0, 1, 2, 3]
agent = A2CAgent(wrap_env(gym.make(args.env)), args, device, observations_to_use, act_num)

In [None]:
# generate video from the best set of parameters
ckpt_path = Path('./save_model/CartPole-v1_a2c_default_best.pt')
checkpoint = torch.load(ckpt_path)
agent.policy.load_state_dict(checkpoint['policy_state_dict'])
agent.vf.load_state_dict(checkpoint['vf_state_dict'])
agent.policy_optimizer.load_state_dict(checkpoint['policy_optimizer_state_dict'])
agent.vf_optimizer.load_state_dict(checkpoint['vf_optimizer_state_dict'])

agent.eval_mode = True

# Run one episode
step_length, episode_reward = agent.run(args.max_steps)
print(f'Ended after {step_length} steps with reward {episode_reward}')
agent.env.close()
show_video()

Ended after 500 steps with reward 500.0


#### Summarizing

- Advantages
  - Reduced variation and more stable training compared to Reinforce
  - A2C uses state-values and advantages, which are much more stable over time than Q-values, so there is no need for target networks
  - It can deal with continuous action spaces easily

- Disadvantages
  - Utilizes two neural networks (one for the critic, one for the agent), increasing computation time and memory usage
  - A2C is an on-policy algorithm, the major drawback of on-policy methods is their sample complexity: it is difficult to ensure that the “interesting” regions of the policy are actually discovered by the actor. If the actor is initialized in a flat region of the reward space (where there is not a lot of rewards), policy gradient updates will only change slightly the policy and it may take a lot of iterations until interesting policies are discovered and fine-tuned. [[8]](https://julien-vitay.net/deeprl/ImportanceSampling.html)


---
# Off-policy Method

### DQN algorithm 

In project 1, we have seen that Q-learning did not perform very well. The main reasons are: 
1. Correlations between samples
2. Non-stationary targets

To tackle these problem, Deep Q-learning (DQN) uses two strategies:

1. Experience replay
2. Fixed Q-targets

As we can have learned, DQN is an off-policy value-based algorithm. The general process of the algorithm [[9]](https://julien-vitay.net/deeprl/Valuebased.html):

- Initialize value network $Q_{\theta}$ with random weights.
- Copy $Q_{\theta}$ to create the target network $Q_{\theta^{\prime}}$
- Initialize experience replay memory $\mathcal{D}$ of maximal size $N$.
- Observe the initial state $s_{0}$.
- for $t \in\left[0, T_{\text {total }}\right]$
  - Select the action $a_{t}$ based on the behavior policy derived from $Q_{\theta}\left(s_{t}, a\right)$ (e.g. softmax).
  - Perform the action $a_{t}$ and observe the next state $s_{t+1}$ and the reward $r_{t+1}$.
  - Store $\left(s_{t}, a_{t}, r_{t+1}, s_{t+1}\right)$ in the experience replay memory.
  - Every $T_{\text {train }}$ steps:
    - Sample a minibatch $\mathcal{D}_{s}$ randomly from $\mathcal{D}$.
    - For each transition $\left(s, a, r, s^{\prime}\right)$ in the minibatch:

      * Predict the Q-value of the greedy action in the next state using the target network:
\begin{equation}
\max _{a^{\prime}} Q_{\theta^{\prime}}\left(s^{\prime}, a^{\prime}\right)
\end{equation}
      
      * Compute the target value: 
\begin{equation}
y=r+\gamma \max _{a^{\prime}} Q_{\theta^{\prime}}\left(s^{\prime}, a^{\prime}\right)
\end{equation}

    - Train the value network $Q_{\theta}$ on $\mathcal{D}_{s}$ to minimize:
\begin{equation}
\mathcal{L}(\theta)=\mathbb{E}_{\mathcal{D}_{s}}\left[\left(y-Q_{\theta}(s, a)\right)^{2}\right]
\end{equation}
  - Every $T_{\text {target }}$ steps:
    - Update the target network with the trained value network: $\theta^{\prime} \leftarrow \theta$



In [None]:
!git clone https://github.com/dongminlee94/deep_rl.git > /dev/null 2>&1
%cd /content/deep_rl/

In [None]:
episodes = 1_000
eval_num_episodes = 100
gym_env = 'CartPole-v1'
gpu_index = 0
max_steps = 10_000
seed = 0

In [None]:
# Initialize environment
env = wrap_env(gym.make(gym_env))
obs_dim = env.observation_space.shape[0]
act_num = env.action_space.n

# Set a random seed
env.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

device = torch.device('cuda', index=gpu_index) if torch.cuda.is_available() else torch.device('cpu')

In [None]:
def show_video(video_path='video', prefix='', index=-1):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  mp4list = sorted(Path(video_path).glob(f'{prefix}*.mp4'))
  mp4 = mp4list[index]
  print(f'Found {mp4list}')
  print(f'Using {mp4}')
  video_b64 = base64.b64encode(mp4.read_bytes())
  html.append('''<video alt="{}" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{}" type="video/mp4" />
            </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [None]:
import numpy as np
import torch
import torch.optim as optim
import torch.nn.functional as F

from agents.common.utils import *
from agents.common.buffers import *
from agents.common.networks import *


class DQNAgent(object):
   """An implementation of the Deep Q-Network (DQN), Double DQN agents."""

   def __init__(self,
                env,
                args,
                device,
                obs_dim,
                act_num,
                steps=0,
                gamma=0.99,
                epsilon=1.0,
                epsilon_decay=0.995,
                buffer_size=int(1e4),
                batch_size=64,
                target_update_step=100,
                eval_mode=False,
                q_losses=list(),
                logger=dict(),
                additional_reward_fn=None
   ):

      self.env = env
      self.args = args
      self.device = device
      if isinstance(obs_dim, list):
        self.observations_to_use = obs_dim
        self.obs_dim = len(obs_dim)
      else:
        self.observations_to_use = None
        self.obs_dim = obs_dim
      self.act_num = act_num
      self.steps = steps
      self.gamma = gamma
      self.epsilon = epsilon
      self.epsilon_decay = epsilon_decay
      self.buffer_size = buffer_size
      self.batch_size = batch_size
      self.target_update_step = target_update_step
      self.eval_mode = eval_mode
      self.q_losses = q_losses
      self.logger = logger
      
      #function to give extra rewards 
      self.additional_reward = additional_reward_fn

      # Main network
      self.qf = MLP(self.obs_dim, self.act_num).to(self.device)
      # Target network
      self.qf_target = MLP(self.obs_dim, self.act_num).to(self.device)
      
      # Initialize target parameters to match main parameters
      hard_target_update(self.qf, self.qf_target)

      # Create an optimizer
      self.qf_optimizer = optim.Adam(self.qf.parameters(), lr=1e-3)

      # Experience buffer
      self.replay_buffer = ReplayBuffer(self.obs_dim, 1, self.buffer_size, self.device)

   def select_action(self, obs):
      """Select an action from the set of available actions."""
      # Decaying epsilon
      self.epsilon *= self.epsilon_decay
      self.epsilon = max(self.epsilon, 0.01)

      if np.random.rand() <= self.epsilon:
         # Choose a random action with probability epsilon
         return np.random.randint(self.act_num)
      else:
         # Choose the action with highest Q-value at the current state
         action = self.qf(obs).argmax()
         return action.detach().cpu().numpy()

   def train_model(self):
      batch = self.replay_buffer.sample(self.batch_size)
      obs1 = batch['obs1']
      obs2 = batch['obs2']
      acts = batch['acts']
      rews = batch['rews']
      done = batch['done']

      if 0: # Check shape of experiences
         print("obs1", obs1.shape)
         print("obs2", obs2.shape)
         print("acts", acts.shape)
         print("rews", rews.shape)
         print("done", done.shape)

      # Prediction Q(s)
      q = self.qf(obs1).gather(1, acts.long()).squeeze(1)
      
      # Target for Q regression
      if self.args.algo == 'dqn':      # DQN
         q_target = self.qf_target(obs2)
      elif self.args.algo == 'ddqn':   # Double DQN
         q2 = self.qf(obs2)
         q_target = self.qf_target(obs2)
         q_target = q_target.gather(1, q2.max(1)[1].unsqueeze(1))
      q_backup = rews + self.gamma*(1-done)*q_target.max(1)[0]
      q_backup.to(self.device)

      if 0: # Check shape of prediction and target
         print("q", q.shape)
         print("q_backup", q_backup.shape)

      # Update perdiction network parameter
      qf_loss = F.mse_loss(q, q_backup.detach())
      self.qf_optimizer.zero_grad()
      qf_loss.backward()
      self.qf_optimizer.step()

      # Synchronize target parameters 𝜃‾ as 𝜃 every C steps
      if self.steps % self.target_update_step == 0:
         hard_target_update(self.qf, self.qf_target)
      
      # Save loss
      self.q_losses.append(qf_loss.item())

   def run(self, max_step):
      step_number = 0
      total_reward = 0.

      if self.additional_reward is not None and not self.eval_mode:
        reward_without_additional = 0.
      
      #observe the environment
      obs = self.env.reset()
      if self.observations_to_use is not None:
        obs = [ obs[i] for i in range(len(obs)) if i in self.observations_to_use ]
      done = False

      # Keep interacting until agent reaches a terminal state.
      while not (done or step_number == max_step):
         if self.args.render:
            self.env.render()       
      
         if self.eval_mode:
            q_value = self.qf(torch.Tensor(obs).to(self.device)).argmax()
            action = q_value.detach().cpu().numpy()
            next_obs, reward, done, _ = self.env.step(action)
            if self.observations_to_use is not None:
              next_obs = [ next_obs[i] for i in range(len(next_obs)) if i in self.observations_to_use ]
         else:
            self.steps += 1

            # Collect experience (s, a, r, s') using some policy
            action = self.select_action(torch.Tensor(obs).to(self.device))
            next_obs, reward, done, _ = self.env.step(action)

            #drop an observation dimension if necessary
            if self.observations_to_use is not None:
              next_obs = [ next_obs[i] for i in range(len(next_obs)) if i in self.observations_to_use ]
            
            #adjust the reward is necessary
            if self.additional_reward is not None:
              reward_without_additional += reward
              reward += self.additional_reward(step_number, next_obs)

            # Add experience to replay buffer
            self.replay_buffer.add(obs, action, reward, next_obs, done)
            
            # Start training when the number of experience is greater than batch_size
            if self.steps > self.batch_size:
               self.train_model()

         total_reward += reward
         step_number += 1
         obs = next_obs
      
      # Save total average losses
      if self.additional_reward is not None and not self.eval_mode:
        self.logger['LossQ'] = round(np.mean(self.q_losses), 5)
        self.logger['custom_reward'] = total_reward
        return step_number, reward_without_additional
      else:
        self.logger['LossQ'] = round(np.mean(self.q_losses), 5)
      return step_number, total_reward


In [None]:
def run_agent(env, agent, args, episode_bonus_fn=None, tags=[], save_model=False, play_video=False):
    # Create an experiment logger to comet ml
    experiment = Experiment(workspace='george-gca', project_name='RL Project 02',
                            api_key='KKBIogMwA9HL9BuIAjOa6uoEQ',
                            auto_metric_logging=False)
    
    experiment_name = f'{args.algo}'
    if args.episode_bonus is not None:
      experiment_name += f'_{args.episode_bonus}'
    if args.step_bonus is not None:
      experiment_name += f'_{args.step_bonus}'

    if args.algo == 'dqn':
      experiment.log_parameter('gamma', agent.gamma)
      experiment.log_parameter('epsilon', agent.epsilon)
      experiment.log_parameter('batch_size', agent.batch_size)
      experiment.log_parameter('replay_buffer', agent.buffer_size)
      experiment.log_parameter('epsilon_decay', agent.epsilon_decay)
      experiment_name += f'_gamma_{agent.gamma}_epsilon_{agent.epsilon}_batch_{agent.batch_size}_replay_{agent.buffer_size}'

      trainable_params = sum(p.numel() for p in agent.qf.parameters() if p.requires_grad)
      trainable_params += sum(p.numel() for p in agent.qf_target.parameters() if p.requires_grad)

      total_params = sum(p.numel() for p in agent.qf.parameters())
      total_params += sum(p.numel() for p in agent.qf_target.parameters())

      experiment.log_other('trainable params', trainable_params)
      experiment.log_other('total params', total_params)

      if agent.observations_to_use is not None:
        experiment.log_parameter('observations_to_use', agent.observations_to_use)
      else:
        experiment.log_parameter('observations_to_use', [0, 1, 2, 3])
    
    if isinstance(agent.observations_to_use, list):
      observations_ids_str = "".join(map(str, agent.observations_to_use))
      experiment_name += f"_obs_[{observations_ids_str}]"

    experiment_name += f'_eps_{args.iterations}_steps_{args.max_steps}'
    print(f"Experiment Name: {experiment_name}")
    
    experiment.set_name(experiment_name)
    if len(tags) > 0:
      experiment.add_tags(tags)

    experiment.log_parameters(args)

    start_time = time.time()

    train_sum_steps = 0
    train_sum_rewards = 0.
    # train_num_episodes = 0

    if episode_bonus_fn is not None:
      train_sum_rewards_with_bonus = 0.

    # Main loop
    for i in range(args.iterations):
        # Perform the training phase, during which the agent learns
        if args.phase == 'train':
            agent.eval_mode = False
        
            # Run one episode
            train_step_length, train_episode_reward = agent.run(args.max_steps)
            
            train_sum_steps += train_step_length
            train_sum_rewards += train_episode_reward
            
            train_average_reward = train_sum_rewards / (i+1)
            train_average_steps = train_sum_steps / (i+1)

            # Log experiment result for training episodes
            metrics = {'train/steps': train_step_length, 'train/reward': train_episode_reward,
                       'train/avg_steps': train_average_steps, 'train/avg_reward': train_average_reward}

            if episode_bonus_fn is not None:
              finish_bonus = episode_bonus_fn(train_step_length)
              train_sum_rewards_with_bonus += train_episode_reward + finish_bonus
              metrics.update({'train/reward_with_bonus': train_episode_reward + finish_bonus,
                              'train/reward_bonus': finish_bonus,
                              'train/avg_reward_with_bonus': train_sum_rewards_with_bonus / (i+1)})

            metrics.update(agent.logger)
            experiment.log_metrics(metrics, epoch=i+1)

        # Perform the evaluation phase -- no learning
        if (i + 1) % args.eval_per_train == 0:
            eval_sum_rewards = 0.
            eval_sum_steps = 0
            agent.eval_mode = True
            
            for _ in range(args.eval_num_episodes):
                # Run one episode
                eval_step_length, eval_episode_reward = agent.run(args.max_steps)

                eval_sum_rewards += eval_episode_reward
                eval_sum_steps += eval_step_length

            eval_average_reward = eval_sum_rewards / args.eval_num_episodes
            eval_average_steps = eval_sum_steps / args.eval_num_episodes

            # Log experiment result for evaluation episodes
            metrics = {'eval/avg_steps': eval_average_steps, 'eval/avg_reward': eval_average_reward}
            experiment.log_metrics(metrics, epoch=i+1)
            
            if args.phase == 'train':
                print('---------------------------------------')
                print('Iterations:', i + 1)
                print('Steps:', train_sum_steps)
                print('Episodes:', i)
                print('EpisodeReturn:', round(train_episode_reward, 2))
                print('AverageReturn:', round(train_average_reward, 2))
                print('EvalEpisodes:', args.eval_num_episodes)
                print('EvalEpisodeReturn:', round(eval_episode_reward, 2))
                print('EvalAverageReturn:', round(eval_average_reward, 2))
                print('OtherLogs:', agent.logger)
                print('Time:', int(time.time() - start_time))
                print('---------------------------------------')

            elif args.phase == 'test':
                print('---------------------------------------')
                print('EvalEpisodes:', args.eval_num_episodes)
                print('EvalEpisodeReturn:', round(eval_episode_reward, 2))
                print('EvalAverageReturn:', round(eval_average_reward, 2))
                print('Time:', int(time.time() - start_time))
                print('---------------------------------------')
    
    # Save the trained model
    if save_model:
        print("Saving the trained model...")
        if not os.path.exists('./save_model'):
            os.mkdir('./save_model')
        
        ckpt_path = os.path.join('./save_model/' + args.env + '_' + args.algo \
                                                            + '_s_' + str(args.seed) \
                                                            + '_i_' + str(i + 1) \
                                                            + '_tr_' + str(round(train_episode_reward, 3)) \
                                                            + '_er_' + str(round(eval_episode_reward, 3)) + '.pt')
        
        if args.algo == 'dqn' or args.algo == 'ddqn':
            torch.save(agent.qf.state_dict(), ckpt_path)
        elif args.algo == 'a2c':
            torch.save(agent.policy.state_dict(), ckpt_path)
        
        experiment.log_asset(ckpt_path, file_name=f'{experiment_name}_ep_{args.iterations}', step=args.iterations)
        print("Model Saved!")

    #save the video
    mp4list = sorted(Path('./video').glob('*.mp4'))
    if len(mp4list) > 0:
        print(f'Found files: {mp4list}')
        print(f'Uploading file: {mp4list[-1]}')
        experiment.log_asset(mp4list[-1], file_name=f'{experiment_name}_ep_{args.iterations}.mp4', step=args.iterations)
    experiment.end()
    
    if play_video:
        show_video()

#### Hyperparameter Search 

In [None]:
for gamma in [0.1, 0.75, 0.9, 0.99]:
  for reward_mode in [0, 1, 2, 3]:
    for batch in [64, 256]:
      for replay_buffer in [500, int(1e4), int(1e5)]:
          if reward_mode == 0:
            step_bonus = None
            episode_bonus = None
            args_step_bonus = None
            args_episode_bonus = 'default' 
          if reward_mode == 1:
            step_bonus = None
            episode_bonus = high_penalty
            args_step_bonus = None
            args_episode_bonus = 'high_penalty' 
          elif reward_mode == 2:
            step_bonus = step_bonus
            episode_bonus = high_penalty
            args_step_bonus = 'step_bonus'
            args_episode_bonus = 'high_penalty'
          else:
            step_bonus = pole_angle_based_reward
            episode_bonus = high_penalty
            args_step_bonus = 'pole_angle_bonus'
            args_episode_bonus = 'high_penalty'

          args = {
            'algo': 'dqn',
            'env': gym_env,
            'episode_bonus': args_episode_bonus,
            'eval_num_episodes': eval_num_episodes,
            'eval_per_train': 50,
            'gpu_index': gpu_index,
            'iterations': episodes,
            'load': None,
            'max_steps': max_steps,
            'phase': 'train',
            'render': False,
            'seed': seed,
            'step_bonus': args_step_bonus,
            'threshold_reward': 500,
          }
          args = Arguments(**args)
          agent = DQNAgent(env, args, device, obs_dim, act_num, buffer_size=replay_buffer, gamma=gamma, batch_size=batch, additional_reward_fn=step_bonus)
          run_agent(env, agent, args, episode_bonus_fn=episode_bonus, tags=['hyper-search'], show_video=False)


#### Run 5 times with the best parameters

In [None]:
#reward_mode=3
step_bonus = pole_angle_based_reward
episode_bonus = high_penalty
args_step_bonus = 'pole_angle_bonus'
args_episode_bonus = 'high_penalty'

gamma=0.99
batch=256
replay_buffer=10000

args = {
  'algo': 'dqn',
  'env': gym_env,
  'episode_bonus': args_episode_bonus,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'seed': seed,
  'step_bonus': args_step_bonus,
  'threshold_reward': 500,
}

args = Arguments(**args)

for i in range(5):
  agent = DQNAgent(env, args, device, obs_dim, act_num, buffer_size=replay_buffer, gamma=gamma, batch_size=batch, additional_reward_fn=step_bonus)
  run_agent(env, agent, args, episode_bonus_fn=episode_bonus, tags=['dqn','best'], show_video=True)

#### Occluding observation in state representation

In [None]:
step_bonus = None
episode_bonus = None
args_step_bonus = None
args_episode_bonus = 'default' 
gamma=0.99
batch=256
replay_buffer=10000

args = {
  'algo': 'dqn',
  'env': gym_env,
  'episode_bonus': args_episode_bonus,
  'eval_num_episodes': eval_num_episodes,
  'eval_per_train': 50,
  'gpu_index': gpu_index,
  'iterations': episodes,
  'load': None,
  'max_steps': max_steps,
  'phase': 'train',
  'render': False,
  'seed': seed,
  'step_bonus': args_step_bonus,
  'threshold_reward': 500,
}

args = Arguments(**args)
#observations_to_use = [1, 2, 3]
observations_to_use = [[0], [1], [2], [3], [0,1], [3,0], [3,1], [3,2], [0,1,2,3]]

# Create an agent
for obs in observations_to_use:
  agent = DQNAgent(env, args, device, obs, act_num, buffer_size=replay_buffer, gamma=gamma, batch_size=batch, additional_reward_fn=step_bonus)
  run_agent(env, agent, args, episode_bonus_fn=episode_bonus, tags=['dqn','observation_slice'], show_video=True)

**Discussion on DQN**

- Advantages
  - Comparatively simple
  - Experiene reply reduces correlation between experiences in updating DNN
  - Experiene reply increases learning speed with mini-batches
  - Target network helps improve stability
- Disadvantages
  - Not applicable to solving all problems (infinite possibility of actions).
  - If the possible number of state-action pairs is relatively large in a given environment, then the Q-function can become extremely complicated, and so it becomes intractable to estimate the optimal Q-value.
  - DQN finds very good policies, but at the cost of a very long training time.(Very old transitions, generated using an initially bad policy, can be used to train the network for a very long time; If the target network is not updated very often, the target values are going to be wrong a long time. These two factors can make DQN very slow to learn. )

# Discussion

**Please check the discussions in the report** (link provided in the beginning)

### Comparison between on-policy and off-policy methods


### Comparison between linear and non-linear function approximation