In [1]:
%matplotlib inline
import os, sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from utils.helpers import launch_env, wrap_env, view_results_ipython, change_exercise, seedall, force_done, evaluate_policy
from utils.helpers import SteeringToWheelVelWrapper, ResizeWrapper, ImgWrapper

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

import matplotlib.pyplot as plt

INFO:aido-protocols:aido-protocols 5.0.5
[2m08:00:59|[0mzn[2m|__init__.py:6|<module>(): [0m[32mzn 2.0.3[0m
[2m08:01:00|[0mzj[2m|__init__.py:5|<module>(): [0m[32mzj 2.0.4[0m
[2m08:01:00|[0mgym-duckietown[2m|__init__.py:10|<module>(): [0m[32mgym-duckietown 5.0.3[0m
[32m[0m
[2m08:01:00|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-loop_pedestrians-v0[0m
[2m08:01:00|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-small_loop_cw-v0[0m
[2m08:01:00|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-loop_empty-v0[0m
[2m08:01:00|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-loop_obstacles-v0[0m
[2m08:01:00|[0mgym-duckietown[2m|__init__.py:24|reg_map_env(): [0m[32mRegistering gym environment id: Duckietown-zigzag_dists-v0[0m
[2m08:01:00|

# Reinforcement Learning Basics

Reinforcement Learning, as we saw in lecture, is the idea of learning a _policy_ in order to maximize future (potentially discounted) rewards. Our policy, similar to the imitation learning network, maps raw image observations to wheel velocities, and at every timestep, receives a _reward_ from the environment. 

Rewards can be sparse (`1` if goal or task is completed, `0` otherwise) or dense; in general, dense rewards make it easier to learn policies, but as we'll see later in this exercise, defining the correct dense reward is an engineering challenge on its own.

Today's reinforcement learning algorithms are often a mix between _value-based_ and _policy-gradient_ algorithms, instances of what is called an _actor-critic_ formulation. Actor-critic methods have had a lot of research done on them in recent years (especially within in the deep reinforcement learning era), and later in this exercise, we shall also rediscover the formulation's original problems and different methods currently used to stabilize learning.

We begin by defining two networks, an `Actor` and `Critic`; in this exercise, we'll be using a deep RL algorithm titled _Deep Deterministic Policy Gradients_. 

In [2]:
class Actor(nn.Module):
    def __init__(self, action_dim, max_action):
        super(Actor, self).__init__()
        
        # TODO: You'll need to change this!
        flat_size = 31968

        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()

        self.conv1 = nn.Conv2d(3, 32, 8, stride=2)
        self.conv2 = nn.Conv2d(32, 32, 4, stride=2)

        self.bn1 = nn.BatchNorm2d(32)
        self.bn2 = nn.BatchNorm2d(32)

        self.dropout = nn.Dropout(.1)

        self.lin1 = nn.Linear(flat_size, 100)
        self.lin2 = nn.Linear(100, action_dim)

        self.max_action = max_action

    def forward(self, x):
        x = self.bn1(self.relu(self.conv1(x)))
        x = self.bn2(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)  # flatten
        x = self.dropout(x)
        x = self.relu(self.lin1(x))

        x = self.lin2(x)
        x = self.max_action * self.tanh(x)
        
        return x
    
class Critic(nn.Module):
    def __init__(self, action_dim, max_action):
        super(Critic, self).__init__()
        
        # TODO: You'll need to change this!
        flat_size = 31968

        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()

        self.conv1 = nn.Conv2d(3, 32, 8, stride=2)
        self.conv2 = nn.Conv2d(32, 32, 4, stride=2)

        self.bn1 = nn.BatchNorm2d(32)
        self.bn2 = nn.BatchNorm2d(32)

        self.dropout = nn.Dropout(.1)

        self.lin1 = nn.Linear(flat_size + action_dim, 100)
        #self.lin2 = nn.Linear(100, action_dim)
        self.lin2 = nn.Linear(100, 1)
        #self.max_action = max_action

    def forward(self, obs, action):
        x = self.bn1(self.relu(self.conv1(obs)))
        x = self.bn2(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)  # flatten
        x = self.dropout(x)
        x = torch.cat([x, action], 1)
        x = self.relu(self.lin1(x))

        x = self.lin2(x)
        #x = self.max_action * self.tanh(x)
        
        return x

## Reward Engineering

In this part of the exercise, we will experiment with the reward formulation. Given the same model, we'll see how the effect of various reward functions changes the final policy trained. 

In the section below, we'll take a look at the reward function implemented in `gym-duckietown` with a slightly modified training loop. Traditionally, we `reset()` the environment to start an episode, and then `step()` the environment forward for a set amount of time, executing a new action. If this sounds a bit odd, especially for roboticists, you're right - in real robotics, most code runs asynchronously. As a result, although `gym-duckietown` runs locally by stopping the environment, the `AIDO` submissions will run asynchronously, executing the same action until a new one is received.

In [3]:
def updated_reward(env):
    # Compute the collision avoidance penalty
    pos, angle, speed = env.cur_pos, env.cur_angle, env.speed
    col_penalty = env.proximity_penalty2(pos, angle) # negative number
    print('col_penalty: %.4f'% col_penalty)
    # Get the position relative to the right lane tangent
    try:
        lp = env.get_lane_pos2(pos, angle)
    except NotInLane:
        reward = 40 * col_penalty
    else:
        # Compute the reward
        reward = (
                1 * speed * lp.dot_dir +
                1 * np.abs(lp.dist) +
                1 * col_penalty
        )
    return reward

In [4]:
nepisodes = 3

In [5]:
local_env = launch_env()
local_env = wrap_env(local_env)
local_env = ResizeWrapper(local_env)
local_env = ImgWrapper(local_env)

policy = Actor(2, 1.0)

for _ in range(nepisodes):
    done = False
    obs = local_env.reset() # (3,160,120)
    
    while not done:
        obs, r, done, info = local_env.step(np.random.random(2))
        new_r = updated_reward(local_env)
        print('Reward: %.4f --- Updated Reward: %.4f' % (r, new_r))
 

[2m08:01:02|[0mgym-duckietown[2m|graphics.py:121|create_frame_buffers(): [0m[35mFalling back to non-multisampled frame buffer[0m
[2m08:01:02|[0mgym-duckietown[2m|graphics.py:121|create_frame_buffers(): [0m[35mFalling back to non-multisampled frame buffer[0m
[2m08:01:02|[0mgym-duckietown[2m|simulator.py:550|_load_map(): [0m[35mloading map file "/duckietown/simulation/gym_duckietown/maps/loop_empty.yaml"[0m
[2m08:01:02|[0mgym-duckietown[2m|objmesh.py:50|__init__(): [0m[35mloading mesh "duckiebot.obj"[0m
[2m08:01:02|[0mgym-duckietown[2m|objmesh.py:238|_load_mtl(): [0m[35mloading materials from "/duckietown/simulation/gym_duckietown/meshes/duckiebot.mtl"[0m
[2m08:01:03|[0mgym-duckietown[2m|objmesh.py:50|__init__(): [0m[35mloading mesh "duckie.obj"[0m
[2m08:01:03|[0mgym-duckietown[2m|graphics.py:60|load_texture(): [0m[35mloading texture "duckie.png"[0m
[2m08:01:03|[0mgym-duckietown[2m|objmesh.py:50|__init__(): [0m[35mloading mesh "cone.obj"[0m

[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.59826268 0.         2.08721855][0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.5894849  0.         2.18432262][0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [3.60704045 0.         1.99011448][0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [3.48173779 0.         2.07668522][0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.48172154 0.         1.84866924] corresponds to tile at (5, 3) which is not drivable: {'coords': (5, 3), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_du

[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [4.02468139 0.         1.75868687][0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.49620504 0.         2.20742184] corresponds to tile at (5, 3) which is not drivable: {'coords': (5, 3), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f0ff8e54588>, 'color': array([1, 1, 1])}[0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.57500004 0.         2.14999713][0m
[2m08:01:04|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.49620504 0.         2.20742184][0m
[2m08:01:04|[0mgym-duckietown[2m|simu

[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [2.33703727 0.         2.57700384][0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [2.24568806 0.         2.542922  ][0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [2.42838647 0.         2.61108568][0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [2.37793548 0.         2.46738479][0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[2.28873639 0.         2.68406941] corresponds to tile at (3, 4) which is not drivable: {'coords': (3, 4), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_du

action_dim:2


[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6415370304316736, 0, 2.884689280745409] angle -1.4471358043066236[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6415370304316736, 0, 2.884689280745409] angle -1.4471358043066236[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6415370304316736, 0, 2.884689280745409] angle -1.4471358043066236[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6415370304316736, 0, 2.884689280745409] angle -1.4471358043066236[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6415370304316736, 0, 2.884689280745409] angle -1.4471358043066236[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6415370304316736, 0, 2.884689280745409] angle -1.4471358043066236[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_im

col_penalty: 0.0000
Reward: -0.1101 --- Updated Reward: 0.1310
col_penalty: 0.0000
Reward: -0.1101 --- Updated Reward: 0.1310
col_penalty: 0.0000
Reward: -0.1101 --- Updated Reward: 0.1310
col_penalty: 0.0000
Reward: -0.1101 --- Updated Reward: 0.1310
col_penalty: 0.0000
Reward: -0.1099 --- Updated Reward: 0.1310
col_penalty: 0.0000
Reward: -0.1099 --- Updated Reward: 0.1309


[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.642436546500211, 0, 2.8918369527691348] angle -1.4481855750711006[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.642436546500211, 0, 2.8918369527691348] angle -1.4481855750711006[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.642928129122019, 0, 2.8957363265619307] angle -1.442596062629904[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.642928129122019, 0, 2.8957363265619307] angle -1.442596062629904[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.643394135332763, 0, 2.8993114428358515] angle -1.4397638554204282[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.643394135332763, 0, 2.8993114428358515] angle -1.4397638554204282[0m
[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(

col_penalty: 0.0000
Reward: -0.1095 --- Updated Reward: 0.1308
col_penalty: 0.0000
Reward: -0.1084 --- Updated Reward: 0.1307
col_penalty: 0.0000
Reward: -0.1068 --- Updated Reward: 0.1305
col_penalty: 0.0000
Reward: -0.1042 --- Updated Reward: 0.1302
col_penalty: 0.0000
Reward: -0.1005 --- Updated Reward: 0.1297


[2m08:01:10|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6455727552388018, 0, 2.916818050579724] angle -1.451384018033992[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.646621659935202, 0, 2.9256900271537445] angle -1.4548477266669788[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.646621659935202, 0, 2.9256900271537445] angle -1.4548477266669788[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6476805076740146, 0, 2.93516540369167] angle -1.4641737127255812[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6476805076740146, 0, 2.93516540369167] angle -1.4641737127255812[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6487684016785398, 0, 2.9453738572494053] angle -1.465084422351721[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img()

col_penalty: 0.0000
Reward: -0.0956 --- Updated Reward: 0.1292
col_penalty: 0.0000
Reward: -0.1196 --- Updated Reward: 0.1311
col_penalty: 0.0000
Reward: -0.1328 --- Updated Reward: 0.1323
col_penalty: 0.0000
Reward: -0.1503 --- Updated Reward: 0.1337
col_penalty: 0.0000
Reward: -0.1691 --- Updated Reward: 0.1352
col_penalty: 0.0000
Reward: -0.1872 --- Updated Reward: 0.1368


[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6515511766993254, 0, 2.97378632788835] angle -1.4748496992425482[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6515511766993254, 0, 2.97378632788835] angle -1.4748496992425482[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.652466128364776, 0, 2.9830023708802633] angle -1.468835164344013[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.652466128364776, 0, 2.9830023708802633] angle -1.468835164344013[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.653406150425831, 0, 2.991836502765225] angle -1.4607393456015287[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.653406150425831, 0, 2.991836502765225] angle -1.4607393456015287[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): 

col_penalty: 0.0000
Reward: -0.2121 --- Updated Reward: 0.1387
col_penalty: 0.0000
Reward: -0.2404 --- Updated Reward: 0.1407
col_penalty: 0.0000
Reward: -0.2708 --- Updated Reward: 0.1429
col_penalty: 0.0000
Reward: -0.3099 --- Updated Reward: 0.1457
col_penalty: 0.0000
Reward: -0.3536 --- Updated Reward: 0.1488


[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6570368838313665, 0, 3.0214670123513514] angle -1.435913630412115[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6583064643855203, 0, 3.030925750384227] angle -1.4388278348898837[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6583064643855203, 0, 3.030925750384227] angle -1.4388278348898837[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.659671831267765, 0, 3.0410095646761266] angle -1.4335981345427735[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.659671831267765, 0, 3.0410095646761266] angle -1.4335981345427735[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.661074695654875, 0, 3.0512049073118916] angle -1.434514749356536[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img

col_penalty: 0.0000
Reward: -0.4034 --- Updated Reward: 0.1522
col_penalty: 0.0000
Reward: -0.4478 --- Updated Reward: 0.1558
col_penalty: 0.0000
Reward: -0.5048 --- Updated Reward: 0.1598
col_penalty: 0.0000
Reward: -0.5667 --- Updated Reward: 0.1641
col_penalty: 0.0000
Reward: -0.6195 --- Updated Reward: 0.1687
col_penalty: 0.0000
Reward: -0.6844 --- Updated Reward: 0.1741


[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.665274696066976, 0, 3.084068727992307] angle -1.449794230773679[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6666808806334306, 0, 3.095912805174821] angle -1.4554553322307746[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6666808806334306, 0, 3.095912805174821] angle -1.4554553322307746[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.66798017974013, 0, 3.107084352768295] angle -1.454569083937558[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.66798017974013, 0, 3.107084352768295] angle -1.454569083937558[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.669285876230797, 0, 3.11853170343478] angle -1.459883161740893[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[

col_penalty: 0.0000
Reward: -0.7568 --- Updated Reward: 0.1796
col_penalty: 0.0000
Reward: -0.8307 --- Updated Reward: 0.1857
col_penalty: 0.0000
Reward: -0.9089 --- Updated Reward: 0.1917
col_penalty: 0.0000
Reward: -0.9880 --- Updated Reward: 0.1980
col_penalty: 0.0000
Reward: -1.0624 --- Updated Reward: 0.2037


[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.672651143315984, 0, 3.1498010015134748] angle -1.4604605221892053[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.672651143315984, 0, 3.1498010015134748] angle -1.4604605221892053[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.673745918848455, 0, 3.1599753101754535] angle -1.4667530497132117[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.673745918848455, 0, 3.1599753101754535] angle -1.4667530497132117[0m
[2m08:01:11|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6747201777796894, 0, 3.1702371904168016] angle -1.485527807798554[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6747201777796894, 0, 3.1702371904168016] angle -1.485527807798554[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_im

col_penalty: 0.0000
Reward: -1.1382 --- Updated Reward: 0.2101
col_penalty: 0.0000
Reward: -1.2252 --- Updated Reward: 0.2166
col_penalty: 0.0000
Reward: -1.3016 --- Updated Reward: 0.2231
col_penalty: 0.0000
Reward: -1.3701 --- Updated Reward: 0.2296
col_penalty: 0.0000
Reward: -1.4311 --- Updated Reward: 0.2355
col_penalty: 0.0000
Reward: -1.4782 --- Updated Reward: 0.2410


[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6760916423883816, 0, 3.195630631705036] angle -1.552279696267345[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.676150029747086, 0, 3.2024977774180687] angle -1.5723085265903658[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.676150029747086, 0, 3.2024977774180687] angle -1.5723085265903658[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.676095806422556, 0, 3.20945972618936] angle -1.584860865621025[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.676095806422556, 0, 3.20945972618936] angle -1.584860865621025[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.675991474143575, 0, 3.217232037187473] angle -1.5835773460138034[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0

col_penalty: 0.0000
Reward: -1.5214 --- Updated Reward: 0.2458
col_penalty: 0.0000
Reward: -1.5609 --- Updated Reward: 0.2501
col_penalty: 0.0000
Reward: -1.6066 --- Updated Reward: 0.2544
col_penalty: 0.0000
Reward: -1.6683 --- Updated Reward: 0.2593
col_penalty: 0.0000
Reward: -1.7253 --- Updated Reward: 0.2644


[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6756686154234735, 0, 3.233277771952065] angle -1.5974521565810982[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6754312942318785, 0, 3.24093723352493] angle -1.606088791765442[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6754312942318785, 0, 3.24093723352493] angle -1.606088791765442[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.675073894448252, 0, 3.2488101930240205] angle -1.626233295611243[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.675073894448252, 0, 3.2488101930240205] angle -1.626233295611243[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6745332174755934, 0, 3.2574432167203744] angle -1.6404538177671713[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): 

col_penalty: 0.0000
Reward: -1.7851 --- Updated Reward: 0.2697
col_penalty: 0.0000
Reward: -1.8407 --- Updated Reward: 0.2747
col_penalty: 0.0000
Reward: -1.8886 --- Updated Reward: 0.2798
col_penalty: 0.0000
Reward: -1.9462 --- Updated Reward: 0.2855
col_penalty: 0.0000
Reward: -2.0007 --- Updated Reward: 0.2914


[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6728728446727623, 0, 3.2754781967124997] angle -1.6855838332890913[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.671741205670136, 0, 3.284949419321653] angle -1.6938449797817468[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.671741205670136, 0, 3.284949419321653] angle -1.6938449797817468[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6706216283896933, 0, 3.293751419624245] angle -1.7007804824221244[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6706216283896933, 0, 3.293751419624245] angle -1.7007804824221244[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6693549715984437, 0, 3.303360579073518] angle -1.702936224134159[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(

col_penalty: 0.0000
Reward: -2.0530 --- Updated Reward: 0.2972
col_penalty: 0.0000
Reward: -2.1205 --- Updated Reward: 0.3034
col_penalty: 0.0000
Reward: -2.1853 --- Updated Reward: 0.3092


[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.668141944571026, 0, 3.3127992849846497] angle -1.694287690537041[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.668141944571026, 0, 3.3127992849846497] angle -1.694287690537041[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6671258427700026, 0, 3.3214985773486263] angle -1.6798568447030904[0m
[2m08:01:12|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6671258427700026, 0, 3.3214985773486263] angle -1.6798568447030904[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6662165972464416, 0, 3.330806404155929] angle -1.6564901392485643[0m


col_penalty: 0.0000
Reward: -2.2664 --- Updated Reward: 0.3156
col_penalty: 0.0000
Reward: -2.3505 --- Updated Reward: 0.3221
col_penalty: 0.0000
Reward: -2.4301 --- Updated Reward: 0.3282


[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6662165972464416, 0, 3.330806404155929] angle -1.6564901392485643[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.665488588953069, 0, 3.340106314048908] angle -1.6413463230476901[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.665488588953069, 0, 3.340106314048908] angle -1.6413463230476901[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.664925949151894, 0, 3.348578471787088] angle -1.6328724819277762[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.664925949151894, 0, 3.348578471787088] angle -1.6328724819277762[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.664478349624813, 0, 3.356535525833448] angle -1.6211055617872634[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): 

col_penalty: 0.0000
Reward: -2.5308 --- Updated Reward: 0.3349
col_penalty: 0.0000
Reward: -2.6272 --- Updated Reward: 0.3419
col_penalty: 0.0000
Reward: -2.7133 --- Updated Reward: 0.3483
col_penalty: 0.0000
Reward: -2.7929 --- Updated Reward: 0.3545
col_penalty: 0.0000
Reward: -2.8768 --- Updated Reward: 0.3612


[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6636163102483437, 0, 3.3821464532883665] angle -1.5847040234035932[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6636163102483437, 0, 3.3821464532883665] angle -1.5847040234035932[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6635338141065508, 0, 3.3912894808325604] angle -1.574933835187407[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6635338141065508, 0, 3.3912894808325604] angle -1.574933835187407[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6635597712793655, 0, 3.3999873785794574] angle -1.5606902277125956[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6635597712793655, 0, 3.3999873785794574] angle -1.5606902277125956[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_rende

col_penalty: 0.0000
Reward: -2.9713 --- Updated Reward: 0.3680
col_penalty: 0.0000
Reward: -3.0617 --- Updated Reward: 0.3749
col_penalty: 0.0000
Reward: -3.1542 --- Updated Reward: 0.3825
col_penalty: 0.0000
Reward: -3.2564 --- Updated Reward: 0.3898
col_penalty: 0.0000
Reward: -3.3693 --- Updated Reward: 0.3975
col_penalty: 0.0000
Reward: -3.4750 --- Updated Reward: 0.4060


[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6651611866019924, 0, 3.4370461766652443] angle -1.5146041364356633[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.6651611866019924, 0, 3.4370461766652443] angle -1.5146041364356633[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [2.665611483397785, 0, 3.445340617074881] angle -1.5185170185396188[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[2.66906035 0.         3.51125044] corresponds to tile at (4, 6) which is not drivable: {'coords': (4, 6), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f0ff8e54588>, 'color': array([1, 1, 1])}[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:13|[0mgym-duckietown[2m|simula

[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [3.23292411 0.         0.68014598][0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[3.30655235 0.         1.18763146] corresponds to tile at (5, 2) which is not drivable: {'coords': (5, 2), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f0ff8e54588>, 'color': array([1, 1, 1])}[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.22934714 0.         1.12808647][0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.30655235 0.         1.18763146][0m
[2m08:01:13|[0mgym-duckietown[2m|simu

col_penalty: 0.0000
Reward: -3.5674 --- Updated Reward: 0.4142
col_penalty: 0.0000
Reward: -3.6577 --- Updated Reward: 0.4220
col_penalty: 0.0000
Reward: -1000.0000 --- Updated Reward: 0.4294


[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [3.14686844 0.         1.03248141][0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:531|reset(): [0m[32mStarting at [3.42800181 0.         0.96675439] 0.03283258105572982[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.42800181 0.         0.96675439] angle 0.03283258105572982[0m
[2m08:01:13|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.42800181 0.         0.96675439] angle 0.03283258105572982[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4280018077174046, 0, 0.9667543902996143] angle 0.03283258105572982[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4280018077174046, 0, 0.9667543902996143] angle 0.03283258105572982[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4280018077174046, 0, 0.966754390299614

col_penalty: 0.0000
Reward: 0.9219 --- Updated Reward: 0.0277
col_penalty: 0.0000
Reward: 0.9219 --- Updated Reward: 0.0277
col_penalty: 0.0000
Reward: 0.9219 --- Updated Reward: 0.0277
col_penalty: 0.0000
Reward: 0.9219 --- Updated Reward: 0.0277
col_penalty: 0.0000
Reward: 0.9212 --- Updated Reward: 0.0278
col_penalty: 0.0000
Reward: 0.9204 --- Updated Reward: 0.0279


[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4358255827513617, 0, 0.9664445326517899] angle 0.038762029695192386[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.440605635076343, 0, 0.9662655154882138] angle 0.03610474044398275[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.440605635076343, 0, 0.9662655154882138] angle 0.03610474044398275[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4459275242520033, 0, 0.9660863649168068] angle 0.03119577657577631[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4459275242520033, 0, 0.9660863649168068] angle 0.03119577657577631[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.451968823436249, 0, 0.965883860657466] angle 0.03581910230253662[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render

col_penalty: 0.0000
Reward: 0.9185 --- Updated Reward: 0.0281
col_penalty: 0.0000
Reward: 0.9169 --- Updated Reward: 0.0282
col_penalty: 0.0000
Reward: 0.9153 --- Updated Reward: 0.0284
col_penalty: 0.0000
Reward: 0.9131 --- Updated Reward: 0.0286
col_penalty: 0.0000
Reward: 0.9108 --- Updated Reward: 0.0288
col_penalty: 0.0000
Reward: 0.9086 --- Updated Reward: 0.0291


[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.477177056829412, 0, 0.9650863478284739] angle 0.014018871183508167[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.477177056829412, 0, 0.9650863478284739] angle 0.014018871183508167[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4843431453925366, 0, 0.964990411468118] angle 0.012754626989764187[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4843431453925366, 0, 0.964990411468118] angle 0.012754626989764187[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4913770495848424, 0, 0.9648913892612265] angle 0.015399201486552477[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.4913770495848424, 0, 0.9648913892612265] angle 0.015399201486552477[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_r

col_penalty: 0.0000
Reward: 0.9068 --- Updated Reward: 0.0293
col_penalty: 0.0000
Reward: 0.9057 --- Updated Reward: 0.0294
col_penalty: 0.0000
Reward: 0.9048 --- Updated Reward: 0.0295
col_penalty: 0.0000
Reward: 0.9038 --- Updated Reward: 0.0296
col_penalty: 0.0000
Reward: 0.9024 --- Updated Reward: 0.0297
col_penalty: 0.0000
Reward: 0.9000 --- Updated Reward: 0.0299
col_penalty: 0.0000
Reward: 0.8946 --- Updated Reward: 0.0303


[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5374795457348056, 0, 0.9630071569677039] angle 0.06576687027643449[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5374795457348056, 0, 0.9630071569677039] angle 0.06576687027643449[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.548840050837324, 0, 0.9621181546758955] angle 0.09042236168929703[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.548840050837324, 0, 0.9621181546758955] angle 0.09042236168929703[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5608095561684947, 0, 0.9609076922502697] angle 0.11115005834147924[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.5608095561684947, 0, 0.9609076922502697] angle 0.11115005834147924[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_rende

col_penalty: 0.0000
Reward: 0.8829 --- Updated Reward: 0.0311
col_penalty: 0.0000
Reward: 0.8622 --- Updated Reward: 0.0325
col_penalty: 0.0000
Reward: 0.8294 --- Updated Reward: 0.0345
col_penalty: 0.0000
Reward: 0.7825 --- Updated Reward: 0.0374
col_penalty: 0.0000
Reward: 0.7228 --- Updated Reward: 0.0409
col_penalty: 0.0000
Reward: 0.6474 --- Updated Reward: 0.0451


[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.607264392246033, 0, 0.9529912240605523] angle 0.21746502779259366[0m
[2m08:01:14|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.607264392246033, 0, 0.9529912240605523] angle 0.21746502779259366[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6191281794909895, 0, 0.9503518473545323] angle 0.2203514928608858[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6191281794909895, 0, 0.9503518473545323] angle 0.2203514928608858[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6297893830714982, 0, 0.947938546556141] angle 0.2248710601878647[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6297893830714982, 0, 0.947938546556141] angle 0.2248710601878647[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(

col_penalty: 0.0000
Reward: 0.5573 --- Updated Reward: 0.0503
col_penalty: 0.0000
Reward: 0.4611 --- Updated Reward: 0.0563
col_penalty: 0.0000
Reward: 0.3593 --- Updated Reward: 0.0629
col_penalty: 0.0000
Reward: 0.2689 --- Updated Reward: 0.0693
col_penalty: 0.0000
Reward: 0.1596 --- Updated Reward: 0.0765
col_penalty: 0.0000
Reward: 0.0598 --- Updated Reward: 0.0837


[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6721313011778403, 0, 0.9382613041315033] angle 0.2188890355056301[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6721313011778403, 0, 0.9382613041315033] angle 0.2188890355056301[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.681582528124741, 0, 0.9361882527054042] angle 0.21295623696928193[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.681582528124741, 0, 0.9361882527054042] angle 0.21295623696928193[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6909277442684867, 0, 0.9341790215671475] angle 0.21059806060023423[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.6909277442684867, 0, 0.9341790215671475] angle 0.21059806060023423[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_

col_penalty: 0.0000
Reward: -0.0402 --- Updated Reward: 0.0909
col_penalty: 0.0000
Reward: -0.1298 --- Updated Reward: 0.0978
col_penalty: 0.0000
Reward: -0.2195 --- Updated Reward: 0.1048
col_penalty: 0.0000
Reward: -0.3072 --- Updated Reward: 0.1117
col_penalty: 0.0000
Reward: -0.3863 --- Updated Reward: 0.1178
col_penalty: 0.0000
Reward: -0.4717 --- Updated Reward: 0.1243


[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7241995234086724, 0, 0.9268887442094371] angle 0.24219796037274907[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7241995234086724, 0, 0.9268887442094371] angle 0.24219796037274907[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.731564521164257, 0, 0.9249903839524527] angle 0.26232896039019155[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.731564521164257, 0, 0.9249903839524527] angle 0.26232896039019155[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.739043271511851, 0, 0.9229276009273177] angle 0.2759263712956548[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.739043271511851, 0, 0.9229276009273177] angle 0.2759263712956548[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_im

col_penalty: 0.0000
Reward: -0.5699 --- Updated Reward: 0.1310
col_penalty: 0.0000
Reward: -0.6856 --- Updated Reward: 0.1380
col_penalty: 0.0000
Reward: -0.7836 --- Updated Reward: 0.1443
col_penalty: 0.0000
Reward: -0.8784 --- Updated Reward: 0.1508
col_penalty: 0.0000
Reward: -0.9663 --- Updated Reward: 0.1572
col_penalty: 0.0000
Reward: -1.0542 --- Updated Reward: 0.1641


[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7604149189331437, 0, 0.9165307382922303] angle 0.2976456406628963[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7676071370336475, 0, 0.9142765816779361] angle 0.30979276900465597[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7676071370336475, 0, 0.9142765816779361] angle 0.30979276900465597[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7757082903068024, 0, 0.9116169711597268] angle 0.3246351238800532[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7757082903068024, 0, 0.9116169711597268] angle 0.3246351238800532[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7837656317850312, 0, 0.9088628485873631] angle 0.3340944746443156[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_

col_penalty: 0.0000
Reward: -1.1351 --- Updated Reward: 0.1702
col_penalty: 0.0000
Reward: -1.2231 --- Updated Reward: 0.1768
col_penalty: 0.0000
Reward: -1.3324 --- Updated Reward: 0.1845
col_penalty: 0.0000
Reward: -1.4285 --- Updated Reward: 0.1922
col_penalty: 0.0000
Reward: -1.5345 --- Updated Reward: 0.2006
col_penalty: 0.0000
Reward: -1.6478 --- Updated Reward: 0.2094
col_penalty: 0.0000
Reward: -1.7585 --- Updated Reward: 0.2191


[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8216820103960587, 0, 0.8950327504498716] angle 0.35810166586893155[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8216820103960587, 0, 0.8950327504498716] angle 0.35810166586893155[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8321122035777755, 0, 0.8910498485388261] angle 0.37144569955509554[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8321122035777755, 0, 0.8910498485388261] angle 0.37144569955509554[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8417736988438875, 0, 0.8872224193862626] angle 0.38292529865868086[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8417736988438875, 0, 0.8872224193862626] angle 0.38292529865868086[0m
[2m08:01:15|[0mgym-duckietown[2m|simulator.py:1435|_ren

col_penalty: 0.0000
Reward: -1.8812 --- Updated Reward: 0.2296
col_penalty: 0.0000
Reward: -2.0102 --- Updated Reward: 0.2401
col_penalty: 0.0000
Reward: -2.1308 --- Updated Reward: 0.2500
col_penalty: 0.0000
Reward: -2.2417 --- Updated Reward: 0.2600
col_penalty: 0.0000
Reward: -2.3629 --- Updated Reward: 0.2701
col_penalty: 0.0000
Reward: -2.4734 --- Updated Reward: 0.2806


[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8923698183817996, 0, 0.8662861208676769] angle 0.40555806961471824[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8923698183817996, 0, 0.8662861208676769] angle 0.40555806961471824[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.901355113167131, 0, 0.8623830369672496] angle 0.41402896308099535[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.901355113167131, 0, 0.8623830369672496] angle 0.41402896308099535[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.910470801487975, 0, 0.8583471541117209] angle 0.41957185009601033[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.910470801487975, 0, 0.8583471541117209] angle 0.41957185009601033[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_

col_penalty: 0.0000
Reward: -2.5945 --- Updated Reward: 0.2914
col_penalty: 0.0000
Reward: -2.7274 --- Updated Reward: 0.3026
col_penalty: 0.0000
Reward: -2.8414 --- Updated Reward: 0.3121
col_penalty: 0.0000
Reward: -2.9538 --- Updated Reward: 0.3218
col_penalty: 0.0000
Reward: -3.0450 --- Updated Reward: 0.3315
col_penalty: 0.0000
Reward: -3.1341 --- Updated Reward: 0.3411


[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9369489427764273, 0, 0.8469150476629168] angle 0.3910969018400132[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9445109923399184, 0, 0.8438128234649427] angle 0.3875014282116624[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9445109923399184, 0, 0.8438128234649427] angle 0.3875014282116624[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9524962487963253, 0, 0.8406150385387248] angle 0.37430627338483574[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9524962487963253, 0, 0.8406150385387248] angle 0.37430627338483574[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9608360578618798, 0, 0.8373972971618318] angle 0.3621552246743959[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_

col_penalty: 0.0000
Reward: -3.2192 --- Updated Reward: 0.3499
col_penalty: 0.0000
Reward: -3.2945 --- Updated Reward: 0.3579
col_penalty: 0.0000
Reward: -3.3714 --- Updated Reward: 0.3662
col_penalty: 0.0000
Reward: -3.4439 --- Updated Reward: 0.3749
col_penalty: 0.0000
Reward: -3.5310 --- Updated Reward: 0.3841
col_penalty: 0.0000
Reward: -3.6024 --- Updated Reward: 0.3929


[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9967579680809857, 0, 0.8249324539177385] angle 0.3005909295580434[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9967579680809857, 0, 0.8249324539177385] angle 0.3005909295580434[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.00530298030141, 0, 0.8223864498278881] angle 0.2785628771737668[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.00530298030141, 0, 0.8223864498278881] angle 0.2785628771737668[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.013308329079746, 0, 0.8201560627696107] angle 0.2648796095499486[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.013308329079746, 0, 0.8201560627696107] angle 0.2648796095499486[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): 

col_penalty: 0.0000
Reward: -3.6865 --- Updated Reward: 0.4023
col_penalty: 0.0000
Reward: -3.7651 --- Updated Reward: 0.4115
col_penalty: 0.0000
Reward: -3.8249 --- Updated Reward: 0.4200
col_penalty: 0.0000
Reward: -3.8968 --- Updated Reward: 0.4279
col_penalty: 0.0000
Reward: -3.9671 --- Updated Reward: 0.4354
col_penalty: 0.0000
Reward: -4.0292 --- Updated Reward: 0.4432


[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[4.10111294 0.         0.79944459] corresponds to tile at (7, 1) which is not drivable: {'coords': (7, 1), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f0ff8e54588>, 'color': array([1, 1, 1])}[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.0[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [4.01341071 0.         0.81965148][0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.99657164 0.         0.74656629][0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [4.03024979 0.         0.89273667][0m
[2m08:01:16|[0mgym-duckietown[2m|simu

[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.3[0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [3.50618012 0.         1.81369074][0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [3.56298004 0.         1.73444416][0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1221|_valid_pose(): [0m[35mr_pos: [3.44938021 0.         1.89293731][0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1222|_valid_pose(): [0m[35mf_pos: [3.60127601 0.         1.88185064][0m
[2m08:01:16|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[4.11818143 0.         1.90573522] corresponds to tile at (7, 3) which is not drivable: {'coords': (7, 3), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_du

col_penalty: 0.0000
Reward: -1000.0000 --- Updated Reward: 0.4507


[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855499619477313, 0, 2.103742738477171] angle -1.5811139864690011[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.785541944564376, 0, 2.105284853715503] angle -1.5708764777682305[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7855411632734466, 0, 2.1088135355861777] angle -1.5711589989180472[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_im

col_penalty: 0.0000
Reward: 0.1994 --- Updated Reward: 0.1000
col_penalty: 0.0000
Reward: 0.1994 --- Updated Reward: 0.1000
col_penalty: 0.0000
Reward: 0.1994 --- Updated Reward: 0.1000
col_penalty: 0.0000
Reward: 0.1994 --- Updated Reward: 0.1000
col_penalty: 0.0000
Reward: 0.1996 --- Updated Reward: 0.1000
col_penalty: 0.0000
Reward: 0.1996 --- Updated Reward: 0.1000
col_penalty: 0.0000
Reward: 0.1995 --- Updated Reward: 0.1001
col_penalty: 0.0000
Reward: 0.1987 --- Updated Reward: 0.1001
col_penalty: 0.0000
Reward: 0.1970 --- Updated Reward: 0.1003
col_penalty: 0.0000
Reward: 0.1942 --- Updated Reward: 0.1005
col_penalty: 0.0000
Reward: 0.1892 --- Updated Reward: 0.1009
col_penalty: 0.0000
Reward: 0.1802 --- Updated Reward: 0.1015


[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.789240799956674, 0, 2.169072884391931] angle -1.4289661386964367[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7908609539079796, 0, 2.179453444255895] angle -1.4029731809283048[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7929235346414063, 0, 2.1909114383466948] angle -1.3824096222396929[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.795094635220923, 0, 2.2020132133828962] angle -1.3729311191966143[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.7970705750631994, 0, 2.211685441529286] angle -1.3656271916933436[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.799364769145895, 0, 2.2226327406682866] angle -1.3628103154019713[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_im

col_penalty: 0.0000
Reward: 0.1671 --- Updated Reward: 0.1024
col_penalty: 0.0000
Reward: 0.1505 --- Updated Reward: 0.1037
col_penalty: 0.0000
Reward: 0.1295 --- Updated Reward: 0.1054
col_penalty: 0.0000
Reward: 0.1045 --- Updated Reward: 0.1074
col_penalty: 0.0000
Reward: 0.0806 --- Updated Reward: 0.1096
col_penalty: 0.0000
Reward: 0.0591 --- Updated Reward: 0.1116
col_penalty: 0.0000
Reward: 0.0355 --- Updated Reward: 0.1139
col_penalty: 0.0000
Reward: 0.0105 --- Updated Reward: 0.1164
col_penalty: 0.0000
Reward: -0.0105 --- Updated Reward: 0.1185
col_penalty: 0.0000
Reward: -0.0317 --- Updated Reward: 0.1205
col_penalty: 0.0000
Reward: -0.0546 --- Updated Reward: 0.1228
col_penalty: 0.0000
Reward: -0.0750 --- Updated Reward: 0.1248


[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.814151302726044, 0, 2.2918961376812166] angle -1.3503041415155757[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8161954496374237, 0, 2.3009928775194854] angle -1.3492078416357214[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8181720793608216, 0, 2.3100455866176395] angle -1.3624391235484998[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8201196204812122, 0, 2.3191617003903504] angle -1.3582070932837773[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.821870716881784, 0, 2.327291364968686] angle -1.3590764899239884[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.823896613398341, 0, 2.336778711611224] angle -1.361762013997794[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_im

col_penalty: 0.0000
Reward: -0.0935 --- Updated Reward: 0.1266
col_penalty: 0.0000
Reward: -0.1156 --- Updated Reward: 0.1287
col_penalty: 0.0000
Reward: -0.1363 --- Updated Reward: 0.1307
col_penalty: 0.0000
Reward: -0.1527 --- Updated Reward: 0.1327
col_penalty: 0.0000
Reward: -0.1732 --- Updated Reward: 0.1346
col_penalty: 0.0000
Reward: -0.1905 --- Updated Reward: 0.1364
col_penalty: 0.0000
Reward: -0.2101 --- Updated Reward: 0.1384
col_penalty: 0.0000
Reward: -0.2372 --- Updated Reward: 0.1407
col_penalty: 0.0000
Reward: -0.2723 --- Updated Reward: 0.1435
col_penalty: 0.0000
Reward: -0.3098 --- Updated Reward: 0.1465
col_penalty: 0.0000
Reward: -0.3508 --- Updated Reward: 0.1498
col_penalty: 0.0000
Reward: -0.3979 --- Updated Reward: 0.1532


[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.840954930622251, 0, 2.415881180946249] angle -1.3464113422743218[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8436110869905433, 0, 2.427223511971797] angle -1.3351102922408955[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8464440593371587, 0, 2.4385221963911716] angle -1.315142293266352[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.849722887800085, 0, 2.4505353736953968] angle -1.2935577439168153[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8532236154128987, 0, 2.462628170228935] angle -1.2844635167164729[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.856290155228373, 0, 2.472895000730829] angle -1.2766307761596034[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(

col_penalty: 0.0000
Reward: -0.4523 --- Updated Reward: 0.1573
col_penalty: 0.0000
Reward: -0.5091 --- Updated Reward: 0.1617
col_penalty: 0.0000
Reward: -0.5772 --- Updated Reward: 0.1664
col_penalty: 0.0000
Reward: -0.6543 --- Updated Reward: 0.1716
col_penalty: 0.0000
Reward: -0.7444 --- Updated Reward: 0.1776
col_penalty: 0.0000
Reward: -0.8351 --- Updated Reward: 0.1843
col_penalty: 0.0000
Reward: -0.9243 --- Updated Reward: 0.1902
col_penalty: 0.0000
Reward: -1.0074 --- Updated Reward: 0.1966
col_penalty: 0.0000
Reward: -1.0937 --- Updated Reward: 0.2030
col_penalty: 0.0000
Reward: -1.1755 --- Updated Reward: 0.2094
col_penalty: 0.0000
Reward: -1.2540 --- Updated Reward: 0.2152


[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8759182092207234, 0, 2.5313983103880116] angle -1.2202014051923264[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.87987488551801, 0, 2.541932542604143] angle -1.202794953100895[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8839669867459428, 0, 2.552308738140466] angle -1.1875038094984305[0m
[2m08:01:17|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8879073926974828, 0, 2.561990926806659] angle -1.1810797193671712[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8919951953821426, 0, 2.5718594494279694] angle -1.1750928408626466[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.8963890117363027, 0, 2.5823894529458356] angle -1.1758958096718644[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_im

col_penalty: 0.0000
Reward: -1.3466 --- Updated Reward: 0.2224
col_penalty: 0.0000
Reward: -1.4521 --- Updated Reward: 0.2299
col_penalty: 0.0000
Reward: -1.5665 --- Updated Reward: 0.2380
col_penalty: 0.0000
Reward: -1.6825 --- Updated Reward: 0.2463
col_penalty: 0.0000
Reward: -1.7815 --- Updated Reward: 0.2543
col_penalty: 0.0000
Reward: -1.8839 --- Updated Reward: 0.2627
col_penalty: 0.0000
Reward: -1.9943 --- Updated Reward: 0.2717
col_penalty: 0.0000
Reward: -2.1069 --- Updated Reward: 0.2815
col_penalty: 0.0000
Reward: -2.2238 --- Updated Reward: 0.2915
col_penalty: 0.0000
Reward: -2.3428 --- Updated Reward: 0.3011
col_penalty: 0.0000
Reward: -2.4668 --- Updated Reward: 0.3114


[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9241183645204005, 0, 2.648464643147203] angle -1.173789662389367[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9283797189076015, 0, 2.6584131766894235] angle -1.1584105657612678[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9328111678648145, 0, 2.668159026795947] angle -1.1296690961519282[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.93747572044871, 0, 2.677751573481909] angle -1.1067193008745058[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9421048885061603, 0, 2.6867344757813636] angle -1.0831992449540357[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9462601193711744, 0, 2.694366172628333] angle -1.061217994527029[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(

col_penalty: 0.0000
Reward: -2.5768 --- Updated Reward: 0.3217
col_penalty: 0.0000
Reward: -2.6775 --- Updated Reward: 0.3310
col_penalty: 0.0000
Reward: -2.8019 --- Updated Reward: 0.3402
col_penalty: 0.0000
Reward: -2.9337 --- Updated Reward: 0.3496
col_penalty: 0.0000
Reward: -3.0698 --- Updated Reward: 0.3591
col_penalty: 0.0000
Reward: -3.1954 --- Updated Reward: 0.3683
col_penalty: 0.0000
Reward: -3.3080 --- Updated Reward: 0.3763
col_penalty: 0.0000
Reward: -3.4127 --- Updated Reward: 0.3844
col_penalty: 0.0000
Reward: -3.5274 --- Updated Reward: 0.3936
col_penalty: 0.0000
Reward: -3.6375 --- Updated Reward: 0.4031
col_penalty: 0.0000
Reward: -3.7627 --- Updated Reward: 0.4134
col_penalty: 0.0000
Reward: -3.8813 --- Updated Reward: 0.4237


[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.978072043363679, 0, 2.7463494904124657] angle -1.000199838735178[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.9843378597540577, 0, 2.756025646299708] angle -0.9920877183836424[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.990787493978155, 0, 2.7659281516630774] angle -0.9949090241903511[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [3.996918712249813, 0, 2.775555617144254] angle -1.0125378889468344[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.002565080420937, 0, 2.7848654277495375] angle -1.038689365338237[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.008378684210718, 0, 2.795053393673273] angle -1.0658088403215813[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img():

col_penalty: 0.0000
Reward: -3.9879 --- Updated Reward: 0.4339
col_penalty: 0.0000
Reward: -4.1168 --- Updated Reward: 0.4449
col_penalty: 0.0000
Reward: -4.2362 --- Updated Reward: 0.4563
col_penalty: 0.0000
Reward: -4.3344 --- Updated Reward: 0.4673
col_penalty: 0.0000
Reward: -4.4085 --- Updated Reward: 0.4777
col_penalty: 0.0000
Reward: -4.4978 --- Updated Reward: 0.4888
col_penalty: 0.0000
Reward: -4.5771 --- Updated Reward: 0.4989
col_penalty: 0.0000
Reward: -4.6561 --- Updated Reward: 0.5087
col_penalty: 0.0000
Reward: -4.7225 --- Updated Reward: 0.5173
col_penalty: 0.0000
Reward: -4.8039 --- Updated Reward: 0.5254
col_penalty: 0.0000
Reward: -4.9027 --- Updated Reward: 0.5345


[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1435|_render_img(): [0m[32mPos: [4.038462251288956, 0, 2.8573336090219126] angle -1.111925127786953[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1107|_drivable_pos(): [0m[35m[4.09507326 0.         2.80259611] corresponds to tile at (7, 4) which is not drivable: {'coords': (7, 4), 'kind': 'floor', 'angle': 0, 'drivable': False, 'texture': <simulation.gym_duckietown.graphics.Texture object at 0x7f0ff8e54588>, 'color': array([1, 1, 1])}[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1217|_valid_pose(): [0m[35mInvalid pose. Collision free: True On drivable area: False[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1218|_valid_pose(): [0m[35msafety_factor: 1.0[0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1219|_valid_pose(): [0m[35mpos: [4.02783178 0.         2.83581634][0m
[2m08:01:18|[0mgym-duckietown[2m|simulator.py:1220|_valid_pose(): [0m[35ml_pos: [4.09507326 0.         2.80259611][0m
[2m08:01

col_penalty: 0.0000
Reward: -5.0081 --- Updated Reward: 0.5437
col_penalty: 0.0000
Reward: -1000.0000 --- Updated Reward: 0.5534


In [6]:
view_results_ipython(local_env)

In [7]:
print(local_env.cur_pos, local_env.cur_angle)
lp = local_env.get_lane_pos2(local_env.cur_pos, local_env.cur_angle)
print(lp)
print(local_env.proximity_penalty2(local_env.cur_pos, local_env.cur_angle))

[4.038462251288956, 0, 2.8573336090219126] -1.111925127786953
LanePosition(dist=-0.5533891828846278, dot_dir=0.3318322128845906, angle_deg=-70.61997886041287, angle_rad=-1.2325511488029977)
0.0


**Question 0: After understanding the above computed reward, experiment with the constants for each component. What type of behavior does the above reward function penalize? Is this good or bad in context of autonomous driving? Name some other issues that can arise with single-objective optimization. In addition, give three sets of constants and explain qualitatively what types of behavior each penalizes or rewards (note, you may want to use a different action policy than random)**. Place the answers to the above in `reinforcement-learning-answers.txt`




# The Reinforcement Learning Learning Code

Below we'll see a relatively naive implementation of the actor-critic training loop, which proceeds as follows: the critic is tasked with a supervised learning problem of fitting rewards acquired by the agent. Then, the policy, using policy gradients, maximizes the return according to the critic's estimate, rather than using Monte-Carlo updates.

Below, we see an implementation of `DDPGAgent`, a class which handles the networks and training loop. 

In [35]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class DDPGAgent(object):
    def __init__(self, state_dim, action_dim, max_action=1.0):
        super(DDPGAgent, self).__init__()
        self.flat = False
        
        self.actor = Actor(action_dim, max_action).to(device)
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=5e-2)
        
        self.critic = Critic(action_dim, max_action).to(device)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=5e-2)
        
        # MOD: Add 2 target networks
        self.actor_target = Actor(action_dim, max_action).to(device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.critic_target = Critic(action_dim, max_action).to(device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        
        # MOD: Param for updating target networks
        self.tau = 0.001
        self.critic_losses = [] # To store critic loss
        self.actor_losses = [] # To store actor loss
        
    def predict(self, state):
        assert state.shape[0] == 3
        state = torch.FloatTensor(np.expand_dims(state, axis=0)).to(device)
        return self.actor(state).cpu().data.numpy().flatten()

    def train(self, replay_buffer, iterations, batch_size=64, discount=0.99):
        for it in range(iterations):

            # Sample replay buffer
            sample = replay_buffer.sample(batch_size, flat=self.flat)
            state = torch.FloatTensor(sample["state"]).to(device)
            action = torch.FloatTensor(sample["action"]).to(device)
            next_state = torch.FloatTensor(sample["next_state"]).to(device)
            done = torch.FloatTensor(1 - sample["done"]).to(device)
            reward = torch.FloatTensor(sample["reward"]).to(device)

            # MOD: Compute the target Q value
            # target_Q = self.critic(next_state, self.actor(next_state))
            target_Q = self.critic_target(next_state, self.actor_target(next_state)) # Use target networks instead
            
            
            # TODO: - no detach is a subtle, but important bug!
            target_Q = (reward + (done * discount * target_Q)).detach()

            # Get current Q estimate
            current_Q = self.critic(state, action)

            # Compute critic loss
            critic_loss = F.mse_loss(current_Q, target_Q)

            # Optimize the critic
            self.critic_optimizer.zero_grad()
            critic_loss.backward()
            self.critic_optimizer.step()

            # Compute actor loss
            actor_loss = -self.critic(state, self.actor(state)).mean()
    
            # Optimize the actor
            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()
            
            # MOD: store the losses
            self.critic_losses.append(critic_loss)
            self.actor_losses.append(actor_loss)
            
            # MOD: update actor_target and critic_target every iteration using soft update
            for real_param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
                target_param.data.copy_(self.tau * real_param.data + (1. - tau) * target_param.data)
            for real_param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                target_param.data.copy_(self.tau * real_param.data + (1. - tau) * target_param.data)
                
    def save(self, filename, directory):
        torch.save(self.actor.state_dict(), '{}/{}_actor.pth'.format(directory, filename))
        torch.save(self.critic.state_dict(), '{}/{}_critic.pth'.format(directory, filename))

    def load(self, filename, directory):
        self.actor.load_state_dict(torch.load('{}/{}_actor.pth'.format(directory, filename), map_location=device))
        self.critic.load_state_dict(torch.load('{}/{}_critic.pth'.format(directory, filename), map_location=device))


You'll notice that the training loop needs a `replay_buffer` object. In value-based and actor-critic methods in deep reinforcement learning, the use of a replay buffer is crucial. In the following sections, you'll explore why this is the case, and some other stabilization techniques that are needed in order to get the above code to work. Below, you can find an implementation of the replay buffer, as well the training loop that we use to train DDPG.

In [36]:
# Simple replay buffer
class ReplayBuffer(object):
    def __init__(self, max_size=1e6):
        self.storage = []
        self.max_size = max_size

    # Expects tuples of (state, next_state, action, reward, done)
    def add(self, state, next_state, action, reward, done):
        if len(self.storage) < self.max_size:
            self.storage.append((state, next_state, action, reward, done))
        else:
            # Remove random element in the memory beforea adding a new one
            self.storage.pop(random.randrange(len(self.storage)))
            self.storage.append((state, next_state, action, reward, done))


    def sample(self, batch_size=100, flat=True):
        ind = np.random.randint(0, len(self.storage), size=batch_size)
        states, next_states, actions, rewards, dones = [], [], [], [], []

        for i in ind:
            state, next_state, action, reward, done = self.storage[i]

            if flat:
                states.append(np.array(state, copy=False).flatten())
                next_states.append(np.array(next_state, copy=False).flatten())
            else:
                states.append(np.array(state, copy=False))
                next_states.append(np.array(next_state, copy=False))
            actions.append(np.array(action, copy=False))
            rewards.append(np.array(reward, copy=False))
            dones.append(np.array(done, copy=False))

        # state_sample, action_sample, next_state_sample, reward_sample, done_sample
        return {
            "state": np.stack(states),
            "next_state": np.stack(next_states),
            "action": np.stack(actions),
            "reward": np.stack(rewards).reshape(-1,1),
            "done": np.stack(dones).reshape(-1,1)
        }

In [37]:
seed_ = 123
max_timesteps = 1e5  
batch_size = 64
discount = 0.99
eval_freq = 500 # 5e3
file_name = 'dt-class-rl'
start_timesteps = 1e3 # 1e3 # 1e4
expl_noise = 0.1
env_timesteps = 500 # 500 # this was not defined

In [38]:
import logging
logging.getLogger('gym-duckietown').disabled = True

In [39]:
local_env = launch_env()
# local_env = wrap_env(local_env)
local_env = ResizeWrapper(local_env)
local_env = ImgWrapper(local_env)

if not os.path.exists("./pytorch_models"):
    os.makedirs("./pytorch_models")

# Set seeds
seedall(seed_)

state_dim = local_env.observation_space.shape
action_dim = local_env.action_space.shape[0]
max_action = float(local_env.action_space.high[0])

# Initialize policy
policy = DDPGAgent(state_dim, action_dim, max_action)

replay_buffer = ReplayBuffer()

# Evaluate untrained policy
evaluations= [evaluate_policy(local_env, policy)]


# MOD: store reward and done flag
iter_rewards = []
episode_rewards = []
dones = []


total_timesteps = 0
timesteps_since_eval = 0
episode_num = 0
done = True
episode_reward = 0 # None
env_counter = 0
while total_timesteps < max_timesteps:
    if done:
        if total_timesteps != 0:
            print(("Total T: %d Episode Num: %d Episode T: %d Reward: %f") % (
                total_timesteps, episode_num, episode_timesteps, episode_reward))
            policy.train(replay_buffer, episode_timesteps, batch_size, discount)

        # Evaluate episode
        if timesteps_since_eval >= eval_freq:
            timesteps_since_eval %= eval_freq
            evaluations.append(evaluate_policy(local_env, policy))

            policy.save(file_name, directory="./pytorch_models")
            np.savez("./pytorch_models/{}.npz".format(file_name),evaluations)
        
        # MOD: log episode rewards
        episode_rewards.append(episode_reward)
        print('Episode: %d - Episode reward: %.4f' % (episode_num, episode_reward))
        
        # Reset environment
        env_counter += 1
        obs = local_env.reset()
        done = False
        episode_reward = 0
        episode_timesteps = 0
        episode_num += 1

    # Select action randomly or according to policy
    if total_timesteps < start_timesteps:
        action = local_env.action_space.sample()
    else:
        action = policy.predict(np.array(obs))
        if expl_noise != 0:
            action = (action + np.random.normal(
                0,
                expl_noise,
                size=local_env.action_space.shape[0])
            ).clip(-1, +1)

    # Perform action
    new_obs, reward, done, _ = local_env.step(action)

    if episode_timesteps >= env_timesteps:
        done = True

    done_bool = 0 if episode_timesteps + 1 == env_timesteps else float(done)
    episode_reward += reward

    # Store data in replay buffer
    replay_buffer.add(obs, new_obs, action, reward, done_bool)

    obs = new_obs

    episode_timesteps += 1
    total_timesteps += 1
    timesteps_since_eval += 1
    
    # MOD: log reward and done flag
    dones.append(done)
    iter_rewards.append(reward)
    
    print('Iteration %d - episode: %d, reward: %.4f' % (total_timesteps, episode_num, reward))

# Final evaluation
evaluations.append(evaluate_policy(local_env, policy))

# if args.save_models:
policy.save(file_name, directory="./pytorch_models")
np.savez("./pytorch_models/{}.npz".format(file_name),evaluations)

action_dim:2
action_dim:2
Episode: 0 - Episode reward: 0.0000
Iteration 1 - episode: 1, reward: -1.3040
Iteration 2 - episode: 1, reward: -1.3040
Iteration 3 - episode: 1, reward: -1.3040
Iteration 4 - episode: 1, reward: -1.3040
Iteration 5 - episode: 1, reward: -1.3039
Iteration 6 - episode: 1, reward: -1.3035
Iteration 7 - episode: 1, reward: -1.3033
Iteration 8 - episode: 1, reward: -1.3044
Iteration 9 - episode: 1, reward: -1.3073
Iteration 10 - episode: 1, reward: -1.3124
Iteration 11 - episode: 1, reward: -1.3231
Iteration 12 - episode: 1, reward: -1.3409
Iteration 13 - episode: 1, reward: -1.3588
Iteration 14 - episode: 1, reward: -1.3705
Iteration 15 - episode: 1, reward: -1.3924
Iteration 16 - episode: 1, reward: -1.4084
Iteration 17 - episode: 1, reward: -1.4188
Iteration 18 - episode: 1, reward: -1.4188
Iteration 19 - episode: 1, reward: -1.4173
Iteration 20 - episode: 1, reward: -1.4146
Iteration 21 - episode: 1, reward: -1.4097
Iteration 22 - episode: 1, reward: -1.4108
I

Iteration 193 - episode: 1, reward: -2.0248
Iteration 194 - episode: 1, reward: -2.0556
Iteration 195 - episode: 1, reward: -2.0573
Iteration 196 - episode: 1, reward: -2.0716
Iteration 197 - episode: 1, reward: -2.0824
Iteration 198 - episode: 1, reward: -2.0758
Iteration 199 - episode: 1, reward: -2.0530
Iteration 200 - episode: 1, reward: -2.0009
Iteration 201 - episode: 1, reward: -1.9690
Iteration 202 - episode: 1, reward: -1.9548
Iteration 203 - episode: 1, reward: -1.9703
Iteration 204 - episode: 1, reward: -1.9561
Iteration 205 - episode: 1, reward: -1.9715
Iteration 206 - episode: 1, reward: -1.9774
Iteration 207 - episode: 1, reward: -1.9509
Iteration 208 - episode: 1, reward: -1.9292
Iteration 209 - episode: 1, reward: -1.9231
Iteration 210 - episode: 1, reward: -1.9380
Iteration 211 - episode: 1, reward: -1.9220
Iteration 212 - episode: 1, reward: -1.8827
Iteration 213 - episode: 1, reward: -1.8741
Iteration 214 - episode: 1, reward: -1.8470
Iteration 215 - episode: 1, rewa

Iteration 381 - episode: 1, reward: 0.0168
Iteration 382 - episode: 1, reward: 0.0445
Iteration 383 - episode: 1, reward: 0.0659
Iteration 384 - episode: 1, reward: 0.0779
Iteration 385 - episode: 1, reward: 0.0898
Iteration 386 - episode: 1, reward: 0.0947
Iteration 387 - episode: 1, reward: 0.0961
Iteration 388 - episode: 1, reward: 0.0974
Iteration 389 - episode: 1, reward: 0.0974
Iteration 390 - episode: 1, reward: 0.0974
Iteration 391 - episode: 1, reward: 0.0973
Iteration 392 - episode: 1, reward: 0.0975
Iteration 393 - episode: 1, reward: 0.0981
Iteration 394 - episode: 1, reward: 0.0994
Iteration 395 - episode: 1, reward: 0.1016
Iteration 396 - episode: 1, reward: 0.1031
Iteration 397 - episode: 1, reward: 0.1024
Iteration 398 - episode: 1, reward: 0.1040
Iteration 399 - episode: 1, reward: 0.1072
Iteration 400 - episode: 1, reward: 0.1079
Iteration 401 - episode: 1, reward: 0.1092
Iteration 402 - episode: 1, reward: 0.1110
Iteration 403 - episode: 1, reward: 0.1111
Iteration 4

KeyboardInterrupt: 

# Stabilizing DDPG

As you may notice, the above model performs poorly or doesn't converge. Your job is to improve it; first in the notebook, later in the AIDO submission. This last part of the assignment consists of four sections:

**1. There are subtle, but important, bugs that have been introduced into the code above. Your job is to find them, and explain them in your `reinforcement-learning-answers.txt`. You'll want to reread the original [DQN](https://deepmind.com/research/publications/human-level-control-through-deep-reinforcement-learning) and [DDPG](https://arxiv.org/abs/1509.02971) papers in order to better understand the issue, but by answering the following subquestions (*please put the answers to these in the submission for full credit*), you'll be on the right track:**

   a) Read some literature on actor-critic methods, including the original [actor-critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) paper. What is an issue that you see related to *non-stationarity*? Define what _non-stationarity_ means in the context of machine learning and how it relates to actor-critic methods. In addition, give some hypotheses on why reinforcement learning is much more difficult (from an optimization perspective) than supervised learning, and how the answer to the previous question and this one are related.

   b) What role does the replay buffer play in off-policy reinforcement learning? It's most important parameter is `max_size` - how does changing this value (answer for both increasing and decreasing trends) qualitatively affect the training of the algorithm?

   c) **Challenge Question:** Briefly, explain how automatic differentiation works. In addition, expand on the difference between a single-element tensor (that `requires_grad`) and a scalar value as it relates to automatic differentiation; when do we want to backpropogate through a single-element tensor, and when do we not? Take a close look at the code and how losses are being backpropogated. On paper or your favorite drawing software, draw out the actor-critic architecture *as described in the code*, and label how the actor and critic losses are backpropogated. On your diagram, highlight the particular loss that will cause issues with the above code, and fix it.
   
For the next section, please pick **either** the theoretical or the practical pathway. If you don't have access to the necessary compute, for the exercise, please do the theoretical portion. 
   
_Theoretical Component_ 

**2. We discussed a case study of DQN in class. The original authors used quite a few tricks to get this to work. Detail some of the following, and explain what problem they solve in training the DQN:**

a) Target Networks

b) Annealed Learning Rates

c) Replay Buffer

d) Random Exploration Period

e) Preprocessing the Image


**3. Read about either [TD3](https://arxiv.org/abs/1802.09477) or [Soft Actor Critic](https://arxiv.org/abs/1801.01290); for your choice, summarize what problems they are addressing with the standard actor-critic formulation, and how they solve them**


_Practical Component_ 

**2. [Optional - if you have access to compute] Using your analysis from the reward engineering ablation, train two agents (after you've found the bugs in DDPG) - one with the standard, `gym-duckietown` reward, and another with the parameters of your choosing. Report each set of parameters, and describe qualitatively what type of behavior the agent produces.**

If you don't have the resources to actually train these agents, instead describe what types of behaviors each reward function might prioritize.

**3. [Optional - if you have access to compute] Using the instructions [here](http://docs.duckietown.org/DT19/AIDO/out/embodied_rl.html), use the saved policy files from this notebook and submit using the template submission provided through the AIDO submission. Report your best submission number (i.e the one you'd like to be graded) in `reinforcement-learning-answers.txt`**