# Training an Ant to Walk using TD3 in MuJoCo Environment

This Jupyter notebook showcases the implementation of the Twin Delayed Deep Deterministic (TD3) algorithm to train an ant to walk in the MuJoCo environment. TD3 is a state-of-the-art reinforcement learning algorithm designed for continuous control tasks. In this project, we leverage the power of PyTorch and various other libraries to create and train a neural network that learns to control the ant's movements in the MuJoCo simulation.

### Dependencies
Before we dive into the implementation, let's ensure that the necessary dependencies are installed. We'll be using PyTorch for deep learning, gym for the MuJoCo environment, and tensorboardX for visualizing training progress.

In [1]:
import numpy as np
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
from tensorboardX import SummaryWriter

import gym
# import roboschool
import pybullet
import sys

Make sure to install the required libraries using the following commands:


pip install torch tensorboardX gym roboschool pybullet

### MuJoCo Environment
MuJoCo (Multi-Joint dynamics with Contact) is a physics engine designed for research and development in robotics, biomechanics, graphics, and machine learning. In this project, we will be using the MuJoCo environment provided by the gym library to simulate the ant's movements.

Let's proceed with the implementation and training of the TD3 algorithm to observe how our neural network learns to control the ant's locomotion.

In [2]:
def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)

## Actor Network for TD3 Algorithm
The Actor class is an essential component of the Twin Delayed Deep Deterministic (TD3) algorithm. It serves as the policy network responsible for generating actions in the continuous action space. 

### Model Architecture
The actor network consists of three fully connected layers:

Input Layer (self.l1): This layer takes the state as input with a dimensionality of state_dim and outputs a feature representation with 400 nodes.      
First Hidden Layer (self.l2): The first hidden layer has 400 nodes and applies the Rectified Linear Unit (ReLU) activation function to introduce non-linearity.     
Second Hidden Layer (self.l3): The second hidden layer reduces the dimensionality to action_dim and applies the hyperbolic tangent (tanh) activation function. The final output is scaled by max_action to ensure it falls within the specified action range.

Initialization Parameters

The Actor class takes the following parameters during initialization:

    state_dim: Dimension of each state.
    action_dim: Dimension of each action.
    max_action: Highest action value allowed.

To obtain the actor's output (action), pass a state tensor through the network using the forward method. The output is a tensor with tanh activation, scaled by max_action.

In [3]:
class Actor(nn.Module):
    """Initialize parameters and build model.
        Args:
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            max_action (float): highest action to take
            seed (int): Random seed
            h1_units (int): Number of nodes in first hidden layer
            h2_units (int): Number of nodes in second hidden layer
            
        Return:
            action output of network with tanh activation
    """
    
    def __init__(self, state_dim, action_dim, max_action):
        super(Actor, self).__init__()

        self.l1 = nn.Linear(state_dim, 400)
        self.l2 = nn.Linear(400, 300)
        self.l3 = nn.Linear(300, action_dim)

        self.max_action = max_action


    def forward(self, x):
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = self.max_action * torch.tanh(self.l3(x)) 
        return x

## Critic Network for TD3 Algorithm
The Critic class plays a crucial role in the Twin Delayed Deep Deterministic (TD3) algorithm, serving as the value function approximator. This network evaluates the quality of the actions taken by the actor in a given state. 

### Model Architecture
The critic network consists of two separate Q-value estimators (or, Q Learning networks):

Q1 Architecture:

Input Layer (self.l1): Concatenates the state and action tensors and passes through a fully connected layer with 400 nodes.
First Hidden Layer (self.l2): Applies the Rectified Linear Unit (ReLU) activation function to introduce non-linearity.
Second Hidden Layer (self.l3): Reduces the dimensionality to 1, providing the Q1 value.

Q2 Architecture:

Input Layer (self.l4): Similar to Q1, concatenates the state and action tensors and passes through a fully connected layer with 400 nodes.
First Hidden Layer (self.l5): Applies ReLU activation.
Second Hidden Layer (self.l6): Reduces the dimensionality to 1, providing the Q2 value.

The Critic class takes the following parameters during initialization:

    state_dim: Dimension of each state.
    action_dim: Dimension of each action.

To obtain the Q-values from both Q1 and Q2, pass the state and action tensors through the network using the forward method. Additionally, the Q1 method allows for retrieving Q1 values separately.

In [4]:
class Critic(nn.Module):
    """Initialize parameters and build model.
        Args:
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            max_action (float): highest action to take
            seed (int): Random seed
            h1_units (int): Number of nodes in first hidden layer
            h2_units (int): Number of nodes in second hidden layer
            
        Return:
            value output of network 
    """
    
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()

        # Q1 architecture
        self.l1 = nn.Linear(state_dim + action_dim, 400)
        self.l2 = nn.Linear(400, 300)
        self.l3 = nn.Linear(300, 1)

        # Q2 architecture
        self.l4 = nn.Linear(state_dim + action_dim, 400)
        self.l5 = nn.Linear(400, 300)
        self.l6 = nn.Linear(300, 1)


    def forward(self, x, u):
        xu = torch.cat([x, u], 1)

        x1 = F.relu(self.l1(xu))
        x1 = F.relu(self.l2(x1))
        x1 = self.l3(x1)

        x2 = F.relu(self.l4(xu))
        x2 = F.relu(self.l5(x2))
        x2 = self.l6(x2)
        return x1, x2


    def Q1(self, x, u):
        xu = torch.cat([x, u], 1)

        x1 = F.relu(self.l1(xu))
        x1 = F.relu(self.l2(x1))
        x1 = self.l3(x1)
        return x1

## Experience Replay Buffer for TD3 Algorithm
The ReplayBuffer class is crucial for implementing the experience replay mechanism in the Twin Delayed Deep Deterministic (TD3) algorithm. This buffer stores tuples of experiences, allowing the agent to learn from past interactions. It also facilitates efficient training by breaking the temporal correlation between consecutive experiences, leading to more stable and effective learning in the TD3 algorithm.


### Implementation Details
The replay buffer is implemented based on the code from OpenAI's Baselines library. It stores tuples of the form (state, next_state, action, reward, done). The key functionalities of the ReplayBuffer include:

Initialization: The buffer is initialized with a maximum size (max_size) to limit the number of experiences stored.

Adding Data: The add method is used to add experience tuples to the buffer. If the buffer is full, it replaces old entries in a cyclic manner.

Sampling Data: The sample method randomly selects a batch of experiences from the buffer for training the TD3 algorithm.

In [5]:
# Code based on: 
# https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py

# Expects tuples of (state, next_state, action, reward, done)
class ReplayBuffer(object):
    """Buffer to store tuples of experience replay"""
    
    def __init__(self, max_size=1000000):
        """
        Args:
            max_size (int): total amount of tuples to store
        """
        
        self.storage = []
        self.max_size = max_size
        self.ptr = 0

    def add(self, data):
        """Add experience tuples to buffer
        
        Args:
            data (tuple): experience replay tuple
        """
        
        if len(self.storage) == self.max_size:
            self.storage[int(self.ptr)] = data
            self.ptr = (self.ptr + 1) % self.max_size
        else:
            self.storage.append(data)

    def sample(self, batch_size):
        """Samples a random amount of experiences from buffer of batch size
        
        Args:
            batch_size (int): size of sample
        """
        
        ind = np.random.randint(0, len(self.storage), size=batch_size)
        states, actions, next_states, rewards, dones = [], [], [], [], []

        for i in ind: 
            s, a, s_, r, d = self.storage[i]
            states.append(np.array(s, copy=False))
            actions.append(np.array(a, copy=False))
            next_states.append(np.array(s_, copy=False))
            rewards.append(np.array(r, copy=False))
            dones.append(np.array(d, copy=False))

        return np.array(states), np.array(actions), np.array(next_states), np.array(rewards).reshape(-1, 1), np.array(dones).reshape(-1, 1)


## Twin Delayed Deep Deterministic (TD3) Agent Implementation
The TD3 class represents the agent responsible for training and selecting actions in the Twin Delayed Deep Deterministic (TD3) algorithm. This agent consists of an actor network, a critic network, and associated methods for training and action selection. This class encapsulates the key components of the TD3 algorithm, facilitating training and action selection for solving continuous control tasks in the given environment.

### Initialization
The agent is initialized with the following parameters:

state_dim: Dimension of the state space.
action_dim: Dimension of the action space.
max_action: Highest possible action value.
env: The Gym environment used for training.

The select_action method is used to select actions based on the current state. It also supports the addition of noise to actions if needed.

The train method is responsible for training both the actor and critic networks using experiences sampled from the replay buffer.

The save and load methods allow for saving and loading the trained models.

In [6]:
#AGENT

class TD3(object):
    """Agent class that handles the training of the networks and provides outputs as actions
    
        Args:
            state_dim (int): state size
            action_dim (int): action size
            max_action (float): highest action to take
            device (device): cuda or cpu to process tensors
            env (env): gym environment to use
    
    """
    
    def __init__(self, state_dim, action_dim, max_action, env):
        self.actor = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target = Actor(state_dim, action_dim, max_action).to(device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=1e-3)

        self.critic = Critic(state_dim, action_dim).to(device)
        self.critic_target = Critic(state_dim, action_dim).to(device)
        self.critic_target.load_state_dict(self.critic.state_dict())
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=1e-3)

        self.max_action = max_action
        self.env = env


        
    def select_action(self, state, noise=0.1):
        """Select an appropriate action from the agent policy
        
            Args:
                state (array): current state of environment
                noise (float): how much noise to add to acitons
                
            Returns:
                action (float): action clipped within action range
        
        """
        
        state = torch.FloatTensor(state.reshape(1, -1)).to(device)
        
        action = self.actor(state).cpu().data.numpy().flatten()
        if noise != 0: 
            action = (action + np.random.normal(0, noise, size=self.env.action_space.shape[0]))
            
        return action.clip(self.env.action_space.low, self.env.action_space.high)

    
    def train(self, replay_buffer, iterations, batch_size=100, discount=0.99, tau=0.005, policy_noise=0.2, noise_clip=0.5, policy_freq=2):
        """Train and update actor and critic networks
        
            Args:
                replay_buffer (ReplayBuffer): buffer for experience replay
                iterations (int): how many times to run training
                batch_size(int): batch size to sample from replay buffer
                discount (float): discount factor
                tau (float): soft update for main networks to target networks
                
            Return:
                actor_loss (float): loss from actor network
                critic_loss (float): loss from critic network
        
        """
        
        for it in range(iterations):

            # Sample replay buffer 
            x, y, u, r, d = replay_buffer.sample(batch_size)
            state = torch.FloatTensor(x).to(device)
            action = torch.FloatTensor(u).to(device)
            next_state = torch.FloatTensor(y).to(device)
            done = torch.FloatTensor(1 - d).to(device)
            reward = torch.FloatTensor(r).to(device)

            # Select action according to policy and add clipped noise 
            noise = torch.FloatTensor(u).data.normal_(0, policy_noise).to(device)
            noise = noise.clamp(-noise_clip, noise_clip)
            next_action = (self.actor_target(next_state) + noise).clamp(-self.max_action, self.max_action)

            # Compute the target Q value
            target_Q1, target_Q2 = self.critic_target(next_state, next_action)
            target_Q = torch.min(target_Q1, target_Q2)
            target_Q = reward + (done * discount * target_Q).detach()

            # Get current Q estimates
            current_Q1, current_Q2 = self.critic(state, action)

            # Compute critic loss
            critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q) 

            # Optimize the critic
            self.critic_optimizer.zero_grad()
            critic_loss.backward()
            self.critic_optimizer.step()

            # Delayed policy updates
            if it % policy_freq == 0:

                # Compute actor loss
                actor_loss = -self.critic.Q1(state, self.actor(state)).mean()

                # Optimize the actor 
                self.actor_optimizer.zero_grad()
                actor_loss.backward()
                self.actor_optimizer.step()

                # Update the frozen target models
                for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
                    target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

                for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                    target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)


    def save(self, filename, directory):
        torch.save("new", "actor.pth")
        torch.save("new", "critic.pth")
#         torch.save(self.actor.state_dict(), '%s\%s_actor.pth' % (directory, filename))
#         torch.save(self.critic.state_dict(), '%s\%s_critic.pth' % (directory, filename))


    def load(self, filename="best_avg", directory="./saves"):
        torch.save("new", "actor.pth")
        torch.save("new", "critic.pth")
#         self.actor.load_state_dict(torch.load('%s\%s_actor.pth' % (directory, filename)))
#         self.critic.load_state_dict(torch.load('%s\%s_critic.pth' % (directory, filename)))


## Environment Runner Implementation
The Runner class is responsible for executing environment steps and adding experiences to the replay buffer. It interacts with the environment using the agent's selected actions, collects observations, rewards, and updates the replay buffer. This class facilitates the interaction between the environment, the TD3 agent, and the replay buffer, ensuring that experiences are collected and stored for training the agent.

Initialization:         
The Runner is initialized with the following parameters:

env: The Gym environment    
agent: The TD3 agent    
replay_buffer: The replay buffer for storing experiences    

The next_step method is used to perform the next environment step. It takes the current number of timesteps in the episode (episode_timesteps) and an optional noise parameter. The method returns the reward obtained from the step and a boolean indicating whether the episode is done.


In [7]:
#RUNNER
class Runner():
    """Carries out the environment steps and adds experiences to memory"""
    
    def __init__(self, env, agent, replay_buffer):
        
        self.env = env
        self.agent = agent
        self.replay_buffer = replay_buffer
        self.obs = env.reset()
        self.done = False
        
    def next_step(self, episode_timesteps, noise=0.1):
        
        action = self.agent.select_action(np.array(self.obs), noise=0.1)
        
        # Perform action
        new_obs, reward, done, _ = self.env.step(action) 
        done_bool = 0 if episode_timesteps + 1 == 200 else float(done)
    
        # Store data in replay buffer
        replay_buffer.add((self.obs, new_obs, action, reward, done_bool))
        
        self.obs = new_obs
        
        if done:
            self.obs = self.env.reset()
            done = False
            
            return reward, True
        
        return reward, done

## Policy Evaluation Function
The evaluate_policy function is designed to assess the performance of a given agent's policy over a specified number of evaluation episodes. It runs the agent's policy in the environment and computes the average reward. It runs the agent's policy for the specified number of evaluation episodes and returns the average reward. It also has an option to render the environment during evaluation for visual inspection.

This function is valuable for assessing the performance of the trained agent in a quantitative manner, providing insights into its effectiveness in the given environment.


In [8]:
#Evaluate

def evaluate_policy(policy, env, eval_episodes=100,render=False):
    """run several episodes using the best agent policy
        
        Args:
            policy (agent): agent to evaluate
            env (env): gym environment
            eval_episodes (int): how many test episodes to run
            render (bool): show training
        
        Returns:
            avg_reward (float): average reward over the number of evaluations
    
    """
    
    avg_reward = 0.
    for i in range(eval_episodes):
        obs = env.reset()
        done = False
        while not done:
            if render:
                env.render()
            action = policy.select_action(np.array(obs), noise=0)
            obs, reward, done, _ = env.step(action)
            avg_reward += reward

    avg_reward /= eval_episodes

    print("\n---------------------------------------")
    print("Evaluation over {:d} episodes: {:f}" .format(eval_episodes, avg_reward))
    print("---------------------------------------")
    return avg_reward

## Observation Function
The observe function is designed to populate the replay buffer by running episodes while taking random actions. This process helps the agent gather diverse experiences for training.

This function runs episodes, taking random actions in the environment, and adds the corresponding experiences to the replay buffer. This pre-filling of the buffer with diverse experiences helps bootstrap the training of the agent.

It's important to observe enough steps to provide a rich set of experiences for the agent to learn from during the initial stages of training. The progress is printed in the console for monitoring.

In [9]:
# OBSERVATION
def observe(env,replay_buffer, observation_steps):
    """run episodes while taking random actions and filling replay_buffer
    
        Args:
            env (env): gym environment
            replay_buffer(ReplayBuffer): buffer to store experience replay
            observation_steps (int): how many steps to observe for
    
    """
    
    time_steps = 0
    obs = env.reset()
    done = False

    while time_steps < observation_steps:
        action = env.action_space.sample()
        new_obs, reward, done, _ = env.step(action)

        replay_buffer.add((obs, new_obs, action, reward, done))

        obs = new_obs
        time_steps += 1

        if done:
            obs = env.reset()
            done = False

        print("\rPopulating Buffer {}/{}.".format(time_steps, observation_steps), end="")
        sys.stdout.flush()

## Training Function
The train function is responsible for training the agent using the Twin Delayed Deep Deterministic (TD3) algorithm for a specified number of exploration steps. It performs training iterations, evaluates the agent periodically, and logs relevant metrics.

This function iterates through exploration steps, updating the agent's policy using experiences from the replay buffer. It also evaluates the agent periodically and saves the best model based on the evaluation reward. Training progress is logged using TensorBoard.

To experiment with different instances of the TD3, adjust the hyperparameters and replace placeholder values according to your specific use case and environment.

In [10]:
#TRAIN
def train(agent, test_env):
    """Train the agent for exploration steps
    
        Args:
            agent (Agent): agent to use
            env (environment): gym environment
            writer (SummaryWriter): tensorboard writer
            exploration (int): how many training steps to run
    
    """

    total_timesteps = 0
    timesteps_since_eval = 0
    episode_num = 0
    episode_reward = 0
    episode_timesteps = 0
    done = False 
    obs = env.reset()
    evaluations = []
    rewards = []
    best_avg = -2000
    
    writer = SummaryWriter(comment="-TD3_Baseline_HalfCheetah")
    
    while total_timesteps < EXPLORATION:
    
        if done: 

            if total_timesteps != 0: 
                rewards.append(episode_reward)
                avg_reward = np.mean(rewards[-100:])
                
                writer.add_scalar("avg_reward", avg_reward, total_timesteps)
                writer.add_scalar("reward_step", reward, total_timesteps)
                writer.add_scalar("episode_reward", episode_reward, total_timesteps)
                
                if best_avg < avg_reward:
                    best_avg = avg_reward
                    print("saving best model....\n")
                    agent.save("best_avg","new")

                print("\rTotal T: {:d} Episode Num: {:d} Reward: {:f} Avg Reward: {:f}".format(
                    total_timesteps, episode_num, episode_reward, avg_reward), end="")
                sys.stdout.flush()


                if avg_reward >= REWARD_THRESH:
                    break

                agent.train(replay_buffer, episode_timesteps, BATCH_SIZE, GAMMA, TAU, NOISE, NOISE_CLIP, POLICY_FREQUENCY)

                # Evaluate episode
                if timesteps_since_eval >= EVAL_FREQUENCY:
                    timesteps_since_eval %= EVAL_FREQUENCY
                    eval_reward = evaluate_policy(agent, test_env)
                    evaluations.append(avg_reward)
                    writer.add_scalar("eval_reward", eval_reward, total_timesteps)

                    if best_avg < eval_reward:
                        best_avg = eval_reward
                        print("saving best model....\n")
                        agent.save("best_avg","new")

                episode_reward = 0
                episode_timesteps = 0
                episode_num += 1 

        reward, done = runner.next_step(episode_timesteps)
        episode_reward += reward

        episode_timesteps += 1
        total_timesteps += 1
        timesteps_since_eval += 1

## Initialization and Training Setup
The following code initializes the environment, sets seeds, and sets up the necessary components for training.

The provided code initializes the environment, sets up the necessary components, and prepares for training using the TD3 algorithm. The training loop can be extended based on your specific requirements and evaluation criteria. Adjust the hyperparameters and replace placeholder values accordingly.

In [11]:
#CONFIG

ENV = "Ant-v2"
SEED = 0
OBSERVATION = 10000
EXPLORATION = 70000
BATCH_SIZE = 100
GAMMA = 0.99
TAU = 0.05
NOISE = 0.2
NOISE_CLIP = 0.5
EXPLORE_NOISE = 0.1
POLICY_FREQUENCY = 2
EVAL_FREQUENCY = 5000
REWARD_THRESH = 8000

In [12]:
# Create the Gym environment and set the device (GPU or CPU)
env = gym.make(ENV)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Set seeds for reproducibility
env.seed(SEED)
torch.manual_seed(SEED)
np.random.seed(SEED)

# Obtain state and action dimensions along with the maximum action value from the environment
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0] 
max_action = float(env.action_space.high[0])

# Create the TD3 policy using the obtained dimensions and maximum action value
policy = TD3(state_dim, action_dim, max_action, env)

# Create a replay buffer to store experiences for training
replay_buffer = ReplayBuffer()

# Create a Runner instance to handle environment steps and experience collection
runner = Runner(env, policy, replay_buffer)

# Initialize variables for tracking training progress
total_timesteps = 0
timesteps_since_eval = 0
episode_num = 0
done = True

running build_ext


In [14]:
# Populate replay buffer
observe(env, replay_buffer, OBSERVATION)

Populating Buffer 10000/10000.

In [15]:
# Train agent
train(policy, env)

saving best model....

Total T: 5000 Episode Num: 4 Reward: -2593.868000 Avg Reward: -1756.026355
---------------------------------------
Evaluation over 100 episodes: -2381.210180
---------------------------------------
Total T: 10203 Episode Num: 12 Reward: -2700.322466 Avg Reward: -1743.057244
---------------------------------------
Evaluation over 100 episodes: -1407.382413
---------------------------------------
Total T: 15514 Episode Num: 22 Reward: -2631.763569 Avg Reward: -1595.317124
---------------------------------------
Evaluation over 100 episodes: -3003.934923
---------------------------------------
Total T: 20116 Episode Num: 32 Reward: -2677.090817 Avg Reward: -1483.324740
---------------------------------------
Evaluation over 100 episodes: -673.511732
---------------------------------------
Total T: 25014 Episode Num: 110 Reward: -43.153825 Avg Reward: -413.86411692
---------------------------------------
Evaluation over 100 episodes: -113.430779
---------------------

The following code can be used to save the policy, and load it. Further steps, such as those to evaluate policy for n steps are useful once the policy runs satisfactorily.
The demo of the walking ant was taken by running this function.

In [16]:
# Save all networks
torch.save(policy.actor.state_dict(), "actor.pth")
torch.save(policy.critic.state_dict(), "critic.pth")

# Load trained policy
policy.load()
policy.actor.load_state_dict(torch.load('actor.pth'))
policy.critic.load_state_dict(torch.load('critic.pth'))

# Watch the trained agent run 
for i in range(10):
    evaluate_policy(policy, env, render=True)

# Shut down the environment
env.close()

The best possible policy was obtained by a TD3 instance using the below configurations. The console output for its run was as below.


```
WITH TAU = 0.005

Total T: 5000 Episode Num: 4 Reward: 621.704693 Avg Reward: 715.725392
---------------------------------------
Evaluation over 100 episodes: 472.979710
---------------------------------------
Total T: 10897 Episode Num: 18 Reward: 314.681440 Avg Reward: 261.822236
---------------------------------------
Evaluation over 100 episodes: 275.993126
---------------------------------------
Total T: 15234 Episode Num: 27 Reward: 200.160667 Avg Reward: 239.144880
---------------------------------------
Evaluation over 100 episodes: 208.402902
---------------------------------------
Total T: 20046 Episode Num: 40 Reward: 499.843529 Avg Reward: 222.455397
---------------------------------------
Evaluation over 100 episodes: 350.905543
---------------------------------------
Total T: 25004 Episode Num: 51 Reward: 46.759798 Avg Reward: 224.2817825
---------------------------------------
Evaluation over 100 episodes: 289.261309
---------------------------------------
Total T: 30097 Episode Num: 58 Reward: 278.929342 Avg Reward: 246.330099
---------------------------------------
Evaluation over 100 episodes: 419.907933
---------------------------------------
Total T: 35990 Episode Num: 69 Reward: 491.753643 Avg Reward: 255.749711
---------------------------------------
Evaluation over 100 episodes: 374.798351
---------------------------------------
Total T: 40250 Episode Num: 78 Reward: 481.119700 Avg Reward: 261.986176
---------------------------------------
Evaluation over 100 episodes: 402.360503
---------------------------------------
Total T: 45783 Episode Num: 87 Reward: 627.406303 Avg Reward: 275.038945
---------------------------------------
Evaluation over 100 episodes: 405.799572
---------------------------------------
Total T: 50195 Episode Num: 96 Reward: 105.710424 Avg Reward: 279.954127
---------------------------------------
Evaluation over 100 episodes: 398.072862
---------------------------------------
Total T: 55064 Episode Num: 111 Reward: 806.813517 Avg Reward: 266.249092
---------------------------------------
Evaluation over 100 episodes: 429.293722
---------------------------------------
Total T: 60430 Episode Num: 123 Reward: 796.467087 Avg Reward: 288.427197
---------------------------------------
Evaluation over 100 episodes: 734.379806
---------------------------------------
Total T: 65308 Episode Num: 136 Reward: 615.881982 Avg Reward: 310.279847
---------------------------------------
Evaluation over 100 episodes: 696.263060
---------------------------------------
Total T: 69264 Episode Num: 141 Reward: 838.117703 Avg Reward: 334.0237282
```