# Leave no trace

# **INTRODUCTION**

In this work, we are going to study a paper : "*Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning*". For this matter we are going to train a bipedal robot how to walk using the framework introduced in the paper. We will build our work upon a benchmark : Soft-Actor Critic Method, and use it as a baseline.

# **Environment presentation : BipedalWalker-v2**

![Texte alternatif…](https://miro.medium.com/max/1348/1*1hZSWhdg1BKy9iXlPA_luA.png)

For the purpose of our experiments, we will be using **Gym**,  a toolkit for developing and comparing reinforcement learning algorithms. Gym offers many environments in which we can train different types of agents to accomplish some tasks. The environment we will use for our experiments is the **Bipedal walker envronment**. <br><br>


In this environment a 2D bipedal walker has to learn a policy to walk without falling over.

The total reward calculation is based on the total distance achieved by the agent. The episode ends when the robot body touches ground or the robot reaches far right side of the environment. BipedalWalker-v2 defines "solving" as getting average reward of 300 over 100 consecutive trials

https://gym.openai.com/envs/BipedalWalker-v2/




<br><br>
**How is the agent interacting with his environment?**<br>
Source: [OpenAI](https://openai.com/)

Each timestep, the agent chooses an action, and the environment returns an observation and a reward.

- observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game line Taxi.
- reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
- done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you - lost your last life.)
info (dict): ignore, diagnostic information useful for debugging. Official evaluations of your agent are not allowed to use this for learning.
<br><br>

For example, for our walker :<br>
- **Observations** : are made of a collection of 24 numeric values representing the state of our agent at a given step. Those include the angle of its "right" or "left" hips, same for the knees, the inclination angle of the hull (0 meaning that the hull is horizontal, thus stable), lidar data collected by the agent, giving an idea about the way the agent stands, its velocity etc.


- **Rewards** : The total reward calculation is based on the total distance achieved by the agent, total 300+ points up to the far end. If the robot falls, it gets -100 and the game ends. The episode ends when the robot body touches ground or the robot reaches far right side of the environment.

- **Actions** : the input the agent provides to the environment. Given a state, our agent can take 4 actions at time (making one action at a step) : Hip_1 Torque, Knee_1 Torque, Hip_2 Torque, Knee_2 Torque. Meaning, we control the 2 legs of the robot, and each leg has 2 motors (hip and knee). 

# **Benchmark Framework**

As we previously said, our goal is to do some experiments emphasizing some results of the paper. For this, we will be building our work upon an existing benchmark.
<br><br>
For this purpose, we will use a Pytorch implementation of continuous action actor-critic algorithm. The algorithm uses DeepMind's Deep Deterministic Policy Gradient (DDPG) method for updating the actor and critic networks along with Ornstein–Uhlenbeck process for exploring in continuous action space while using a Deterministic policy.
DDPG is a policy gradient alogrithm, that uses stochastic behaviour policy for exploration (Ornstein-Uhlenbeck in this case) and outputs a deterministic target policy, which is easier to learn.
<br><br>
In this framework, we use a Soft Actor Critic (SAC) for the agent. In our SAC implementation, we have:
- **The actor** : a network made of 3-layer neural network that takes as an input the current state and outputs an action (denoting the policy). This network is updated by maximizing the critic score : $\sum Q(s,a)$
- **The target actor** : which is a neural network, the target policy, updated with a soft update rule (weighted sum) using the current target actor parameters and the actor parameters. This is the policy used when we wish to exploit our agent policy.
- **The critic** : this network consists of a 3-layer neural network taking into input the state (s) and correspoding action (a) and outputs the state-action value function denoted by Q(s,a). This network is updated by minimizing the loss between the prediction of the network and what hould this prediction be (the current gained reward added to the cumulated Q-value gained up to now estimated using the current policy applied to the previous state - temporal difference learning) : $L_2(r+\gamma*Q(s1,a1) - Q(s,a))$
- **The target critic** : same principle as the target actor but with the critic.
<br><br>

**PS**:
- We helped ourselves with this [project](https://github.com/vy007vikas/PyTorch-ActorCriticRL).
- The purpose of this study is not to experiment the SAC on itself, but use it as a benchmark and a tool to experiment the framework introduced in the paper we study : the Leave-No-Trace framework. Therefore, from now on, the SAC will be treated as a black box.

![Texte alternatif…](https://raw.githubusercontent.com/steph1793/Leave-No-Trace/master/images/Algo1.PNG)

# **Leave-No-Trace Framework**

The idea of the authors is to introduce a framework which allows the agent to detect actions that would lead to a failure avoid them, and do only reversible actions. This framework is said to have many advantages that we will try to check in the next experiments:

- The agent can make longer steps before failing (less resets)
- The agent learns more safe actions (reversible actions, allowing him to go back easily to it's initial state)
- The performance of the agent is not altered by the framework.

All those are some advantages claimed in the paper.

To achieve this, the authors introduced a second agent, the reset agent, which alternatly learns from a given state how to reach an equilibirum state (generally the inital state), in which the agent can be considered to be safe and will not fail. The main agent becomes what we call a forward agent. And while training this forward agent, we regularly use the critic of the reset agent in order to evaluate how good is each action the agent is about to take. If the action is considered to be bad, we switch to the reset policy and then take an action advised by this reset policy. For our reset agent, we will also use a SAC since it is an off-policy method with a Q-learning network.
<br><br>

**PS : For the reset agent, we will hav to define  rewatrd function other than the reward function of the main agent. We will talk much more about it in the experiments.**

![Texte alternatif…](https://raw.githubusercontent.com/steph1793/Leave-No-Trace/master/images/algo2.PNG)

# **Install some dependencies and import them**
Pay attention to restart the notebook after installing the dependencies.

In [0]:
!pip install folium==0.2.1 > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install 'gym[box2d]' > /dev/null 2>&1

In [0]:
!git clone https://github.com/steph1793/Leave-No-Trace

In [0]:
cd Leave-No-Trace

In [0]:
!mkdir results
!mkdir results/Models
!mkdir results/Models2
!mkdir results/Models4

In [0]:
import gym
from gym import logger as gymlogger
gymlogger.set_level(40) #error only
import tensorflow as tf
import torch.nn.functional as F
import torch
from torch.autograd import Variable
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math

import buffer
from helper import show_video, wrap_env
from utils import create_agent
from train import Trainer

import gc

In [0]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [0]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()


def video_callable(_ep):
  if _ep<50:
    return (_ep)%10==0
  else:
    return (_ep)%50==0 

# **Visualize the environement**

In [0]:
env = gym.make('BipedalWalker-v3')
obs= env.reset()
for _ in range(15):
    action = [1,1,-1,-1]
    obs, rew, done, info = env.step(action)
    plt.imshow(env.render('rgb_array'))
    plt.grid(False)
env.close()

# **Some Hyperparameters**

Here we define some hyperparameters like the maximum number of episodes to run, the maximum number of steps per episode. We also intialize a "MAX_BUFFER" variable which represents the size of the buffer in which we store the different (state, action, reward, state) tuples and from which we sample the training batch.

In [0]:
MAX_EPISODES = 1500
MAX_STEPS = 800
MAX_BUFFER = 1000000
MAX_TOTAL_REWARD = 300


S_DIM = env.observation_space.shape[0]
A_DIM = env.action_space.shape[0]
A_MAX = env.action_space.high[0]

print(' State Dimensions : ', S_DIM)
print(' Action Dimensions : ', A_DIM)
print(' Action Max : ', A_MAX)

# **EXPERIMENTS**

**Our experiments**:

- First, we will run our benchmark model and collect some informations that we will study later in the notebook.
- Secondly, we will run our Leave-No-Trace Framework
<br><br>

**Our goal** is to study:

- how good is the SAC only method.
- the performance of the main agent when helped by the reset agent
- Comparative study of both methods
- the impact of the reset agent 
- the impact of the reset reward function
<br><br>

For the experiments, we will look essentially at two **performance "metrics"**:

- The maximum number of steps achieved by the agent per episode
- The total number of rewards collected throughout each episode

We will also visualize and comment the training using some videos.
We run a series of four experimentations. Each experiment took at **least 6 hours** for training the agent(s). 
<br><br><br>


## **SAC only *experiment***

First, we creat our environement. We  wrap this environment in a wrapper, allowing us to limit the maximum number of steps per episode (we want our agent to reach the maximum number of points within a given number of steps).

In [0]:
env = wrap_env(gym.make('BipedalWalker-v3'), MAX_STEPS, video_folder='./results/video', video_callable=video_callable )
ram = buffer.MemoryBuffer(MAX_BUFFER)

Then, we create the agent (Soft Actor Critic).

In [0]:
agent = create_agent(S_DIM, A_DIM, A_MAX)

We create a trainer object which will be a helper for exploring/exploiting the agent policy and optimizing this agent ( by updating its parameters).

In [0]:
trainer = Trainer(agent, ram)

Finally, we write our algorithm described in the pseudo-code above, "Benchmark Framework" part.

In [0]:
total_rewards_per_ep = []
num_steps_per_ep = []
for _ep in range(MAX_EPISODES):
  observation = env.reset()
  print()
  r = 0
  for i in range(MAX_STEPS):
    env.render()
    state = np.float32(observation)

    action = trainer.get_exploration_action(state)

    new_observation, reward, done, info = env.step(action)
    r += reward
    if done:
      new_state = None
    else:
      new_state = np.float32(new_observation)
      # push this exp in ram
      ram.add(state, action, reward, new_state)

    observation = new_observation
    print( end='\rEpisode : {}, reward : {}'.format(_ep, r))

    # perform optimization
    trainer.optimize()
    if done:
      break

  gc.collect()

  total_rewards_per_ep.append(r)
  num_steps_per_ep.append(i+1)

  if _ep==0 or (_ep+1)%100 == 0:
    trainer.save_models(_ep, './results/Models')

In [0]:
gc.collect()

In [0]:
np.savetxt("results/1_steps_per_ep.txt", num_steps_per_ep)
np.savetxt("results/1_rewards.txt", total_rewards_per_ep)

## **Leave-No-Trace experiment with SAC**

First, we create the method described in the pseudo-code above, "Leave-No-Trace Framework" part. We coded the method with a variant. We do not program hard reset after N failures of the reset policy as described in the paper. We just use the hard reset programmed in the environment (reset when falling, etc). So alternatly, we train the forward policy and the reset policy. And, during the forward agent train, we use early aborts to avoid taking actions that will lead to failures. The ealy aborts are programed in that way:

- At each iteration we use the reset critic to evaluate the Q value of the action. If this Q-value is higher than a given threshold (Q_min), the action is considered to be safe, otherwise, we use the reset actor policy to take an action and avoid bad actions.
<br><br>


Two "parameters" must then be taken into account in our experiments : the Q_min threshold, and the reset reward.

In [0]:
def train_with_reset_agent(reset_reward_fn, q_min, save_model_folder):
  global total_rewards_per_ep
  global num_steps_per_ep
  for _ep in range(MAX_EPISODES):
    observation = env.reset()
    print()
    r = 0
    i=0
    while i < MAX_STEPS:
      i += 1
      
      state = np.float32(observation)

      action = forward_trainer.get_exploration_action(state)

      q_value = reset_trainer.critic.forward(Variable(torch.from_numpy(np.float32([state]))), Variable(torch.from_numpy(np.float32([action]))))
      if q_value < q_min: #### early abort, we switch to reset policy (safe mode)
        action = reset_trainer.get_exploitation_action(state)
        i -= 1
      else:
        env.render()

      new_observation, reward, done, info = env.step(action)
      r += reward
      if done:
        new_state = None
      else:
        new_state = np.float32(new_observation)
        # push this exp in ram
        forward_ram.add(state, action, reward, new_state)

      observation = new_observation
      print( end='\rEpisode : {}, reward : {}'.format(_ep, r))

      # perform optimization
      forward_trainer.optimize()
      if done:
        break
    observation = env.reset()
    for i in range(MAX_STEPS):
      state = np.float32(observation)
      action = reset_trainer.get_exploration_action(state)
      new_observation, _, done, info = env.step(action)
      reward = reset_reward_fn(new_observation)

      if done :
        new_state = None
      else:
        new_state = np.float32(new_observation)
        reset_ram.add(state, action, reward, new_state)

      observation = new_observation

      reset_trainer.optimize()
      if done :
        break

    gc.collect()

    total_rewards_per_ep.append(r)
    num_steps_per_ep.append(i+1)

    if _ep==0 or (_ep+1)%100 == 0:
      forward_trainer.save_models(_ep, save_model_folder)

We define two reset reward functions. 
- In the first reward function, we penalize the agent when it takes actions that puts him in a state that is very far away from the sate where the hull is horizontal and one leg is straight and touches the ground. We consider that position to be an equilibrium state (not the best necessarily).
- In the second reward function, we consider the equilibrium state to be the one where the hull is horizontal and the body is higher than a certain threshold.

In the two cases, the maximum reward the reset agent can get is 0.

In [0]:

def reset_reward_fn_1(state):
  r0 = float(F.smooth_l1_loss(torch.Tensor(np.array(state)[[0]]), torch.Tensor(np.array(inital_state)[[0]])))
  r1 = float(F.smooth_l1_loss(torch.Tensor(np.array(state)[[9,11,13]]), torch.Tensor(np.array(inital_state)[[9,11,13]])))
  r2 = float(F.smooth_l1_loss(torch.Tensor(np.array(state)[[4,6,8]]), torch.Tensor(np.array(inital_state)[[4,6,8]])))
  r = - (r0 + min(r1, r2))
  return r


In [0]:

def reset_reward_fn_2(state):
  r = -5.0 * (env.hull.position[1]< 4.7)
  r += -5.0 * np.abs(state[0])
  return r

We build a helper function, that creates the environment, the agents, the trainers, and the buffers.

In [0]:
def build_env_agents(video_folder):
  env = wrap_env(gym.make('BipedalWalker-v3'), MAX_STEPS, video_folder=video_folder, video_callable=video_callable )
  inital_state = env.reset()

  forward_ram = buffer.MemoryBuffer(MAX_BUFFER)
  reset_ram = buffer.MemoryBuffer(MAX_BUFFER)

  forward_agent = create_agent(S_DIM, A_DIM, A_MAX)
  reset_agent = create_agent(S_DIM, A_DIM, A_MAX)

  forward_trainer = Trainer(forward_agent, forward_ram)
  reset_trainer = Trainer(reset_agent, reset_ram)

  return env, inital_state, forward_ram, reset_ram, forward_agent, reset_agent, forward_trainer, reset_trainer

### **Equilibrium state: horizontal hull and a least one straight leg;   *q*\_min=-5.0**

In this section, we use the first reset reward function with Q_min threshold at -5.0

In [0]:
env, inital_state, forward_ram, reset_ram, forward_agent, reset_agent, forward_trainer, reset_trainer = build_env_agents("results/video2")

In [0]:
q_min = -5.0

In [0]:
total_rewards_per_ep = []
num_steps_per_ep = []

train_with_reset_agent(reset_reward_fn=reset_reward_fn_1, q_min = q_min , save_model_folder= "results/Models2")

In [0]:
np.savetxt("results/with_reset_2_rewards.txt", np.array(total_rewards_per_ep))
np.savetxt("results/with_reset_2_steps_per_ep.txt", np.array(num_steps_per_ep))

### **Equilibrium state: horizontal hull and body higher than a certain level;   q\_min=-50.0**

In this section, we use the second reset reward function and a Q_min threshold at -50.0

In [0]:
env, inital_state, forward_ram, reset_ram, forward_agent, reset_agent, forward_trainer, reset_trainer = build_env_agents("results/video4")

In [0]:
q_min = -50.0

In [0]:
total_rewards_per_ep = []
num_steps_per_ep = []

train_with_reset_agent(reset_reward_fn=reset_reward_fn_2, q_min = q_min , save_model_folder= "results/Models4")

In [0]:
np.savetxt("results/with_reset_4_rewards.txt", np.array(total_rewards_per_ep))
np.savetxt("results/with_reset_4_steps_per_ep.txt", np.array(num_steps_per_ep))

# **RESULTS**

We reload the different measures we have saved previsously.

In [0]:
## Results from the simple SAC Framework experiment (number of steps per episode and total rewards per episode)
steps_per_ep_1 = np.loadtxt("results/1_steps_per_ep.txt")
rewards_1 = np.loadtxt("results/1_rewards.txt")

## results from the Leave-No-Trace Framework using Q_min =-5 and the first reset reward function
with_reset_2_steps_per_ep = np.loadtxt("results/with_reset_2_steps_per_ep.txt")
with_reset_2_rewards = np.loadtxt("results/with_reset_2_rewards.txt")

## results from the Leave-No-Trace Framework using Q_min =-50 and the second reset reward function
with_reset_4_steps_per_ep = np.loadtxt("results/with_reset_4_steps_per_ep.txt")
with_reset_4_rewards = np.loadtxt("results/with_reset_4_rewards.txt")

#### Comparative study

In [0]:
f, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,4))
ax1.plot(with_reset_2_steps_per_ep)
ax1.plot(steps_per_ep_1)

ax1.set_title('Number of steps per episode')
ax1.set_xlabel("Episode")
ax1.set_ylabel("steps")

ax2.plot(with_reset_2_rewards)
ax2.plot(rewards_1)
ax2.set_xlabel("Episode")
ax2.set_ylabel("steps")

_=ax2.set_title('Total rewards per episode')

_=f.legend(["simple SAC", "SAC with reset agent"])

**Comments**
<br>
We can clearly observe that the Leave-No-Trace framework allowed our agent to run much more steps without failing. It is even possible to get much more rewards than the simple agent alone. In this experiment, the safe action is when the agent is keeping its head horizontal and at least one leg straight and touching the ground.

In [0]:
f, (ax1, ax2) = plt.subplots(1, 2,figsize=(12,4))
ax1.plot(with_reset_4_steps_per_ep)
ax1.plot(steps_per_ep_1)

ax1.set_title('Number of steps per episode')
ax1.set_xlabel("Episode")
ax1.set_ylabel("steps")

ax2.plot(with_reset_4_rewards)
ax2.plot(rewards_1)
ax2.set_xlabel("Episode")
ax1.set_ylabel("steps")

_=ax2.set_title('Total rewards per episode')
_=f.legend(["simple SAC", "SAC with reset agent"])

**Comments**
<br>
In this experiment, the equilibrium state is considered to be the state where the agent head is horizontal and hi body is higher than a certain level. The result are obtained with a Q_min=-50 but we tried with a Q_min=-5 and obtained similar results. The way we defined the "good" state doesn't work as in the previous case.
<br>
However, we can see that the training is much more longer per episode than it is with a simple SAC without reset agent.

#### Visualisation

In [0]:
show_video("results/video/openaigym.video.0.3678.video001099.mp4")

In [0]:
show_video("results/video2/openaigym.video.0.2266.video001600.mp4")

**Comments**
<br>
The first video shows, the agent after 1100 episodes in the SAC-Only framework. It is one of the longer sequences the agent has been able to hold when we look at all the videos monitored. It lasts only 3 seconds.
<br>
The second video shows the forward agent in the Leave-No-Trace Framework after 800 episodes (Be aware about the number of episodes; looking at the name of the video it seems like episode 1600 but it is episode 800 in fact; this is due to the fact that we train both agents in the same environment, making the environement step incrementation going twice faster than it should.)
<br><br>
When we look at the second video, we can observe that the agent tries to go forward but in a very safe way. It restricts itself in the actions it takes. It is important then to well define the equilibrium state (as we've seen it before), otherwise the agent could be too cautious and finally forget the inital goal which is here to go forward as fast as possible.
<br><br>
**Does the reset agent impact considerably the main agent?**<br>
The answer to this question is rather yes. This is in part what we've explained in the previous paragraph. For example, if we pay attention to the way our agent moves, it tries to do it by keeping at least one leg very straight (corresponding exactly to what we defined in the reset reward function).

# **Conclusion**

In this work, we tried to prove the point of this [paper](https://arxiv.org/abs/1711.06782). We managed to prove that introducing a reset agent allows the main agent to run longer episodes and then reduce the number of hard resets and the cost that it implies. But more than that, we have seen that our agent learns to accomplish its task by taking "safe" actions. It is learning the main goal, while also learning how to be cautious about the actions it takes. For example, in our experiments, we can see that the agent while learning to move forward also learns how to stay in balance and not fall.

**Experimental method**<br>
- **Reproductibility** : In order to check our results, we run those experiments three (03) times and had about the same conclusions.
- **HyperParameters Tuning**: Due to our lack of computation power, we were limited in the amount of experiments we could run. We thought about studying the impact of the Q_min threshold. But rather than that, we prefered to focus on proving the contribution of the reset agent. In further experiments, we could try to study the influence of the Q-value threshold.

# **References**

[Leave No trace: Learning to Reset for Safe and Autonomous Reinforcement Learning](https://arxiv.org/abs/1711.06782!), *by Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, Sergey Levine* 

[Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https://arxiv.org/abs/1801.01290), *by Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine*

[GYM Framework](https://gym.openai.com/), *by OpenAI*

[Soft-Actor-Critic Implementation](https://github.com/vy007vikas/PyTorch-ActorCriticRL), *Vikas Yadav*
