
# **OpenAI Gym BiPedalwalker-V3 TD3 with Tensorboard & Video** 

This colab will allow you to train, evaluate and visulize your results using stable-baselines and tensorboard. Google colab don't support env.render() so we will use a work around where we "fake" a display, record a video and then display it. We will be using OpenAI Gym enviorment,  Stable-baselines & TD3

## **Instructions**
Click **Open in playground** top left corner.   
Then either run cell by cell (recommended)   
or just click "Runtime" in toolbar, then "Run all" leave the tab running,  
check back in 30-60 min and scroll down top bottom

### **Links**
[https://github.com/openai/gym/wiki/BipedalWalker-v2](https://github.com/openai/gym/wiki/BipedalWalker-v2)  
[https://stable-baselines.readthedocs.io/en/master/](https://stable-baselines.readthedocs.io/en/master/)  
[https://towardsdatascience.com/td3-learning-to-run-with-ai-40dfc512f93](https://towardsdatascience.com/td3-learning-to-run-with-ai-40dfc512f93)

----

# **A Notebook from Nextgrid.ai**
![Nextgrid Deep learning labs](https://nextgrid.ai/wp-content/uploads/2020/01/deep-learning-labs-scaled.jpg)

 
### **Nextgrid** - _The **Superlative** destination for deep & reinforcement learning startups & talent_

Learn more: [Deep learning labs](https://nextgrid.ai/deep-learning-labs/) / [Nextgrid](https://nextgrid.ai) 



▪️️️️️️️▪️️️️️️️▪️️️️️️️▪️️️️️️️▪️️️️️️️▪️️️️️️️▪️️️️️️️▪️️️️️️️  
*Notebook by Mathias*  
*I would love your feedback,*  
*or discuss your DL/DRL startup/business idea.*   
*find me on* _[twitter](https://twitter.com/mathiiias123)_ or _[linkedin](https://www.linkedin.com/in/imathias)_



#### **Changelog** 
```
2020/02/09 - Updated package versions and switched to Bipedalwalker-V3
2020/04/08 - Updated to stable-baselines 2.10.0 & Tensorboard issue workaround
```

## Install system wide packages
Install linux server packages using `apt-get` and Python packages using `pip`

In [0]:
!sudo apt-get update
!apt-get install swig cmake python3-dev libopenmpi-dev zlib1g-dev xvfb x11-utils ffmpeg #remove -qq for full output

%tensorflow_version 1.x
%load_ext tensorboard

!pip install stable-baselines==2.10.0 box2d box2d-kengz pyvirtualdisplay pyglet==1.5.0 --quiet #remove --quiet for full output 

## Dependencis
import dependencis required to run, train & record video

In [0]:
import gym
import imageio
import time
import numpy as np
import base64
import IPython
import PIL.Image
import pyvirtualdisplay


# Video 
from pathlib import Path
from IPython import display as ipythondisplay

# Stable baselines
from stable_baselines import TD3
from stable_baselines.td3.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
from stable_baselines.common.vec_env import VecVideoRecorder, SubprocVecEnv, DummyVecEnv
from stable_baselines.common.evaluation import evaluate_policy

# Define & Configure our Reinforcment learning algo
Here we define our variables & Hyperparamters  
In this example we are using default Twin Delayed DDPG.  
Read more about how you define your TD3 [parameters](https://stable-baselines.readthedocs.io/en/master/modules/td3.html#parameters) 

In [0]:
### Variables
env_id = 'BipedalWalker-v3'
video_folder = '/videos'
video_length = 3000
logs_base_dir = './runs' # Log DIR
steps_total= 0 # Keep track of total steps


### Enviorment 
env = DummyVecEnv([lambda: gym.make(env_id)])
obs = env.reset()
score = 0
log_interval = 10          # Print avg reward after interval


### Hyperparameters 

# Action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

# Model configuration
model = TD3(
    MlpPolicy,
    env,
    verbose=1,                         # display output when training, 0 = no output, 1 = show output
    gamma=0.99,                        # discount for future rewards
    learning_rate=0.003,               # learning rate
    buffer_size=100000,                # size of the replay buffer
    batch_size=1000,                    # number of transitions sampled from replay buffer
    learning_starts=500,                # steps before starting training
    train_freq=1000,                   # update the model every train_freq steps.
    gradient_steps=1000,               # how many gradient update after each step
    #tau=0.005,                        # the soft update coefficient (“polyak update” of the target networks, between 0 and 1)
    #policy_delay=2,                   # policy and target networks will only be updated once every policy_delay steps per training steps. The Q values will be updated policy_delay more often (update every training step).
    #action_noise=action_noise,        # action noise type. Cf DDPG for the different action noise type.
    #target_policy_noise=0.2,          # standard deviation of Gaussian noise added to target policy 
    #target_noise_clip=0.5,            # limit for absolute value of target policy smoothing noise.
    #random_exploration=0.0,           # probability of taking a random action
    n_cpu_tf_sess=None,                # number of threads for TensorFlow operations If None, the number of cpu of the current machine will be used.

    # Tensorboard stuff
    tensorboard_log=logs_base_dir,
    full_tensorboard_log=True, 

    # seed=None, 
    # _init_setup_model=True, 
    # 
    )


## Training & Rec/Play Video [functions]

- `def learning(name, steps=10000, prefix=env_id, eval=1000):`
- `def record(name, length=1500):`  

_that simply help us call the right functions to train our agent and to record & display video_ 

In [0]:
# Training function
def learning(name, steps=10000, prefix=env_id, eval=1000):
  model.learn(total_timesteps=steps, log_interval=log_interval)
  model.save(name + "-" + prefix)
  # Random Agent, after training
  # mean_reward_after_train = evaluate(model, num_steps=eval)


def record(name, length=1500):
   record_video(env_id, model, video_length=length, prefix=name)
   show_videos('videos', prefix=name)
   print(name, " steps total")

## Functions





In [0]:
### Record & Display Video

import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

# Record video
def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make(env_id)])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # Close the video recorder
  eval_env.close()


## Display video
def show_videos(video_path='', prefix=''):
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

# Display Tensorboard inline
Run & Display tensorboard   
**PS.** *sometimes it does not show up at all, then test to uncomment the reload code, or jusrt run cell again*

It's correctly loaded when you see this view
![Tensorboard](https://nextgrid.ai/wp-content/uploads/2019/12/Screenshot-2019-12-27-at-16.40.02.png)

In [0]:
# Often not loading on first try, run again until u see the screen
%tensorboard --logdir {logs_base_dir}
%reload_ext tensorboard

# Training Function
We want to automate the training function so that it will keep running until the result we looking for is achived

In [0]:

def run_training(steps_per_round=200000,limit=300):
# This function will run a training with value set in `steps_per_round`
# after each round it will messure it's value, If value is under `limit` it will keep training until score limit is reached.  

  global score
  global steps_total

  print("Training is starting.. ")
  
  while score < limit:
      steps_total = steps_total + steps_per_round
      learning(str(steps_total), steps=steps_per_round)
      new_evaluation = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=False)
      score = new_evaluation[0]
      record(name=steps_total, length=1000) # uncomment to show video from each round
      print("Mean reward:", score )
    

  # Threshold reached > evaluate over 100 episodes > Video rec/display
  print("Reward limit achived, messuring over 100ep & recording video, please wait...")
  record(name=steps_total, length=1750)
  ep100 = evaluate_policy(model, env, n_eval_episodes=50, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=True)
  print("Mean Reward 100 Epispodes: ", ep100[0])

## Train moodel
Add the amount of total moves that will be run before messuring results with `steps_per_round` parameter.  In `limnit` add the score you want model to reach to end training. If not reached it will simply run another round.

In [0]:
# Traing
run_training(steps_per_round=100000,limit=100)

In [0]:
run_training(steps_per_round=100000,limit=200)

# Evaluation
OpenAI scores is generally messured over 100 epochs. Use code belowe to messure your avarage score over 100 rounds

In [0]:
evals = evaluate_policy(model, env, n_eval_episodes=100, deterministic=True, render=False, callback=None, reward_threshold=None, return_episode_rewards=False)
print(evals)

In [0]:
### Code demostrating how to save, delete & load model
# model.save("save_as_name")
# del model # 
# model = TD3.load("name_of_model_to_load")