<a href="https://colab.research.google.com/github/vidgi/bipedalwalker-rl/blob/main/bipedalwalker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Bipedalwalker Deep Reinforcement Learning 

Vidya Giri

## Project Overview

In this project, I will be working on using RL models to teach a bipedal walker to walk! For this project, I used the [Open Gym AI bipedalwalker-v2](https://gym.openai.com/envs/#box2d) environment and [stable baselines](https://stable-baselines.readthedocs.io/en/master/) to train it with deep reinforcement learning. I calculated the mean reward over episodes in order to evaluate the RL agent and used [Proximal Policy Optimization](https://arxiv.org/abs/1707.06347) in order to define, train, and evaluate the model. To compare and contrast evaluation timesteps, I first tried the procedure with 50,000 steps and then tried it with 500,000 steps for the normal bipedal walker environment.

After this, I also tried training the bipedal walker in the hardcore environment which contains obstacles that further makes it harder for the model to walk forward in the environment.

## Approach


- Approach: explain your environment, your choice of model(s), the methods and purpose of testing and experiments, explain any trouble shooting required.


### Installing system wide packages

Here we will install the linux server packages needed for the project using `apt-get` and python packages using `pip`

In [1]:
!apt-get install swig cmake libopenmpi-dev zlib1g-dev xvfb x11-utils ffmpeg -qq #remove -qq for full output
!pip install stable-baselines[mpi] box2d box2d-kengz pyvirtualdisplay pyglet==1.3.1 --quiet #remove --quiet for full output 
# Stable Baselines only supports tensorflow 1.x for now
%tensorflow_version 1.x

Selecting previously unselected package libxxf86dga1:amd64.
(Reading database ... 155222 files and directories currently installed.)
Preparing to unpack .../libxxf86dga1_2%3a1.1.4-1_amd64.deb ...
Unpacking libxxf86dga1:amd64 (2:1.1.4-1) ...
Selecting previously unselected package swig3.0.
Preparing to unpack .../swig3.0_3.0.12-1_amd64.deb ...
Unpacking swig3.0 (3.0.12-1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_3.0.12-1_amd64.deb ...
Unpacking swig (3.0.12-1) ...
Selecting previously unselected package x11-utils.
Preparing to unpack .../x11-utils_7.7+3build1_amd64.deb ...
Unpacking x11-utils (7.7+3build1) ...
Selecting previously unselected package xvfb.
Preparing to unpack .../xvfb_2%3a1.19.6-1ubuntu4.9_amd64.deb ...
Unpacking xvfb (2:1.19.6-1ubuntu4.9) ...
Setting up swig3.0 (3.0.12-1) ...
Setting up xvfb (2:1.19.6-1ubuntu4.9) ...
Setting up libxxf86dga1:amd64 (2:1.1.4-1) ...
Setting up swig (3.0.12-1) ...
Setting up x11-utils (7.7+3build1) ...
P

### Project Dependencies
Now, we will import all the packages and dependencies required to run and train the model and record the demo video. Colab does not support env.render() which you typically use to view results in standard python notebooks so I used a workaround that allows us to emulate the display, record the video, and then display it. 

In [2]:
import gym
import imageio
import numpy as np
import base64
import IPython
import PIL.Image
import pyvirtualdisplay

# Video stuff 
from pathlib import Path
from IPython import display as ipythondisplay

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import VecVideoRecorder, SubprocVecEnv, DummyVecEnv
from stable_baselines import PPO2

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Define variables/functions
Here I will define the variables and then create an evaluate, record_video, and show_video functions that will be used later to train the model for the bipedal walker and show the results.

In [3]:
# set environment variables
env_id = 'BipedalWalker-v2'
video_folder = '/videos'
video_length = 100

# set our inital enviorment
env = DummyVecEnv([lambda: gym.make(env_id)])
obs = env.reset()

In [4]:
# define the evaluate function
def evaluate(model, num_steps=1000):
  """
  Evaluate a RL agent
  :param model: (BaseRLModel object) the RL Agent
  :param num_steps: (int) number of timesteps to evaluate it
  :return: (float) Mean reward for the last 100 episodes
  """
  episode_rewards = [0.0]
  obs = env.reset()
  for i in range(num_steps):
      # _states are only useful when using LSTM policies
      action, _states = model.predict(obs)

      obs, reward, done, info = env.step(action)
      
      # stats
      episode_rewards[-1] += reward
      if done:
          obs = env.reset()
          episode_rewards.append(0.0)
  # compute the mean reward for the last 100 episodes
  mean_100ep_reward = round(np.mean(episode_rewards[-100:]), 1)
  print("Mean reward:", mean_100ep_reward, "Num episodes:", len(episode_rewards))
  
  return mean_100ep_reward

In [5]:
# make video and set up emulated display (otherwise rendering will fail)
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [6]:
# define the record_video function
def record_video(env_id, model, video_length=500, prefix='', video_folder='videos/'):
  """
  :param env_id: (str)
  :param model: (RL model)
  :param video_length: (int)
  :param prefix: (str)
  :param video_folder: (str)
  """
  eval_env = DummyVecEnv([lambda: gym.make('BipedalWalker-v2')])
  # Start the video at step=0 and record 500 steps
  eval_env = VecVideoRecorder(env, video_folder=video_folder,
                              record_video_trigger=lambda step: step == 0, video_length=video_length,
                              name_prefix=prefix)

  obs = eval_env.reset()
  for _ in range(video_length):
    action, _ = model.predict(obs)
    obs, _, _, _ = eval_env.step(action)

  # close the video recorder
  eval_env.close()

In [7]:
# display video
def show_videos(video_path='', prefix=''):
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

### Configure the reinforcement learning algorithim 
Then I used the PPO2/Proximal Policy Optimization for the reinforcement learning algorithim.

In [8]:
# define the model
# model = PPO2(MlpPolicy, env, gamma=0.89, n_steps=128, nminibatches=4, noptepochs=4, ent_coef=0.01, verbose=1)

model = PPO2(MlpPolicy, env, gamma=0.99, verbose=0, tensorboard_log="./logs/")





Instructions for updating:
Use `tf.cast` instead.
Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.




Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where






## Result
- Result: show the result and interpretation of your experiment. Any iterative improvements summary.


### Train bipedalwalker model 50k steps & evaluate results
Here we train, evaluate, save, record & display video

In [9]:

# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

# Train model
model.learn(total_timesteps=50000)

# Save model
model.save("ppo2-walker-50000")

# Random Agent, after training
mean_reward_after_train = evaluate(model, num_steps=10000)

Mean reward: -107.1 Num episodes: 26




Mean reward: -65.5 Num episodes: 8


In [10]:
# Record & show video
record_video('BipedalWalker-v2', model, video_length=1500, prefix='ppo2-walker-50000')
show_videos('videos', prefix='ppo2-walker-50000')

Saving video to  /content/videos/ppo2-walker-50000-step-0-to-step-1500.mp4


### Train bipedalwalker model another 500k steps & evaluate results


In [12]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

# Train model
model.learn(total_timesteps=500000)

# Save model
model.save("ppo2-walker-500000")

# Random Agent, after training
mean_reward_after_train = evaluate(model, num_steps=10000)

Mean reward: -83.5 Num episodes: 15
Mean reward: -92.2 Num episodes: 46


In [13]:
# Record & show video
record_video('BipedalWalker-v2', model, video_length=1500, prefix='ppo2-walker-500000')
show_videos('videos', prefix='ppo2-walker-500000')

Saving video to  /content/videos/ppo2-walker-500000-step-0-to-step-1500.mp4


### Train bipedalwalker model 100k steps & evaluate results

In [15]:
# Random Agent, before training
mean_reward_before_train = evaluate(model, num_steps=10000)

# Train model
model.learn(total_timesteps=100000)

# Save model
model.save("ppo2-walker-100000")

# Random Agent, after training
mean_reward_after_train = evaluate(model, num_steps=10000)

Mean reward: -90.0 Num episodes: 41
Mean reward: -89.4 Num episodes: 57


In [16]:
# Record & show video
record_video('BipedalWalker-v2', model, video_length=1500, prefix='ppo2-walker-100000')
show_videos('videos', prefix='ppo2-walker-100000')

Saving video to  /content/videos/ppo2-walker-100000-step-0-to-step-1500.mp4


In [17]:
%tensorflow_version 1.x
%load_ext tensorboard
%tensorboard --logdir logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 1182), started 0:10:28 ago. (Use '!kill 1182' to kill it.)

## Conclusion
- discussion, reflection, or suggestions for future improvements or future ideas. I could do further parameter optimization as described [here](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html#parameters). add & tweak default parameters, messure your output & improve link to parameters above (it will however work with default)


## References
- https://nextgrid.ai/deep-learning-labs/
- https://stackoverflow.com/questions/50107530/how-to-render-openai-gym-in-google-colab
- https://stackoverflow.com/questions/63012499/tensorboard-in-google-colab-for-tensorflow-1-x
- https://stable-baselines.readthedocs.io/en/master/guide/tensorboard.html
- https://medium.com/data-from-the-trenches/choosing-a-deep-reinforcement-learning-library-890fb0307092
- https://araffin.github.io/post/sb3/
- https://gym.openai.com/envs/BipedalWalker-v2/
- https://arxiv.org/pdf/1707.06347.pdf
- https://github.com/araffin/rl-tutorial-jnrr19/blob/sb3/4_callbacks_hyperparameter_tuning.ipynb
- https://stable-baselines3.readthedocs.io/en/master/guide/examples.html
- https://github.com/DLR-RM/rl-baselines3-zoo
- https://opensourcelibs.com/lib/rl-baselines3-zoo
- https://towardsdatascience.com/elegantrl-a-lightweight-and-stable-deep-reinforcement-learning-library-95cef5f3460b
- https://github.com/mayurmadnani/BipedalWalker


## Demo
- A video clip of a demo(s)- either .mp4 youtube link, unlisted, or transformed to .gif image (you can leave those links in the report or embed in the notebook)
