# Midterm Assignment (Make a Bipedal Robot Walk) <a class="tocSkip">

Its midterm season Wizards! The midterm is to make a bipedal humanoid robot walk in a simulation.
You can use OpenAI Gym for the environment, this [link](https://github.com/search?q=bipedal+gym) shows some potential solutions that you can use to help you when you build your own. Submit your repository to schoolofaigrading@gmail.com . We will review your work and send back grades! We’re looking for good documentation, readable code, and bonus points for using reinforcement learning in a novel way for this challenge. Due date for all midterms is October 29, 2018 at 12 PM PST. Good luck!

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-code" data-toc-modified-id="The-code-1">The code</a></span><ul class="toc-item"><li><span><a href="#imports" data-toc-modified-id="imports-1.1">imports</a></span></li><li><span><a href="#slider-widget-code" data-toc-modified-id="slider-widget-code-1.2">slider widget code</a></span></li><li><span><a href="#Hyperparameters-Class" data-toc-modified-id="Hyperparameters-Class-1.3">Hyperparameters Class</a></span></li><li><span><a href="#ARS-agent-classes" data-toc-modified-id="ARS-agent-classes-1.4">ARS agent classes</a></span></li><li><span><a href="#helper-functions" data-toc-modified-id="helper-functions-1.5">helper functions</a></span></li><li><span><a href="#Hyperparameter-sliders" data-toc-modified-id="Hyperparameter-sliders-1.6">Hyperparameter sliders</a></span></li><li><span><a href="#Train-the-agent" data-toc-modified-id="Train-the-agent-1.7">Train the agent</a></span></li></ul></li><li><span><a href="#Experimental-Results" data-toc-modified-id="Experimental-Results-2">Experimental Results</a></span></li><li><span><a href="#Here-is-a-video-of-one-of-the-best-results-achieved-by-training" data-toc-modified-id="Here-is-a-video-of-one-of-the-best-results-achieved-by-training-3">Here is a video of one of the best results achieved by training</a></span><ul class="toc-item"><li><span><a href="#Additional-Information" data-toc-modified-id="Additional-Information-3.1">Additional Information</a></span></li></ul></li></ul></div>

## The code

### imports

In [None]:
import os
import numpy as np
import gym
import ipywidgets as widgets

from gym import wrappers
from tqdm import tnrange, tqdm_notebook
from ipywidgets import interact, interactive
from IPython.display import display

ENV_NAME = 'BipedalWalker-v2'
hyperparameters = {}

### slider widget code

In [None]:
"""Code to create hyperparameter widgets (sliders)"""
seed = widgets.IntSlider(
    description='seed:', value=1946, min=0, max=3000, orientation='vertical')
num_episodes = widgets.IntSlider(
    description='num_episodes:', value=200, min=0, max=10000, step=10, orientation='vertical')
learning_rate = widgets.FloatSlider(
    description='learning_rate:', value=0.2, min=0.0, max=1.0, orientation='vertical')
record_every = widgets.IntSlider(
    description='record_every:', value=50, min=0, max=1000, step=10, orientation='vertical')
episode_length = widgets.IntSlider(
    description='episode_length:', value=2000, min=0, max=10000, step=100, orientation='vertical')
num_deltas = widgets.IntSlider(
    description='num_deltas:', value=16, min=0, max=100, orientation='vertical')
num_best_deltas = widgets.IntSlider(
    description='num_best_deltas:', value=16, min=0, max=100, orientation='vertical')
noise = widgets.FloatSlider(
    description='noise:', value=0.03, min=0.0, max=1.0, step=0.01, orientation='vertical')

ui_0 = widgets.HBox([
    seed, num_episodes, learning_rate, record_every, episode_length, num_deltas, num_best_deltas, noise])

def f_0(
    seed, num_episodes, learning_rate, record_every, episode_length, num_deltas,
    num_best_deltas, noise):
    hyperparameters["seed"] = seed
    hyperparameters["num_episodes"] = num_episodes
    hyperparameters["learning_rate"] = learning_rate  
    hyperparameters["record_every"] = record_every         
    hyperparameters["episode_length"] = episode_length        
    hyperparameters["num_deltas"] = num_deltas           
    hyperparameters["num_best_deltas"] = num_best_deltas
    hyperparameters["noise"] = noise   
    
out_0 = widgets.interactive_output(f_0, {
    'seed': seed, 'num_episodes': num_episodes, 
    'learning_rate': learning_rate, 'record_every': record_every,
    'episode_length': episode_length, 'num_deltas': num_deltas,
    'num_best_deltas': num_best_deltas, 'noise': noise
})

### Hyperparameters Class

In [None]:
class HyperParams():
    """Hyperparameters"""
    def __init__(self,
                 num_episodes=200,
                 episode_length=2000,
                 learning_rate=0.02,
                 num_deltas=16,
                 num_best_deltas=16,
                 noise=0.03,
                 seed=1,
                 env_name=ENV_NAME,
                 record_every=50):

        self.num_episodes = num_episodes
        self.episode_length = episode_length
        self.learning_rate = learning_rate
        self.num_deltas = num_deltas
        self.num_best_deltas = num_best_deltas
        assert self.num_best_deltas <= self.num_deltas
        self.noise = noise
        self.seed = seed
        self.env_name = env_name
        self.record_every = record_every

### ARS agent classes

In [None]:


class Normalizer():
    # Normalizes the inputs
    def __init__(self, nb_inputs):
        self.n = np.zeros(nb_inputs)
        self.mean = np.zeros(nb_inputs)
        self.mean_diff = np.zeros(nb_inputs)
        self.var = np.zeros(nb_inputs)

    def observe(self, x):
        self.n += 1.0
        last_mean = self.mean.copy()
        self.mean += (x - self.mean) / self.n
        self.mean_diff += (x - last_mean) * (x - self.mean)
        self.var = (self.mean_diff / self.n).clip(min = 1e-2)

    def normalize(self, inputs):
        obs_mean = self.mean
        obs_std = np.sqrt(self.var)
        return (inputs - obs_mean) / obs_std


class Policy():
    def __init__(self, input_size, output_size, hp, theta=None):
        self.theta = theta or np.zeros((output_size, input_size))
        self.hp = hp

    def evaluate(self, state, delta=None, direction=None):
        if direction is None:
            return self.theta.dot(state)
        elif direction == "+":
            return (self.theta + self.hp.noise * delta).dot(state)
        elif direction == "-":
            return (self.theta - self.hp.noise * delta).dot(state)

    def sample_deltas(self):
        return [np.random.randn(*self.theta.shape) for _ in range(self.hp.num_deltas)]

    def update(self, rollouts, sigma_rewards):
        # sigma_rewards is the standard deviation of the rewards
        step = np.zeros(self.theta.shape)
        for r_pos, r_neg, delta in rollouts:
            step += (r_pos - r_neg) * delta
        self.theta += self.hp.learning_rate / (self.hp.num_best_deltas * sigma_rewards) * step

    def get_theta(self):
        return self.theta
    
    def set_theta(self, new_theta):
        self.theta = new_theta
                

class ArsTrainer():
    def __init__(self,
                 hp=None,
                 input_size=None,
                 output_size=None,
                 normalizer=None,
                 policy=None,
                 monitor_dir=None):

        self.hp = hp or HyperParams()
        np.random.seed(self.hp.seed)
        self.env = gym.make(self.hp.env_name)
        if monitor_dir is not None:
            should_record = lambda i: self.record_video
            self.env = wrappers.Monitor(self.env, monitor_dir, video_callable=should_record, resume=True)
        self.hp.episode_length = self.env.spec.timestep_limit or self.hp.episode_length
        self.input_size = input_size or self.env.observation_space.shape[0]
        self.output_size = output_size or self.env.action_space.shape[0]
        self.normalizer = normalizer or Normalizer(self.input_size)
        self.policy = policy or Policy(self.input_size, self.output_size, self.hp)
        self.record_video = False
        
    def __del__(self):
        self.env.close()
        
    def save(self, filename='theta.npy'):
        print(f"Saving {filename}") 
        np.save(filename, self.policy.get_theta())
            
    def load(self, filename='theta.npy'):
        print(f"Loading {filename}")
        self.policy.set_theta(np.load(filename))
        
    def get_policy(self):
        return self.policy

    
    # Explore the policy on one specific direction and over one episode
    def explore(self, direction=None, delta=None):
        state = self.env.reset()
        done = False
        num_plays = 0.0
        sum_rewards = 0.0
        while not done and num_plays < self.hp.episode_length:
            self.normalizer.observe(state)
            state = self.normalizer.normalize(state)
            action = self.policy.evaluate(state, delta=delta, direction=direction)
            state, reward, done, _ = self.env.step(action)
            reward = max(min(reward, 1), -1)
            sum_rewards += reward
            num_plays += 1
        return sum_rewards

    def play(self):
        """play stuff"""
        self.record_video = True
        self.explore()
        
    
    def train(self):
        """trains the """
        best_reward, reward_evaluation = float("-inf"), float("-inf")
        
        # used by tqdm to give us a formatted progress bar
        desc = f"Current Reward: {reward_evaluation:>3.5f}, Best Reward: {reward_evaluation:>3.5f}"
        t = tnrange(self.hp.num_episodes, desc=desc, leave=True)
        
        rewards = np.zeros((self.hp.num_episodes, 1))
        
        for step in t:
            #print(f"Step: {step}")
            # used by tqdm to give us a formatted progress bar            
            t.set_description(f"Current Reward: {reward_evaluation:>3.5f}, Best Reward: {best_reward:>3.5f}")
            t.refresh()
            
            # initialize the random noise deltas and the positive/negative rewards
            deltas = self.policy.sample_deltas()
            positive_rewards = [0] * self.hp.num_deltas
            negative_rewards = [0] * self.hp.num_deltas

            # play an episode each with positive deltas and negative deltas, collect rewards
            for k in range(self.hp.num_deltas):
                positive_rewards[k] = self.explore(direction="+", delta=deltas[k])
                negative_rewards[k] = self.explore(direction="-", delta=deltas[k])
                
            # Compute the standard deviation of all rewards
            sigma_rewards = np.array(positive_rewards + negative_rewards).std()

            # Sort the rollouts by the max(r_pos, r_neg) and select the deltas with best rewards
            scores = {k:max(r_pos, r_neg) for k,(r_pos,r_neg) in enumerate(zip(positive_rewards, negative_rewards))}
            order = sorted(scores.keys(), key = lambda x:scores[x], reverse = True)[:self.hp.num_best_deltas]
            rollouts = [(positive_rewards[k], negative_rewards[k], deltas[k]) for k in order]

            # Update the policy
            self.policy.update(rollouts, sigma_rewards)

            # Only record video during evaluation, every n steps
            if step % self.hp.record_every == 0 and step > 0 or step == self.hp.num_episodes - 1:
                self.record_video = True
                
            # Play an episode with the new weights and print the score
            reward_evaluation = self.explore()
            rewards[step] = reward_evaluation
            
            best_reward = max(best_reward, reward_evaluation)
            self.record_video = False
            
            if step >= 10:
                last_ten_mean = np.mean(rewards[-10:])
                if last_ten_mean >= 300.00:
                    print(f"Finished training early, the last_ten_mean is {last_ten_mean} "
                          f"after {step} steps")
            
        return rewards

### helper functions 

In [None]:
def mkdir(base, name):
    path = os.path.join(base, name)
    if not os.path.exists(path):
        os.makedirs(path)
    return path


def train_agent(rewards_file='000.npy', continue_training=False):
    """Starts everything"""
    print(hyperparameters)
    videos_dir = mkdir('.', 'videos')
    monitor_dir = mkdir(videos_dir, ENV_NAME)    
    hp = HyperParams(**hyperparameters)
    trainer = ArsTrainer(hp=hp, monitor_dir=monitor_dir)
        
    rewards = trainer.train()

    rewards_dir = mkdir('.', 'rewards')
    np.save(rewards_dir + "/" + rewards_file, rewards)

    
def play_agent(filename='theta.npy'):
    videos_dir = mkdir('.', 'videos')
    monitor_dir = mkdir(videos_dir, ENV_NAME)    
    hp = HyperParams(**hyperparameters)
    trainer = ArsTrainer(hp=hp, monitor_dir=monitor_dir)
    trainer.load(filename=filename)
    policy = trainer.get_policy()
    print(f"Policy.theta:\n{policy.theta}")
    trainer.play()

### Hyperparameter sliders

In [8]:
display(ui_0, out_0)

HBox(children=(IntSlider(value=1946, description='seed:', max=3000, orientation='vertical'), IntSlider(value=1…

Output()

In [9]:
# Let's just verify our hyperparameters
print(hyperparameters)

{'seed': 1946, 'num_episodes': 10, 'learning_rate': 0.2, 'record_every': 50, 'episode_length': 2000, 'num_deltas': 16, 'num_best_deltas': 16, 'noise': 0.03}


### Train the agent

In [6]:
train_agent(rewards_file='experiment_14.npy', continue_training=False)


{'seed': 1946, 'num_episodes': 10, 'learning_rate': 0.2, 'record_every': 50, 'episode_length': 2000, 'num_deltas': 16, 'num_best_deltas': 16, 'noise': 0.03}
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


HBox(children=(IntProgress(value=0, description='Current Reward: -inf, Best Reward: -inf', max=10, style=Progr…




## Experimental Results

After 12 experiments, we found that **experiment 8** showed a pretty reasonable result within 200 episodes


















| Experiment | Seed   | num_episodes | learning_rate | episode_length | noise  | best_reward |
|------------|--------|--------------|---------------|----------------|--------|-------------|
|           1|1946    | 200          |            0.2|            2000|   0.03 |         4.08|
|           2|1946    | 200          |            0.2|            2000|   0.00 |        -1.25|
|           3|1946    | 200          |            0.2|            2000|   0.90 |         9.63|
|           4|1946    | 200          |            0.2|            2000|   0.90 |       263.60|
|           5|1946    | 200          |            0.2|            2000|   0.90 |       263.57|
|           6|1946    | 200          |            0.2|            2000|   0.50 |         2.57|         
|           7|1946    | 200          |            0.2|            2000|   0.40 |         3.17|         
|       **8**|**1946**| **200**      |        **0.2**|        **2000**|**0.25**|   **281.37**|         
|           9|1946    | 200          |            0.2|            2000|    0.22|       248.47|
|          10|1946    | 200          |            0.3|            2000|    0.22|       190.21|
|          11|1946    | 300          |            0.3|            2000|    0.25|       276.58|
|          12|1946    | 300          |            0.3|            2000|    0.10|        38.40|




## Here is a video of one of the best results achieved by training

<video controls src="./videos/BipedalWalker-v2/openaigym.video.9.22269.video004982.mp4" />