## HER - Hindsight Experience Replay

After succeed in training the ant from relatively close place to the target (this is thanks to random success), we can observe that after moving the ant a little bit further from the target it is pretty hard to learn a long sequence of actions that will lead to goal arrival randomly.
We could optionally set the reward as a function of the distance to the target, but doing that will harm our reward sparsity, and make the reward much dense. We would like to achieve the same target with sparse reward (success or failure), for this mission we can use HER.
By using HER, we actually learn from unsuccessful episodes. The key insight that HER formalizes is what humans do intuitively: Even though we have not succeeded at a specific goal, we have at least achieved a different one.
When the RL algorithm (DDPG in our case) is wrong, we pretend that we wanted to achieve this goal (in our case, a different target location) if we repeat this process, we will eventually learn how to achieve arbitrary goals, including the goals that we really want to achieve. 


In [None]:
from stable_baselines3 import HerReplayBuffer



In [60]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [61]:
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sys
from stable_baselines3.common.monitor import Monitor
from stable_baselines3 import DDPG
from stable_baselines3 import HerReplayBuffer

sys.path.append('..')

import MazeEnv.MazeEnv as mz
from MazeEnv.MazeEnv import Rewards
from MazeEnv.Utils import *
from Evaluation import EvalAndSaveCallback
import Evaluation


import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda available: True


<h1> Remiender: batch size is important! </h1>
<h1> Remiender: batch size is important! </h1>
<h1> Remiender: batch size is important! </h1>

In [68]:
START_LOC = (5, 2.8)
TIMEOUT_STEPS = 300
BUFFER_SIZE = 1000 # smaller buffer for small task
TOTAL_TIME_STEPS = 10000
LEARNING_RATE = 0.001

REWARDS = Rewards(target_arrival=1, collision=-1, timeout=0)

EVAL_EPISODES=1
EVAL_FREQ=1000
VIDEO_FREQ=2


#HER parameters
N_SAMPLED = 4
STRATEGY = 'future' # futute, random or episode
ONLINE_SAMPLING = True

In [80]:
# create environment :
tile_size = 0.1
maze_size = mz.MazeSize.SQUARE10
map_size = np.dot(maze_size, int(1 / tile_size))
maze_map = make_circular_map(map_size, 5 / tile_size)
# maze_map = np.zeros(map_size)

maze_env = Monitor(mz.MazeEnv(maze_size=maze_size,
                              maze_map=maze_map,
                              tile_size=tile_size,
                              start_loc=START_LOC,
                              target_loc=np.divide(maze_size, 2),
                              timeout_steps=TIMEOUT_STEPS,
                              show_gui=False,
                              rewards=REWARDS), 
                  filename="logs/DummyMaze/results")
_ = maze_env.reset()

# create separete evaluation environment:
eval_maze_env = Monitor(mz.MazeEnv(maze_size=maze_size,
                                   maze_map=maze_map,
                                   tile_size=tile_size,
                                   start_loc=START_LOC,
                                   target_loc=np.divide(maze_size, 2),
                                   timeout_steps=TIMEOUT_STEPS,
                                   show_gui=False,
                                   rewards=REWARDS)
                       )
_ = eval_maze_env.reset()
 
# create model:
model = DDPG(policy="MultiInputPolicy",
             env=maze_env,
             buffer_size=BUFFER_SIZE,
             learning_rate=LEARNING_RATE,
             device='cuda',
             train_freq=(1, "episode"),
             replay_buffer_class=HerReplayBuffer,
             replay_buffer_kwargs=dict(
                 n_sampled_goal=N_SAMPLED,
                 goal_selection_strategy=STRATEGY,
                 online_sampling=ONLINE_SAMPLING,
                 max_episode_length=TIMEOUT_STEPS,
             ),
             verbose=1)

# create callback for evaluation
callback = EvalAndSaveCallback(log_dir="logs/DummyMaze",
                               eval_env=eval_maze_env,
                               eval_freq=EVAL_FREQ,
                               eval_episodes=EVAL_EPISODES,
                               eval_video_freq=VIDEO_FREQ,
                               verbose=1)

Using cuda device
Wrapping the env in a DummyVecEnv.


In [None]:
start = time.time()

#clean all movies from the previous run
!rm -f logs/DummyMaze/*_steps.avi

torch.manual_seed(3295)
model.learn(total_timesteps=TOTAL_TIME_STEPS,
            callback=callback)
 
print("time", time.time() - start)

In [None]:
from IPython.display import Image
import ipyplot
import glob

# load best model 
best_model = DDPG.load("./logs/DummyMaze/best_model", env=maze_env)

Evaluation.plot_train_eval_results("logs/DummyMaze", EVAL_EPISODES)
episode_reward = Evaluation.record_model(best_model, eval_maze_env, "logs/DummyMaze/final.avi")
print("Reward:", episode_reward)

#Clean previous gifs
!rm -f logs/DummyMaze/*_steps.gif

# Plot gifs
Evaluation.create_gifs_from_avi("logs/DummyMaze")
gifs = glob.glob("logs/DummyMaze/*.gif")
labels = [pth.split('/')[-1].split('.')[0] for pth in gifs]
ipyplot.plot_images(gifs, labels, img_width=250)