# Evaluating trained agents

This Notebook will be used to visualize & analyze various trained agents on RiskyPath environment. Analysis will especially comprise observing the agent's behaviour in the environment it was trained for but also different versions of the environment (distributional shift analysis)

To observe deterministic agent behaviour, slipping/collision probabilities will be automatically set to zero. In this case, the agent might not be tested on the exact same environment configuration it was trained on. However, other environmental factors will behave as per the training environment's configuration. This is not the case if the model is explicitly tested on custom environments, which is obvious in the corresponding cells.

The next cell defines the folder location prefix for saved models. Set this to the location in which you have downloaded the trained models.
**Warning**: The folder structure of the saved models (inside `saved_models/…`) must not be changed and stay in the format defined by `experiment_config.py`. This is necessary as code in this notebook uses the path's information to infer environment configuration, model specifics etc.

Some cells are saved as raw format. They contain code which mostly renders agent behaviour in a specific environmental setting. This is especially useful for making videos and watching agent interaction, but otherwise not needed. 

In Jupyter, one can change raw or markdown cells to code cells by entering the command mode in the cell by pressing `esc` and then `y`. To return to raw format, use `esc` and `r`.

In [1]:
# NOTE Save the prefix for the models folder here for compatibility across different systems
model_path_prefix = "/Users/tilioschulze/Library/CloudStorage/OneDrive-Personal/Studium/Bachelorarbeit/experiment_models/saved_models/"

In [2]:
import json
import time

import gym
import gym_minigrid
from gym_minigrid.minigrid import Goal, Floor, Lava, Wall, SpikyTile
from gym_minigrid.envs import RiskyPathEnv
from gym_minigrid.wrappers import RGBImgObsWrapper, ImgObsWrapper, TensorObsWrapper
from special_wrappers import RandomizeGoalWrapper

from experiment_config import GridworldExperiment
import torch as th
import stable_baselines3
from stable_baselines3.dqn import DQN
from stable_baselines3.a2c import A2C
from stable_baselines3.common.utils import obs_as_tensor

import numpy as np

## Utilities

Definition of functions to use for quick analysis

In [3]:
sinfo = "\33[32mINFO:\33[0m"

def model_env_from_path(agent_path: str, no_slip: bool = True, no_rebound: bool = True):
    """Extract model, environment, observation type and tile render size from information in the model's save-path.
    In order to allow deterministic analysis of agent-environment interaction, slipping and wall rebound are turned off per default.
    """    
    # Extract model from path (a2c or dqn?)
    if "/dqn/" in agent_path:
        model_class = DQN
    elif "/a2c/" in agent_path:
        model_class = A2C

    model = model_class.load(agent_path)

    # Create environment given information in function input
    path_keys = agent_path.split("saved_models/")[1].split("/")
    env_name = path_keys[0]
    observation_type = path_keys[1]

    render_size = 8
    rgb = False
    if "pixel_obs_" in agent_path:
        render_size = int(path_keys[1].split("_")[-1])
        rgb = True

    env_info = ""
    with open('env_config.json', 'r') as f:
        env_kwargs = json.load(f)[env_name]
    if 'goal_rnd' in env_kwargs:
        env_kwargs.pop('goal_rnd')
    if no_slip and env_kwargs['slip_proba'] != 0:
        env_kwargs.pop('slip_proba')
        env_info += "slipping probability removed; "
    if no_rebound and env_kwargs['wall_rebound']:
        env_kwargs.pop('wall_rebound')
        env_info += "wall rebound deactivated"
    if len(env_info) != 0:
        print("\33[32mINFO:\33[0m", env_info)

    env = gym.make(
        "MiniGrid-RiskyPath-v0",
        **env_kwargs
    )

    return model, env, rgb, render_size

def test_agent_on_environment(
    agent_path: str,
    num_episodes: int = 1,
    render_time: float = 0.2,
    custom_environment: gym.Env = None,
    predict_deterministic: bool = True,
    accelerate_viz: bool = True
):
    """Render agent interaction with the environment in an interactive matplotlib window. Useful to make videos of agent behaviour or analyzing trajectories. Slipping and wall rebound is turned off in order to observe the agent's intended behaviour. When passing a custom environment, no checking for stationary state distribution and deterministic transitions is performed.

    Args:
        agent_path (str): Model location. Folder path must conform to experiment structure
        num_episodes (int, optional): number of episodes to render
        render_time (float, optional): render time for one time step in seconds
        custom_environment (gym.Env, optional): None by default
        predict_deterministic (bool, optional): Make deterministic mode predictions
        accelerate_viz (bool, optional): Will accelerate rendering when agent takes too long to solve environment
    """
    model, env, rgb_on, render_size = model_env_from_path(agent_path)

    if custom_environment is not None:
        env = custom_environment
    
    if rgb_on:
        env = RGBImgObsWrapper(env, tile_size=render_size)
        env = ImgObsWrapper(env)
    else:
        env = TensorObsWrapper(env)
    
    # Execute episodes and render agent
        # TODO print reward, action [number, (himmelsrichtung)] etc.
    for i in range(num_episodes):

        print(f"Starting episode {i+1}")
        total_reward = 0
        needed_timesteps = 0

        obs = env.reset()
        done = False
        env.render(tile_size=render_size)
        time.sleep(render_time)

        while not done:
            action, _ = model.predict(obs, deterministic=predict_deterministic)
            obs, reward, done, info = env.step(action)
            env.render(tile_size=render_size)
            total_reward += reward
            needed_timesteps += 1
            if needed_timesteps > 25:
                render_time = 0.05
            time.sleep(render_time)
        
        print(f"Episode ended after {needed_timesteps} time steps.")
        out = f"Total reward: {total_reward}"
        print(out)
        print("-"*len(out))
    
    %matplotlib

def make_env(
    **kwargs
):
    env = gym.make(
        "MiniGrid-RiskyPath-v0",
        **kwargs
    )
    return env

def compute_q_values(model_policy, obs):
    """Compute q-values from a DQN model given a certain observation.

    Args:
        model_policy: The DQN model's policy
        obs: The environmental observation for which q-values should be computed
    """
    # Code adapted from this stackoverflow post
    # https://stackoverflow.com/questions/73239501/how-to-get-the-q-values-in-dqn-in-stable-baseline-3/73242315#73242315?newreg=d2762c51b8bc44778cde16b43499a6d5
    observation = obs.reshape((-1,) + model_policy.observation_space.shape)
    observation = obs_as_tensor(observation, "cpu")
    with th.no_grad():
        q_values = model_policy.q_net(observation)
    return q_values

def visualize_policy(
    path: str,
    custom_env = None
):
    """Visualize the model policy on the given environment specification. The environmental state distribution is assumed to stationary. Goal randomization is explicitly not applied. A custom environment can be passed and policy will be applied on it. This method does not check if the state distribution is stationary. Returns a colored string representation to print to the console.

    Args:
        path (str): The saved models location
        custom_env (RiskyPathEnv): A custom environment
    """    
    model, env, rgb_on, render_size = model_env_from_path(path)
    if custom_env is not None:
        env = custom_env
    
    # No wrapping is needed
    env.reset()

    grid = env.grid
    visual_policy = ""

    ansi_color = lambda code, text:  f"\33[{code}m{text}\33[0m"

    for i in range(grid.width):
        visual_policy += " " + str(i) + "  "
        if i == grid.width - 1:
            visual_policy += "\n"

    for y in range(grid.height):
        for x in range(grid.width):
            tile = grid.get(x, y)
            
            if tile is None or isinstance(tile, Floor) or isinstance(tile, SpikyTile):
                # get model action and map to <, >, ^, v strings
                # NOTE setting a variable only works on unwrapped env as gym automatically wraps the environment with orderenforcing wrapper and wrappers do not implement a __setattr__ method but a __getattr__
                env.unwrapped.agent_pos = (x, y)
                if rgb_on:
                    obs = env.render(
                        mode="rgb_array",
                        highlight=False,
                        tile_size=render_size
                    )
                else:
                    obs = env.tensor_obs()

                dir_mapping = {0 : "<", 1 : "^", 2 : ">", 3 : "v"}
                action = int(model.predict(obs, deterministic=True)[0])
                dir_str = dir_mapping[action]

                visual_policy += f"[{dir_str}] "
            elif isinstance(tile, Wall):
                w = ansi_color(36, "#")
                visual_policy += f"[{w}] "
            elif isinstance(tile, Lava):
                l = ansi_color(41, "~")
                visual_policy += f"[{l}] "
            elif isinstance(tile, Goal):
                g = ansi_color(42, "x")
                visual_policy += f"[{g}] "
            
            if x == grid.width - 1: 
                visual_policy += f" {y} \n"
            
    return visual_policy

def load_model_params(
    path: str
):
    """Return model policy and additional information

    Args:
        path (str): The path to the saved model

    Returns:
        tuple: policy, policy_class, policy_kwargs
    """
    # Extract model from path (a2c or dqn?)
    if "/dqn/" in path:
        model_class = DQN
    elif "/a2c/" in path:
        model_class = A2C

    model = model_class.load(path)
    return model.policy, model.policy_class, model.policy_kwargs

def dqn_params(path):
    if "/dqn/" in path:
        model_class = DQN
    model = model_class.load(path)
    return model.get_parameters()

def randomized_goal_stats(path: str, episodes: int = 50):
    """Test the agent statistically on the training environment specification but with randomized goal tile placement. Other environmental factors are taken from the env_config key except for slipping and wall rebound which is turned off in order to truly analyze the agent's capabilities.

    Args:
        path (str): Location of the saved model. Must conform to predefined folder structure.
        episodes (int, optional): Number random-goal episodes
    """
    model, env, rgb_on, render_size = model_env_from_path(path)
    count_successes = 0
    if rgb_on:
        env = RGBImgObsWrapper(env, render_size)
        env = ImgObsWrapper(env)
    else:
        env = TensorObsWrapper(env)

    env = RandomizeGoalWrapper(env, randomization=1)

    episode_lengths = []
    success_goal_locations = []
    all_goals = []
    for _ in range(episodes):
        obs = env.reset()
        done = False
        step = 0
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, _, done, info = env.step(action)
            step += 1
        if info['is_success']:
            count_successes += 1
            success_goal_locations += env.goal_positions
        all_goals += env.goal_positions
        episode_lengths.append(step)
    
    success_goal_locations = set(success_goal_locations)
    all_goals = set(all_goals)
    adj_success_rate = len(success_goal_locations)/len(all_goals)

    print("Goal randomization success rate:\33[35m", round(count_successes/episodes*100, 1), f"\33[0m% on \33[35m{episodes}\33[0m random-goal episodes")
    print(f"Adjusted goal randomization success rate on unique different goals: \33[35m{round(adj_success_rate*100, 1)}\33[0m %")
    print("Goal randomization mean episode length:\33[35m", np.mean(episode_lengths), "\33[0m")
    print("Goal positions with successes (unique):", success_goal_locations)

def execute_episode(path: str):
    """Execute a test episode with the specified saved model on the training environment configuration. Slipping and wall rebound is deactivated. Prints episode summary to stdout.

    Args:
        path (str): Location of the saved model. Must conform to specified structure (see experiment_config.py)
    """ 
    model, env, rgb, render_size = model_env_from_path(path)
    if rgb:
        env = RGBImgObsWrapper(env, tile_size=render_size)
        env = ImgObsWrapper(env)
    else:
        env = TensorObsWrapper(env)

    obs = env.reset()
    done = False
    count_steps = 0
    cumulative_reward = 0
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
        cumulative_reward += reward
        count_steps += 1
    print(f"Episode summary -> success: \33[35m{info['is_success']}\33[0m, cumulative reward: \33[35m{cumulative_reward}\33[0m, number of steps: \33[35m{count_steps}\33[0m")

In [4]:
# definitions of helpers & constants to reuse for custom environments

base_lava_positions = []
for y in range(1, 11 - 1):
    base_lava_positions.append((1, y))
for y in range(11 - 3, 11 - 8, -1):
    base_lava_positions.append((3, y))
base_lava_positions.extend([(6, 11 - 5), (6, 11 - 6)])
base_lava_positions.remove((1,3)) # remove location of original goal position

original_lava = lambda: base_lava_positions.copy()

upper_right_goal_env = lambda: make_env(goal_positions=[(9, 1)])
alt_upper_right_goal_env = lambda: make_env(goal_positions=[(9, 1)], lava_positions=original_lava())


In [5]:
# defining standard test suite
def test_suite_model(path: str, rnd_eps=50):
    print(sinfo, "Beginning execution of test suite.")
    print(f"Path: \33[3m{path[len(model_path_prefix):]}\33[0m")

    # test agent on deterministic (!) environment
    print("\n\33[4mAgent success on \33[1mdeterministic\33[0;4m training environment:\33[0m")
    execute_episode(path)

    # visualized policy on original environment
    print("\n\33[4mPolicy visualization on training environment:\33[0m")
    print(visualize_policy(path))
    
    # summary statistics on randomized goal locations
    print("\33[4mTesting Goal generalization capabilities:\33[0m")
    randomized_goal_stats(path, episodes=rnd_eps)

    print(sinfo, "Test suite execution ended.")

In [6]:
# ignore "memory not enough" warnings concerning replay buffer
import warnings
warnings.filterwarnings('ignore', module="stable_baselines3.common.buffers")

In [7]:
%matplotlib
# Force matplotlib to render outside of notebook (Don't use 'inline' backend)

Using matplotlib backend: <object object at 0x136970e50>


## `exp_001`

_Environment configuration:_
```json
    "exp_001" : {
        "max_steps" : 150,
        "slip_proba" : 0,
        "wall_rebound" : false,
        "spiky_active" : false,
        "reward_spec" : {
            "step_penalty" : 0,
            "goal_reward" : 1,
            "absorbing_states" : false,
            "absorbing_reward_goal" : 0,
            "absorbing_reward_lava" : -1,
            "risky_tile_reward" : 0,
            "lava_reward" : -1
        }
```

### DQN on `exp_001`

#### DQN algo_default

First, let's load one of the successful trained agents on the `stable-baselines3` DQN defaults. It was trained on **tensor observations for 250k time steps**.

In [8]:
model_1 = model_path_prefix + "exp_001/tensor_obs/dqn/algo_default/seed_763.zip"

Agent walks to the goal tile. Let's visualize it's policy on this version of the environment:

In [9]:
test_suite_model(model_1)

[32mINFO:[0m Beginning execution of test suite.
Path: [3mexp_001/tensor_obs/dqn/algo_default/seed_763.zip[0m

[4mAgent success on [1mdeterministic[0;4m training environment:[0m
Episode summary -> success: [35mTrue[0m, cumulative reward: [35m1[0m, number of steps: [35m7[0m

[4mPolicy visualization on training environment:[0m
 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [v] [<] [<] [<] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [v] [v] [v] [<] [<] [[36m#[0m]  2 
[[36m#[0m] [[42mx[0m] [<] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [<] [^] [^] [<] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [[41m~[0m] [^] [<] [>] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [[41m~[0m]

**Explanation of output:**

- `[~]` is lava
- `[#]` is a wall
- `[x]` is the goal tile
- `<,^,>,v` are the directions the agent would take from that cell

The agent successfully navigates the environment from most positions, it quickly finds the goal tile and mostly doesn't take any detours. Exceptions are (9,5), (9,6), where the agent prefers to move against the wall (in this case not moving at all). The agent does not always walk the quickest paht, but considering that the reward model of `exp_001` does not incentivize the agent to find the shortest path (no time penalty), this is not especially surprising.
One can also observe that in no case would the agent walk in one of the lava tiles.

From the goal randomization summary, it can be concluded that the agent has no generalization capabilities whatsoever. In 50 episodes with random goal tile placements (guaranteed to be accessible), the agent success rate is 0.

As an example, the next cell shows the agent interaction when the goal tile is placed at the top right corner:

In [10]:
print(visualize_policy(model_1, upper_right_goal_env()))

 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [<] [<] [>] [[42mx[0m] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  2 
[[36m#[0m] [[41m~[0m] [>] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [v] [[41m~[0m] [^] [<] [<] [<] [<] [>] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [v] [[41m~[0m] [^] [<] [[41m~[0m] [>] [>] [>] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [v] [[41m~[0m] [^] [<] [[41m~[0m] [>] [>] [>] [[36m#[0m]  6 
[[36m#[0m] [[41m~[0m] [v] [[41m~[0m] [>] [<] [>] [>] [>] [>] [[36m#[0m]  7 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [v] [<] [>] [>] [>] [>] [[36m#[0m]  8 
[[36m#[0m] [[41m~[0m] [>] [<] [<] [<] [<] [>] [>] [>] [[36m#[0m]  9 
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#

The agent oscillates between the starting position (2,9) and the adjacent tile to the right. It seems confused about the changed environment. Two factors seem to come into play: The goal tile was changed to (9,9) and the old goal tile was replaced with a lava tile. The agent might have recognized that its previous strategy might no longer be safe. Given the policy visualization for this environment version, this is further evidenced by the fact that the agent still tries to avoid all lava tiles, even the new lava tile at (1,3).
Let's see how the agent behaviour changes when the original goal tile is turned to floor.

In [11]:
print(visualize_policy(model_1, custom_env=alt_upper_right_goal_env()))

 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [<] [<] [<] [[42mx[0m] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  2 
[[36m#[0m] [v] [<] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [<] [<] [<] [<] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [[41m~[0m] [>] [<] [>] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [[41m~[0m] [>] [>] [>] [[36m#[0m]  6 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [>] [<] [<] [>] [[36m#[0m]  7 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [v] [<] [<] [<] [>] [>] [[36m#[0m]  8 
[[36m#[0m] [[41m~[0m] [>] [<] [<] [<] [<] [<] [>] [>] [[36m#[0m]  9 
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[3

We see that the agent still oscillates at the starting position and its right neighbour, however, its policy would lead the agent to walk to the original goal position when placed in most other cells (and then terminating the episode by walking in the lava tile below). Note that no policy-induced trajectory would end up in the actual goal tile given this environment.




Next, I'd like to see if the agent is able to avoid _newly placed_ lava tiles. Let's see two examples:

In [12]:
blocking_lava = original_lava() + [(2,8)]
env = make_env(
    lava_positions=blocking_lava
)
print(visualize_policy(model_1, custom_env=env))

blocking_lava = original_lava() + [(2,4)]
env = make_env(
    lava_positions=blocking_lava
)
print(visualize_policy(model_1, custom_env=env))

 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  2 
[[36m#[0m] [[42mx[0m] [<] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [<] [<] [<] [<] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [v] [[41m~[0m] [^] [<] [[41m~[0m] [^] [<] [<] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [[41m~[0m] [^] [<] [<] [[36m#[0m]  6 
[[36m#[0m] [[41m~[0m] [v] [[41m~[0m] [^] [<] [<] [<] [<] [<] [[36m#[0m]  7 
[[36m#[0m] [[41m~[0m] [[41m~[0m] [[41m~[0m] [v] [<] [<] [<] [<] [<] [[36m#[0m]  8 
[[36m#[0m] [[41m~[0m] [^] [<] [<] [<] [<] [<] [<] [<] [[36m#[0m]  9 
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#

It seems that the agent is not able to grasp the inherent danger of lava tiles. It only evades the lava tile positions that it already knows from training. It seems the agent has not learned the causation between lava and negative reward but instead learned the correlation between the positions (of lava tiles during training) in the gridworld and the negative reward. This would also explain why changing the starting position does not confuse the agent in searching the goal tile (when it is at the original position). During training, only one part of the observation tensor is constantly changing, namely the agent's position.

**Summary:**

- During training, the agent learned to walk to the goal tile successfully
- When placed in the training environment, the trained model is able to find the goal tile from almost all starting positions
- However, the agent is not able to generalize this knowledge to goal tiles with other positions
- Only lava tiles known from training are circumvented by the agent

#### DQN dqn_slow_learning

Next, let's load `dqn_slow_learning` on exp_001. It was trained for 1m timesteps and is taken as a negative example. Model performance at the end of the training was a bit below 0. After some surges in performance around 200k-300k time steps, the agents across all random seeds forget their initial performance and slowly converge to a local minimum around 0. One such policy is visualized below.

In [13]:
model_2 = model_path_prefix + "exp_001/tensor_obs/dqn/dqn_slow_learning/seed_4744.zip"

In [14]:
test_suite_model(model_2)

[32mINFO:[0m Beginning execution of test suite.
Path: [3mexp_001/tensor_obs/dqn/dqn_slow_learning/seed_4744.zip[0m

[4mAgent success on [1mdeterministic[0;4m training environment:[0m
Episode summary -> success: [35mFalse[0m, cumulative reward: [35m0[0m, number of steps: [35m150[0m

[4mPolicy visualization on training environment:[0m
 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [>] [>] [>] [>] [>] [v] [>] [>] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [>] [>] [>] [>] [>] [>] [>] [>] [[36m#[0m]  2 
[[36m#[0m] [[42mx[0m] [^] [>] [>] [>] [>] [>] [>] [>] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [>] [>] [^] [>] [>] [>] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [>] [^] [[41m~[0m] [>] [>] [>] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [v] [[41m~[0m] [>] [v] [[

Interestingly, the agent seems to be able to find goals when they are on the x=2 axis. However, this can hardly be labelled as generalization capability given that the agent is not even able to solve the environment for which it was trained. (However, it is possible that the neural network attributes some relevance to placement of goal tiles.) This is not further investigated due to the reason stated above.

#### Goal randomization

Goal randomization with 2% random-goal episodes

```json
    "exp_001_goal_rnd_2" : {
        "max_steps" : 150,
        "slip_proba" : 0,
        "wall_rebound" : false,
        "spiky_active" : false,
        "reward_spec" : {
            "step_penalty" : 0,
            "goal_reward" : 1,
            "absorbing_states" : false,
            "absorbing_reward_goal" : 0,
            "absorbing_reward_lava" : -1,
            "risky_tile_reward" : 0,
            "lava_reward" : -1
        },
        "goal_rnd" : 0.02
    }
```

In [15]:
model_3 = model_path_prefix + "exp_001_goal_rnd_2/tensor_obs/dqn/dqn_low_eps/seed_5672.zip"

In [16]:
test_suite_model(model_3, 100)

[32mINFO:[0m Beginning execution of test suite.
Path: [3mexp_001_goal_rnd_2/tensor_obs/dqn/dqn_low_eps/seed_5672.zip[0m

[4mAgent success on [1mdeterministic[0;4m training environment:[0m
Episode summary -> success: [35mTrue[0m, cumulative reward: [35m1[0m, number of steps: [35m7[0m

[4mPolicy visualization on training environment:[0m
 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [^] [<] [<] [<] [>] [v] [<] [<] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [>] [<] [<] [<] [<] [<] [v] [<] [[36m#[0m]  2 
[[36m#[0m] [[42mx[0m] [<] [>] [<] [<] [<] [<] [<] [<] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [>] [<] [<] [<] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^] [[41m~[0m] [^] [<] [<] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^] [

- [ ] Hierzu Notizen aufschreiben

In [17]:
# testing the best model:
model_3_best = model_path_prefix + "exp_001_goal_rnd_2/tensor_obs/dqn/dqn_low_eps/seed_4744_best_model/best_model.zip"

In [18]:
test_suite_model(model_3_best, 100)

[32mINFO:[0m Beginning execution of test suite.
Path: [3mexp_001_goal_rnd_2/tensor_obs/dqn/dqn_low_eps/seed_4744_best_model/best_model.zip[0m

[4mAgent success on [1mdeterministic[0;4m training environment:[0m
Episode summary -> success: [35mTrue[0m, cumulative reward: [35m1[0m, number of steps: [35m7[0m

[4mPolicy visualization on training environment:[0m
 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [v] [<] [v] [^] [<] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [v] [<] [<] [<] [<] [>] [<] [<] [[36m#[0m]  2 
[[36m#[0m] [[42mx[0m] [<] [>] [<] [<] [^] [<] [>] [<] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [>] [<] [>] [<] [<] [^] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^] [[41m~[0m] [^] [v] [<] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [^] 

Goal randomization with 5 % random-goal episodes:

```json
    "exp_001_goal_rnd_5" : {
        "max_steps" : 150,
        "slip_proba" : 0,
        "wall_rebound" : false,
        "spiky_active" : false,
        "reward_spec" : {
            "step_penalty" : 0,
            "goal_reward" : 1,
            "absorbing_states" : false,
            "absorbing_reward_goal" : 0,
            "absorbing_reward_lava" : -1,
            "risky_tile_reward" : 0,
            "lava_reward" : -1
        },
        "goal_rnd" : 0.05
    }
```

- [ ] Hier weitermachen

In [19]:
model_4 = model_path_prefix + ""

### A2C on `exp_001`

Let's load an a2c model that was very succesful during training. It was trained for 500k time steps The next observation is an interesting one:

In [20]:
a2c_model_1 = model_path_prefix + "exp_001/tensor_obs/a2c/a2c_entropy_6/seed_4267.zip"

test_suite_model(a2c_model_1)

[32mINFO:[0m Beginning execution of test suite.
Path: [3mexp_001/tensor_obs/a2c/a2c_entropy_6/seed_4267.zip[0m

[4mAgent success on [1mdeterministic[0;4m training environment:[0m
Episode summary -> success: [35mTrue[0m, cumulative reward: [35m1[0m, number of steps: [35m7[0m

[4mPolicy visualization on training environment:[0m
 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [v] [^] [^] [^] [^] [^] [^] [^] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [v] [<] [^] [^] [^] [^] [^] [^] [[36m#[0m]  2 
[[36m#[0m] [[42mx[0m] [<] [^] [^] [^] [^] [^] [^] [^] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^] [^] [^] [^] [^] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^] [[41m~[0m] [^] [^] [^] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^] [[41m~[0

**Distributional Shift, Goal Misgeneralization:**
The agent does not understand that the important tile is the goal tile. It still navigates to the position in which it recieved positive rewards during training. Once the state distribution shifts, the agent is not able to apply the learned skills to a simple alteration of the environment. This shows that the state representation during training is not truly sufficient if we want the agent to be able to generalize knowledge. Perfect example of goal misgeneralization: agent has learned a *directional/location proxy* of the intended objective which is to find the goal tile. What is even more interesting: the agent would walk upwards from almost all other positions, which means it would not even be able to generalize its skills to other starting positions, as the DQN agent could.

- This very poor generalization might be a good showcase of how the different algorithms work and learn: When looking at the mean episodic reward during the a2c agent's training, one immediately recognizes that the abruptly gets better around 250k-275k time steps (across all 5 random seeds) 
- [ ] Then analysis with epsilon-greedy exploration vs actor-critic entropy update and local optimum in which it stays in a very stable manner 

**Note:** Goal randomization reports a success rate of 14 %. This is expected and does not hint at possible goal generalization capabilities. 14% implies 7 successes out of 50 episodes and we know that the agent always takes the same path to the same location. There are 7 tiles on that path that can be subject to goal randomization (63 tiles are eligible for goal randomization in this setting) which means that we would expect ~11% of the random goals to land on this path. As such, 14% is not a significant deviation. We also see that the mean episode length is 6.48, which implies the agent was stopped early on its intended route which takes 7 steps (until in lava, when goal is not at original location).

#### Goal Randomization on exp_001

The next model is an a2c-agent trained on the entropy_6 + exp_001_goal_rnd_2 configuration for 1m time steps on pixel observations

In [21]:
pixel_model_rnd2 = model_path_prefix + "exp_001_goal_rnd_2/pixel_obs_8/a2c/a2c_entropy_6/seed_3377.zip"

In [22]:
test_suite_model(pixel_model_rnd2)

[32mINFO:[0m Beginning execution of test suite.
Path: [3mexp_001_goal_rnd_2/pixel_obs_8/a2c/a2c_entropy_6/seed_3377.zip[0m

[4mAgent success on [1mdeterministic[0;4m training environment:[0m
Episode summary -> success: [35mTrue[0m, cumulative reward: [35m1[0m, number of steps: [35m7[0m

[4mPolicy visualization on training environment:[0m
 0   1   2   3   4   5   6   7   8   9   10  
[[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m] [[36m#[0m]  0 
[[36m#[0m] [[41m~[0m] [v] [v] [<] [<] [<] [<] [^] [^] [[36m#[0m]  1 
[[36m#[0m] [[41m~[0m] [v] [v] [<] [<] [<] [<] [^] [^] [[36m#[0m]  2 
[[36m#[0m] [[42mx[0m] [<] [<] [<] [<] [<] [<] [<] [^] [[36m#[0m]  3 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [<] [<] [^] [<] [^] [[36m#[0m]  4 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^] [[41m~[0m] [^] [<] [<] [[36m#[0m]  5 
[[36m#[0m] [[41m~[0m] [^] [[41m~[0m] [^] [^

Let's see what generalization capabilities have been developed across different random seeds:

`exp_001_goal_rnd_5`
Let's test randomization capabilities with 5 % randomization

In [25]:
# Cell put in raw mode due to high computational cost
# (output should always be identical due to random seed resetting)
m1 = model_path_prefix + "exp_001_goal_rnd_5/pixel_obs_8/a2c/a2c_entropy_6/seed_3377.zip"
m2 = model_path_prefix + "exp_001_goal_rnd_5/pixel_obs_8/a2c/a2c_entropy_6/seed_763.zip"
m3 = model_path_prefix + "exp_001_goal_rnd_5/pixel_obs_8/a2c/a2c_entropy_6/seed_5672.zip"
m4 = model_path_prefix + "exp_001_goal_rnd_5/pixel_obs_8/a2c/a2c_entropy_6/seed_4744.zip"
m5 = model_path_prefix + "exp_001_goal_rnd_5/pixel_obs_8/a2c/a2c_entropy_6/seed_4267.zip"
print("---", m1.split("/")[-1], "---")
randomized_goal_stats(m1)
print("---", m2.split("/")[-1], "---")
randomized_goal_stats(m2)
print("---", m3.split("/")[-1], "---")
randomized_goal_stats(m3)
print("---", m4.split("/")[-1], "---")
randomized_goal_stats(m4)
print("---", m5.split("/")[-1], "---")
randomized_goal_stats(m5)

--- seed_3377.zip ---
Goal randomization success rate:[35m 30.0 [0m% on [35m50[0m random-goal episodes
Adjusted goal randomization success rate on unique different goals: [35m28.6[0m %
Goal randomization mean episode length:[35m 106.46 [0m
Goal positions with successes (unique): {(2, 4), (2, 7), (9, 9), (2, 3), (8, 9), (5, 6), (2, 2), (2, 5), (4, 7), (2, 8)}
--- seed_763.zip ---
Goal randomization success rate:[35m 32.0 [0m% on [35m50[0m random-goal episodes
Adjusted goal randomization success rate on unique different goals: [35m34.3[0m %
Goal randomization mean episode length:[35m 103.62 [0m
Goal positions with successes (unique): {(8, 8), (2, 4), (2, 7), (4, 9), (2, 3), (8, 9), (7, 6), (8, 6), (2, 2), (2, 5), (6, 9), (2, 8)}
--- seed_5672.zip ---
Goal randomization success rate:[35m 46.0 [0m% on [35m50[0m random-goal episodes
Adjusted goal randomization success rate on unique different goals: [35m48.6[0m %
Goal randomization mean episode length:[35m 83.34 [0m


In [26]:
print("mean:", sum([28.6, 34.3, 48.6, 17.1, 20])/5)

mean: 29.72


Range of 20 - 48.6 % generalization capabilities

## `time_penalty`

*Environment Configuration:*

```json
    "time_penalty" : {
        "max_steps" : 150,
        "slip_proba" : 0,
        "wall_rebound" : false,
        "spiky_active" : false,
        "reward_spec" : {
            "step_penalty" : -0.1,
            "goal_reward" : 1,
            "absorbing_states" : false,
            "absorbing_reward_goal" : 0,
            "absorbing_reward_lava" : -1,
            "risky_tile_reward" : 0,
            "lava_reward" : -1
        }
    }
```

### A2C on `time_penalty`

## Slipping Experiments

### `exp_slip_1`

_Environment Configuration:_

```json
    "exp_slip_1" : {
        "max_steps" : 150,
        "slip_proba" : 0.05,
        "wall_rebound" : false,
        "spiky_active" : false,
        "reward_spec" : {
            "step_penalty" : 0,
            "goal_reward" : 1,
            "absorbing_states" : false,
            "absorbing_reward_goal" : 0,
            "absorbing_reward_lava" : -1,
            "risky_tile_reward" : 0,
            "lava_reward" : -1
        }
    }
```

### `exp_slip_2`

_Environment Configuration:_

```json
    "exp_slip_2" : {
        "max_steps" : 150,
        "slip_proba" : 0.1,
        "wall_rebound" : false,
        "spiky_active" : false,
        "reward_spec" : {
            "step_penalty" : 0,
            "goal_reward" : 1,
            "absorbing_states" : false,
            "absorbing_reward_goal" : 0,
            "absorbing_reward_lava" : -1,
            "risky_tile_reward" : 0,
            "lava_reward" : -1
        }
    }
```

### `exp_slip_3`

_Environment Configuration:_

```json
    "exp_slip_3" : {
        "max_steps" : 150,
        "slip_proba" : 0.15,
        "wall_rebound" : false,
        "spiky_active" : false,
        "reward_spec" : {
            "step_penalty" : 0,
            "goal_reward" : 1,
            "absorbing_states" : false,
            "absorbing_reward_goal" : 0,
            "absorbing_reward_lava" : -1,
            "risky_tile_reward" : 0,
            "lava_reward" : -1
        }
    }
```