## Level 1: Easy Stright Line

#### **Important note:**

The Minecraft world is generated using Microsoft Malmo's mission xml template, make sure to put the mission template in this directory (navigateDense.xml) in the MineRL python package location i.e. `~/anaconda3/envs/rltorch/lib/python3.7/site-packages/minerl/herobraine/env_specs/missions/`

MineRL will use this mission file to create the world.

### Load Agent Environment Libraries

In [1]:
import gym
import minerl

from logging import getLogger
logger = getLogger(__name__)



### Load MineRL environment wrappers
* The MineRL Gym Environment returns action and observation spaces as Dictionary spaces instead of Discrete spaces. We need a wrapper to map possible actions to discrete space.

In [None]:
# Add parent directory to sys path to acess env_wrappers.py
import sys
sys.path.insert(0,'..')

In [5]:
import chainerrl
from chainerrl.wrappers import ContinuingTimeLimit
from chainerrl.wrappers.atari_wrappers import FrameStack, ScaledFloatFrame

# Environment wrapper borrowed from minerl sample code: 
# https://github.com/minerllabs/baselines/tree/master/general/chainerrl
from env_wrappers import (
    SerialDiscreteActionWrapper, CombineActionWrapper, SerialDiscreteCombineActionWrapper,
    ContinuingTimeLimitMonitor,
    MoveAxisWrapper, FrameSkip, ObtainPoVWrapper, PoVWithCompassAngleWrapper, GrayScaleWrapper)


In [6]:
# Agruments for wrapper
class Args:
    def __init__(self):
        self.frame_skip = None
        self.gray_scale = False
        self.env = 'MineRLNavigateDense'
        self.frame_stack = None
        self.disable_action_prior = False # False=Discrete of True=CombineDiscrete
args = Args()

In [8]:
# This entire function is borrowed from MineRL demo files:
# https://github.com/minerllabs/baselines/blob/master/general/chainerrl/baselines/ppo.py#L124
def wrap_env(env, test):

        if isinstance(env, gym.wrappers.TimeLimit):
            logger.info('Detected `gym.wrappers.TimeLimit`! Unwrap it and re-wrap our own time limit.')
            env = env.env
            max_episode_steps = env.spec.max_episode_steps
            env = ContinuingTimeLimit(env, max_episode_steps=max_episode_steps)

        # wrap env: observation...
        # NOTE: wrapping order matters!

        if test and args.monitor:
            env = ContinuingTimeLimitMonitor(
                env, os.path.join(args.outdir, 'monitor'),
                mode='evaluation' if test else 'training', video_callable=lambda episode_id: True)
        if args.frame_skip is not None:
            env = FrameSkip(env, skip=args.frame_skip)
        if args.gray_scale:
            env = GrayScaleWrapper(env, dict_space_key='pov')
        if args.env.startswith('MineRLNavigate'):
            env = PoVWithCompassAngleWrapper(env)
        else:
            env = ObtainPoVWrapper(env)
        env = MoveAxisWrapper(env, source=-1, destination=0)  # convert hwc -> chw as Chainer requires.
        env = ScaledFloatFrame(env)
        if args.frame_stack is not None and args.frame_stack > 0:
            env = FrameStack(env, args.frame_stack, channel_order='chw')

        # wrap env: action...
        if not args.disable_action_prior:
            env = SerialDiscreteActionWrapper(
                env,
                always_keys=[], reverse_keys=[], exclude_keys=['camera'], exclude_noop=False)
        else:
            env = CombineActionWrapper(env)
            env = SerialDiscreteCombineActionWrapper(env)

        return env

### Load the environment

In [9]:
core_env = gym.make("MineRLNavigateDense-v0") # A MineRLNavigate-v0 env

In [10]:
env = wrap_env(core_env, test=False)



In [17]:
# Initialize environment to check if mission XML working
env.reset()
print('done')

done


### Define custom policy network 

Instead of built in policies, we will use a custom policy because the default CNN Policy is not configured correctly to the current shape of the input observation pixels and rewards. The deafult MLP Policy works but doesn't perform very well.

In [20]:
import tensorflow as tf
import numpy as np
from stable_baselines.a2c.utils import conv, linear, conv_to_fc
from stable_baselines.common.policies import FeedForwardPolicy

In [22]:
def modified_cnn(scaled_images, **kwargs):
    activ = tf.nn.relu
    layer_1 = activ(conv(scaled_images, 'c1', n_filters=32, filter_size=2, stride=1, init_scale=np.sqrt(2), **kwargs))
    layer_2 = activ(conv(layer_1, 'c2', n_filters=64, filter_size=2, stride=1, init_scale=np.sqrt(2), **kwargs))
    layer_3 = activ(conv(layer_2, 'c3', n_filters=64, filter_size=2, stride=1, init_scale=np.sqrt(2), **kwargs))
    layer_3 = conv_to_fc(layer_3)
    return activ(linear(layer_3, 'fc1', n_hidden=512, init_scale=np.sqrt(2)))

class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs, cnn_extractor=modified_cnn, feature_extraction="cnn")

### Define model

PPO Model by OpenAI: https://openai.com/blog/openai-baselines-ppo/

In [None]:
from stable_baselines import PPO2

In [24]:
# Rewards are logged in tensorboard
model = PPO2(policy = CustomPolicy,env=env, n_steps=64,
            verbose=1, tensorboard_log="./test_tensorboard/")

Wrapping the env in a DummyVecEnv.






Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where




In [25]:
env.observation_space

Box(4, 96, 128)

## Train model

In [26]:
# 300k steps ~ 1hr 30min

model.learn(total_timesteps=300000,log_interval=100)
model.save("level1.5_ppo2")


-------------------------------------
| approxkl           | 0.005878729  |
| clipfrac           | 0.0859375    |
| explained_variance | 0.00123      |
| fps                | 13           |
| n_updates          | 1            |
| policy_entropy     | 2.2969208    |
| policy_loss        | -0.031599294 |
| serial_timesteps   | 64           |
| time_elapsed       | 4.43e-05     |
| total_timesteps    | 64           |
| value_loss         | 673.185      |
-------------------------------------
------------------------------------
| approxkl           | 0.048252713 |
| clipfrac           | 0.5078125   |
| explained_variance | 0.898       |
| fps                | 25          |
| n_updates          | 100         |
| policy_entropy     | 1.2821951   |
| policy_loss        | 0.018677505 |
| serial_timesteps   | 6400        |
| time_elapsed       | 205         |
| total_timesteps    | 6400        |
| value_loss         | 51.58204    |
------------------------------------
------------------------