# Custom environment tutorial

This tutorial demonstrates how to create and use a custom environment in nnabla-rl.\

## Preparation

Let's start by first installing nnabla-rl and importing required packages for training.

In [None]:
!pip install nnabla-rl

In [None]:
import nnabla as nn
from nnabla import functions as NF
from nnabla import parametric_functions as NPF
import nnabla.solvers as NS

import nnabla_rl
import nnabla_rl.algorithms as A
import nnabla_rl.hooks as H
from nnabla_rl.utils.evaluator import EpisodicEvaluator
from nnabla_rl.models.q_function import DiscreteQFunction
from nnabla_rl.builders import ModelBuilder, SolverBuilder
import nnabla_rl.functions as RF

## Understanding gym.Env

If you don't know what gym library is, [gym documentation](https://gym.openai.com/docs/) will be helpful. Please read it before creating an original enviroment.

Referring to the [gym.Env implementation](https://github.com/openai/gym/blob/master/gym/core.py), gym Env has following five methods.

- `step(action): Run one timestep of the environment's dynamics.` This method's argument is action and this should return next_state, reward, done, and info.

- `reset(): Resets the environment to an initial state and returns an initial observation.` 

- `render(): Renders the environment.` (Optional)

- `close(): Override close in your subclass to perform any necessary cleanup.`  (Optional)

- `seed(): Sets the seed for this env's random number generator(s).`  (Optional)

In addition, there are three key attributes.

- `action_space: The Space object corresponding to valid actions.`

- `observation_space: The Space object corresponding to valid observations`

- `reward_range: A tuple corresponding to the min and max possible rewards`  (Optional)

action_space and observation_space should be defined by using [gym.Spaces](https://github.com/openai/gym/tree/master/gym/spaces).

These methods and attributes will decide how environment works, so let's implement them!!

## Creating a Simple Enviroment

As an example case, we will create a simple enviroment called CliffEnv which has following settings.

<img src="./assets/CliffEnv.png" width="500">

- In this enviroment, task goal is to reach the place where is 10.0 <= x and 0.0 <= y <= 5.0

- State is continuous and has 2 dimension (i.e., x and y).

- There are two discrete actions, up (y+=5), right (x+=5).

- If agent reaches the cliff region (x > 5.0 and x < 10.0 and y > 0.0 and y < 5.0) or (x < 0.0) or (y > 10.0) or (y < 0.0), -100 is given as reward.

- For all timesteps the agent gets -1 as reward.

- If agent reaches the goal (x >= 10.0 and y >= 5.0 and y <= 10.0), 100 is given as reward.

- Initial states are x=2.5, y=2.5.

We can easily guess the optimal actions are \[ "up", "right", "right" \] and the optimal score will be 98 (-1 + -1 + 100).  


In [None]:
import gym
from gym import spaces
import numpy as np

class CliffEnv(gym.Env):
    def __init__(self):
        # action is defined as follows:
        # 0 = up, 1 = right
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(shape=(2,), low=-np.inf, high=np.inf, dtype=np.float32)
        self._state = np.array([2.5, 2.5])

    def reset(self):
        self._state = np.array([2.5, 2.5])
        return self._state

    def step(self, action):
        if action == 0:  # up (y+=5)
            self._state[1] += 5.
        elif action == 1:  # right (x+=5)
            self._state[0] += 5.
        else:
            raise ValueError

        x, y = self._state
        if (x > 5.0 and y < 5.0) or (x < 0.0) or (y > 10.0) or (y < 0.0):
            done = True
            reward = -100
        elif x >= 10.0 and y >= 5.0 and y <= 10.0:
            done = True
            reward = 100
        else:
            done = False
            reward = -1

        info = {}
        return self._state, reward, done, info

After defining an original enviroment, it would be nice to confirm if your implementation is correct by running this code.

In [None]:
env = CliffEnv()

# first call reset and every internal state will be initialized
state = env.reset()
done = False

while not done:
    action = env.action_space.sample()  # random sample from the action space
    next_state, reward, done, info = env.step(action)
    print('next_state=', next_state, 'action=', action, 'reward=', reward, 'done=', done)
    if done:
        print("Episode is Done")
        break

## Appling nnabla-rl to an original environment

Environment is now ready to run the training!!\
Let's apply nnabla-rl algorithms to the created enviroment and train the agent!!

Define a Q function, a Q function solver and a solver builder.

In [None]:
class CliffQFunction(DiscreteQFunction):
    def __init__(self, scope_name: str, n_action: int):
        super(CliffQFunction, self).__init__(scope_name)
        self._n_action = n_action

    def all_q(self, s: nn.Variable) -> nn.Variable:
        with nn.parameter_scope(self.scope_name):
            h = NF.tanh(NPF.affine(s, 64, name="affine-1"))
            h = NF.tanh(NPF.affine(h, 64, name="affine-2"))
            q = NPF.affine(h, self._n_action, name="pred-q")
        return q

class CliffQFunctionBuilder(ModelBuilder[DiscreteQFunction]):
    def build_model(self, scope_name, env_info, algorithm_config, **kwargs):
        return CliffQFunction(scope_name, env_info.action_dim)

class CliffSolverBuilder(SolverBuilder):
    def build_solver(self,  # type: ignore[override]
                     env_info,
                     algorithm_config,
                     **kwargs):
        return NS.Adam(alpha=algorithm_config.learning_rate)

Instantiate your env and run the training !!

In [None]:
train_env = CliffEnv()
eval_env = CliffEnv()

iteration_num_hook = H.IterationNumHook(timing=100)
evaluator = EpisodicEvaluator(run_per_evaluation=10)
evaluation_hook = H.EvaluationHook(eval_env, evaluator, timing=100)
total_timesteps = 10000

config = A.DQNConfig(
    gpu_id=0,
    gamma=0.99,
    learning_rate=1e-5,
    batch_size=32,
    learner_update_frequency=1,
    target_update_frequency=1000,
    start_timesteps=1000,
    replay_buffer_size=1000,
    max_explore_steps=10000,
    initial_epsilon=1.0,
    final_epsilon=0.0,
    test_epsilon=0.0,
)

dqn = A.DQN(train_env, config=config, q_func_builder=CliffQFunctionBuilder(),
            q_solver_builder=CliffSolverBuilder())

hooks = [iteration_num_hook, evaluation_hook]
dqn.set_hooks(hooks)

dqn.train_online(train_env, total_iterations=total_timesteps)

eval_env.close()
train_env.close()

We can see the agent gets 98 score in evaluation enviroment!! That means we solved the task. Congratuations!!