# AI-LAB SESSION 4: Tutorial

In this tutorial we will see some additional functionalities available to OpenAI Gym environments

## Cliff environment

The environment used is **Cliff** (taken from the book of Sutton and Barto as visible in the figure)
![CliffWalking](images/cliff.png)

The agent starts in cell $(3, 0)$ and has to reach the goal in $(3, 11)$. Falling from the cliff resets the position to the start state (the episode ends only when the goal state is reached). All other cells are safe. Action dinamycs is deterministic, meaning that the agent always reaches the desired next state (although the agent does not have access to this information)

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import gym
import envs
import numpy as np
from utils.funcs import run_episode, plot

env = gym.make("Cliff-v0")
env.render()

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T



The cell types are the following:
* *x* - Start position
* *o* - Safe
* *C* - Cliff
* *T* - Goal

Rewards:
- <span style="color:orange">-1</span> for each "safe" cell (o)
- <span style="color:red">-100</span> for falling from the cliff (C)

In addition to the functionalities of the environments you have been using in the previous sessions, there are also a few more:
- *step(action)*: the agent performs *action* from the current state. Returns a tuple *(new_state, reward, done, info)* where:
    - *new_state*: is the new state reached as a consequence of the agent's last action
    - *reward*: the reward obtained by the agent
    - *done*: `True` if the episode has ended, `False` otherwise
    - *info*: not used, you can safely discard it
- *reset()*: the environment is reset and the agent goes back to the starting position. Returns the initial state id

In [2]:
state = env.reset()
env.step(0)  # Go UP

(24, -1, False, {'prob': 1.0})

### Disclaimer

The environment is not known a-priori, hence it does not have the following properties and methods available:
* *T(s, a, s')*: no transition matrix
* *R(s, a, s')*: no reward matrix

The action ids are different from the previous environments:

In [3]:
env.actions

{0: 'U', 1: 'R', 2: 'D', 3: 'L'}

Suppose we want to execute a random policy in the environment: we create such policy as usual, we reset the environment to its initial state and also set a maximum number of steps for the episode

In [4]:
policy = np.random.choice(env.action_space.n, env.observation_space.n)
state = env.reset()
ep_limit = 20

Then we execute a loop where at each iteration a step is performed by using the action defined by the policy

In [5]:
el = 0
total_reward = 0

# Episode execution loop
for _ in range(ep_limit):
    next_state, reward, done, _ = env.step(policy[state])  # Execute a step
    total_reward += reward
    el += 1
    if done or el == ep_limit:  # If done == True, the episode has ended
        break
    state = next_state
    
total_reward

-515

A useful operation that can be done with numpy is dividing 2 $n$-dimensional arrays element wise for all the positions where the denominator is $\neq 0$. For example

In [6]:
a = np.asarray([[10, 10, 8], [10, 10, 20]], dtype="float16")
b = np.asarray([[2, 2, 0], [2, 2, 0]], dtype="float16")
print("Array A:\n", a)
print("\nArray B:\n", b)
np.divide(a, b, out=a, where=b != 0)
print("\nDivision result:\n", a)

Array A:
 [[10. 10.  8.]
 [10. 10. 20.]]

Array B:
 [[2. 2. 0.]
 [2. 2. 0.]]

Division result:
 [[ 5.  5.  8.]
 [ 5.  5. 20.]]


Notice that the result of the operation is stored in A (its original value is lost)