In [2]:
import gym
from lib.src.rl.utils.utils import Utils
from lib.src.rl.utils.visualize import Visualization as Vis
from lib.src.rl.agents.q_learning import QLearningAgent
from lib.src.rl.agents.value_iteration import ValueIterationAgent

# CliffWalking-v0

![Cliff Walking by OpenAI Gym](./img/cliff_walking.gif)

## State space

- A $4 \times 12$ grid world
- Positions represented as flattened index
  - For example, the starting point (3, 0) is represented as $3 \times 12 + 0 = 36$

## Action space
- 0: Move up
- 1: Move right
- 2: Move down
- 3: Move left

## Initialize environment

In [7]:
env = gym.make("CliffWalking-v0")
env.reset()

(36, {'prob': 1})

## Value Iteration Agent Demo
- set the stopping critera $\theta = 0.001$

In [8]:
v = ValueIterationAgent(env, "CliffWalking-v0")
policy = v.value_iteration(0.001)

### Visualize the result

In [9]:
Vis.visualize("CliffWalking-v0", env.shape, v.policy)

<╔════════════╗
 ║→→→→→→→→→→→↓║
 ║→→→→→→→→→→→↓║
 ║→→→→→→→→→→→↓║
 ║↑▀▀▀▀▀▀▀▀▀▀☺║
 ╚════════════╝>


## Q-Learning Agent Demo

- set agent_info
  - the implementation of the Q-learning agent is slightly different from the implementation of the Value Iteration agent, so we're passing arguments in a dictionary to the constructor of the ```QLearningAgent``` class in addition to the ```env``` object

In [11]:
agent_info = {
    "num_actions": env.action_space.n, 
    "num_states": env.observation_space.n, 
    "epsilon": 0.1, 
    "alpha": 0.5, 
    "gamma": 1.0
}
q = QLearningAgent(env, "CliffWalking-v0", agent_info)

### Training loop
- A Q-Learning agent directly learns the value of state-action pairs, while a Value Iteration agent learns state-value functions and greedifies the policy with regard to the estimation of the state-value
- A Q-Learning agent combines the advantages of a Dynamic Programming method and a Monte Carlo method
  - It boostraps: learns a guess from a guess
  - It does not require the model of the environment (the transition dynamic)
    - It learns by directly interaction with the environment
- Thus, while for a Value Iteration agent, we use the transition dynamics of the environment to compute the state values (i.e., planning), for a Q-Learning agent, we need to interact with the environment for some episodes to learn (i.e., learning)

In [12]:
num_episode = 100
for i in range(num_episode):
    observation, info = env.reset()
    q.agent_step(observation)

### Visualize the result
- For a Value Iteration agent, we pass its learned policy to the ```visualize()``` function to display the optimal route the agent should take
- For a Q-Learning agent, we only learned the ```q``` value, which is enough for us to figure out the policy
  - Since a q-value represents the value of a state-action pair, to figure out the learned policy, in each state, we can simply select the action that results in the largest value of the state-action pair
    - It turns out that in my implementation, the q-value list (a 2D list) for a Q-Learning agent and the policy list (also a 2D list) for a Value Iteration agent can both be used directly to draw the route of the agent without any modification
      - Simply use ```np.argmax()``` to decide the arrow in each position

In [13]:
Vis.visualize("CliffWalking-v0", env.shape, q.q)

<╔════════════╗
 ║→↓→→↓→←→↓→→↓║
 ║←→↓→↑→←↓↓↓↓↓║
 ║→→→→→→→→→→→↓║
 ║↑▀▀▀▀▀▀▀▀▀▀☺║
 ╚════════════╝>
