# <center> Introduction to Reinforcement Learning</center>

#### Import dependencies

In [None]:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import numpy as np
from copy import deepcopy

from IntroRL_Support.helper import *
from ece4078.gym_simple_gridworlds.envs.grid_env import GridEnv
from ece4078.gym_simple_gridworlds.envs.grid_2dplot import *

from IPython.display import display, HTML

UP = 0; DOWN = 1; LEFT = 2; RIGHT = 3; STAY = np.nan

# Activity 1. Elements of an MDP (Grid World Example)

Recall the grid in which our robot lives

![GridWorldExample.png](https://i.postimg.cc/5tMM5vqf/Grid-World-Example.png)

- The states $s \in \mathcal{S}$ correspond to locations in the grid. Each location has also a cell index associated to it, e.g., cell index 4 is associated to location (row=1,col=0)
- The robot can move up, down, left, or right. Actions correpond to unit increments or decrements in the specified direction.
    - Up : (-1,0)
    - Down: (1,0)
    - Left: (0,-1)
    - Right: (0, 1)
- Each action is represented by a number. Action (Up) is represented by 0, (Down) by 1, (Left) by 2 and, finally, (Right) by 3. No actions are available at a terminal state

## Create Environment and Explore its Attributes

The noise parameter corresponds to the probability of a change of direction when an action is taken (e.g., going left/right when agent decides to move up/down)

In [None]:
# Create a Grid World instance
grid_world = GridEnv(gamma=0.9, noise=0.2, living_reward=-0.04)

### State and Action Spaces

Let's take a look at the state and action spaces of our environment

In [None]:
# State (or observation) space
print(grid_world.observation_space)
print(grid_world.get_states())
print()

# Action space
print(grid_world.action_space)
print(grid_world.get_actions())

### Transition Function

Let's take a look at the current state transition function. Some things to keep in mind regarding the transition function:

1. Given that $\mathcal{T}: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$, the ``state_transitions`` attribute of the class ``GridEnv`` corresponds to a 3-Dimensional numpy array of size $11\times4\times11$.
2. With a noise attribute set to 0.2, at state 5, if the agent chooses to move up, it will end up at:
    - state 2 with $80\%$ probability,
    - state 6 with $10\%$ probability, or
    - state 5 with $10\%$ probability

In [None]:
# at state 5 the agent takes action 0 (going up)
print(grid_world.state_transitions[5, UP])

# Pretty print, red shows current state
print("\nPretty Print:")
pp_state_transitions(grid_world, 5, UP)

### Living Reward and Reward Function

Let's now take a quick look at the living reward (i.e., running cost) and reward function $\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$.

1. Living reward corresponds to the attribute ``living_rewards`` of the class ``GridEnv`` and is represented as an 1-Dimensional numpy array
2. The reward function corresponds to the attribute ``rewards`` of the class ``GridEnv`` and is also represented as a 2-Dimensional numpy array of size $11\times4$

In [None]:
# Living rewards
print(f"Living rewards for all states:\n{grid_world.immediate_rewards}\n")
print("Pretty Print:")
pp_immediate_rewards(grid_world)

# Reward function, i.e., expected reward for taking action a at state s
print(f"Reward function for all state-action pairs:\n{grid_world.rewards}\n")
print(f"The expected reward at state 5 if agent chooses to move right is: {grid_world.rewards[5,RIGHT]}")

# To do (Flux Quiz 2): what is the expected reward at state 2 if the agent chooses to move right?

### Policy

Let's see the path and total reward of an agent moving on our grid world according to the following policy $\pi$

![example_policy.png](https://i.postimg.cc/pLjHnkj0/example-policy.png)

In [None]:
# We represent this policy as a 2-Dimensional numpy array
policy_matrix = np.array([[RIGHT, RIGHT,  RIGHT,  STAY],
                          [UP,    STAY,   UP,     STAY],
                          [UP,    LEFT,   UP,     LEFT]])

In [None]:
print(grid_world.grid)

Let's now apply this policy and observe the agent's behavior (blue dot in the figure shown below).

In [None]:
# Create a Grid World instance
grid_world = GridEnv(gamma=0.9, noise=0.2, living_reward=-0.04)
# grid_world.seed(seed = 10)
s_x, s_y = get_state_to_plot(grid_world)

# We can visualize our grid world using the render() function
fig, ax = grid_world.render()
agent, = ax.plot([], [], 'o', color='b', linewidth=6)
reward_text = ax.text(0.02, 0.95, '', transform=ax.transAxes)

done = False
cumulative_reward = 0
cur_state = grid_world.cur_state
path_to_plot = []

while not done:
    _, cur_reward, done, _ = grid_world.step(int(policy_matrix[cur_state[0], cur_state[1]]))
    cur_state = grid_world.cur_state
    n_x, n_y = get_state_to_plot(grid_world)
    cumulative_reward += cur_reward
    path_to_plot.append([cumulative_reward, n_x, n_y])

def init():
    agent.set_data([s_x + 0.5], [s_y + 0.5])
    reward_text.set_text('')
    return agent, reward_text

def animate(i):
    if i < len(path_to_plot):
        r, n_x, n_y = path_to_plot[i]
        agent.set_data([n_x + 0.5], [n_y + 0.5])
        reward_text.set_text('Cumulative reward: %.2f' % r)
    return agent, reward_text

ani = animation.FuncAnimation(fig, animate, frames=len(path_to_plot), blit=False, interval=500, init_func=init,
                              repeat=False)

plt.close('all') 
display(HTML(f"<div align=\"center\">{ani.to_jshtml()}</div>"))