### Run in collab
<a href="https://colab.research.google.com/github/racousin/rl_introduction/blob/master/notebooks/2_Dynamic_Programming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install swig==4.2.1
!apt-get install xvfb
!pip install box2d-py==2.3.8
!pip install gymnasium[box2d,atari,accept-rom-license]==0.29.1
!pip install pyvirtualdisplay==3.0
!pip install opencv-python-headless
!pip install imageio imageio-ffmpeg
!git clone https://github.com/racousin/rl_introduction.git > /dev/null 2>&1

In [None]:
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import gymnasium as gym
import copy
import matplotlib.pyplot as plt
import seaborn as sns
from time import sleep
from rl_introduction.rl_introduction.tools import Agent, plot_values_lake
env = gym.make('FrozenLake-v1')

# 2_Dynamic_Programming

### Objective
Before diving deeper into Reinforcement Learning (RL), it's essential to understand how to compute the best agent's strategy when the model of the environment is perfectly known, referred to as the Markov Decision Process (MDP). In this exercise, we will use the FrozenLake environment as our example to solve an MDP.



In [None]:
from rl_introduction.rl_introduction.render_colab import exp_render
exp_render({"name":'FrozenLake-v1', "fps":2, "nb_step":30})

### Understanding the Environment


In [None]:
env = gym.make('FrozenLake-v1')

#### 1) Environment Transition Model and Policy
**Question 1:** Environment Description

Describe the **observation space** and **action space** of the FrozenLake environment.

**Question 2:** Transition Model

Describe the **Transistion Model** `env.P[state][action]`. Is the transition model of the environment stochastic?

**Exercise 1:** Implement a Random Policy

 Write a **random policy** to perform experiments in the FrozenLake environment. Then run the code bellow.

In [None]:
# Random Policy Implementation
policy =

In [None]:
class MyAgent(Agent):
    def __init__(self, env, policy):
        super().__init__(env)
        self.policy = policy
    def act(self, state):
        action = np.random.choice(np.arange(self.env.action_space.n),p=self.policy[state])
        return action

my_agent = MyAgent(env, policy)

# Experiment Running Function
def run_experiment_episode(env, agent, nb_episode):
    rewards = np.zeros(nb_episode)
    for i in range(nb_episode):
        state = env.reset()[0]
        done = False
        rews = []
        while not done:
            action = agent.act(state)
            state, reward, done, _, _ = env.step(action)
            rews.append(reward)
        rewards[i] = sum(rews)
    return rewards

# Running the experiment
rewards = run_experiment_episode(env, my_agent, 50)
plt.plot(rewards, 'o')
plt.title('Cumulative Reward per Episode for Random Agent')
plt.show()

## 2a) Policy Evaluation - Value Function

To evaluate the value function of a policy, we use the iterative approach


$V_{k+1}(s) = \mathbb{E}_\pi [r + \gamma V_k(s_{t+1}) | S_t = s] $.

$(V_k)_{k\in \mathbb{N}}$ converges to $V_\pi$.

**Exercise 1:** Policy Evaluation Implementation

Complete the Python function below to evaluate the given policy using the iterative approach described above.


- Iterate over all states in the environment within the outer loop.
- For each state, compute the expected return by considering all possible actions and their probabilities under the current policy.
- Use the transition model (env.P[s][a]) to access the probabilities of next states and rewards for each action.
- Update the value function until the maximum change across all states is less than θ, indicating convergence.

In [None]:
#TODO: write the value evaluation from Policy, reward and transition model
def policy_evaluation(env, policy, gamma=1, theta=1e-8):
    V = np.zeros(env.observation_space.n) # initialization
    #complete here
    return V

In [None]:
# evaluate the policy
policy =  np.ones([env.observation_space.n, env.action_space.n]) / env.action_space.n
V = policy_evaluation(env, policy)
plot_values_lake(V)

## 2b) Policy Evaluation - Action Value Function

The action-value function Q(s,a) is computed as the expected return of taking an action a in state s, which includes the immediate reward plus the discounted future value as per the state transition probabilities. Mathematically, it is expressed as:

\begin{aligned}
Q(s, a)
&= \sum_{s'} [r_{t+1} + \gamma V(s')] P(S_{t+1}=s'|S_t=s,A_t=a)
\end{aligned}

**Exercise 2:** Action Value Function

compute the action value function from the value function
- For each state, compute the q(s) action value using reward and transition model
- Iterate over all states to get the action value function

In [None]:
#TODO: write the q evaluation from the value function, reward and transition model
def q_from_v(env, V, s, gamma=1):
    #complete here
    return q

## 3) Policy Improvement

The policy improvement step uses the action-value function Q to make the policy greedy, thereby ensuring the policy selects the action with the highest value in each state. Mathematically, this is represented as: $\pi'(.|s) = \arg\max_a Q_\pi(a,s)$.

**Exercise 1:** Choosing the Best Action

Complete the `best_action_from_Q` function bellow to determine the best action for a state s from the action-value function Q.

**Exercise 2:**

Complete the `policy_improvement` function bellow  to generate a new, improved policy $\pi'$ based on the value function $V$ and the best_action_from_Q.

**Exercise 3:**
E Evaluate the value function of the new, improved policy $\pi'$and compare it to the original policy. Better?

In [None]:
#TODO: choose the best action in a state s from Q, What the best direction/action on state 1?
def best_action_from_Q(env, Q, s):
  # Complete
  return best_a
print(f"best direction/action on state 1: {best_action_from_Q(env, Q, 1)}")

In [None]:
#TODO: write the policy improvment update step
def policy_improvement(env, V, gamma=1):
    policy = np.zeros([env.observation_space.n, env.action_space.n])
    #complete here
    return policy

## 4) Policy iteration

Policy iteration alternates between evaluating a policy and improving it until convergence.

$\pi_0 \xrightarrow[]{\text{evaluation}} V_{\pi_0} \xrightarrow[]{\text{improve}}
\pi_1 \xrightarrow[]{\text{evaluation}} V_{\pi_1} \xrightarrow[]{\text{improve}}
\pi_2 \xrightarrow[]{\text{evaluation}} \dots \xrightarrow[]{\text{improve}}
\pi_* \xrightarrow[]{\text{evaluation}} V_*$

- Policy Evaluation: Compute the value function $V_{\pi}$ for the current policy.
- Policy Improvement: Generate a new policy $\pi'$ that is greedy with respect to $V_{\pi}$.

**Exercise 1:**
Complete the policy iteration function below, which iteratively evaluates and improves a policy until it converges to the optimal policy.

In [None]:
#TODO: write the policy iteration
def policy_iteration(env):
    policy = np.ones([env.observation_space.n, env.action_space.n]) / env.action_space.n # init a random policy
    # complete here
    return policy, V

## 5) Run experiments (optimal policy vs random)

**Last Exercise:**Evaluate the effectiveness of the optimal policy obtained from policy iteration compared to a random policy. compare the cumulative rewards obtained by an agent using a random policy versus an agent using the optimal policy over 50 episodes. Use the run_experiment_episode function to collect the cumulative rewards for each policy and visualize the results.

In [None]:
#TODO: eval best Policy with run_experiment_episode

## 4) Value iteration

Value iteration consists in directly compute the best policy evaluation.
We initialize $V_0$ arbitrarly. And we update it using:

$V_{k+1}(s) = \mathbb{E}_\pi [r + \gamma \max_a Q_k(s_{t+1},a) | S_t = s] $ (2).
$\forall s$, $V_{\pi^*}(s)$ is a fix point for (2), so if $(V_k)_{k\in \mathbb{N}}$ converges, it converges to $V_{\pi^*}$.

In [None]:
def value_iteration(env, gamma=1, theta=1e-8):
    V = np.zeros(env.observation_space.n)
    while True:
        delta = 0
        for s in range(env.observation_space.n):
            v = V[s]
            V[s] = max(q_from_v(env, V, s, gamma))
            delta = max(delta,abs(V[s]-v))
        if delta < theta:
            break
    policy = policy_improvement(env, V, gamma)
    return policy, V

In [None]:
policy_vi, V_vi = value_iteration(env)

# print the optimal policy
print("\nOptimal Policy (LEFT = 0, DOWN = 1, RIGHT = 2, UP = 3):")
print(policy_vi,"\n")

# plot the optimal state-value function
plot_values_lake(V_vi)

In [None]:
V_vi.sum()