# Lab03 - Monte Carlo Control

### Learning Goals:
- Understanding policies in the context of Reinforcement Learning
- Understanding Monte Carlo methods and implementing a First-visit MC prediction algorithm
- Visualizing the value function

In [None]:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from collections import defaultdict
from toolbox.blackjack import BlackjackEnv
import sys

## 3.1 Monte Carlo Control

Let's consider how Monte Carlo estimation can be used to approximate optimal policies. The overall idea is to maintain both an approximate policy and an approximate value function. The value function is repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function. 

To begin let's recall the classical policy iteration. In this method, alternating complete steps of policy evaluation and policy improvement are completed:

\begin{equation*}
   \large \pi_0 \xrightarrow[]{E} v_{\pi_0} \xrightarrow[]{I} \pi_1 \xrightarrow[]{E} v_{\pi_1} \xrightarrow[]{I} \pi_2 \xrightarrow[]{E} ... \xrightarrow[]{I} \pi_* \xrightarrow[]{E} v_*
\end{equation*}

In the Monte Carlo version of classical policy iteration policy evaluation is done exactly like in the preceding section. Many episodes are experienced,  with the approximate action-value function approaching the true function asymptotically. We assume that we observe an infinite number of episodes with exploring starts. Under these assumptions, the Monte Carlo methods will compute each $q_{\pi_k}$ exactly, for arbitrary $\pi_k$. 

Policy improvement is done by making the policy greedy with respect to the current value function. No model is needed, since we have an action-value function. For each $s \in \mathcal{S}$ we deterministically choose an action with maximal action-value:
\begin{equation}
    \pi(s) \doteq \underset{a}{\operatorname{arg max}} q(s,a)
\end{equation}

Two unlikely assumptions have been made in order to obtain the convergence guarantee. First one, episodes have exploring starts and second, policy evaluation can be done with an infinite number of episodes.

<div class="alert alert-block alert-info">
    <b>MC Control:</b> Idea is the same as for Dynamic Programming. Use MC Policy Evaluation to evaluate the current policy then improve the policy greedily.
</div>

<div class="alert alert-block alert-warning">
    <b>The Problem:</b> How do we ensure that we explore all states if we don't know the full environment?
</div>

<div class="alert alert-block alert-success">
    <b>Solution to exploration problem:</b> Use epsilon-greedy policies instead of full greedy policies. When making a decision act randomly with probability epsilon. This will learn the optimal epsilon-greedy policy.
</div>

**TODO:** Implement the on-policy first-visit MC control (for $\epsilon$-soft policies) from Sutton & Barto Chapter 5.4.

In [None]:
env = BlackjackEnv()

In [None]:
def mc_control_epsilon_greedy(env, num_episodes, discount_factor=1.0, epsilon=0.1):
    """
    Monte Carlo Control using Epsilon-Greedy policies.
    Finds an optimal epsilon-greedy policy.
    
    Args:
        env: OpenAI gym environment.
        num_episodes: Number of episodes to sample.
        discount_factor: Gamma discount factor.
        epsilon: Chance the sample a random action. Float betwen 0 and 1.
    
    Returns:
        Q is a dictionary mapping state -> action values.
    """
    
    # TODO: Implement the on-policy first-visit MC control (for  𝜖 -soft policies) from Sutton & Barto Chapter 5.4.
   
    return Q

In [None]:
Q = mc_control_epsilon_greedy(env, num_episodes=500000, epsilon=0.1)

## 3.2 Visualizing the Value Function

Your function `mc_control_epsilon_greedy()` returns the action-value function $q_\pi(s,a)$ (`Q`) for the learned $\epsilon$-greedy policy. Use the generated `Q` to calculate the value function:

\begin{equation*}
   \large v(s) = \max_a q_\pi(s,a)
\end{equation*}

**TODO:** Calculate the value function and reuse the `plot_value_function()` from the Lab02 to visualize it. 

In [6]:
def plot_value_function(V, title="Value Function"):
    """
    Plots the value function as a surface plot.
    """
    

In [None]:
plot_value_function(V, title="Optimal Value Function")

## 3.3 Visualizing the policy

Come up with a creative way to visualize the learned policy. Keep in mind that your policy is different for games with an ace and without the ace. Here is just one possible way of visualizing the policy:

![policy](images/Ex3.3_policy.png)

In [7]:
# TODO: Visualize the policy