# Lab02 - Monte Carlo Methods

### Learning Goals:
- Getting to know the blackjack environment
- Understanding policies in the context of Reinforcement Learning
- Understanding Monte Carlo methods and implementing a First-visit MC prediction algorithm
- Visualizing the value function

In [None]:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from collections import defaultdict
from toolbox.blackjack import BlackjackEnv

## 2.1 Blackjack Environment

The object of the popular casino card game of blackjack is to obtain cards the sum of whose numerical values is as great as possible without exceeding 21. All face cards count as 10, and an ace can count either as 1 or as 11. We consider the version in which each player competes independently against the dealer. The game begins with two cards dealt to both dealer and player. One of the dealer’s cards is face up and the other is face down. If the player has 21 immediately (an ace and a 10-card), it is called a natural. He then wins unless the dealer also has a natural, in which case the game is a draw. If the player does not have a natural, then he can request additional cards, one by one (hits), until he either stops (sticks) or exceeds 21 (goes bust). If he goes bust, he loses; if he sticks, then it becomes the dealer’s turn. The dealer hits or sticks according to a fixed strategy without choice: he sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player wins; otherwise, the outcome - win, lose, or draw - is determined by whose final sum is closer to 21.

In [None]:
env = BlackjackEnv()

Playing blackjack is naturally formulated as an episodic finite MDP. Each game of blackjack is an episode. Rewards of +1, -1, and 0 are given for winning, losing, and drawing, respectively. All rewards within a game are zero, and we do not discount ($\gamma = 1$); therefore these terminal rewards are also the returns. The player’s actions are to hit or to stick. The states depend on the player’s cards and the dealer’s showing card. We assume that cards are dealt from an infinite deck (i.e., with replacement) so that there is no advantage to keeping track of the cards already dealt. If the player holds an ace that he could count as 11 without going bust, then the ace is said to be usable. In this case it is always counted as 11 because counting it as 1 would make the sum 11 or less, in which case there is no decision to be made because, obviously, the player should always hit. Thus, the player makes decisions on the basis of three variables: his current sum (12–21), the dealer’s one showing card (ace–10), and whether or not he holds a usable ace.

In [None]:
def print_observation(observation):
    score, dealer_score, usable_ace = observation
    # TODO: Print player's score, if the player has a usable ace and the dealer's score
    ...

### Defining a policy
**TODO:** Define a policy that sticks if the player’s sum is 20 or 21, and otherwise hits in the cell below.

In [None]:
def strategy(observation):
    # TODO: Unpack the observation
    ...
    # TODO: Stick (action 0) if the score is >= 20, hit (action 1) otherwise
    return ...

### Getting familiar with the environment
**TODO:** Write a program that plays 10 games using the blackjack environment imported above and the policy you defined.

In [None]:
# TODO: Play 10 games using the defined policy and the imported blackjack environment. Print the results.
...

## 2.2 Monte Carlo Prediction
### 2.2.1 First-visit MC prediction

In this section we use Monte Carlo methods for learning the state-value function for a given policy. Recall that the value of a state is the expected return, starting from that state. An obvious way to estimate it from experience is to average the returns observed after visits to that state.

Suppose we want to estimate $v_\pi(s)$ (the value of a state $s$ under policy $\pi$, given a set of episodes obtained by following $\pi$ and passing through $s$. The **first-visit MC** method estimates $v_\pi(s)$ as the average of the returns following **first** visits to $s$. 

 
<div class="alert alert-block alert-info">
    <b>MC Policy Evaluation</b>: Given a policy, we want to estimate the state-value function V(s). Sample episodes of experience and estimate V(s) to be the reward received from that state onwards averaged across all of your experience. The same technique works for the action-value function Q(s, a). Given enough samples, this is proven to converge.
</div>

**TODO:** Implement First-visit MC prediction from Sutton & Barto Chapter 5.1 to evaluate the defined policy.

In [None]:
def mc_prediction_first_visit(policy, env, num_episodes, discount_factor=1.0):
    """
    Monte Carlo prediction algorithm. Calculates the value function
    for a given policy using sampling.
    
    Args:
        policy: A function that maps an observation to action probabilities.
        env: OpenAI gym environment.
        num_episodes: Number of episodes to sample.
        discount_factor: Gamma discount factor.
    
    Returns:
        A dictionary that maps from state -> value.
        The state is a tuple and the value is a float.
    """

    # TODO: Implement first-visit MC prediction from Sutton & Barto Chapter 5.1 to evaluate the defined policy

    return V    

In [None]:
# TODO: Run the MC prediction function for 10.000 episodes and
V_10k_first_visit = ...

### 2.2.2 Every-visit MC prediction
The **every-visit MC** method estimates $v_\pi(s)$ as the average of the returns following **all** visits to $s$. 

**TODO:** Sligthly change the first-visit method implemented above to obtain the every-visit mc prediction.

In [None]:
def mc_prediction_every_visit(policy, env, num_episodes, discount_factor=1.0):
    """
    Monte Carlo prediction algorithm. Calculates the value function
    for a given policy using sampling.
    
    Args:
        policy: A function that maps an observation to action probabilities.
        env: OpenAI gym environment.
        num_episodes: Number of episodes to sample.
        discount_factor: Gamma discount factor.
    
    Returns:
        A dictionary that maps from state -> value.
        The state is a tuple and the value is a float.
    """

    # TODO: Implement every-visit MC prediction from Sutton & Barto Chapter 5.1 to evaluate the defined policy

    return V    

In [None]:
# TODO: Run the MC prediction function for 10.000 episodes and
V_10k_every_visit = ...

## 2.3 Visualizing of the Value Function
In this section you will learn how to visualize properly the value function of the evaluated policy. In the following you can see one possible way of value function visualization. 

![value_function](images/Ex2.3_plot.png)

**TODO:** Try to recreate the given plot.

In [None]:
def plot_value_function(V, title="Value Function"):
    ...

In [None]:
plot_value_function(V_10k_first_visit, title="10,000 Steps")

In [None]:
plot_value_function(V_10k_every_visit, title="10,000 Steps")