# Practical Activity 1 (**PRA1**)

## Evaluable Practical Exercise

<u>General considerations</u>:

- The proposed solution cannot use methods, functions or parameters declared **_deprecated_** in future versions.
- This activity must be carried out on a **strictly individual** basis. Any indication of copying will be penalized with a failure for all parties involved and the possible negative evaluation of the subject in its entirety.
- It is necessary for the student to indicate **all the sources** that she/he has used to carry out the PRA. If not, the student will be considered to have committed plagiarism, being penalized with a failure and the possible negative evaluation of the subject in its entirety.

<u>Delivery format</u>:

- Some exercises may require several minutes of execution, so the delivery must be done in **Notebook format** and in **HTML format**, where the code, results and comments of each exercise can be seen. You can export the notebook to HTML from the menu File $\to$ Download as $\to$ HTML.
- There is a special type of cell to hold text. This type of cell will be very useful to answer the different theoretical questions posed throughout the activity. To change the cell type to this type, in the menu: Cell $\to$ Cell Type $\to$ Markdown.

<div class="alert alert-block alert-info">
<strong>Name and surname: Victor Brao Ruiz </strong>
</div>

## Introduction

Blackjack environment is part of the Gymnasium's [Toy Text](https://gymnasium.farama.org/environments/toy_text/) environments. Blackjack is a card game where the goal is to beat the dealer by obtaining cards that sum to closer to 21 (without going over 21) than the dealer's cards.

The card values are, as depicted in the following figure:
- Face cards (Jack, Queen, King) have a point value of **10**.
- Aces can either count as **11** (called a "usable ace") or **1**.
- Numerical cards (**2-9**) have a value equal to their number.

<img src="./figs/BlackJackCards.png" />

Game Dynamics:
1. The game starts with the dealer having one face up and one face down card, while the player has two face up cards. All cards are drawn from an infinite deck (i.e. with replacement).
2. The player has a total sum of cards. They can request additional cards (**hit**) until they decide to stop (**stick**) or exceed 21 (**bust**), which results in an immediate loss.
3. After the player decides to stick, the dealer reveals their face-down card and draws cards until their total is 17 or greater. If the dealer goes bust, the player wins.
4. If neither the player nor the dealer goes bust, the winner is whoever has a sum closer to 21.

Further information could be found at:
- Gymnasium [Blackjack](https://gymnasium.farama.org/environments/toy_text/blackjack/)

In order to initialize the environment, we will use `natural=True` to give an additional reward for starting with a natural blackjack, i.e. starting with an ace and ten (sum is 21), as depicted in the following piece of code:

In [119]:
import gymnasium as gym

env = gym.make('Blackjack-v1', natural=True, sab=False)

In [120]:
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))

Action space is Discrete(2) 
Observation space is Tuple(Discrete(32), Discrete(11), Discrete(2)) 


## Part 1. Naïve Policy

Implement an agent that carries out the following deterministic policy: 
- The agent will **stick** if it gets a score of 20 or 21.
- Otherwise, it will **hit**.

<u>Questions</u> (**1 point**): 
1. Using this agent, simulate 100,000 games and calculate the agent's return (total accumulated reward).
2. Additionally, calculate the % of wins, natural wins, losses and draws. 
3. Comment on the results.

In [121]:
import numpy as np

def naive_policy(player_sum):
    return 0 if player_sum >= 20 else 1  # 0=stick, 1=hit


n_episodes = 100000
total_reward = 0
wins = 0
natural_wins = 0
losses = 0
draws = 0

for i in range(n_episodes):
    state, info = env.reset()
    terminated = False
    truncated = False
    while not (terminated or truncated):
        player_sum = state[0]
        action = naive_policy(player_sum)
        state, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    if reward == 1.0:
        wins += 1
    elif reward == 1.5:
        natural_wins += 1
        wins += 1
    elif reward == 0.0:
        draws += 1
    elif reward == -1.0:
        losses += 1

print(f"Total accumulated reward: {total_reward}")
print(f"Win percentage: {wins / n_episodes * 100:.2f}%")
print(f"Natural win percentage: {natural_wins / n_episodes * 100:.2f}%")
print(f"Loss percentage: {losses / n_episodes * 100:.2f}%")
print(f"Draw percentage: {draws / n_episodes * 100:.2f}%")

Total accumulated reward: -33285.5
Win percentage: 29.32%
Natural win percentage: 4.21%
Loss percentage: 64.71%
Draw percentage: 5.97%


With this first basic policy implementation we have a loss percentage of almost 65%, with a win % of 29.43%, which also includes the % of natural wins (a 4.13%). As far as I know, the % of natural wins is the only one that souldn't change much in the future, as the probabilities of a natural blackjack are independent of the policy chosen. The results seem pretty consisistent with the policy we have implemented. Of course, the accumulated reward is just a reflection of this low win rate.

## Part 2. Monte Carlo method

The objective of this section is to estimate the optimal policy using Monte Carlo methods. Specifically, you can choose and implement one of the algorithms related to _Control using MC methods_ (with ''exploring starts'' or without ''exploring starts'', both on-policy or off-policy).

<u>Questions</u> (**2.5 points**): 
1. Implement the selected algorithm and justify your choice.
2. Comment and justify all the parameters, such as:
- Number of episodes
- Discount factor
- Etc.
3. Implement a function that prints on the screen the optimal policy found for each state (similar to the figure in Section 3.1).
4. Using the trained agent, simulate 100,000 games and calculate the agent's return (total accumulated reward).
5. Additionally, calculate the % of wins, natural wins, losses and draws.

For this part i will implement a On-policy first-visit Monte Carlo Control with $\epsilon$-soft policy, without exploring starts. The justification is that its a simple policy that can be useful for a game like blackjack, with a small state space where we can mantain a table of Q values. And $\epsilon$-soft ensures all actions have a non-zero probability, so we ensure exploration.

I will start with 100.000 episodes, just to check if its working and end with 500.000, the same used for the CliffWalking example, that should be enough. Discount Factor = 1 as in blackjack we care about the only reward at the end of each episode. For epsilon, i will start with a value of Epsilon = 0.1 just to ensure exploration. (maybe to be changed in the future)

In [122]:
from collections import defaultdict


def make_epsilon_greedy_policy(Q, epsilon, n_actions):
   
    def policy_fn(observation):
        A = np.ones(n_actions, dtype=float) * epsilon / n_actions
        best_action = np.argmax(Q[observation])
        A[best_action] += (1.0 - epsilon)
        return A
    return policy_fn


def mc_control_on_policy_epsilon_greedy(env, num_episodes, discount=1.0, epsilon=0.1):

    returns_sum = defaultdict(float)
    returns_count = defaultdict(float)
    Q = defaultdict(lambda: np.zeros(env.action_space.n))

    for i_episode in range(1, num_episodes + 1):
        policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
        episode = []
        state, _ = env.reset()
        done = False

        while not done:
            probs = policy(state)
            action = np.random.choice(np.arange(len(probs)), p=probs)
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode.append((state, action, reward))
            state = next_state
            done = terminated or truncated

        # First-visit Monte Carlo update
        visited_state_actions = set()
        G = 0
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = discount * G + reward
            if (state, action) not in visited_state_actions:
                returns_sum[(state, action)] += G
                returns_count[(state, action)] += 1.0
                Q[state][action] = returns_sum[(state, action)] / returns_count[(state, action)]
                visited_state_actions.add((state, action))

    policy = make_epsilon_greedy_policy(Q, epsilon, env.action_space.n)
    return Q, policy


In [123]:
#Training
num_episodes = 500000
discount_factor = 1.0
epsilon = 0.1

Q_mc, policy = mc_control_on_policy_epsilon_greedy(env, num_episodes, discount_factor, epsilon)


In [124]:
#Display policy
def print_policy(Q):
    usable_ace = np.full((10, 10), " ")
    no_usable_ace = np.full((10, 10), " ")

    for player_sum in range(12, 22):
        for dealer_card in range(1, 11):
            for ace in [True, False]:
                state = (player_sum, dealer_card, ace)
                if state in Q:
                    action = np.argmax(Q[state])
                    symbol = "S" if action == 0 else "H"
                    if ace:
                        usable_ace[player_sum - 12, dealer_card - 1] = symbol
                    else:
                        no_usable_ace[player_sum - 12, dealer_card - 1] = symbol

    print("\nPolicy with usable ace (S=stick, H=hit):")
    print(np.flipud(usable_ace))
    print("\nPolicy without usable ace (S=stick, H=hit):")
    print(np.flipud(no_usable_ace))

print_policy(Q)


Policy with usable ace (S=stick, H=hit):
[['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'H' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'H' 'S' 'S' 'S' 'S']
 ['H' 'H' 'H' 'H' 'S' 'H' 'S' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']]

Policy without usable ace (S=stick, H=hit):
[['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['H' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['H' 'S' 'S' 'S' 'S' 'S' 'H' 'H' 'S' 'S']
 ['H' 'H' 'S' 'S' 'S' 'S' 'H' 'H' 'H' 'H']
 ['H' 'S' 'S' 'S' 'S' 'S' 'H' 'H' 'H' 'H']
 ['H' 'H' 'S' 'S' 'S' 'S' 'H' 'H' 'H' 'H']
 ['H' 'S' 'H' 'S' 'H' 'S' 'H' 'H' 'H' 'H']]


In [125]:
#Metrics
episodes_eval = 500000
total_reward = 0
wins = 0
natural_wins = 0
losses = 0
draws = 0

for _ in range(episodes_eval):
    state, info = env.reset()
    done = False
    while not done:
        action = np.argmax(Q[state])
        state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

    total_reward += reward
    if reward == 1.0:
        wins += 1
    elif reward == 1.5:
        natural_wins += 1
        wins += 1
    elif reward == 0.0:
        draws += 1
    elif reward == -1.0:
        losses += 1

print(f"\nTotal accumulated reward: {total_reward}")
print(f"Win percentage: {wins / episodes_eval * 100:.2f}%")
print(f"Natural win percentage: {natural_wins / episodes_eval * 100:.2f}%")
print(f"Loss percentage: {losses / episodes_eval * 100:.2f}%")
print(f"Draw percentage: {draws / episodes_eval * 100:.2f}%")


Total accumulated reward: -15564.0
Win percentage: 42.88%
Natural win percentage: 4.16%
Loss percentage: 48.07%
Draw percentage: 9.05%


Now the performance has improved a lot (if we take into account that the optimal result will be around 50%), and the full printed policy kind of reminds of the optimal policy from Sutton & Barto. Again, the % of wins includes both normal wins and natural wins.

## Part 3. TD learning

The objective of this section is to estimate the optimal policy using TD learning methods. Specifically, you have to implement the **SARSA algorithm**.

<u>Questions</u> (**2.5 points**): 
1. Implement the algorithm.
2. Comment and justify all the parameters.
3. Print on the screen the optimal policy found for each state.
4. Using the trained agent, simulate 100,000 games and calculate the agent's return (total accumulated reward).
5. Additionally, calculate the % of wins, natural wins, losses and draws.

For the SARSA algorithm I will train with 500.000 episodes, same as before, as it looks like its working well. Discount Factor = 1 as in blackjack we care about the only reward at the end of each episode. For epsilon, i will start with a value of Epsilon = 0.1 and a Learning Rate = 0.01.

In [126]:
import numpy as np
import gymnasium as gym
from collections import defaultdict


def epsilon_greedy_policy(Q, state, nA, epsilon):
    probs = np.ones(nA) * epsilon / nA
    best_action = np.argmax(Q[state])
    probs[best_action] += 1.0 - epsilon
    return probs


def SARSA(env, episodes=500000, alpha=0.01, gamma=1.0, epsilon=0.1):

    Q = defaultdict(lambda: np.zeros(env.action_space.n))

    for i_episode in range(episodes):
        state, _ = env.reset()
        probs = epsilon_greedy_policy(Q, state, env.action_space.n, epsilon)
        action = np.random.choice(np.arange(env.action_space.n), p=probs)
        done = False

        while not done:
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            if not done:
                next_probs = epsilon_greedy_policy(Q, next_state, env.action_space.n, epsilon)
                next_action = np.random.choice(np.arange(env.action_space.n), p=next_probs)
                td_target = reward + gamma * Q[next_state][next_action]
            else:
                td_target = reward

            td_error = td_target - Q[state][action]
            Q[state][action] += alpha * td_error

            state = next_state
            if not done:
                action = next_action

    policy = {state: np.argmax(actions) for state, actions in Q.items()}
    return Q, policy

In [127]:
#Metrics
def evaluate_policy(env, policy, n_episodes=500000):
    total_reward = 0
    wins = 0
    natural_wins = 0
    losses = 0
    draws = 0

    for _ in range(n_episodes):
        state, _ = env.reset()
        done = False
        while not done:
            action = policy.get(state, 1)  # default = Hit
            state, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
        total_reward += reward
        if reward == 1.0:
            wins += 1
        elif reward == 1.5:
            natural_wins += 1
            wins += 1
        elif reward == 0.0:
            draws += 1
        elif reward == -1.0:
            losses += 1

    print(f"\nEvaluation results over {n_episodes} episodes:")
    print(f"Total accumulated reward: {total_reward}")
    print(f"Win percentage: {wins / n_episodes * 100:.2f}%")
    print(f"Natural win percentage: {natural_wins / n_episodes * 100:.2f}%")
    print(f"Loss percentage: {losses / n_episodes * 100:.2f}%")
    print(f"Draw percentage: {draws / n_episodes * 100:.2f}%")

In [128]:
#Display policy

def print_policy2(policy):
    usable_ace = np.full((10, 10), " ")
    no_usable_ace = np.full((10, 10), " ")

    for player_sum in range(12, 22):
        for dealer_card in range(1, 11):
            for ace in [True, False]:
                state = (player_sum, dealer_card, ace)
                if state in Q:
                    action = np.argmax(Q[state])
                    symbol = "S" if action == 0 else "H"
                    if ace:
                        usable_ace[player_sum - 12, dealer_card - 1] = symbol
                    else:
                        no_usable_ace[player_sum - 12, dealer_card - 1] = symbol

    print("\nPolicy with usable ace (S=stick, H=hit):")
    print(np.flipud(usable_ace))
    print("\nPolicy without usable ace (S=stick, H=hit):")
    print(np.flipud(no_usable_ace))

In [129]:
#Training and printing policy
Q_sarsa, policy = SARSA(env, episodes=500000, alpha=0.01, gamma=1.0, epsilon=0.1)
print_policy2(policy)
evaluate_policy(env, policy)


Policy with usable ace (S=stick, H=hit):
[['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'H' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'H' 'S' 'S' 'S' 'S']
 ['H' 'H' 'H' 'H' 'S' 'H' 'S' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
 ['H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'H']]

Policy without usable ace (S=stick, H=hit):
[['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['H' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S' 'S']
 ['H' 'S' 'S' 'S' 'S' 'S' 'H' 'H' 'S' 'S']
 ['H' 'H' 'S' 'S' 'S' 'S' 'H' 'H' 'H' 'H']
 ['H' 'S' 'S' 'S' 'S' 'S' 'H' 'H' 'H' 'H']
 ['H' 'H' 'S' 'S' 'S' 'S' 'H' 'H' 'H' 'H']
 ['H' 'S' 'H' 'S' 'H' 'S' 'H' 'H' 'H' 'H']]

Evaluation results over 500000 episodes:
Total acc

## Part 4. Comparison of the algorithms

In this section, we will make a comparison among the algorithms.

We will compare the performance of the algorithms when changing the number of episodes, the discount factor and the *learning rate* values (in the case of the SARSA method).

For each exercise, the results must be presented and justified.

**Note**: 
- It is recommended to run the simulations multiple times for each exercise, as these are random, and to comment on the most frequent result or the average of these.

### 4.1. Comparison to the optimal policy

The optimal policy for this problem, described by [Sutton & Barto](http://incompleteideas.net/book/the-book-2nd.html) is depicted in the following image:

<img src="./figs/optimal.png" style="width: 800px;" />

<u>Questions</u> (**1 point**): 
- Compare the _optimal_ policies of the naïve, Monte Carlo and SARSA methods to the optimal one provided by Sutton & Barto.
- Comment on the results and justify your answer. 

In [130]:

opt_usable_ace = np.array([
    ["H","H","H","H","H","H","H","H","H","H"],
    ["H","H","H","H","H","H","H","H","H","H"],
    ["H","H","H","H","H","H","H","H","H","H"],
    ["H","H","H","H","H","H","H","H","H","H"],
    ["H","H","H","H","H","H","H","H","H","H"],
    ["H","H","H","H","H","H","H","H","H","H"],
    ["H","S","S","S","S","S","S","S","H","H"],
    ["S","S","S","S","S","S","S","S","S","S"],
    ["S","S","S","S","S","S","S","S","S","S"],
    ["S","S","S","S","S","S","S","S","S","S"]
])

opt_no_usable_ace = np.array([
    ["H","H","H","S","S","S","H","H","H","H"],
    ["H","S","S","S","S","S","H","H","H","H"],
    ["H","S","S","S","S","S","H","H","H","H"],
    ["H","S","S","S","S","S","H","H","H","H"],
    ["H","S","S","S","S","S","H","H","H","H"],
    ["S","S","S","S","S","S","S","S","S","S"],
    ["S","S","S","S","S","S","S","S","S","S"],
    ["S","S","S","S","S","S","S","S","S","S"],
    ["S","S","S","S","S","S","S","S","S","S"],
    ["S","S","S","S","S","S","S","S","S","S"]
])

def convert_matrix(matrix, usable_ace):
    policy_dict = {}
    for i, player_row in enumerate(matrix):
        player_sum = i + 12
        for j, action in enumerate(player_row):
            dealer_card = j + 1
            policy_dict[(player_sum, dealer_card, usable_ace)] = 0 if action == 'S' else 1
    return policy_dict

optimal_policy = {}
optimal_policy.update(convert_matrix(opt_usable_ace, usable_ace=True))
optimal_policy.update(convert_matrix(opt_no_usable_ace, usable_ace=False))
optimal_policy


{(12, 1, True): 1,
 (12, 2, True): 1,
 (12, 3, True): 1,
 (12, 4, True): 1,
 (12, 5, True): 1,
 (12, 6, True): 1,
 (12, 7, True): 1,
 (12, 8, True): 1,
 (12, 9, True): 1,
 (12, 10, True): 1,
 (13, 1, True): 1,
 (13, 2, True): 1,
 (13, 3, True): 1,
 (13, 4, True): 1,
 (13, 5, True): 1,
 (13, 6, True): 1,
 (13, 7, True): 1,
 (13, 8, True): 1,
 (13, 9, True): 1,
 (13, 10, True): 1,
 (14, 1, True): 1,
 (14, 2, True): 1,
 (14, 3, True): 1,
 (14, 4, True): 1,
 (14, 5, True): 1,
 (14, 6, True): 1,
 (14, 7, True): 1,
 (14, 8, True): 1,
 (14, 9, True): 1,
 (14, 10, True): 1,
 (15, 1, True): 1,
 (15, 2, True): 1,
 (15, 3, True): 1,
 (15, 4, True): 1,
 (15, 5, True): 1,
 (15, 6, True): 1,
 (15, 7, True): 1,
 (15, 8, True): 1,
 (15, 9, True): 1,
 (15, 10, True): 1,
 (16, 1, True): 1,
 (16, 2, True): 1,
 (16, 3, True): 1,
 (16, 4, True): 1,
 (16, 5, True): 1,
 (16, 6, True): 1,
 (16, 7, True): 1,
 (16, 8, True): 1,
 (16, 9, True): 1,
 (16, 10, True): 1,
 (17, 1, True): 1,
 (17, 2, True): 1,
 (17, 3

In [131]:
import numpy as np

def policy_from_Q_or_map(Q_or_map):

    def fn(state):
        if state in Q_or_map:
            val = Q_or_map[state]
            if isinstance(val, (list, np.ndarray)):
                return int(np.argmax(val))
            else:
                return int(val)
        player_sum = state[0]
        return 0 if player_sum >= 20 else 1
    return fn

policy_mc_fn = policy_from_Q_or_map(Q_mc)
policy_sarsa_fn = policy_from_Q_or_map(Q_sarsa)
policy_naive_fn = lambda s: 0 if s[0] >= 20 else 1

def compare_to_reference(policy_fn, reference_map):

    states = []
    actions_learned = []
    actions_ref = []
    disagreements = []

    for player_sum in range(12, 22):
        for dealer_card in range(1, 11):
            for ace in [True, False]:
                s = (player_sum, dealer_card, ace)
                a_learned = policy_fn(s)
                a_ref = reference_map.get(s, None)
                if a_ref is None:
                    # if reference not provided for this state, skip or treat as unknown
                    continue
                states.append(s)
                actions_learned.append(a_learned)
                actions_ref.append(a_ref)
                if a_learned != a_ref:
                    disagreements.append(s)

    total = len(states)
    agree = sum(1 for i in range(len(states)) if actions_learned[i] == actions_ref[i])
    return {
        "total_states_compared": total,
        "agreement_count": agree,
        "agreement_pct": agree / total * 100.0,
        "disagreements": disagreements
    }


res_mc = compare_to_reference(policy_mc_fn, optimal_policy)
res_sarsa = compare_to_reference(policy_sarsa_fn, optimal_policy)
res_naive = compare_to_reference(policy_naive_fn, optimal_policy)


print("Number of agreements for Naive:", res_naive["agreement_count"])
print("Number of agreements for MC:", res_mc["agreement_count"])
print("Number of agreements for SARSA:", res_sarsa["agreement_count"])

print("Number of disagreements for Naive:", len(res_naive["disagreements"]))
print("Number of disagreements for MC:", len(res_mc["disagreements"]))
print("Number of disagreements for SARSA:", len(res_sarsa["disagreements"]))

print("MC agreement %:", res_mc["agreement_pct"])
print("SARSA agreement %:", res_sarsa["agreement_pct"])
print("Naive agreement %:", res_naive["agreement_pct"])

Number of agreements for Naive: 130
Number of agreements for MC: 186
Number of agreements for SARSA: 184
Number of disagreements for Naive: 70
Number of disagreements for MC: 14
Number of disagreements for SARSA: 16
MC agreement %: 93.0
SARSA agreement %: 92.0
Naive agreement %: 65.0


### 4.2. Influence of the Number of Episodes

Conduct a study by varying the number of episodes in each of the algorithms.

<u>Questions</u> (**1 point**): 
- Train each algorithm multiple times with 100,000, 1,000,000, and 5,000,000 episodes and average the results.
- Indicate how the **number of episodes** influences the convergence of each algorithm by calculating the number of states where the policy differs from the optimal one, as well as the average return obtained after playing 100,000 games following each training.

In [132]:
def evaluate_policy(policy_fn, n_games=100000):
    total_reward = 0
    wins = natural_wins = losses = draws = 0
    for i in range(n_games):
        state, info = env.reset()
        terminated = truncated = False
        while not (terminated or truncated):
            action = policy_fn(state)
            state, reward, terminated, truncated, info = env.step(action)

        total_reward += reward
        if reward == 1.0:
            wins += 1
        elif reward == 1.5:
            natural_wins += 1
            wins += 1
        elif reward == 0.0:
            draws += 1
        elif reward == -1.0:
            losses += 1

    avg_reward = total_reward / n_games
    return {
        "avg_reward": avg_reward,
        "win_pct": wins / n_games * 100,
        "natural_win_pct": natural_wins / n_games * 100,
        "loss_pct": losses / n_games * 100,
        "draw_pct": draws / n_games * 100
    }

In [133]:
episodes_list = [100000, 1000000, 5000000]
n_eval_games = 100000
discount_factor = 1.0
epsilon = 0.1
alpha = 0.01


for num_episodes in episodes_list:
    print(f"\nTraining with {num_episodes:,} episodes: ")

    
    Q_mc, policy = mc_control_on_policy_epsilon_greedy(env, num_episodes, discount_factor, epsilon)
    policy_mc_fn = policy_from_Q_or_map(Q_mc)

    
    Q_sarsa, policy = SARSA(env, num_episodes, alpha=alpha, gamma=discount_factor, epsilon=epsilon)
    policy_sarsa_fn = policy_from_Q_or_map(Q_sarsa)

    eval_mc = evaluate_policy(policy_mc_fn, n_eval_games)
    eval_sarsa = evaluate_policy(policy_sarsa_fn, n_eval_games)
    eval_naive = evaluate_policy(lambda s: 0 if s[0] >= 20 else 1, n_eval_games)

    res_mc = compare_to_reference(policy_mc_fn, optimal_policy)
    res_sarsa = compare_to_reference(policy_sarsa_fn, optimal_policy)
    res_naive = compare_to_reference(lambda s: 0 if s[0] >= 20 else 1, optimal_policy)

    print(f"\nMonte Carlo ({num_episodes:,} episodes):")
    print(f"  Avg reward: {eval_mc['avg_reward']:.3f}")
    print(f"  Win%: {eval_mc['win_pct']:.2f}, Natural win%: {eval_mc['natural_win_pct']:.2f}, Loss%: {eval_mc['loss_pct']:.2f}, Draw%: {eval_mc['draw_pct']:.2f}")
    print(f"  Agreement with optimal policy: {res_mc['agreement_pct']:.2f}%")

    print(f"\nSARSA ({num_episodes:,} episodes):")
    print(f"  Avg reward: {eval_sarsa['avg_reward']:.3f}")
    print(f"  Win%: {eval_sarsa['win_pct']:.2f}, Natural win%: {eval_sarsa['natural_win_pct']:.2f}, Loss%: {eval_sarsa['loss_pct']:.2f}, Draw%: {eval_sarsa['draw_pct']:.2f}")
    print(f"  Agreement with optimal policy: {res_sarsa['agreement_pct']:.2f}%")
    
    print(f"\nNaive Policy(ignore):")
    print(f"  Avg reward: {eval_naive['avg_reward']:.3f}")
    print(f"  Win%: {eval_naive['win_pct']:.2f}, Natural win%: {eval_naive['natural_win_pct']:.2f}, Loss%: {eval_naive['loss_pct']:.2f}, Draw%: {eval_naive['draw_pct']:.2f}")
    print(f"  Agreement with optimal policy: {res_naive['agreement_pct']:.2f}%")



Training with 100,000 episodes: 

Monte Carlo (100,000 episodes):
  Avg reward: -0.041
  Win%: 42.61, Natural win%: 4.19, Loss%: 48.80, Draw%: 8.59
  Agreement with optimal policy: 88.00%

SARSA (100,000 episodes):
  Avg reward: -0.036
  Win%: 42.73, Natural win%: 4.11, Loss%: 48.37, Draw%: 8.90
  Agreement with optimal policy: 90.00%

Naive Policy(ignore):
  Avg reward: -0.334
  Win%: 29.30, Natural win%: 4.18, Loss%: 64.79, Draw%: 5.91
  Agreement with optimal policy: 65.00%

Training with 1,000,000 episodes: 

Monte Carlo (1,000,000 episodes):
  Avg reward: -0.023
  Win%: 43.10, Natural win%: 4.22, Loss%: 47.50, Draw%: 9.40
  Agreement with optimal policy: 97.00%

SARSA (1,000,000 episodes):
  Avg reward: -0.031
  Win%: 42.80, Natural win%: 4.02, Loss%: 47.91, Draw%: 9.29
  Agreement with optimal policy: 97.50%

Naive Policy(ignore):
  Avg reward: -0.336
  Win%: 29.19, Natural win%: 4.18, Loss%: 64.92, Draw%: 5.89
  Agreement with optimal policy: 65.00%

Training with 5,000,000 epi

### 4.3. Influence of the Discount Factor

Conduct a study by varying the *discount factor* in each of the algorithms.

<u>Questions</u> (**1 point**):
- Run the algorithms with *discount factor* = 0.1, 0.5, 0.9 and the rest of the parameters the same as in previous exercises. 
- Describe the changes in the optimal policy, comparing the result obtained with the result of previous exercises (*discount factor* = 1).

In [134]:
discount_factors = [0.1, 0.5, 0.9]
num_episodes = 1000000
n_eval_games = 100000
epsilon = 0.1
alpha = 0.01


for gamma in discount_factors:
    print(f"\nTraining with discount factor γ = {gamma}: ")
    
    Q_mc, _ = mc_control_on_policy_epsilon_greedy(env, num_episodes, gamma, epsilon)
    policy_mc_fn = policy_from_Q_or_map(Q_mc)
    
    Q_sarsa, policy = SARSA(env, num_episodes, alpha=alpha, gamma=gamma, epsilon=epsilon)
    policy_sarsa_fn = policy_from_Q_or_map(Q_sarsa)

    eval_mc = evaluate_policy(policy_mc_fn, n_eval_games)
    eval_sarsa = evaluate_policy(policy_sarsa_fn, n_eval_games)

    res_mc = compare_to_reference(policy_mc_fn, optimal_policy)
    res_sarsa = compare_to_reference(policy_sarsa_fn, optimal_policy)

    print(f"\nMonte Carlo (γ = {gamma}):")
    print(f"  Avg reward: {eval_mc['avg_reward']:.3f}")
    print(f"  Win%: {eval_mc['win_pct']:.2f}, Natural win%: {eval_mc['natural_win_pct']:.2f}, Loss%: {eval_mc['loss_pct']:.2f}, Draw%: {eval_mc['draw_pct']:.2f}")
    print(f"  Agreement with optimal policy (γ=1.0): {res_mc['agreement_pct']:.2f}%")

    print(f"\nSARSA (γ = {gamma}):")
    print(f"  Avg reward: {eval_sarsa['avg_reward']:.3f}")
    print(f"  Win%: {eval_sarsa['win_pct']:.2f}, Natural win%: {eval_sarsa['natural_win_pct']:.2f}, Loss%: {eval_sarsa['loss_pct']:.2f}, Draw%: {eval_sarsa['draw_pct']:.2f}")
    print(f"  Agreement with optimal policy (γ=1.0): {res_sarsa['agreement_pct']:.2f}%")



Training with discount factor γ = 0.1: 

Monte Carlo (γ = 0.1):
  Avg reward: -0.031
  Win%: 42.93, Natural win%: 4.09, Loss%: 48.08, Draw%: 8.99
  Agreement with optimal policy (γ=1.0): 95.50%

SARSA (γ = 0.1):
  Avg reward: -0.030
  Win%: 42.80, Natural win%: 4.25, Loss%: 47.96, Draw%: 9.24
  Agreement with optimal policy (γ=1.0): 94.50%

Training with discount factor γ = 0.5: 

Monte Carlo (γ = 0.5):
  Avg reward: -0.034
  Win%: 42.80, Natural win%: 4.15, Loss%: 48.28, Draw%: 8.92
  Agreement with optimal policy (γ=1.0): 95.50%

SARSA (γ = 0.5):
  Avg reward: -0.030
  Win%: 43.01, Natural win%: 4.16, Loss%: 48.07, Draw%: 8.91
  Agreement with optimal policy (γ=1.0): 95.50%

Training with discount factor γ = 0.9: 

Monte Carlo (γ = 0.9):
  Avg reward: -0.025
  Win%: 43.24, Natural win%: 4.15, Loss%: 47.80, Draw%: 8.97
  Agreement with optimal policy (γ=1.0): 97.00%

SARSA (γ = 0.9):
  Avg reward: -0.025
  Win%: 43.11, Natural win%: 4.07, Loss%: 47.62, Draw%: 9.27
  Agreement with op

Here i expected the results to improve as the df incremented, but only the Monte Carlo algorithm improves, SARSA works a little bit worse as the df value goes up to 0.9.

### 4.4. Influence of the Learning Rate

Conduct a study by varying the learning rate in the *SARSA* algorithm.

<u>Questions</u> (**1 point**):
- Run the *SARSA* algorithm with the following *learning rate* values: 0.001, 0.01, 0.1, and 0.9.
- Analyze the differences with the results obtained previously in terms of the number of errors relative to the optimal policy and the accumulated reward for every 100,000 episodes played.

In [138]:
discount_factor = 1
num_episodes = 1000000
n_eval_games = 100000
epsilon = 0.1
alpha = [0.001, 0.01, 0.1, 0.9]


for lr in alpha:
    print(f"\nTraining with learning rate α = {lr}: ")

    Q_sarsa, policy = SARSA(env, num_episodes, alpha=lr, gamma=discount_factor, epsilon=epsilon)
    
    policy_sarsa_fn = policy_from_Q_or_map(Q_sarsa)


    eval_sarsa = evaluate_policy(policy_sarsa_fn, n_eval_games)

    res_sarsa = compare_to_reference(policy_sarsa_fn, optimal_policy)

    print(f"\nSARSA (Learning rate = {lr}):")
    print(f"  Avg reward: {eval_sarsa['avg_reward']:.3f}")
    print(f"  Win%: {eval_sarsa['win_pct']:.2f}, Natural win%: {eval_sarsa['natural_win_pct']:.2f}, Loss%: {eval_sarsa['loss_pct']:.2f}, Draw%: {eval_sarsa['draw_pct']:.2f}")
    print(f"  Agreement with optimal policy (γ=1.0): {res_sarsa['agreement_pct']:.2f}%")




Training with learning rate α = 0.001: 

SARSA (Learning rate = 0.001):
  Avg reward: -0.032
  Win%: 42.72, Natural win%: 4.14, Loss%: 47.97, Draw%: 9.30
  Agreement with optimal policy (γ=1.0): 97.00%

Training with learning rate α = 0.01: 

SARSA (Learning rate = 0.01):
  Avg reward: -0.031
  Win%: 42.68, Natural win%: 4.11, Loss%: 47.86, Draw%: 9.46
  Agreement with optimal policy (γ=1.0): 95.50%

Training with learning rate α = 0.1: 

SARSA (Learning rate = 0.1):
  Avg reward: -0.039
  Win%: 42.41, Natural win%: 4.24, Loss%: 48.44, Draw%: 9.15
  Agreement with optimal policy (γ=1.0): 87.00%

Training with learning rate α = 0.9: 

SARSA (Learning rate = 0.9):
  Avg reward: -0.060
  Win%: 41.41, Natural win%: 3.80, Loss%: 49.34, Draw%: 9.25
  Agreement with optimal policy (γ=1.0): 82.00%


The SARSA algorithm looks to be working worse and worse the bigger the learning rate is, which makes sense. I thought alpha = 0.001 would be too small but apparently, for 1000000 episodes it ends up converging, and it's the one that works the best. 