# Practical Activity 1 (**PRA1**)

## Evaluable Practical Exercise

<u>General considerations</u>:

- The proposed solution cannot use methods, functions or parameters declared **_deprecated_** in future versions.
- This activity must be carried out on a **strictly individual** basis. Any indication of copying will be penalized with a failure for all parties involved and the possible negative evaluation of the subject in its entirety.
- It is necessary for the student to indicate **all the sources** that she/he has used to carry out the PRA. If not, the student will be considered to have committed plagiarism, being penalized with a failure and the possible negative evaluation of the subject in its entirety.

<u>Delivery format</u>:

- Some exercises may require several minutes of execution, so the delivery must be done in **Notebook format** and in **HTML format**, where the code, results and comments of each exercise can be seen. You can export the notebook to HTML from the menu File $\to$ Download as $\to$ HTML.
- There is a special type of cell to hold text. This type of cell will be very useful to answer the different theoretical questions posed throughout the activity. To change the cell type to this type, in the menu: Cell $\to$ Cell Type $\to$ Markdown.

<div class="alert alert-block alert-info">
<strong>Name and surname: Victor Brao Ruiz </strong>
</div>

## Introduction

Blackjack environment is part of the Gymnasium's [Toy Text](https://gymnasium.farama.org/environments/toy_text/) environments. Blackjack is a card game where the goal is to beat the dealer by obtaining cards that sum to closer to 21 (without going over 21) than the dealer's cards.

The card values are, as depicted in the following figure:
- Face cards (Jack, Queen, King) have a point value of **10**.
- Aces can either count as **11** (called a "usable ace") or **1**.
- Numerical cards (**2-9**) have a value equal to their number.

<img src="./figs/BlackJackCards.png" />

Game Dynamics:
1. The game starts with the dealer having one face up and one face down card, while the player has two face up cards. All cards are drawn from an infinite deck (i.e. with replacement).
2. The player has a total sum of cards. They can request additional cards (**hit**) until they decide to stop (**stick**) or exceed 21 (**bust**), which results in an immediate loss.
3. After the player decides to stick, the dealer reveals their face-down card and draws cards until their total is 17 or greater. If the dealer goes bust, the player wins.
4. If neither the player nor the dealer goes bust, the winner is whoever has a sum closer to 21.

Further information could be found at:
- Gymnasium [Blackjack](https://gymnasium.farama.org/environments/toy_text/blackjack/)

In order to initialize the environment, we will use `natural=True` to give an additional reward for starting with a natural blackjack, i.e. starting with an ace and ten (sum is 21), as depicted in the following piece of code:

In [None]:
import gymnasium as gym

env = gym.make('Blackjack-v1', natural=True, sab=False)

In [3]:
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))

Action space is Discrete(2) 
Observation space is Tuple(Discrete(32), Discrete(11), Discrete(2)) 


## Part 1. Naïve Policy

Implement an agent that carries out the following deterministic policy: 
- The agent will **stick** if it gets a score of 20 or 21.
- Otherwise, it will **hit**.

<u>Questions</u> (**1 point**): 
1. Using this agent, simulate 100,000 games and calculate the agent's return (total accumulated reward).
2. Additionally, calculate the % of wins, natural wins, losses and draws. 
3. Comment on the results.

In [None]:
obs, info = env.reset()
player_sum, dealer_sum, ace = 0,0,0
print(f"Starting Observation: {obs}")

Starting Observation: {obs}


In [8]:
#Creating an agent
from collections import defaultdict
import gymnasium as gym
import numpy as np

class BlackjackAgent:
    def __init__(
        self,
        env: gym.Env,
        learning_rate: float,
        discount_factor: float = 0.95,
    ):
        self.env = env
        self.lr = learning_rate
        self.discount_factor = discount_factor        

def get_action():
    return
def update():
    return


## Part 2. Monte Carlo method

The objective of this section is to estimate the optimal policy using Monte Carlo methods. Specifically, you can choose and implement one of the algorithms related to _Control using MC methods_ (with ''exploring starts'' or without ''exploring starts'', both on-policy or off-policy).

<u>Questions</u> (**2.5 points**): 
1. Implement the selected algorithm and justify your choice.
2. Comment and justify all the parameters, such as:
- Number of episodes
- Discount factor
- Etc.
3. Implement a function that prints on the screen the optimal policy found for each state (similar to the figure in Section 3.1).
4. Using the trained agent, simulate 100,000 games and calculate the agent's return (total accumulated reward).
5. Additionally, calculate the % of wins, natural wins, losses and draws.

## Part 3. TD learning

The objective of this section is to estimate the optimal policy using TD learning methods. Specifically, you have to implement the **SARSA algorithm**.

<u>Questions</u> (**2.5 points**): 
1. Implement the algorithm.
2. Comment and justify all the parameters.
3. Print on the screen the optimal policy found for each state.
4. Using the trained agent, simulate 100,000 games and calculate the agent's return (total accumulated reward).
5. Additionally, calculate the % of wins, natural wins, losses and draws.

## Part 4. Comparison of the algorithms

In this section, we will make a comparison among the algorithms.

We will compare the performance of the algorithms when changing the number of episodes, the discount factor and the *learning rate* values (in the case of the SARSA method).

For each exercise, the results must be presented and justified.

**Note**: 
- It is recommended to run the simulations multiple times for each exercise, as these are random, and to comment on the most frequent result or the average of these.

### 4.1. Comparison to the optimal policy

The optimal policy for this problem, described by [Sutton & Barto](http://incompleteideas.net/book/the-book-2nd.html) is depicted in the following image:

<img src="./figs/optimal.png" style="width: 800px;" />

<u>Questions</u> (**1 point**): 
- Compare the _optimal_ policies of the naïve, Monte Carlo and SARSA methods to the optimal one provided by Sutton & Barto.
- Comment on the results and justify your answer. 

### 4.2. Influence of the Number of Episodes

Conduct a study by varying the number of episodes in each of the algorithms.

<u>Questions</u> (**1 point**): 
- Train each algorithm multiple times with 100,000, 1,000,000, and 5,000,000 episodes and average the results.
- Indicate how the **number of episodes** influences the convergence of each algorithm by calculating the number of states where the policy differs from the optimal one, as well as the average return obtained after playing 100,000 games following each training.

### 4.3. Influence of the Discount Factor

Conduct a study by varying the *discount factor* in each of the algorithms.

<u>Questions</u> (**1 point**):
- Run the algorithms with *discount factor* = 0.1, 0.5, 0.9 and the rest of the parameters the same as in previous exercises. 
- Describe the changes in the optimal policy, comparing the result obtained with the result of previous exercises (*discount factor* = 1).

### 4.4. Influence of the Learning Rate

Conduct a study by varying the learning rate in the *SARSA* algorithm.

<u>Questions</u> (**1 point**):
- Run the *SARSA* algorithm with the following *learning rate* values: 0.001, 0.01, 0.1, and 0.9.
- Analyze the differences with the results obtained previously in terms of the number of errors relative to the optimal policy and the accumulated reward for every 100,000 episodes played.