<h1>Monte Carlo Control: Exploring Starts Using First Visit</h1>

This notebook will require the following python modules:

In [1]:
import gymnasium as gym
import numpy as np
from collections import OrderedDict
import statistics as stat
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import matplotlib as mpl
import matplotlib.image as mpimg
from IPython.display import Image
import import_ipynb
import ipynb.fs.full.policyBlackjack as policy
import ipynb.fs.full.returnsStateActionPairBlackjack as returnsSA
import ipynb.fs.full.stateActionValueBlackjack as actionValue
import ipynb.fs.full.helpers as h


<h2>Blackjack Example</h2>


Monte Carlo control algorithms, such as Exploring Starts using first visit, can be used to teach an agent how to play the game of Blackjack. This notebook contains a solution which can find an optimal policy\,$\pi_*$\,for the Blackjack problem described in Example 5.3 in the text by Sutton and Barto. [\[1\]](#references)

<h2>Environment: Blackjack</h2>

Each game (i.e. episode) begins as soon as the first four cards are dealt (two face up agent cards, one dealer card face-down, and one dealer card face-up). Each episode proceeds and ends as described by the Blackjack environment within the open source reinforcement learning API, Gymnasium,  provided by Farama Foundation. [\[2\]](#references) A few of the key components of the Blackjack environment are described below.

<h3>Environment Blackjack: State Space</h3>

Each state of a Blackjack game, represented by the random variable $S$,  consists of a 3-tuple containing 1.) the player's current score via summing the values of the cards in the player's hand, 2.) the value of the card shown by the dealer and 3.) whether the player is holding a usable ace. [\[2\]](#references)

$$\large S\coloneqq\{(playerSum,\,\,dealerShowing,\,\,usableAce)\in\mathbb{Z}^{3}: (0\le playerSum \le 32)\cap(0\le dealerShowing \le 11)\cap (0\le usableAce\le 2)\}$$

Note: The state space (often refered to as the "observation space" within the Gymnasium documentation) covers instances of states that are not reachable in any Blackjack game. More specifically, player sums of 0 or 1 and dealer showing values of 0 are not possible given the constraints of the environment (i.e. the game of Blackjack). This was a design choice made by the Gym API developers (while still being maintained by OpenAI) in order to faciliate "easier indexing for table based algorithms".  processing of numerical results within the environment API. [\[3\]](#references)

In [2]:
# Define the state spaces for player sum, dealer showing card, and usable ace
# Note: Gymnasium's observation space is a Tuple(Discrete(32), Discrete(11), Discrete(2))
# which contains observations that are not possible. There are very few of them so continue
# to set Q_s_a values according to the observation space assumed by Gymnasium
# Note: the observation space contains unreachable states. This was a choice made by
# the OpenAI developers to facilitate easy indexing. See https://github.com/opoenai/gym/issues/1410
# for more information.
stateSpacePlayerSum = [i for i in range(32)]
stateSpaceDealerShows = [i for i in range(1, 11)]
stateSpaceUsableAce = [True, False]

We will use these enumerations of state values later on. For now let's talk about the Action Space of the Blackjack environment.

<h3>Environment Blackjack: Action Space</h3>

Our agent may choose to either 'hit' or 'stick', at each timepoint $t$ during an episode.   Each action taken by the agent (i.e. to either 'stick' or 'hit') is represented by an instance of the random variable $A$ which is sampled from from the Bernoulli distribution.. and there are two actions ('Hit' or 'Stick') available to the agent. [\[2\]](#references)<br>
$$\large A\coloneqq\{a\in\{0, 1\} : a=0\text{ when player 'sticks'} \cap a=1\text{ when player 'hits'}\}$$

Below we create some lists to help us build the state and action spaces later on within the exploring starts algorithm...

In [3]:
# Define the action space (0='Stick', 1='Hit') according to the Gymnasium documentation
actionSpace = [i for i in range(2)]

We will use this enumeration of actions within the MC Exploring Starts First-Visit algorithm later below. For now let's talk about how the Blackjack environment manages episdoes and time steps.

<h3>Environment Blackjack: Episodes and time steps</h3>

In the Blackjack environment within the Gymnasium API, an episode consists of what is commonly called a game in a real-life Blackjack scenario. Each episode consists of $T$ timesteps {$t_{0}$, $t_{1}$, $\dots$,$t_{T-1}$}. The initial state $s_{t=0}$ is provided by the Blackjack environment each time the environment is reset. The MC Exploring Starts algorithm randomly assigns an action $a_{t=0}$ which then form the state-action pair ($S_{t}=s_{t}$, $A_{t}=a_{t}$). Each episode has the following form:<br>
$$\large\text{Episode } \coloneqq \{ S_{0},\,\,A_{0},\,\,R_{1},\,\,S_{1},\,\,A_{1},\,\,R_{2},\,\,\dots,\,\,S_{T-1},\,\,A_{T-1},\,\,R_{T}\}$$

<h3>Environment Blackjack: Rewards and Returns</h3>

Reward calculations are carried out at the end of each episode when using the [Monte Carlo Exploring Starts First-Visit algorithm](#MonteCarloExploringStartsFirstVisit).  Each game (i.e. episode) concludes with assignment of a reward to the random variable $R_{T}$ according to the following five outcomes:

$$
\large R_T
= 
\begin{cases}
-1\quad\text{dealer wins} \\
0\quad\quad\text{draw} \\
1\quad\quad\text{agent wins with non natural} \\
1\quad\quad\text{agent wins on natural(if natural is set to False)} \\
1.5\quad\text{agent wins on natural (if natural is set to True)}
\end{cases}
$$

Now that an episode has ended and the reward, $R_{T}$, has been returned from the Blackjack environment, we set the reward at time step $t$ of the episode to $R_{T}$<br>
$$\large R_{t} = R_{T}\,\,\forall\,\,0\le\,\,t\,\,\lt\,\,T$$   

The MC Exploring Starts First-Visit algorithm will use a returns function, $Returns(s_{t}, a_{t})$, to help us keep track of which state-action pairs accumulate rewards $R_{T}$ at the end of each episode.<br> 

$$\large Returns(s_{t}, a_{t}) \leftarrow\text{list of accumulated returns }G\text{ that resulted from choosing action }a_{t}\text{ when in state }s_{t}\text{ over all episodes of a single run.}$$

The dictionary R_s_a is a member of the [returnsStateActionPairBlackjack class](#returnsStateActionPairBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={$Returns(s, a)$} pairs and will be used later within the MC Exploring Starts First-Visit algorithm below.

In [4]:
# Instantiate and initialize a new temporary returns function R(S_t, A_t) with empty lists
rsaTemp = returnsSA.returnsStateActionPairsBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
rsaTemp.R_s_a.items()

odict_items([(((0, 1, True), 0), []), (((0, 1, True), 1), []), (((0, 1, False), 0), []), (((0, 1, False), 1), []), (((0, 2, True), 0), []), (((0, 2, True), 1), []), (((0, 2, False), 0), []), (((0, 2, False), 1), []), (((0, 3, True), 0), []), (((0, 3, True), 1), []), (((0, 3, False), 0), []), (((0, 3, False), 1), []), (((0, 4, True), 0), []), (((0, 4, True), 1), []), (((0, 4, False), 0), []), (((0, 4, False), 1), []), (((0, 5, True), 0), []), (((0, 5, True), 1), []), (((0, 5, False), 0), []), (((0, 5, False), 1), []), (((0, 6, True), 0), []), (((0, 6, True), 1), []), (((0, 6, False), 0), []), (((0, 6, False), 1), []), (((0, 7, True), 0), []), (((0, 7, True), 1), []), (((0, 7, False), 0), []), (((0, 7, False), 1), []), (((0, 8, True), 0), []), (((0, 8, True), 1), []), (((0, 8, False), 0), []), (((0, 8, False), 1), []), (((0, 9, True), 0), []), (((0, 9, True), 1), []), (((0, 9, False), 0), []), (((0, 9, False), 1), []), (((0, 10, True), 0), []), (((0, 10, True), 1), []), (((0, 10, False),

The expected return of taking action $a\in A$ when in state $s\in S$.

In [5]:
del rsaTemp

<h2>Policy</h2>

Our agent's policy is denoted as $\pi$ and is used to represent the conditional probability mass function $f_{A|S}$ of actions over the states:<br>
$$\large \pi\coloneqq f_{A|S}$$

Our agent must, at timestep $t$, choose an action $a\in A$ (either '0' to 'stick' or '1' to 'hit') to move the from its current state $s\in S$ to its next state $s'\in S$.

We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large \pi(a|s)=f_{A|S}(a|s)=P(A=a|S=s)$$

Recall from the defination of the Action Space $A$ from above, that $A$ follows a Bernoulli distribution where $P(A=a)$  We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large \pi(a_{t}|s_{t})
=
\begin{cases}
0\quad\quad\quad \text{if } Q(s_{t}, a_{t}=0)\lt Q(s_{t}, a_{t}=1) \\
1\quad\quad\quad \text{if }Q(s_{t}, a_{t}=0)\gt Q(s_{t}, a_{t}=1)
\end{cases}
$$


The dictionary pi is a member of the [policyBlackjack class](#policyBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={\pi(a_{t}\,\,|\,\,s_{t})} pairs and will be used later within the MC Exploring Starts First-Visit algorithm below.

In [6]:
# Initialize a temporary policy pi(s)
piTemp = policy.policyBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
piTemp.pi.items()

odict_items([(((0, 1, True), 0), 1), (((0, 1, True), 1), 0), (((0, 1, False), 0), 0), (((0, 1, False), 1), 0), (((0, 2, True), 0), 0), (((0, 2, True), 1), 0), (((0, 2, False), 0), 1), (((0, 2, False), 1), 0), (((0, 3, True), 0), 1), (((0, 3, True), 1), 0), (((0, 3, False), 0), 1), (((0, 3, False), 1), 0), (((0, 4, True), 0), 0), (((0, 4, True), 1), 1), (((0, 4, False), 0), 0), (((0, 4, False), 1), 0), (((0, 5, True), 0), 0), (((0, 5, True), 1), 1), (((0, 5, False), 0), 0), (((0, 5, False), 1), 1), (((0, 6, True), 0), 1), (((0, 6, True), 1), 1), (((0, 6, False), 0), 0), (((0, 6, False), 1), 1), (((0, 7, True), 0), 0), (((0, 7, True), 1), 0), (((0, 7, False), 0), 0), (((0, 7, False), 1), 1), (((0, 8, True), 0), 0), (((0, 8, True), 1), 0), (((0, 8, False), 0), 0), (((0, 8, False), 1), 1), (((0, 9, True), 0), 1), (((0, 9, True), 1), 0), (((0, 9, False), 0), 0), (((0, 9, False), 1), 1), (((0, 10, True), 0), 0), (((0, 10, True), 1), 0), (((0, 10, False), 0), 1), (((0, 10, False), 1), 1), (((

In [7]:
del piTemp

<h2>Action Value Function</h2>

According to the Bellman eqautions, the action-value $Q(s_{t}, a_{t})$ at time $t$ within an episode represents the expected returns when choosing action $a_{t}\in A$ when in state $s_{t}\in S$ according to policy $\pi_*(a_{t}|s_{t})$. [\[4\]](#References)

$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad q_\pi(s_t, a_t)\coloneqq\mathbb{E} [G_t | S_t=s_t, A_t=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t=s_t, A_t=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1})\,\,|\,\,S_{t}=s_{t},\,\,A_{t}=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\sum\limits_{s_{t+1}} \sum\limits_{r} p(s_{t+1}, r| s_{t}, a_{t})[r+\gamma v_\pi(s_{t+1})]$


The dictionary Q_s_a is a member of the [stateActionValue class](#stateActionValueBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={$Q(s_{t}, a_{t})$} pairs and will be used later within the MC Exploring Starts First-Visit algorithm below.

In [8]:
# Instantiate and initialize a new temporary action value function Q(S_t, A_t)
qsaTemp = actionValue.stateActionValueBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
qsaTemp.Q_s_a.items()

odict_items([(((0, 1, True), 0), 0.0), (((0, 1, True), 1), 0.0), (((0, 1, False), 0), 0.0), (((0, 1, False), 1), 0.0), (((0, 2, True), 0), 0.0), (((0, 2, True), 1), 0.0), (((0, 2, False), 0), 0.0), (((0, 2, False), 1), 0.0), (((0, 3, True), 0), 0.0), (((0, 3, True), 1), 0.0), (((0, 3, False), 0), 0.0), (((0, 3, False), 1), 0.0), (((0, 4, True), 0), 0.0), (((0, 4, True), 1), 0.0), (((0, 4, False), 0), 0.0), (((0, 4, False), 1), 0.0), (((0, 5, True), 0), 0.0), (((0, 5, True), 1), 0.0), (((0, 5, False), 0), 0.0), (((0, 5, False), 1), 0.0), (((0, 6, True), 0), 0.0), (((0, 6, True), 1), 0.0), (((0, 6, False), 0), 0.0), (((0, 6, False), 1), 0.0), (((0, 7, True), 0), 0.0), (((0, 7, True), 1), 0.0), (((0, 7, False), 0), 0.0), (((0, 7, False), 1), 0.0), (((0, 8, True), 0), 0.0), (((0, 8, True), 1), 0.0), (((0, 8, False), 0), 0.0), (((0, 8, False), 1), 0.0), (((0, 9, True), 0), 0.0), (((0, 9, True), 1), 0.0), (((0, 9, False), 0), 0.0), (((0, 9, False), 1), 0.0), (((0, 10, True), 0), 0.0), (((0, 

In [9]:
del qsaTemp

<a id="MonteCarloExploringStartsFirstVisit"></a>
<h2>Monte Carlo Exploring Starts First-Visit Algorithm Pseudocode</h2>

$\large\quad\quad\text{Monte Carlo Exploring Starts Algorithm: MCExploringStarts}$<br>

$\large\quad\quad\text{Inputs:}\,\,firstVisit,\,\,numEpisodes,\,\,environment,\,\,discountFactor$<br>

$\large\quad\quad\text{ 1.}\quad\pi(s, a)\leftarrow\forall s\in S,\text{randomly assign probabilities 0 and 1 to } \pi(a=0|s)\text{ or }\pi(a=1|s)$<br>
$\large\quad\quad\text{ 2.}\quad Q(s, a)\leftarrow\text{Randomly chosen action-value for all state-action pairs }(s\in\,S,a\in\,A)$<br>
$\large\quad\quad\text{ 3.}\quad R(s, a)\leftarrow\text{Empty list of returns for each state-action pair }(S=s, A=a)$<br>
$\large\quad\quad\text{ 4.}\quad\text{Loop forever (for each episode)}$:<br>
$\large\quad\quad\text{ 5.}\quad\quad\text{Choose }S_{0}\in S, A_{0}\in A(S_{0})\text{ randomly such that all paris have probability > 0}$<br>
$\large\quad\quad\text{ 6.}\quad\quad\text{Generate an episode from }S_{0}, A_{0},\text{ following }\pi\colon S_{0}, A_{0}, R_{1},..., S_{T-1}, A_{T-1}, R_{T}$<br>
$\large\quad\quad\text{ 7.}\quad\quad\,G\leftarrow\,0$<br>
$\large\quad\quad\text{ 8.}\quad\quad\text{Loop for each step of episode, }t = T-1,\,T-2,\,\dots,\,0$<br>
$\large\quad\quad\text{ 9.}\quad\quad\quad\,\,\,G\leftarrow\,\gamma G+R_{t+1}$<br>
$\large\quad\quad\text{10.}\quad\quad\quad\text{ Unless the pair }S_{t}, A_{t}\text{ appears in } S_{0}, A_{0}, S_{1}, A_{1}, ..., S_{t-1}, A_{t-1}\colon$<br>
$\large\quad\quad\text{11.}\quad\quad\quad\text{ Append }G\text{ to } R(S_{t}, A_{t})$<br>
$\large\quad\quad\text{12.}\quad\quad\quad\,Q(S_{t}, A_{t})\leftarrow\text{average}(R(S_{t}, A_{t}))$<br>
$\large\quad\quad\text{13.}\quad\quad\quad\pi(S_{t}, A_{t})\leftarrow\,\,1\text{ when }Q(s_{t}, a_{t}=0) \gt Q(s_{t}, a_{t}=1)\text{ and 0 otherwise}$<br>
$\large\quad\quad\text{14.}\quad\text{Return }\pi,\,\,Q(S, A)$<br>

In [10]:
def MCExploringStarts(env, firstVisit=True, numEpisodes=1,  discountFactor=1.0):
    # 4. Run numEpisodes episodes (instead of looping forever as described in the pseudocode above)
    for _ in range(numEpisodes):
        # Create a list to track the observations (i.e. states) and actions taken during an episode
        episodeStateActionPairs = []
    
        # Set natural flag to ensure an action is needed to be taken by the agent
        natural = False
        # 5. Reset the envirionment and generate a random intitial state S_0
        observation, info = env.reset(seed=(int)(datetime.now().timestamp() * 1000000))
        # 5 continued: Choose a random action A_0 
        action = env.action_space.sample()
        # Set the done flag to false
        done = False
        # 6. Generate an episode from S_0 and A_0 follwoing policy pi
        while ( not done ):
            # Push the observation (i.e. state) action pair to the episode as the key and add an empty list as the value
            episodeStateActionPairs.append((observation, action))
            # Step to the next state by performing an action 
            observation, reward, terminated, truncated, info = env.step(action)
            # Choose an action from policy pi based upon the observation 
            action = 0 if ( pi.pi[((observation), 0)] > pi.pi[((observation), 1)] ) else 1            
            if ( terminated or truncated ):
                done = True
        # 7. Set the returns for the current episode to zero
        G = 0
        # Set T to the length of the episode list
        # This number should be appropriate in order to maintain 
        # the episode sequence {S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T-1, A_T-1, R_T}
        T = len(episodeStateActionPairs)
        firstVisitOccured = False
        reward_T = reward
        # 8. Loop for each step of episode, t=T-1, T-2, ..., 0
        for t in range(T-1, -1, -1):
            # 9. Compute G_t for timestep t starting at t=T-1
            G = discountFactor * G + reward_T #R_s_a.R_s_a[(episodeStateActionPairs[t][0]), (episodeStateActionPairs[t][1])]         
            # 10. Since we set firstVisit to true, 
            # Unless the pair S_t, A_t appears in S_0, A_0, S_1, A_1, ... S_t-1, A_t-1, do the following
            if ( (firstVisit==False) or ((firstVisit == True) and (not (episodeStateActionPairs[t] in episodeStateActionPairs[:t]))) ):
                # 11. Append G to returns R(S_t, A_t)
                R_s_a.R_s_a[((episodeStateActionPairs[t][0]), episodeStateActionPairs[t][1])].append(G)
                # 12. Set Q(S_t, A_t) to the average of Returns(S_t, A_t)
                Q_s_a.Q_s_a[((episodeStateActionPairs[t][0]), episodeStateActionPairs[t][1])] = sum(R_s_a.R_s_a[episodeStateActionPairs[t]]) / len(R_s_a.R_s_a[episodeStateActionPairs[t]])
                # 13. Update the policy using argmax_a Q(S_t, a)         
                if (Q_s_a.Q_s_a[((episodeStateActionPairs[t][0]), 0)] > Q_s_a.Q_s_a[((episodeStateActionPairs[t][0]), 1)]):
                    pi.pi[((episodeStateActionPairs[t][0]), 0)] = 1
                    pi.pi[((episodeStateActionPairs[t][0]), 1)] = 0
                else:
                    pi.pi[((episodeStateActionPairs[t][0]), 0)] = 0
                    pi.pi[((episodeStateActionPairs[t][0]), 1)] = 1
    return pi, Q_s_a
    

In [11]:
# Set the discount factor (aka "gamma") used when computing returns
discountFactor = 1.0

# Set the number of runs of the MC exploring starts first visit
numRuns = 1

#episodes = [1, 10, 50, 100, 200, 500, 1000, 3000, 5000, 10000]
episodes = [1000]

In [12]:
# Run the experiment
for i in range(len(episodes)):
    # Run an experiment j times
    for runNum in range(numRuns):
        # Set the environment 
        env = gym.make('Blackjack-v1', natural=False, sab=False)
        # Use the RecorEpisodeStatistics wrapper to track rewards and episode lengths
        env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=episodes[i])
        # 1. Initialize policy pi(s) to equal probabilities 0.5 for all s in S
        pi = policy.policyBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
        # 2. Instantiate and initialize a new action value function Q(S_t, A_t)
        Q_s_a = actionValue.stateActionValueBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
        # 3. Instantiate and initialize a new returns function R(S_t, A_t) with empty lists
        R_s_a = returnsSA.returnsStateActionPairsBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
        # 4 Loop forever (for each episode)
        policyResult, actionValueResult = MCExploringStarts(env=env, firstVisit=True, numEpisodes=episodes[i], discountFactor=1.0)
        # Set up the grids for plots of action value and policy when usable Ace is available
        value_grid, policy_grid = h.create_grids(pi=policyResult, Q_s_a=actionValueResult, usable_ace=True)
        # Format a string for the title of the plot when there is a usable Ace available
        title = "MC Exploring Starts First-Visit: #Episodes=" + str(episodes[i]) + " , \u03B3=" + str(discountFactor) + ", Run# " + str(runNum+1) + ", usableAce=T"
        h.create_plots(value_grid=value_grid, policy_grid=policy_grid, title=title, numEpisodes=episodes[i], runNum=runNum, discountFactor=discountFactor, firstVisit=True, usableAce=True)
        # Set up the grids for plots of action value and policy when usable Ace is not available
        value_grid, policy_grid = h.create_grids(pi=policyResult, Q_s_a=actionValueResult, usable_ace=False)
        # Format a string for the title of the plot when there is a usable Ace is not available
        title = "MC Exploring Starts First-Visit: #Episodes=" + str(episodes[i]) + " , \u03B3=" + str(discountFactor) + ", Run# " + str(runNum+1) + ", usableAce=F"
        h.create_plots(value_grid=value_grid, policy_grid=policy_grid, title=title, numEpisodes=episodes[i], runNum=runNum, discountFactor=discountFactor, firstVisit=True, usableAce=False)
        # Close the environment to clean up resources used by the environment
        env.close()

<a id="references"></a>
<h2>References</h2>

1. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 99.

2. https://gymnasium.farama.org/environments/toy_text/blackjack/

3. https://github.com/openai/gym/issues/1410

4. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 78. 