<a id="MonteCarloPrediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_FirstVisit_Blackjack"></a>
<h1>Monte Carlo Prediction of State Value Function: Using Off Policy Ordinary Importance Sampling First Visit for Blackjack</h1>

Monte Carlo prediction algorithms, such as Off Policy Ordinary Importance Sampling First Visit, can be used to estimate the state value and action-value functions, $V(S)$ and $Q(S, A)$, described by the Bellman equations.[\[1\]](#references)  This notebook contains a funtions that compute estimates for $V(S_{t})\approx v_{\pi}(s_{t})$ and $Q(S_{t}, A_{t})\approx q_{\pi}(s_{t}, a_{t})$ using the [Off Policy Orinary Importance Sampling First Visit](#MonteCarloPrediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_FirstVisit_Blackjack) algorithm for the Blackjack problem described in Example 5.3 of the text by Sutton and Barto. [\[2\]](#references)

This notebook will require the following python modules:

In [1]:
import gymnasium as gym
import numpy as np
from collections import OrderedDict
import statistics as stat
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import matplotlib as mpl
import matplotlib.image as mpimg
from IPython.display import Image
import import_ipynb
import ipynb.fs.full.policyBlackjack as tPolicy
import ipynb.fs.full.targetPolicyBlackjack as tPolicy
import ipynb.fs.full.behaviorPolicyBlackjack as bPolicy
import ipynb.fs.full.returnsStateBlackjack as returnsS
import ipynb.fs.full.stateActionValueBlackjack as actionValue
import ipynb.fs.full.stateValueBlackjack as stateValue
import ipynb.fs.full.helpers_MonteCarloPrediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_FirstVisit_Blackjack as h
import pickle

<h2>Environment: Blackjack</h2>

Each game (i.e. episode) begins as soon as the first four cards are dealt (two face up agent cards, one dealer card face-down, and one dealer card face-up). Each episode proceeds and ends as described by the Blackjack environment within the open source reinforcement learning API, Gymnasium,  provided by Farama Foundation. [\[2\]](#references) A few of the key components of the Blackjack environment are described below.

<h3>Environment Blackjack: State Space</h3>

Each state of a Blackjack game, represented by the random variable $S$,  consists of a 3-tuple containing 1.) the player's current score via summing the values of the cards in the player's hand, 2.) the value of the card shown by the dealer and 3.) whether the player is holding a usable ace. [\[2\]](#references)

$$\large S\coloneqq\{(playerSum,\,\,dealerShowing,\,\,usableAce)\in\mathbb{Z}^{3}: (0\le playerSum \le 32)\cap(0\le dealerShowing \le 11)\cap (0\le usableAce\le 2)\}$$

Note: The state space (often refered to as the "observation space" within the Gymnasium documentation) covers instances of states that are not reachable in any Blackjack game. More specifically, player sums of 0 or 1 and dealer showing values of 0 are not possible given the constraints of the environment (i.e. the game of Blackjack). This was a design choice made by the Gym API developers (while still being maintained by OpenAI) in order to faciliate "easier indexing for table based algorithms".  processing of numerical results within the environment API. [\[3\]](#references)

In [2]:
# Define the state spaces for player sum, dealer showing card, and usable ace
# Note: Gymnasium's observation space is a Tuple(Discrete(32), Discrete(11), Discrete(2))
# which contains observations that are not possible. There are very few of them so continue
# to set Q_s_a values according to the observation space assumed by Gymnasium
# Note: the observation space contains unreachable states. This was a choice made by
# the OpenAI developers to facilitate easy indexing. See https://github.com/opoenai/gym/issues/1410
# for more information.
stateSpacePlayerSum = [i for i in range(32)]
stateSpaceDealerShows = [i for i in range(1, 11)]
stateSpaceUsableAce = [True, False]

We will use these enumerations of state values later on. For now let's talk about the Action Space of the Blackjack environment.

<h3>Environment Blackjack: Action Space</h3>

Our agent may choose to either 'hit' or 'stick', at each timepoint $t$ during an episode.   Each action taken by the agent (i.e. to either 'stick' or 'hit') is represented by an instance of the random variable $A$ which is sampled from from the Bernoulli distribution.. and there are two actions ('Hit' or 'Stick') available to the agent. [\[2\]](#references)<br>
$$\large A\coloneqq\{a\in\{0, 1\} : a=0\text{ when player 'sticks'} \cap a=1\text{ when player 'hits'}\}$$

Below we create some lists to help us build the state and action spaces later on within the exploring starts algorithm...

In [3]:
# Define the action space (0='Stick', 1='Hit') according to the Gymnasium documentation
actionSpace = [i for i in range(2)]

We will use this enumeration of actions within the Monte Carlo Prediction Ordinary Off Policy Importance Sampling First Visit algorithm later below. For now let's talk about how the Blackjack environment manages episdoes and time steps.

<h3>Environment Blackjack: Episodes and Time Steps</h3>

In the Blackjack environment within the Gymnasium API, an episode consists of what is commonly called a game in a real-life Blackjack scenario. Each episode consists of $T$ timesteps {$t_{0}$, $t_{1}$, $\dots$,$t_{T-1}$}. The initial state $s_{t=0}$ is provided by the Blackjack environment each time the environment is reset. The Monte Carlo Prediction State Value Function Ordinary Importance Sampling First Visit Blackjack algorithm randomly assigns an action $a_{t=0}$ which then form the state-action pair ($S_{t}=s_{t}$, $A_{t}=a_{t}$). Each episode has the following form:<br>
$$\large\text{Episode } \coloneqq \{ S_{0},\,\,A_{0},\,\,R_{1},\,\,S_{1},\,\,A_{1},\,\,R_{2},\,\,\dots,\,\,S_{T-1},\,\,A_{T-1},\,\,R_{T}\}$$

<h3>Environment Blackjack: Rewards and Returns</h3>

Reward calculations are carried out at the end of each episode when using the Monte Carlo Prediction Off Policy Ordinary Importance Sampling First Visit algorithm.  Each game (i.e. episode) concludes with assignment of a reward to the random variable $R_{T}$ according to the following five outcomes:

$$
\large R_T
= 
\begin{cases}
-1\quad\text{dealer wins} \\
0\quad\quad\text{draw} \\
1\quad\quad\text{agent wins with non natural} \\
1\quad\quad\text{agent wins on natural(if natural is set to False)} \\
1.5\quad\text{agent wins on natural (if natural is set to True)}
\end{cases}
$$

Now that an episode has ended and the reward, $R_{T}$, has been returned from the Blackjack environment, we set the reward at time step $t$ of the episode to $R_{T}$<br>
$$\large R_{t} = R_{T}\,\,\forall\,\,0\le\,\,t\,\,\lt\,\,T$$   

The MC Monte Carlo Prediction Off Policy Ordinary Importance Sampling First Visit algorithm will use a returns function, $Returns(s_{t}, a_{t})$, to help us keep track of which states accumulate rewards $R_{T}$ at the end of each episode.<br> 

$$\large Returns(s_{t}) \leftarrow\text{list of accumulated returns }G\text{ that resulted from being in state }s_{t}\text{ over all episodes of a single run.}$$

The dictionary R_s is a member of the [returnsStateBlackjack class](#returnsStateBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce)} : value={$Returns(s)$} and will be used later within the Monte Carlo Prediction State Value Off Policy Ordinary Importance Sampling First Visit algorithm below.

In [4]:
# Instantiate and initialize a new temporary returns function R(S_t, A_t) with empty lists
rsTemp = returnsS.returnsStateBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce)
rsTemp.R_s.items()

odict_items([((0, 1, True), []), ((0, 1, False), []), ((0, 2, True), []), ((0, 2, False), []), ((0, 3, True), []), ((0, 3, False), []), ((0, 4, True), []), ((0, 4, False), []), ((0, 5, True), []), ((0, 5, False), []), ((0, 6, True), []), ((0, 6, False), []), ((0, 7, True), []), ((0, 7, False), []), ((0, 8, True), []), ((0, 8, False), []), ((0, 9, True), []), ((0, 9, False), []), ((0, 10, True), []), ((0, 10, False), []), ((1, 1, True), []), ((1, 1, False), []), ((1, 2, True), []), ((1, 2, False), []), ((1, 3, True), []), ((1, 3, False), []), ((1, 4, True), []), ((1, 4, False), []), ((1, 5, True), []), ((1, 5, False), []), ((1, 6, True), []), ((1, 6, False), []), ((1, 7, True), []), ((1, 7, False), []), ((1, 8, True), []), ((1, 8, False), []), ((1, 9, True), []), ((1, 9, False), []), ((1, 10, True), []), ((1, 10, False), []), ((2, 1, True), []), ((2, 1, False), []), ((2, 2, True), []), ((2, 2, False), []), ((2, 3, True), []), ((2, 3, False), []), ((2, 4, True), []), ((2, 4, False), []),

In [5]:
del rsTemp

<h2>Target Policy</h2>

Our agent's target policy is denoted as $\pi$ and is used to represent the conditional probability mass function $f_{A|S}$ of actions over the states:<br>
$$\large \pi\coloneqq f_{A|S}$$

We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large \pi(a|s)=f_{A|S}(a|s)=P(A=a|S=s)$$

Recall from the defination of the Action Space $A$ from above, that $A$ follows a Bernoulli distribution where $P(A=a)$  We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large \pi(a_{t}|s_{t})
=
\begin{cases}
0\quad\quad\quad \text{if } Q(s_{t}, a_{t}=0)\lt Q(s_{t}, a_{t}=1) \\
1\quad\quad\quad \text{if }Q(s_{t}, a_{t}=0)\gt Q(s_{t}, a_{t}=1)
\end{cases}
$$


The dictionary pi is a member of the [policyBlackjack class](#policyBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={$\pi(a_{t}\,\,|\,\,s_{t})$} pairs and will be used later within the Monte Carlo Prediction State Value Off Policy Ordinary Importance Sampling First Visit algorithm below.

In [6]:
# Initialize a temporary policy pi(s)
piTemp = tPolicy.targetPolicyBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
piTemp.pi.items()
del piTemp

<h2>Behavior Policy</h2>

Our agent's behavior policy is denoted as $b$ and is used to represent the conditional probability mass function $f_{A|S}$ of actions over the states:<br>
$$\large b\coloneqq f_{A|S}$$

Our agent must, at timestep $t$, choose an action $a\in A$ (either '0' to 'stick' or '1' to 'hit') to move the from its current state $s\in S$ to its next state $s'\in S$.

We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large b(a|s)=f_{A|S}(a|s)=P(A=a|S=s)$$

Recall from the defination of the Action Space $A$ from above, that $A$ follows a Bernoulli distribution where $P(A=a)$  We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large b(a_{t}|s_{t})
=
\begin{cases}
0\quad\quad\quad\text{if }f(s_{t}, a_{t}=0)\lt f(s_{t}, a_{t}=1) \\
1\quad\quad\quad\text{if }f(s_{t}, a_{t}=0)\ge f(s_{t}, a_{t}=1)
\end{cases}
$$


<h2>State Value Function</h2>

According to the Bellman eqautions, the state value function at time $t$, $v_{\pi}(s_{t})$, within an episode represents the expected return when in state $s_{t}\in S$ according to policy $\pi$. [\[4\]](#References)

$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad v_\pi(s_t)\coloneqq\mathbb{E} [G_t | S_t=s_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t=s_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1})\,\,|\,\,S_{t}=s_{t}]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\sum\limits_{a\in A}\pi(a_{t}|s{t})\,\sum\limits_{s_{t+1}} \sum\limits_{r} P(s_{t+1},\,r\,|\,s_{t},\,a_{t})[r+\gamma v_\pi(s_{t+1})]$


Monte Carlo Prediction methods such as Off Policy Ordinary Importance Sampling First Visit allow us to make an estimate $V(s_{t})$ of the state value function $v_\pi(s_{t})$ which will be demonstrated later and below.

The Python dictionary V_s is a member of the [stateValue class](#stateValueBlackjack.ipynb) and uses the key={(playerSum, dealerShows, usableAce)} : value={$V(s_{t})$} pairs which can be used to store the state value function values given policy $\pi$ at any time step t.

In [7]:
# Instantiate and initialize a new temporary action value function Q(S_t, A_t)
vsTemp = stateValue.stateValueBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce)
vsTemp.V_s.items()
del vsTemp

<h2>Action Value Function</h2>

According to the Bellman eqautions, the action value function $Q(s_{t}, a_{t})$ at time $t$ within an episode represents the expected return when choosing action $a_{t}\in A$ when in state $s_{t}\in S$ according to policy $\pi_*(a_{t}|s_{t})$. [\[4\]](#References)

$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad q_\pi(s_t, a_t)\coloneqq\mathbb{E} [G_t | S_t=s_t, A_t=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t=s_t, A_t=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1})\,\,|\,\,S_{t}=s_{t},\,\,A_{t}=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\sum\limits_{s_{t+1}} \sum\limits_{r} P(s_{t+1}, r| s_{t}, a_{t})[r+\gamma v_\pi(s_{t+1})]$


Monte Carlo Prediction methods such as Off Policy Ordinary Importance Sampling First Visit allow us to make an estimate $Q(s_{t}, a_{t})$ of the state value function $q_\pi(s_{t}, a_{t})$ which will be demonstrated later and below.

The Python dictionary Q_s_a is a member of the [stateActionValue class](#stateActionValueBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={$Q(s_{t}, a_{t})$} pairs and will be used later within the Monte Carlo Prediction Action Value Off Policy Ordinary Importance Sampling First Visit algorithm below.

In [8]:
# Instantiate and initialize a new temporary action value function Q(S_t, A_t)
qsaTemp = actionValue.stateActionValueBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
qsaTemp.Q_s_a.items()
del qsaTemp

<a id="MonteCarloPrediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_FirstVisit_Blackjack"></a>
<h2>Monte Carlo Prediction State Value Off Policy Ordinary Importance Sampling First Visit Algorithm</h2>

Our agent must, at timestep $t$, choose an action $a\in A$ (either '0' to 'stick' or '1' to 'hit') to move the from its current state $s\in S$ to its next state $s'\in S$ within a Blackjack episode. As in many reinforcement learning MDP problems, the the choice of action taken by an agent at each time step must either exploit the estimated optimal action $a_{t}\in A$ given the current state $s_{t}\in S$ or explore the action space for actions other than the optimal action in order to see if there are other actions with higher expected returns that may be better to choose going forward.
As discussed in Barto's chapter on Monte Carlo methods for State Value function estimation using off policy mehtods [\[5\]](#References), the Monte Carlo Prediction State Value Off Policy Ordinary Importance Sampling First Visit algorithim uses target policy $\pi$ to exploit the optimal action, and a behavior policy $b$ to explore the action space. 


Off policy algorithms learn from data that is not a result of the target policy $\pi$ but rather some behavior policy $b$. As discussed in [\[5\]](#References). This notebook will demonstrate the problem of estimating the state value function $V(S)\approx v_{\pi}(S)$ by using an Ordinary Importance Sampling First Visit algorithm that uses fixed-given target and behavior policies $\pi$ and $b$ respectively. 

where $\rho$ transforms the returns while following the behavior policy $b$.

$$\large v_{\pi}(s_{t})\,\,=\,\,\mathbb{E_{b}} [\rho_{t:T-1}G_{t}|S_{t}=s_{t}]$$

The probability of taking a trajectory according to the target policy $\pi$ is as follows:

$$\large P(A_{t},\,\,S_{t+1},\,\,A_{t+1},\,\,\dots,\,\,S_{T}\,\,|\,\,S_{t},\,\,A_{t:T-1}\sim\pi)$$
$$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad =\pi(A_{t}|S_{t})\,\,P(S_{t+1}|S_{t},\,\,A_{t})\,\,\pi(A_{t+1}|S_{t+}),\,\,\cdots,\,\,P(S_{T}|S_{T-1},\,\,A_{T-1})$$
$$\large\quad\quad\quad\quad\quad\quad\quad=\quad\prod\limits_{k=t}^{T-1} \pi(A_{k}|S_{k})\,\,P(S_{k+1}|S_{k},\,\,A_{k})$$

Similarly, the probability of taking a trajectory according to the behavior policy $b$ is as follows:

$$\large P(A_{t},\,\,S_{t+1},\,\,A_{t+1},\,\,\dots,\,\,S_{T}\,\,|\,\,S_{t},\,\,A_{t:T-1}\sim b)$$
$$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=b(A_{t}|S_{t})\,\,P(S_{t+1}|S_{t},\,\,A_{t})\,\,b(A_{t+1}|S_{t+}),\,\,\cdots,\,\,P(S_{T}|S_{T-1},\,\,A_{T-1})$$
$$\large\quad\quad\quad\quad\quad\quad\quad=\quad\prod\limits_{k=t}^{T-1} b(A_{k}|S_{k})\,\,P(S_{k+1}|S_{k},\,\,A_{k})$$

Taking the ratio of these two probailites we obtain the Ordingary Importance Sampling Ratio $\rho$.

$$\large\rho_{t:T-1}=\frac{\prod\limits_{k=t}^{T-1} \pi(A_{k}|S_{k})P(S_{k+1}|S_{k},\,\,A_{k})}{\prod\limits_{k=t}^{T-1} b(A_{k}|S_{k})P(S_{k+1}|S_{k},\,\,A_{k})}\,\,=\,\,\prod\limits_{k=t}^{T-1}\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}$$

The Ordinary Importance Sampling Ratio $\rho$ is used to scale the returns which we then take the average of those scaled returns obtained by following behavior policy $b$. These averaged scaled returns for following policy $b$ provide an estimate $V(S_{t})\approx v_{\pi}(S_{t})$.

$$\large V(S_{t})=\frac{\sum\limits_{t\in \tau(s)}\rho_{t:T(t)-1}G_{t}}{|\tau(s)|}$$
$$\quad\quad\quad\quad\quad\quad\quad\quad\text{where }\tau(s)\text{ is the set of all time steps in which state }s\text{ was visited.}$$

In order to estimate the State Value function when following policy $\pi$ while only seeing the outcomes of following policy $b$, the Monte Carlo Prediction State Value Off Policy Ordinary Importance Sampling First Visit algorithm assumes that every action taken under $\pi$ during an MDP is also taken occasionally under the behavior policy $b$ with some non-zero porobability.[\[5\]](#References) 

$$\pi(a_{t}|s_{t})\gt 0 \rightarrow b(a_{t}|s_{t})\gt0$$

Pseudocode for the Off Policy Ordinary Importnace Sampling State Value First Visit algorithm which estimates the state value function $V(s_{t})$ is found below and as described in the Barto et. al. [\[6\]](#References)

$\large\quad\quad\text{Monte Carlo Prediction State Value Off Policy Ordinary Importance Sampling First Visit Algorithm}$<br>

$\large\quad\quad\text{Inputs:}\,\,\text{target policy\,\,}\pi,\,\,\text{behavior policy }b,\,\,environment,\,\,firstVisit=True,\,\,numEpisodes,\,\,discountFactor$<br>

$\large\quad\quad\text{ 1.}\quad V(s)\leftarrow\text{Randomly chosen state value for all states }s\in S$<br>
$\large\quad\quad\text{ 2.}\quad R(s)\leftarrow\text{Empty list of returns for each state }(S=s)$<br>
$\large\quad\quad\text{ 3.}\quad\text{Loop forever (for each episode)}$:<br>
$\large\quad\quad\text{ 4.}\quad\quad\text{Allow the environment to provide }S_{0}\in S\text{ and choose action }A_{0}\in A(S_{0})\text{ randomly from the conditional distribution }b$<br>
$\large\quad\quad\text{ 5.}\quad\quad\text{Generate an episode from }S_{0}, A_{0},\text{ following }b\colon S_{0}, A_{0}, R_{1},..., S_{T-1}, A_{T-1}, R_{T}$<br>
$\large\quad\quad\text{ 6.}\quad\quad\,G\leftarrow\,0\quad W\leftarrow 1$<br>
$\large\quad\quad\text{ 7.}\quad\quad\text{Loop for each step of episode, }t = T-1,\,T-2,\,\dots,\,0$<br>
$\large\quad\quad\text{ 8.}\quad\quad\quad\,\,\,G\leftarrow\,\gamma WG+R_{t+1}$<br>
$\large\quad\quad\text{ 9.}\quad\quad\quad\text{ if(firstVisit==False or (firstVisit==True and pair (}s_{t}, a_{t}\text{) does not appear in } S_{0}, A_{0}, S_{1}, A_{1},\,\dots,\,S_{t-1},\,A_{t-1}\text{))}\colon$<br>
$\large\quad\quad\text{10.}\quad\quad\quad\quad\text{Append }G\text{ to } R(S_{t})$<br>
$\large\quad\quad\text{11.}\quad\quad\quad\quad V(S_{t})\leftarrow\text{average}(R(S_{t}))$<br>
$\large\quad\quad\text{12.}\quad\quad\quad\quad W\leftarrow\,\,W\,\,\pi(A_{t}|S_{t}) / b(A_{t}|S_{t})$<br>
$\large\quad\quad\text{13.}\quad\text{Return } V(S)$<br>

Note: Don't forget that $R_{t+1}=R_{T}\,\,\forall\,0\le\,t\lt\,T$ when using the Blackjack env as we do in this notebook.

In [9]:
def MonteCarloPrediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_Blackjack(targetPolicy, behaviorPolicy, env, V_s, R_s, firstVisit=True, numEpisodes=1, discountFactor=1):
    # 3. Run numEpisodes episodes (instead of looping forever as described in the pseudocode above)
    for _ in range(numEpisodes):
        # 4. Create a list to track the observations (i.e. states) and actions taken during an episode
        episodeStateAction = []
        # Set natural flag to ensure an action is needed to be taken by the agent
        natural = False
        # 4. Reset the envirionment and generate a random intitial state S_0
        observation, info = env.reset(seed=(int)(datetime.now().timestamp() * 1000000))
        # 4. Choose an action from policy pi based upon the observation 
        action = 0 if ( behaviorPolicy.b[((observation), 0)] > behaviorPolicy.b[((observation), 1)] ) else 1            
        # Set the done flag to false
        done = False
        # 5. Generate an episode from S_0 and A_0 follwoing policy pi
        while ( not done ):
            # Push the observation (i.e. state) action pair to the episode as the key and add an empty list as the value
            episodeStateAction.append((observation, action))
            # Step to the next state by performing an action 
            observation, reward, terminated, truncated, info = env.step(action)
            # Choose an action from policy pi based upon the observation 
            action = 0 if ( behaviorPolicy.b[((observation), 0)] > behaviorPolicy.b[((observation), 1)] ) else 1            
            if ( terminated or truncated ):
                done = True
        # 6. Set the returns for the current episode to zero
        G = 0
        # 6. Set the odinary weight value W to one
        W = 1.0
        # Set T to the length of the episode list
        # This number should be appropriate in order to maintain 
        # the episode sequence {S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T-1, A_T-1, R_T}
        T = len(episodeStateAction)
        reward_T = reward
        # 7. Loop for each step of episode, t=T-1, T-2, ..., 0
        for t in range(T-1, -1, -1):
            # 8. Compute G_t for timestep t starting at t=T-1 until t=0
            G = discountFactor * W * G + reward_T         
            # 9. If (firstVisit is False) or (firstVisit is True and state-action pair is not in in S_0, A_0, S_1, A_1, ... S_t-1, A_t-1)  
            if ( (firstVisit==False) or ((firstVisit == True) and (not (episodeStateAction[t] in episodeStateAction[:t]))) ):
                # 10. Append G to returns R(S_t)
                R_s.R_s[(episodeStateAction[t][0])].append(G)
                # 11. Set V(S_t) to the average of Returns(S_t)
                V_s.V_s[(episodeStateAction[t][0])] = sum(R_s.R_s[episodeStateAction[t][0]]) / len(R_s.R_s[episodeStateAction[t][0]])
                # 12. Update W
                W = W * targetPolicy.pi[(episodeStateAction[t])] / behaviorPolicy.b[(episodeStateAction[t])]         
    return V_s
    

In [10]:
# Set the discount factor (aka "gamma") used when computing returns
discountFactor = 1.0

# Set the number of runs
numRuns = 1

#episodes = [1, 10, 50, 100, 200, 500, 1000, 3000, 5000, 10000]
episodes = [5000]

In [11]:
# Run the experiment
for i in range(len(episodes)):
    # Run an experiment j times
    for runNum in range(numRuns):
        # Set the environment 
        env = gym.make('Blackjack-v1', natural=False, sab=False)
        # Use the RecorEpisodeStatistics wrapper to track rewards and episode lengths
        env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=episodes[i])
        # Open a pre-trained target policy that was serialized to json in [\[for use as an example
        with open('../../MonteCarloControl/MonteCarloControl_OnPolicy_ExploringStarts_FirstVisit_Blackjack/results/target_policy.pickle', 'rb') as handle:
            pi = pickle.load(handle)
        # Initialize policy b(s) to equal probabilities 0.5 for all s in S
        b = bPolicy.behaviorPolicyBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
        # 1. Instantiate and initialize a new state value function V(S_t)
        V_s = stateValue.stateValueBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce)
        # 2. Instantiate and initialize a new returns function R(S_t) with empty lists
        R_s = returnsS.returnsStateBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce)
        # Compute the estimate of the state value function V
        stateValueResult = MonteCarloPrediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_Blackjack(targetPolicy=pi, behaviorPolicy=b, env=env, V_s=V_s, R_s=R_s, firstVisit=True, numEpisodes=episodes[i], discountFactor=1.0)
        # Set up the grids for plots of action value and policy when usable Ace is available
        value_grid = h.create_state_value_grid(V_s=stateValueResult, usable_ace=True)
        # Format a string for the title of the plot when there is a usable Ace available
        title = "MC Prediction State Value\nOffPolicy Ordinary Importance Sampling First Visit\n#Episodes=" + str(episodes[i]) + ", \u03B3=" + str(discountFactor) + "\nRun# " + str(runNum+1) + ", usableAce=T\n"
        fileName ="results/MC_Prediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_FirstVisit_Blackjack_Episodes_" + str(episodes[i]) + "_DiscountFactor_" + str(discountFactor) + "_Run_" + str(runNum+1) + "_usableAce_T.png"
        h.create_state_value_plot(value_grid=value_grid, title=title, fileName=fileName, numEpisodes=episodes[i], runNum=runNum, discountFactor=discountFactor, firstVisit=True, usableAce=True)
        # Set up the grids for plots of action value and policy when usable Ace is not available
        value_grid = h.create_state_value_grid(V_s=stateValueResult, usable_ace=False)
        # Format a string for the title of the plot when there is a usable Ace is not available
        title = "MC Prediction State Value\nOff Policy Ordinary Importance Sampling First Visit\n#Episodes=" + str(episodes[i]) + ", \u03B3=" + str(discountFactor) + "\nRun# " + str(runNum+1) + ", usableAce=F\n"
        fileName ="results/MC_Prediction_StateValue_OffPolicy_Ordinary_ImportanceSampling_FirstVisit_Blackjack_Episodes_" + str(episodes[i]) + "_DiscountFactor_" + str(discountFactor) + "_Run_" + str(runNum+1) + "_usableAce_F.png"
        h.create_state_value_plot(value_grid=value_grid, title=title, fileName=fileName, numEpisodes=episodes[i], runNum=runNum, discountFactor=discountFactor, firstVisit=True, usableAce=False)
        # Close the environment to clean up resources used by the environment
        env.close()

<a id="references"></a>
<h2>References</h2>

1. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 58. 

2. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 99.

3. https://gymnasium.farama.org/environments/toy_text/blackjack/

4. https://github.com/openai/gym/issues/1410

5. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 103-106. 

6. White A., White M. "Sample-based Learnning Methods: Week 1", Sample Based Learning Methods, "Reinforcement Learning" Specialization, www.coursera.org/learn/sample-based-learngin-methods/home/week1.

7. [Monte Carlo Control On Policy Exploring Starts First Visit algorithm](#../MonteCarloControl/MonteCarloControl_OnPolicy_ExploringStarts_FirstVisit_Blackjack/MonteCarloControl_OnPolicy_ExploringStarts_FirstVisit_Blackjack.ipynb)