<a id="MonteCarloPrediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_FirstVisit_Blackjack"></a>
<h1>Monte Carlo Prediction of Action Value Function: Using Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit for Blackjack</h1>

Monte Carlo prediction algorithms, such as Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit, can be used to estimate the action-value functions, $Q(S, A)$, described by the Bellman equations.[\[1\]](#references)  This notebook contains a funtions that compute estimates for $Q(S_{t}, A_{t})\approx q_{\pi}(s_{t}, a_{t})$ using the [Off Policy Discount Aware Orinary Per Decision Importance Sampling First Visit](#MonteCarloPrediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_FirstVisit_Blackjack) algorithm for the Blackjack problem described in Example 5.3 of the text by Sutton and Barto. [\[2\]](#references)

This notebook will require the following python modules:

In [14]:
import gymnasium as gym
import numpy as np
from collections import OrderedDict
import statistics as stat
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import matplotlib as mpl
import matplotlib.image as mpimg
from IPython.display import Image
import import_ipynb
import ipynb.fs.full.policyBlackjack as tPolicy
import ipynb.fs.full.targetPolicyBlackjack as tPolicy
import ipynb.fs.full.behaviorPolicyBlackjack as bPolicy
import ipynb.fs.full.returnsStateActionBlackjack as returnsSA
import ipynb.fs.full.stateActionValueBlackjack as actionValue
import ipynb.fs.full.helpers_MonteCarloPrediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_FirstVisit_Blackjack as h
import pickle

<h2>Environment: Blackjack</h2>

<h3>Environment Blackjack: State Space</h3>

Each state of a Blackjack game, represented by the random variable $S$,  consists of a 3-tuple containing 1.) the player's current score via summing the values of the cards in the player's hand, 2.) the value of the card shown by the dealer and 3.) whether the player is holding a usable ace. [\[2\]](#references)

$$\large S\coloneqq\{(playerSum,\,\,dealerShowing,\,\,usableAce)\in\mathbb{Z}^{3}: (0\le playerSum \le 32)\cap(0\le dealerShowing \le 11)\cap (0\le usableAce\le 2)\}$$

Note: The state space (often refered to as the "observation space" within the Gymnasium documentation) covers instances of states that are not reachable in any Blackjack game. More specifically, player sums of 0 or 1 and dealer showing values of 0 are not possible given the constraints of the environment (i.e. the game of Blackjack). This was a design choice made by the Gym API developers (while still being maintained by OpenAI) in order to faciliate "easier indexing for table based algorithms".  processing of numerical results within the environment API. [\[3\]](#references)

In [15]:
# Define the state spaces for player sum, dealer showing card, and usable ace
# Note: Gymnasium's observation space is a Tuple(Discrete(32), Discrete(11), Discrete(2))
# which contains observations that are not possible. There are very few of them so continue
# to set Q_s_a values according to the observation space assumed by Gymnasium
# Note: the observation space contains unreachable states. This was a choice made by
# the OpenAI developers to facilitate easy indexing. See https://github.com/opoenai/gym/issues/1410
# for more information.
stateSpacePlayerSum = [i for i in range(32)]
stateSpaceDealerShows = [i for i in range(1, 11)]
stateSpaceUsableAce = [True, False]

We will use these enumerations of state values later on. For now let's talk about the Action Space of the Blackjack environment.

<h3>Environment Blackjack: Action Space</h3>

Our agent may choose to either 'hit' or 'stick', at each timepoint $t$ during an episode.   Each action taken by the agent (i.e. to either 'stick' or 'hit') is represented by an instance of the random variable $A$ which is sampled from from the Bernoulli distribution.. and there are two actions ('Hit' or 'Stick') available to the agent. [\[2\]](#references)<br>
$$\large A\coloneqq\{a\in\{0, 1\} : a=0\text{ when player 'sticks'} \cap a=1\text{ when player 'hits'}\}$$

Below we create some lists to help us build the state and action spaces later on within the exploring starts algorithm...

In [16]:
# Define the action space (0='Stick', 1='Hit') according to the Gymnasium documentation
actionSpace = [i for i in range(2)]

We will use this enumeration of actions within the Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm later below. For now let's talk about how the Blackjack environment manages episdoes and time steps.

<h3>Environment Blackjack: Episodes and Time Steps</h3>

In the Blackjack environment within the Gymnasium API, an episode consists of what is commonly called a game in a real-life Blackjack scenario. Each episode consists of $T$ timesteps {$t_{0}$, $t_{1}$, $\dots$,$t_{T-1}$}. The initial state $s_{t}$ is provided by the Blackjack environment each time the environment is reset. The Monte Carlo Prediction State Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit Blackjack algorithm randomly assigns an action $a_{t}$ which then form the state-action pair ($S_{t}=s_{t}$, $A_{t}=a_{t}$). Each episode has the following form:<br>
$$\large\text{Episode } \coloneqq \{ S_{t},\,\,A_{t},\,\,R_{t+1},\,\,S_{t+1},\,\,A_{t+1},\,\,R_{t+2},\,\,\dots,\,\,S_{T-1},\,\,A_{T-1},\,\,R_{T}\}$$

Note: In this notebook exercise, the index of the time step immediately following the end of an episode (i.e. the first time step of the next episode) is set to $t=T+1$. $t$ is not set back to zero before starting the next episode. See [\[9\]](#References) for the details of how time indexing can be handled when doing Importance Sampling.

<h3>Environment Blackjack: Rewards and Returns</h3>

Reward calculations are carried out at the end of each episode when using the Monte Carlo Prediction Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm.  Each game (i.e. episode) concludes with assignment of a reward to the random variable $R_{T}$ according to the following five outcomes:

$$
\large R_T
= 
\begin{cases}
-1\quad\text{dealer wins} \\
0\quad\quad\text{draw} \\
1\quad\quad\text{agent wins with non natural} \\
1\quad\quad\text{agent wins on natural(if natural is set to False)} \\
1.5\quad\text{agent wins on natural (if natural is set to True)}
\end{cases}
$$

Now that an episode has ended and the reward, $R_{T}$, has been returned from the Blackjack environment, we set the reward at time step $t$ of the episode to $R_{T}$<br>
$$\large R_{t} = R_{T}\,\,\forall\,\,0\le\,\,t\,\,\lt\,\,T$$   

The MC Monte Carlo Prediction State Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm will use a returns function, $Returns(s_{t}, a_{t})$, to help us keep track of which states accumulate rewards $R_{T}$ at the end of each episode.<br> 

$$\large Returns(s_{t}) \leftarrow\text{list of accumulated returns }G\text{ that resulted from being in state }s_{t}\text{ over all episodes of a single run.}$$

The dictionary R_s_a is a member of the [returnsStateActionBlackjack class](#returnsStateActionBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={$Returns(s, a)$} and will be used later within the Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm below.

In [17]:
# Instantiate and initialize a new temporary returns function R(S_t, A_t) with empty lists
rsaTemp = returnsSA.returnsStateActionBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
rsaTemp.R_s_a.items()

odict_items([(((0, 1, True), 0), []), (((0, 1, True), 1), []), (((0, 1, False), 0), []), (((0, 1, False), 1), []), (((0, 2, True), 0), []), (((0, 2, True), 1), []), (((0, 2, False), 0), []), (((0, 2, False), 1), []), (((0, 3, True), 0), []), (((0, 3, True), 1), []), (((0, 3, False), 0), []), (((0, 3, False), 1), []), (((0, 4, True), 0), []), (((0, 4, True), 1), []), (((0, 4, False), 0), []), (((0, 4, False), 1), []), (((0, 5, True), 0), []), (((0, 5, True), 1), []), (((0, 5, False), 0), []), (((0, 5, False), 1), []), (((0, 6, True), 0), []), (((0, 6, True), 1), []), (((0, 6, False), 0), []), (((0, 6, False), 1), []), (((0, 7, True), 0), []), (((0, 7, True), 1), []), (((0, 7, False), 0), []), (((0, 7, False), 1), []), (((0, 8, True), 0), []), (((0, 8, True), 1), []), (((0, 8, False), 0), []), (((0, 8, False), 1), []), (((0, 9, True), 0), []), (((0, 9, True), 1), []), (((0, 9, False), 0), []), (((0, 9, False), 1), []), (((0, 10, True), 0), []), (((0, 10, True), 1), []), (((0, 10, False),

In [18]:
del rsaTemp

<h2>Target Policy</h2>

Our agent's target policy is denoted as $\pi$ and is used to represent the conditional probability mass function $f_{A|S}$ of actions over the states:<br>
$$\large \pi\coloneqq f_{A|S}$$

We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large \pi(a|s)=f_{A|S}(a|s)=P(A=a|S=s)$$

Recall from the defination of the Action Space $A$ from above, that $A$ follows a Bernoulli distribution where $P(A=a)$  We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large \pi(a_{t}|s_{t})
=
\begin{cases}
0\quad\quad\quad \text{if } Q(s_{t}, a_{t}=0)\lt Q(s_{t}, a_{t}=1) \\
1\quad\quad\quad \text{if }Q(s_{t}, a_{t}=0)\gt Q(s_{t}, a_{t}=1)
\end{cases}
$$


The dictionary pi is a member of the [targetPolicyBlackjack class](#targetPolicyBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={$\pi(a_{t}\,\,|\,\,s_{t})$} pairs and will be used later within the Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Importance Sampling First Visit algorithm below.

In [19]:
# Initialize a temporary policy pi(s)
piTemp = tPolicy.targetPolicyBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
piTemp.pi.items()
del piTemp

<h2>Behavior Policy</h2>

Our agent's behavior policy is denoted as $b$ and is used to represent the conditional probability mass function $f_{A|S}$ of actions over the states:<br>
$$\large b\coloneqq f_{A|S}$$

Our agent must, at timestep $t$, choose an action $a\in A$ (either '0' to 'stick' or '1' to 'hit') to move the from its current state $s\in S$ to its next state $s'\in S$.

We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large b(a|s)=f_{A|S}(a|s)=P(A=a|S=s)$$

Recall from the defination of the Action Space $A$ from above, that $A$ follows a Bernoulli distribution where $P(A=a)$  We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large b(a_{t}|s_{t})
=
\begin{cases}
0\quad\quad\quad\text{if }f(s_{t}, a_{t}=0)\lt f(s_{t}, a_{t}=1) \\
1\quad\quad\quad\text{if }f(s_{t}, a_{t}=0)\ge f(s_{t}, a_{t}=1)
\end{cases}
$$


# Initialize a temporary policy b(s)
bTemp = bPolicy.behaviorPolicyBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
bTemp.b.items()
del bTemp

<h2>Action Value Function</h2>

According to the Bellman eqautions, the action value function $Q(s_{t}, a_{t})$ at time $t$ within an episode represents the expected return when choosing action $a_{t}\in A$ when in state $s_{t}\in S$ according to policy $\pi_*(a_{t}|s_{t})$. [\[4\]](#References)

$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad q_\pi(s_t, a_t)\coloneqq\mathbb{E} [G_t | S_t=s_t, A_t=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t=s_t, A_t=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1})\,\,|\,\,S_{t}=s_{t},\,\,A_{t}=a_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\sum\limits_{s_{t+1}} \sum\limits_{r} P(s_{t+1}, r| s_{t}, a_{t})[r+\gamma v_\pi(s_{t+1})]$


Monte Carlo Prediction methods such as Off Policy Ordinary Per Decision Importance Sampling First Visit allow us to make an estimate $Q(s_{t}, a_{t})$ of the Action Value function $q_\pi(s_{t}, a_{t})$ which will be demonstrated later and below.

The Python dictionary Q_s_a is a member of the [stateActionValue class](#stateActionValueBlackjack.ipynb) and uses the key={((playerSum, dealerShows, usableAce), action)} : value={$Q(s_{t}, a_{t})$} pairs and will be used later within the Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm below.

<a id="MonteCarloPrediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_FirstVisit_Blackjack"></a>
<h2>Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit Algorithm</h2>

Our agent must, at timestep $t$, choose an action $a\in A$ (either '0' to 'stick' or '1' to 'hit') to move the from its current state $s\in S$ to its next state $s'\in S$ within a Blackjack episode. As in many reinforcement learning MDP problems, the the choice of action taken by an agent at each time step must either exploit the estimated optimal action $a_{t}\in A$ given the current state $s_{t}\in S$ or explore the action space for actions other than the optimal action in order to see if there are other actions with higher expected returns that may be better to choose going forward.
As discussed in Barto's chapter on Monte Carlo methods for State Value function estimation using off policy mehtods [\[5\]](#References), the Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithim uses target policy $\pi$ to exploit the optimal action, and a behavior policy $b$ to explore the action space. 


Off policy algorithms learn from data that is not a result of the target policy $\pi$ but rather some behavior policy $b$. As discussed in [\[5\]](#References). This notebook will demonstrate the problem of estimating the Action Value function $Q(S, A)\approx q_{\pi}(S, A)$ by using a Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm that uses fixed-given target and behavior policies $\pi$ and $b$ respectively. 

This notebook demonstrates the problem of estimating the action value function $Q(S, A)\approx q_{\pi}(S, A)$ by using an Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm that uses fixed-given target and behavior policies $\pi$ and $b$ respectively. 

$$\large q_{\pi}(s_{t}, a_{t})\,\,=\,\,\mathbb{E_{b}} [\rho_{t:T-1}G_{t}|S_{t}=s_{t}, A_{t}=a_{t}]$$

where $\rho$ transforms the returns while following the behavior policy $b$ as described below.

The probability of taking a trajectory according to the target policy $\pi$ is as follows:

$$\large P(A_{t},\,\,S_{t+1},\,\,A_{t+1},\,\,\dots,\,\,S_{T}\,\,|\,\,S_{t},\,\,A_{t:T-1}\sim\pi)$$
$$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad =\pi(A_{t}|S_{t})\,\,P(S_{t+1}|S_{t},\,\,A_{t})\,\,\pi(A_{t+1}|S_{t+}),\,\,\cdots,\,\,P(S_{T}|S_{T-1},\,\,A_{T-1})$$
$$\large\quad\quad\quad\quad\quad\quad\quad=\quad\prod\limits_{k=t}^{T-1} \pi(A_{k}|S_{k})\,\,P(S_{k+1}|S_{k},\,\,A_{k})$$

Similarly, the probability of taking a trajectory according to the behavior policy $b$ is as follows:

$$\large P(A_{t},\,\,S_{t+1},\,\,A_{t+1},\,\,\dots,\,\,S_{T}\,\,|\,\,S_{t},\,\,A_{t:T-1}\sim b)$$
$$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=b(A_{t}|S_{t})\,\,P(S_{t+1}|S_{t},\,\,A_{t})\,\,b(A_{t+1}|S_{t+}),\,\,\cdots,\,\,P(S_{T}|S_{T-1},\,\,A_{T-1})$$
$$\large\quad\quad\quad\quad\quad\quad\quad=\quad\prod\limits_{k=t}^{T-1} b(A_{k}|S_{k})\,\,P(S_{k+1}|S_{k},\,\,A_{k})$$

Taking the ratio of these two probalities up until some time horizon h, we obtain the Discount Aware Ordinary Per Decision Importance Sampling Ratio $\rho$.

$$\large\rho_{t:h}=\frac{\prod\limits_{k=t}^{h-1} \pi(A_{k}|S_{k})P(S_{k+1}|S_{k},\,\,A_{k})}{\prod\limits_{k=t}^{h-1} b(A_{k}|S_{k})P(S_{k+1}|S_{k},\,\,A_{k})}\,\,=\,\,\prod\limits_{k=t}^{h-1}\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}$$

$$\large\rho_{t}=W_{t}^{t+1}=\frac{\pi(A_{t}|\,S_{t})}{b(A_{t}|\,S_{t})}$$

$$\large W_{t}^{h}=\prod\limits_{k=t}^{h-1} W_{k}^{k+1}=\prod\limits_{k=t}^{h-1} \rho_{k}$$

The product of the Discount Aware Ordinary Per Decision Importance Sampling Ratio $\rho$ is used to scale the returns which we then take the average of those scaled returns obtained by following behavior policy $b$. These averaged scaled returns for following policy $b$ provide an estimate $Q(S_{t}, A_{t})\approx q_{\pi}(S_{t}, A_{t})$.

Returns can be scaled by discount factor $\gamma_{t}$ and weighted by $W_{t}^{h}$ which is the Per Decision Importance Sampling Ratio.

$$\large \hat G_{t}=\sum \limits_{l=t+1}^{T(t)}W_{t}^{l}\prod \limits_{i=t+1}^{l-1}\gamma_{i} R_{l}$$

$$\large Q(S_{t}, A_{t})=\frac{\sum\limits_{t\in\tau(s)}\hat G_{t}}{|\tau(s)|}$$
$$\large\quad\quad\quad\quad\quad\quad\quad\quad\text{where }\tau(s)\text{ is the set of all time steps in which state }s\text{ was visited.}$$

In order to estimate the action Value function when following policy $\pi$ while only seeing the outcomes of following policy $b$, the Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit algorithm assumes that every action taken under $\pi$ during an MDP is also taken occasionally under the behavior policy $b$ with some non-zero porobability.[\[5\]](#References) 

$$\pi(a_{t}|s_{t})\gt 0 \rightarrow b(a_{t}|s_{t})\gt0\quad\quad\quad\forall\,\in\,\,A\,\,\text{and}\,s\in\,\,S$$

Pseudocode for the Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importnace Sampling First Visit algorithm which estimates the action value function $Q(s_{t}, A_{t})$ is found below and as described in the Barto et. al. [\[6\]](#References)

$\large\quad\quad\text{Monte Carlo Prediction Action Value Off Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit Algorithm}$<br>

$\large\quad\quad\text{Inputs:}\,\,\text{target policy\,\,}\pi,\,\,\text{behavior policy }b,\,\,environment,\,\,firstVisit=True,\,\,numEpisodes$<br>
$\large\quad\quad\text{ 1.}\quad Q(s, a)\leftarrow\text{Randomly chosen state action values all state-action pairs }(s\in S, a\in A)$<br>
$\large\quad\quad\text{ 2.}\quad R(s, a)\leftarrow\text{Empty list of returns for each state action pair }(s\in S, a\in A)$<br>
$\large\quad\quad\text{ 3.}\quad timeStep\leftarrow 0$<br>
$\large\quad\quad\text{ 4.}\quad T\leftarrow \text{ empty dictionary }$<br>
$\large\quad\quad\text{ 5.}\quad episodeStateAction\leftarrow\text{empty list }$<br>
$\large\quad\quad\text{ 6.}\quad discountFactors\leftarrow\text{list \{1\} }$<br>
$\large\quad\quad\text{ 7.}\quad\text{Loop forever (for each episode)}$:<br>
$\large\quad\quad\text{ 8.}\quad\quad\text{Allow the environment to provide }S_{t=startTime}\in S\text{ and choose action }A_{t=startTime}\in A(S_{t=startTime})\text{ randomly from the conditional distribution }b$<br>
$\large\quad\quad\text{ 9.}\quad\quad\text{Generate an episode from }S_{t=startTime}, A_{t=startTime},\text{ following }b\colon S_{t=starTime}, A_{t=starTime}, R_{t+1},..., S_{T(t=startTime)-1}, A_{T(t=startTime)-1}, R_{T(t=startTime)}$<br>
$\large\quad\quad\text{ 10.}\quad\quad\,\hat G\leftarrow\,0\quad W_{t}^{l}\leftarrow 1$<br>
$\large\quad\quad\text{ 11.}\quad\quad\text{Loop for each step of episode, }t = T-1,\,T-2,\,\dots,\,0$<br>
$\large\quad\quad\text{ 12.}\quad\quad\quad\,\,\,\hat G\leftarrow\,\sum \limits_{l=t+1}^{T(t)} W_{t}^{l} \prod \limits_{i=t+1}^{l-1}\gamma_{i} R_{l}$<br>
$\large\quad\quad\text{ 13.}\quad\quad\quad\text{ if(firstVisit==False or (firstVisit==True and pair (}s_{t}, a_{t}\text{) does not appear in } S_{0}, A_{0}, S_{1}, A_{1},\,\dots,\,S_{t-1},\,A_{t-1}\text{))}\colon$<br>
$\large\quad\quad\text{ 14.}\quad\quad\quad\quad\text{Append }\hat G\text{ to } R(S_{t}, A_{t})$<br>
$\large\quad\quad\text{ 15.}\quad\quad\quad\quad Q(S_{t}, A_{t})\leftarrow\text{average}(R(S_{t}, A_{t}))$<br>
$\large\quad\quad\text{ 16.}\quad\text{Return } Q(S, A)$<br>

Note: Don't forget that $R_{t+1}=R_{T}\,\,\forall\,0\le\,t\lt\,T$ when using the Blackjack env as we do in this notebook.

In [20]:
def MonteCarloPrediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_Blackjack(targetPolicy, behaviorPolicy, env, Q_s_a, R_s_a, firstVisit=True, numEpisodes=1):
    # 3. Set stimeStep to zero
    timeStep = 0
    # 4. Set T to empty dictionary
    T = {}
    # 5. Create a list to track the observations (i.e. states) and actions taken during an episode
    episodeStateAction = []
    # 6. Create a list to store discount factors as we generate episodes
    discountFactors = [0.999]
    reward = (float)('-inf')
    # 7. Run numEpisodes episodes (instead of looping forever as described in the pseudocode above)
    for _ in range(numEpisodes):
        # Set natural flag to ensure an action is needed to be taken by the agent
        natural = False
        # 8. Reset the envirionment and generate a random intitial state S_0
        observation, info = env.reset(seed=(int)(datetime.now().timestamp() * 1000000))
        # 8. Choose an action from policy pi based upon the observation 
        action = 0 if ( behaviorPolicy.b[((observation), 0)] > behaviorPolicy.b[((observation), 1)] ) else 1            
        startTime = timeStep
        # Set the done flag to false
        done = False
        # 9. Generate an episode from S_0 and A_0 follwoing policy pi
        while ( not done ):
            # Push the observation (i.e. state) action pair to the episode as the key and add an empty list as the value
            episodeStateAction.append((observation, action))
            timeStep += 1
            # Step to the next state by performing an action 
            observation, reward, terminated, truncated, info = env.step(action)
            discountFactors.append(discountFactors[-1] * .999)
            if ( terminated or truncated ):
                done = True
                T[startTime] = timeStep
                reward_T = reward
                break
            else:
            # Choose an action from policy pi based upon the observation 
                action = 0 if ( behaviorPolicy.b[((observation), 0)] > behaviorPolicy.b[((observation), 1)] ) else 1            
    
        # This number should be appropriate in order to maintain 
        # the episode sequence {S_0, A_0, R_1, S_1, A_1, R_2, ..., S_T-1, A_T-1, R_T}
        reward_T = reward
        # 10. Set G_hat to zero
        G_hat = 0.0
        # 10. Set the partial per decision importance sampling ratio W_t:h
        W_t_l = 1.0
        # 11. Loop for each step of episode, t=T-1, T-2, ..., 0
        for t in range(timeStep-1, startTime-1,  -1):
            # 12. Compute G tilde for time t
            sum_l = 0.0
            for l in range(t+1, T[startTime]+1):
                for j in range(t, l):
                    W_t_l *= targetPolicy.pi[episodeStateAction[j]] / behaviorPolicy.b[episodeStateAction[j]]
                prod_Gamma_i = 1.0 
                for i in range(t+1, l):
                    prod_Gamma_i *= discountFactors[i]
                R_l = reward_T
                sum_l += W_t_l * prod_Gamma_i * R_l
            G_hat += sum_l
            # 13. If (firstVisit is False) or (firstVisit is True and state-action pair is not in in S_0, A_0, S_1, A_1, ... S_t-1, A_t-1)  
            if ( (firstVisit==False) or ((firstVisit == True) and (not (episodeStateAction[t] in episodeStateAction[:t]))) ):
                # 14. Append G_tilde to returns R(S_t, A_t)
                R_s_a.R_s_a[episodeStateAction[t]].append(G_hat)
                # 15. Set Q(S_t, A_t) to the average of the Returns(S_t)
                Q_s_a.Q_s_a[episodeStateAction[t]] = sum(R_s_a.R_s_a[episodeStateAction[t]]) / len(R_s_a.R_s_a[episodeStateAction[t]])
    # 16. Return the estimate of q_pi(s, a)
    return Q_s_a
    

In [21]:
# Set the number of runs
numRuns = 1

#episodes = [1, 10, 50, 100, 200, 500, 1000, 3000, 5000, 10000]
episodes = [5000]

In [23]:
# Run the experiment
for i in range(len(episodes)):
    # Run an experiment j times
    for runNum in range(numRuns):
        # Set the environment 
        env = gym.make('Blackjack-v1', natural=False, sab=False)
        # Use the RecorEpisodeStatistics wrapper to track rewards and episode lengths
        env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=episodes[i])
        # Open a pre-trained target policy that was serialized to json in [\[for use as an example
        with open('../../MonteCarloControl/MonteCarloControl_OnPolicy_ExploringStarts_FirstVisit_Blackjack/results/target_policy.pickle', 'rb') as handle:
            pi = pickle.load(handle)
        # Initialize policy b(s) to equal probabilities 0.5 for all s in S
        b = bPolicy.behaviorPolicyBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
        # 1. Instantiate and initialize a new state value function Q(S_t, A_t)
        Q_s_a = actionValue.stateActionValueBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
        # 2. Instantiate and initialize a new returns function R(S_t) with empty lists
        R_s_a = returnsSA.returnsStateActionBlackjack(stateSpacePlayerSum=stateSpacePlayerSum, stateSpaceDealerShows=stateSpaceDealerShows, stateSpaceUsableAce=stateSpaceUsableAce, actionSpace=actionSpace)
        # Compute the estimate of the action value function Q
        actionValueResult = MonteCarloPrediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_Blackjack(targetPolicy=pi, behaviorPolicy=b, env=env, Q_s_a=Q_s_a, R_s_a=R_s_a, firstVisit=True, numEpisodes=episodes[i])
        # Set up the grids for plots of action value and policy when usable Ace is available
        value_grid = h.create_action_value_grid(Q_s_a=actionValueResult, usable_ace=True)
        # Format a string for the title of the plot when there is a usable Ace available
        title = "MC Prediction Action Value\nOffPolicy Discount Aware Ordinary Per Decision Importance Sampling First Visit\n#Episodes=" + str(episodes[i]) + ", \u03B3 " + "\nRun# " + str(runNum+1) + ", usableAce=T\n"
        fileName ="results/MC_Prediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_FirstVisit_Blackjack_Episodes_" + str(episodes[i]) + "_Run_" + str(runNum+1) + "_usableAce_T.png"
        h.create_action_value_plot(value_grid=value_grid, title=title, fileName=fileName, numEpisodes=episodes[i], runNum=runNum, firstVisit=True, usableAce=True)
        # Set up the grids for plots of action value and policy when usable Ace is not available
        value_grid = h.create_action_value_grid(Q_s_a=actionValueResult, usable_ace=False)
        # Format a string for the title of the plot when there is a usable Ace is not available
        title = "MC Prediction Action Value\nOff Policy Discount Aware Ordinary Per Decision Importance Sampling First Visit\n#Episodes=" + str(episodes[i]) + ", \u03B3 " + "\nRun# " + str(runNum+1) + ", usableAce=F\n"
        fileName ="results/MC_Prediction_ActionValue_OffPolicy_DiscountAware_Ordinary_PerDecision_ImportanceSampling_FirstVisit_Blackjack_Episodes_" + str(episodes[i]) + "_Run_" + str(runNum+1) + "_usableAce_F.png"
        h.create_action_value_plot(value_grid=value_grid, title=title, fileName=fileName, numEpisodes=episodes[i], runNum=runNum, firstVisit=True, usableAce=False)
        # Close the environment to clean up resources used by the environment
        env.close()

<a id="references"></a>
<h2>References</h2>

1. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 58. 

2. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 99.

3. https://gymnasium.farama.org/environments/toy_text/blackjack/

4. https://github.com/openai/gym/issues/1410

5. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 103-106. 

6. White A., White M. "Sample-based Learnning Methods: Week 1", Sample Based Learning Methods, "Reinforcement Learning" Specialization, www.coursera.org/learn/sample-based-learngin-methods/home/week1.

7. [Monte Carlo Control On Policy Exploring Starts First Visit algorithm](#../MonteCarloControl/MonteCarloControl_OnPolicy_ExploringStarts_FirstVisit_Blackjack/MonteCarloControl_OnPolicy_ExploringStarts_FirstVisit_Blackjack.ipynb)

8. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 114-115. 

9. Sutton R.S., "Chapter 5: Monte Carlo Methods", slide 28 of course slides at https://www.stanford.edu/class/cme241/lecture_slides/rich_sutton_slides/9-10-MC.pdf. 

10. Manhoot A., "Incremental Off-policy reinforcement Learning Algorithms", thesis 2017, https://doi.org/10.7939/R3NG4H58D, p. 50-52. 
