<h1>Temporal Difference Methods For Prediction of State Value Function: Using On Policy One-Step TD(0) for Windy Gridworld</h1>


Temporal Difference methods are another class of tabular solutions which can be used to estimate the state value function $V(S_{t})$ described by the Bellman eauations.[[4]](#References) This notebook contains functions that estimate $V(S_{t})\approx v_{\pi}(s_{t})$ using the One Step TD(0) algorithm for the Windy Gridworld problem described in [[5]](#References).

This notebook will require the following Python modules:

In [19]:
import random
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.image as mpimg
from IPython.display import Image
import seaborn as sns
import pickle
import import_ipynb
import envWindyGridworld as wg
import targetPolicyWindyGridworld as tPolicy
import returnsStateWindyGridworld as returnsStateValueWG
import stateValueWindyGridworld as stateValueWG
import helpers as h
# import ipynb.fs.full.envWindyGridworld as wgw
# import ipynb.fs.full.targetPolicyWindyGridworld as tPolicy
# import ipynb.fs.full.returnsStateWindyGridworld as returnsStateWindyGridworld
# import ipynb.fs.full.stateValueWindyGridworld as stateValueWindyGridworld
# import ipynb.fs.full.helpers as h

<h2>Environment Windy Gridworld</h2>

<img src="images/gridworld.png" style="display:block; margin:auto"/>

The Windy Gridworld environment is composed of a class containing the maze itself, the rules for wind effects within the maze and helper functions to compute the rewards associated with any of the available actions. See the [Windy Gridworld environment class notebook](envWindyGridworld.ipynb) for details.

<h3>Environment Windy Gridworld: Maze</h3>

The maze is represented as a 20x20 numpy array of integers where each elelment of the 2D array represents a position in the maze and the value of each element signifies whether the position can be occupied by the agent (i.e. position is not a wall) or not (i.e. the position is a wall).

$$\large maze[i,j]
    = 
    \begin{cases}
    -1\quad\,\,\,\,\,\,\,\normalsize\text{position is a wall} \\
    0\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall} 
    \end{cases}
$$

A predifined maze has been saved as a 2D numpy array. Let's load the 2D numpy array.

In [20]:
maze = np.load('./maze.npy')

In [21]:
maze

array([[ 0, -1,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,
         0, -1,  0,  0],
       [ 0, -1,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,
         0, -1,  0,  0],
       [ 0,  0,  0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
         0, -1,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0, -1,  0,  0],
       [ 0,  0, -1,  0,  0,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0, -1,  0,  0,  3, -1,  0,  0,  0,  0, -1,  0,  0,  0,  0,
         0,  0,  0, -1],
       [ 0,  0, -1,  0,  0,  3, -1,  0,  0, -1, -1, -1,  0,  0,  0, -1,
        -1, -1, -1,  0],
       [ 0,  0, -1,  0,  0,  3, -1,  0,  0,  0,  0, -1,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0, -1,  0,  0,  3,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0, -1, -1, -1

<h3>Environment Windy Gridworld: Wind Direction and Wind Strength</h3>

The windDirections object is represented as a numpy array having the same shape as the maze object. Each position in the windDirections 2D numpy array holds an integer representing the direction in which the wind is blowing.

$$\large windDirections[i,j]
    = 
    \begin{cases}
    0\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows N } \\
    1\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows S } \\
    2\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows W } \\
    3\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows E } 
    \end{cases}
$$

A predifined windDirections has been saved as a 2D numpy array. Let's load the 2D numpy array.

In [22]:
windDirections = np.load('windDirections.npy')

In [23]:
windDirections

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.

The windStrengths object is represented as a numpy array having the same shape as the maze object. Each position in the windStrengths 2D numpy array holds an integer representing the number of positions an agent will be moved by the wind in the direction in which it is blowing.

$$\large windStrengths[i,j]
    = 
    \begin{cases}
    0\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind is not blowing} \\
    1\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows with intensity 1 } \\
    2\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows with intensity 2 } \\
    3\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows with intensity 3 } \\
    4\quad\quad \,\,\,\,\,\,\normalsize\text{position is not a wall and wind blows with intensity 4 }
    \end{cases}
$$

A predifined windStrengths has been saved as a 2D numpy array. Let's load the 2D numpy array.

In [24]:
windStrengths = np.load('windStrengths.npy')

In [25]:
windStrengths

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.

Notice that the positions (4, 5), (5, 5), (6, 5), (7, 5), (8, 5) and (9, 5) are all "windy" and blow in the southern direction with strength 3.

<h3>Environment Windy Gridworld: State Space</h3>

Each white square in the maze represnts a state which is a random variable denoted as $S$.
Each observed state $s\in S$ is defined as an ordered pair $(x, y)$ which represents the position of the agent on the grid.

$$\large S\coloneqq\{(x, y): (x\in\mathbb{Z}) \cap (0\le x<n) \bigcap (y\in\mathbb{Z}) \cap (0\le y<m)\}$$

<h3>Environment Windy Gridworld: Action Space

Each action is a random variable which we denote with the symbol $A$ and there are four actions available to the agent. Action "0" moves the agent one step "Up", "1" moves the agent one step "Down", and so on and so forth for the "2"$\rightarrow$"Left" and "3"$\rightarrow$"Right" actions.


$\large A\coloneqq\{a\in\mathbb{Z}:\quad\, a=0\text{ when agent moves "Up",}$<br>
$\large\quad\quad\quad\quad\quad\quad\quad a=1\text{ when agent moves "Down",}$<br>
$\large\quad\quad\quad\quad\quad\quad\quad a=2\text{ when agent moves "Left",}$<br>
$\large\quad\quad\quad\quad\quad\quad\quad a=3\text{ when agent moves "Right"}\}$

In [26]:
# Define the action space A where:
# a=0 when agent moves "Up"
# a=1 when agent moves "Down"
# a=2 when agent moves "Left"
# a=3 when agent moves "Right"
#actionSpace = {0:"Up", 1:"Down", 2:"Left", 3:"Right"}
actionSpace = [i for i in range(4)]
actionSpace



[0, 1, 2, 3]

<h3> Environment Windy Gridworld: Episodes and Time Steps

In Gridworld, an episode consists of a discrete sequence of time steps that occur from the time when the agent first begins, at the "Start" square on our grid, to the time at which our agent finally reaches the "Finish" square on the grid.
    
The time step is denoted by the variable $t$ and is set to zero at the beginning of the first episode. At this time our agent is located in the "Start" square on the grid.

The time step variable $t$ is then incremented by one after each action is take by our agent until it reaches the "Finish" sqaure on the grid.

The variable $T$ is used to represent the value of the time step $t$ upon which our agent reaches the "Finish" square on the grid and the episode ends.

If another episode is carried out, the agent returns to the "Start" square and the time step variable $t$ is set to $T+1$.

<h3>Environment Windy Gridworld: Rewards and Returns

<h3>Environment Windy Gridworld: Rewards</h3>

At each time step the reward of an action taken is either -1 or 0 depending upon whether the next state is a finish line or non-finish line square on the grid. 

$$
    \large R_t
    =
    \begin{cases}
    0\quad\quad \Large s_{t+1}\normalsize =\text{ finish line square on the grid } \\
    -1\quad\, \Large s_{t+1}\normalsize\neq\text{ finish line square on the grid}
    \end{cases}
    $$

<h3>State Value Function</h3>

According to the Bellman eqautions, the state value function at time $t$, $v_{\pi}(s_{t})$, within an episode represents the expected return when in state $s_{t}\in S$ according to policy $\pi$. [\[1\]](#References)

$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad v_\pi(s_t)\coloneqq\mathbb{E} [G_t | S_t=s_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t=s_t]$<br>
$\large\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad=\mathbb{E} [R_{t+1} + \gamma v_{\pi}(S_{t+1})\,\,|\,\,S_{t}=s_{t}]$<br>


Unlike Monte Carlo state value prediction methods, TD methods update the estimate of the state value function at each time-step within an episode (instead of updating the estimate at the end of each episode).

$$\large V(S_{t})\leftarrow V(S_{t})+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_{t})])]$$

An $mxn$ matrix $V$ is used to store the values of each state $v_\pi(s)$ on the grid where, according to the Bellman equations, the state value represents the expected return of the current state while folowing policy $\pi.$[[4]](#References)

<h2>Policy</h2>

Our agent's policy is denoted as $\pi$ and is used to represent the conditional probability mass function $f_{A|S}$ of actions over the states:<br>
    $$\large \pi\coloneqq f_{A|S}$$
    At any given timestep $t$, our agent must choose an action $a\in A$ to move from its current state $s\in S$ to its next state $s'\in S$.
    We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
    $$\large \pi(a|s)=f_{A|S}(a|s)=P(A=a|S=s)$$

Regardless of the distribution $f_{A|S}(a|s)$, policy $\pi$ deterministically chooses the action which maximizes the conditional probability over all actions for a particular given state $s\in S$.

$$
\large P(A=a|S=s)
=
\begin{cases}
1,\quad\quad\, \Large a=\underset{a}{\operatorname{\arg\max}}\,\pi(A=a|S=s) \\
0,\quad\quad\quad \Large otherwise
\end{cases}
$$

A policy used in Windy Gridworld is represented by an $mxnx|S|$ numpy array pi where each element represents the probability of taking action $a\in A$ while in the state $s=(i,j)$
   ]

As an example a random policy, example_pi, for taking actions $A$ with equal probability in a $2x2$ maze would have the shape (2,2,4).

In [27]:
example_pi = tPolicy.policyWindyGridworld(stateSpaceMazeRows=20, stateSpaceMazeColumns=20)
example_pi.Pi


OrderedDict([(((0, 0), 0), 0.25),
             (((0, 0), 1), 0.25),
             (((0, 0), 2), 0.25),
             (((0, 0), 3), 0.25),
             (((0, 1), 0), 0.25),
             (((0, 1), 1), 0.25),
             (((0, 1), 2), 0.25),
             (((0, 1), 3), 0.25),
             (((0, 2), 0), 0.25),
             (((0, 2), 1), 0.25),
             (((0, 2), 2), 0.25),
             (((0, 2), 3), 0.25),
             (((0, 3), 0), 0.25),
             (((0, 3), 1), 0.25),
             (((0, 3), 2), 0.25),
             (((0, 3), 3), 0.25),
             (((0, 4), 0), 0.25),
             (((0, 4), 1), 0.25),
             (((0, 4), 2), 0.25),
             (((0, 4), 3), 0.25),
             (((0, 5), 0), 0.25),
             (((0, 5), 1), 0.25),
             (((0, 5), 2), 0.25),
             (((0, 5), 3), 0.25),
             (((0, 6), 0), 0.25),
             (((0, 6), 1), 0.25),
             (((0, 6), 2), 0.25),
             (((0, 6), 3), 0.25),
             (((0, 7), 0), 0.25),
             (

In [28]:
del example_pi

<h2>State Value Function</h2>

As an example the 2D numpy array V for a maze of shape $2x2$ represents the state value $v_\pi(s)$ for each $s\in S$ and would have the shape (2,2).

In [29]:
example_V = np.random.sample((2,2))
print(example_V)
example_V.shape

[[0.67894282 0.19162529]
 [0.99447387 0.75623811]]


(2, 2)

In [30]:
del example_V

<h3>TD(0) Prediction State Value Algorithm</h3> 

Psuedocode for the TD(0) Prediction State Value algorithm which estimates the state value function $V(s_t)$ is found below and as described in Barto et. al. [[1]](#References)

$\large\quad\quad\text{TD(0) Prediction State Value Algorithm}$<br>

$\large\quad\quad\text{Inputs:}\,\,\text{target policy\,\,}\pi,\,\,\,\text{environment},\,\,\text{stepSize}\,\alpha,\,\,\text{discount factor }\gamma,\,\,\text{tolerance },\,\,\,\text{numEpisodes }$<br>
$\large\quad\quad\text{ 1.}\quad V(terminal)\leftarrow0,\,V(s)\leftarrow\text{Randomly chosen state value for all non-terminal states }$<br>
$\large\quad\quad\text{ 2.}\quad R(S_{t}, A_{t})\leftarrow \text{Empty list for all state action pairs }$<br>

$\large\quad\quad\text{ 2.}\quad\text{Loop forever (for each episode)}:$<br>
$\large\quad\quad\text{ 3.}\quad\quad\text{Initialize }S(0)=s$<br>
$\large\quad\quad\text{ 4.}\quad\quad\text{Loop for each step of episode}:$<br>
$\large\quad\quad\text{ 5.}\quad\quad\quad A\leftarrow\text{ action given by policy }\pi\text{ for }S(0)$<br>
$\large\quad\quad\text{ 6.}\quad\quad\quad\text{Take action }A\text{, observe }R\text{ and }S_{t+1}$<br>
$\large\quad\quad\text{ 7.}\quad\quad\quad V(S_{t})\leftarrow\,V(S_{t})\,+\alpha[R+\gamma V(S_{t+1})\,-\,V(S_{t})]$<br>
$\large\quad\quad\text{ 8.}\quad\quad\quad S\leftarrow S_{t+1}$<br>
$\large\quad\quad\text{ 9.}\quad\quad\quad\text{until }S\text{ is terminal}$<br>
$\large\quad\quad\text{10.}\quad\quad\text{Return }V$<br>

This simple one-step TD method is called TD(0), which is a special case of the bootstrapping method called TD($\gamma$), and combines the sampling strategies of Monte Carlo methods with the bootstrapping used in Dynamic Programming methods. [[1]](#References)[[2]](#References) 

In [31]:
def TemporalDifferencePrediction_StateValue_TD_0_WindyGridworld(targetPolicy, environment, V_s, R_s, stepSize, discountFactor, tolerance, numEpisodes):
    # 2. Loop forever or for each episode
    for _ in range(numEpisodes):
        # 3. Intitialize the temporary placeholder S with 
        S = environment.StartPosition()        
        # 4. Loop over each step in the episode
        isDone = False
        while( not(isDone) ):
            # 5. Find action A by finding the action that maximizes pi(A=a|S=s)
            argmax_A = []
            max_p_a_given_s = 0
            for action in environment.Actions:
                if (targetPolicy[S][action] > max_p_a_given_s):
                    max_p_a_given_s = targetPolicy[S][action]
                    argmax_A = []
                    argmax_A.append(action)
                elif (targetPolicy[S][action] == max_p_a_given_s):
                    argmax_A.append(action)
            A = random.choice(argmax_A)
       
            # 6. Take action a and observe R and S_{t+1}
            next_state, r = environment.transitions(state=S, action=A)
            # Set the isDone flag to true if the next state is the finishing position on the grid/maze
            if( (next_state[0]==environment.FinishPosition[0]) and (next_state[1]==environment.FinishPosition[1]) ):
                isDone = True
            # 7. Update the estimate of V(S_{t})
            V_s.V_s[S] = V_s.V_s[S] + stepSize * (r + discountFactor * V_s.V_s[next_state] - V_s.V_s[S]) 
            # 8. Set S to S_{t+1}
            S = next_state

    return V_s

<h3>Perform TD(0) for prediction of state value function</h3>

In [32]:

# Set the step size for updating the estimate of the state value function
stepSize = 0.1
# Set the discount factor gamma for rewards
gamma = 0.99

# Set a tolerance for comparison of updated state values for state value estimation
tolerance = 0.0000000001

# Select a square in the Windy Gridworld maze to act as the starting line square
start_position = (19, 0)

# Select a square in the Windy Gridworld maze to act as the finish line square
finish_position = (0,19)

In [33]:
# Set teh number of runs
numRuns = 1
# numEpisodes [1, 10, 50, 100, 200, 500, 1000, 3000, 5000, 10000]
episodes = [1]

An environment can be instantiated by simply choosing a finish position and then calling the envGridworld constructor which takes a path to a numpy array representing the maze and the selected finishing square position as a tuple.

Now run the TD(0) algorithm on the Windy Gridworld environmnet object using the parameters set above.

OrderedDict([(((0, 0), 0), 0.25),
             (((0, 0), 1), 0.25),
             (((0, 0), 2), 0.25),
             (((0, 0), 3), 0.25),
             (((0, 1), 0), 0.25),
             (((0, 1), 1), 0.25),
             (((0, 1), 2), 0.25),
             (((0, 1), 3), 0.25),
             (((0, 2), 0), 0.25),
             (((0, 2), 1), 0.25),
             (((0, 2), 2), 0.25),
             (((0, 2), 3), 0.25),
             (((0, 3), 0), 0.25),
             (((0, 3), 1), 0.25),
             (((0, 3), 2), 0.25),
             (((0, 3), 3), 0.25),
             (((0, 4), 0), 0.25),
             (((0, 4), 1), 0.25),
             (((0, 4), 2), 0.25),
             (((0, 4), 3), 0.25),
             (((0, 5), 0), 0.25),
             (((0, 5), 1), 0.25),
             (((0, 5), 2), 0.25),
             (((0, 5), 3), 0.25),
             (((0, 6), 0), 0.25),
             (((0, 6), 1), 0.25),
             (((0, 6), 2), 0.25),
             (((0, 6), 3), 0.25),
             (((0, 7), 0), 0.25),
             (

In [38]:
# Run the experiment
for numEpisodes in episodes:
    # Run an experiment numRuns times
    for runNum in range(numRuns):
        # Set the environment
        environment = wg.envWindyGridworld(path_to_maze="./maze.npy", path_to_windDirections="./windDirections.npy",path_to_windStrengths="./windStrengths.npy", start_position=start_position, finish_position=finish_position, actionSpace=actionSpace)
        # Open a pre-trained target policy that was serialized to json by pickle
        # pi = pickle.load()
        # Instantiate a new random policy where P(A=a|S=s)=0.25 for all a in A and all s in S
        pi = tPolicy.policyWindyGridworld(stateSpaceMazeRows=maze.shape[0], stateSpaceMazeColumns=maze.shape[1])
        # 1. Instantiate and initialize a new state value function V(S_t)
        V_s = stateValueWG.stateValueWindyGridworld(stateSpace=environment.stateSpaceWindyGridworld, actionSpace=actionSpaceWindyGridworld)
        # 1. Set the state value of walls within the maze to -1
        for i in range(environment.Maze.shape[0]):
            for j in range(environment.Maze.shape[1]):
                if (environment.Maze[i,j] == -1):
                    V_s[i,j] = (float)('-inf')
                    continue
                if (i == environment.FinishPosition[0] and j == environment.FinishPosition[1]):
                    V_s[i,j] = 0.
                    continue
        # Instantiate and initialize a new returns runction R(S_t) with empty lists
        R_s = returnsStateValueWG.returnsStateWindyGridworld(stateSpace=stateSpaceWindyGridworld,     
            actionSpace=actionSpaceWindyGridworld)
        # Run the TD(0) prediciton algorithm to compute the estimate of the state value function V_s
        v_star_estimate, path_to_state_value_fig = TemporalDifferencePrediction_StateValue_TD_0_WindyGridworld(targetPolicy=pi, environment=environment, V_s=V_s, R_s=R_s, stepSize=stepSize, discountFactor=discountFactor, tolerance=tolerance, numEpisodes=numEpisodes)
        # Set up the grids for plots of state value
        value_grid = h.create_state_value_grid(V_s=V_s)
        # Format a string for the title of the plot
        title = "Temporal Difference TD(0) Prediction State Value\nOn Policy #Episodes=" + str(episodes[i]) + "\nstepSize=" + str(stepSize) + ", \u03B3=" + str(discountFactor) + "\nRun# " + str(runNum+1)
        fileName = "results/temporalDifference_Prediction_StateValue_TD_0_windyGridworld_Episodes_" + str(episodes[i]) + "stepSize_" + str(stepSize) + "_DiscountFactor_" + str(discountFactor) + "_run_" + str(runNum+1) + ".png"
        h.create_state_value_plot(value_grid=value_grid, title=title, fileName=fileName, numEpisodes=episodes[i], runNum=runNum )

SyntaxError: positional argument follows keyword argument (4202205079.py, line 26)

<h3>TD(0) Estimation of State Value Function Results for Windy Gridworld</h3>

The resulting estimates for the state value function for all states in the Windy Gridworld environment are provided below where each position in the maze contains:
1. the direction of the wind (if any)
2. the estimated state value $v_{\pi}$ of each position on the grid while following policy $\pi$

In [None]:
Image(filename=path_to_state_value_fig)

<h2>References</h2>

1. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 120.
2. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, pp. 142-145.
3. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, pp. 287-301.
4. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 58.
5. Sutton R.S., Barto A.G., "Reinforcement Learning, An Introduction", 2nd ed, MIT Press, Cambridge MA, 2018, p. 130.

