<h1>Discrete Finite Markov Decision Processes: Gridworld Example</h1>


<h2>The Gridworld Experiment</h2>

> Gridworld is a well known example of a finite Markov decision process which tasks an agent to find its way through a maze represented as a discrete two-dimensional $mxn$ grid. Below is an example of a grid where m=20 and n=20.

<center><img src="images/gridworld_grid.png"/></center>
<center>An example grid where m=20 and n=20</center>


> In Gridworld, our agent will begin in the lower left sqaure of our maze at time $t=0$ and then choose a series of actions where each action is a single step from the current square to an adjacent square that is neither a non-diagonal square nor a wall (represented by black sqaures).<br>

> <h3>Episodes and time steps</h3>

>> In Gridworld, an episode consists of a discrete sequence of time steps that occur from the time when the agent first begins, at the "Start" square on our grid, to the time at which our agent finally reaches the "Finish" square on the grid.

>> The time step is denoted by the variable $t$ and is set to zero at the beginning of the first episode. At this time our agent is located in the "Start" square on the grid.

>> The time step variable $t$ is then incremented by one after each action is take by our agent until it reaches the "Finish" sqaure on the grid.

>> The variable $T$ is used to represent the value of the time step $t$ upon which our agent reaches the "Finish" square on the grid and the episode ends. 

>> If another episode is carried out, the agent returns to the "Start" square and the time step variable $t$ is set to $T+1$.

> <h3>Actions</h3>

>> Each action is a random variable which we denote with the symbol $A$ and there are four actions available to the agent.<br>
$$A\coloneqq\{Up, Down, Left, Right\}$$

> <h3>States</h3>

>> Each white square in the maze represents a state which is a random variable which we denote as $S$.<br> Each observed state $s\in S$ is defined as an ordered pair $(x,y)$ which represents the position of the agent on the grid.
$$S\coloneqq\{(x, y): (x\in\mathbb{Z}) \cap (0\le x<n) \bigcap (y\in\mathbb{Z}) \cap (0\le y<m)\}$$

> <h3>Policy</h3>

>> Our agent's policy is denoted as $\pi$ and is used to represent the conditional probability mass function $f_{A|S}$ of actions over the states:<br>
$$\large\pi\coloneqq f_{A|S}$$

>> At any given timestep $t$, our agent must choose an action $a\in A$ to move from its current state $s\in S$ to its next state $s'\in S$.

>> We denote the conditional probability of our agent selecting action $a\in A$ given its current state $s\in S$ by the following notation:
$$\large\pi(a|s)=f_{A|S}(a|s)=P(A=a|S=s)$$



> <h3>Rewards</h3>

>> The reward given to our agent is a random variable which we denote as $R$. The way we define $R$ will depend on the specific solution being used to solve the Gridworld MDP. See below for a [list of gridworld solutions](#solutions) that will be discussed.

> <h3>Returns</h3>

>> The return at each time step of an MDP is  a random variable we denote as $G_t$. This return is a function of the sequence of subsequent rewards (discounted by the parameter $\gamma$) obtained by the agent after time $t$ during an episode.

$$\large G_t=R_{t+1}+ \gamma R_{t+2}+ \gamma ^{2} R_{t+3}+ \dots+ \gamma ^{T-t-1}R_T$$

>>The following equivalent form of the equation above is used for brevity and generality: 

$$\large G_t=\sum_{k=t+1}^{T}\gamma^{k-t-1}R_k$$

>> Note: We use this generalized form because it allows us to refer to a single mathematical representation of returns which may or may not be discounted ($\gamma=1$ when no discounting is used). Additionally, this generalized form can be used for episodic tasks where $T<\infty$ and for continuous tasks (discussed in other sections of this repo) where $T=\infty$.

> <h3>Environment</h3>

>> Each solution used for the gridworld MDP will make its own assumptions about the environment, its model, etc. See below for a [list of gridworld solutions](#solutions) discussed in this repo. 

<a id="solutions"></a>
> <h2>Solutions to the Gridworld MDP</h2>

>>- Dynamic Programming Solutions
>>>> 1.) [Policy Iteration](dynamic_programming_sync_policy_iteration_gridworld.ipynb)

>>>> 2.) [Value Iteration](dynamic_programming_sync_value_iteration_gridworld.ipynb)