
## Environment
We defined a 2D grid world $\mathcal{G}\in \mathbb{Z}^2$ of size $(20\times 20)$.

Each state $s\in \mathcal{S}$ represents a grid cell, where

```math
\mathcal{S} = \{ (i,j)\mid 0\leq i<20,\, 0\leq j<20\}\, \cup \,\{s_T\},
```

where $s=(i,j)$ denotes the cell at row $i$ and column $j$, and $s_T = (\text{None, None})$ is a terminal state.

## Action Space
The set of allowed actions is defined as:

```math
\mathcal{A} = \{a_1, a_2, a_3,a_4\} = \{\text{up, down, left, right}\}
```

Each action:
- up: $(i,j)\rightarrow (i+1, j)$
- down: $(i,j)\rightarrow (i-1, j)$
- right: $(i,j)\rightarrow (i, j+1)$
- left: $(i,j)\rightarrow (i, j-1)$

Movements are clamped at the grid boundries:
- if $i+1 \geq 20$, then $(i,j)\rightarrow (19,j)$
- if $i-1<0$, then $(i,j) \rightarrow (0,j)$
- similarly for horizontal movement in $j$

## Terminal, Goal, Fire States
We define goal states as $\mathcal{S}_G$, fire states (obstacles) as $\mathcal{S}_F$, and terminal state as $s_T$.

The agent transitions to $s_T$, if it reaches a goal state or is explicitly terminated.

## Reward
The reward function $R:\mathcal{S}\rightarrow \mathbb{R}$ is defined as:

```math
R(s) = \begin{cases}
0 & \text{if } s=s_T  \\
r_g+r_{\text{move}} & \text{if } s\in \mathcal{S}_G \\
r_f+r_{\text{move}} & \text{if } s\in \mathcal{S}_F \\
r_{\text{move}} & \text{otherwise} \\
\end{cases},
```

where $r_g$ is the goal bonus and $r_f$ is the obstacle penalty. Also the cost of per-step movement is denoted as $r_{\text{move}}$.

## Transition Model 

Let $s= (i,j)\in \mathcal{S}\, \backslash\, \{s_T\}$, $a\in \mathcal{A}$, and $w\in [0,1]$ be the probability of random action subsitution.

Let $s'=T_0(s,a)$ be the deterministic next state by applying action $a$ from state $s$. Let $T(s,a,w)$ denote the randomized transition function defined by:

```math
\mathbb{P}(s'\mid s,a) = \begin{cases}
1, & \text{if } s\in \mathcal{S}_G \cup \{s_T \} \text{ and } s'=s_T \\
1- w+\frac{w}{|\mathcal{A}|}, &\text{if } s'=T_0(s,a) \text{ and } s\notin \mathcal{S}_G\\
\frac{w}{|\mathcal{A}|}, &\text{if } s'=T_0(s,a) \text{ for } a'\neq a\\
0, &\text{o.w.}
\end{cases}.
```
This models $1-w$ as agent takes intended action $a$ and $w$ as agent takes uniformly random action.

## Value Iteration
Let $V_k(s)$ be the value of state $s$ at iteration $k$. The value update follows the Bellman optmilaity equation:

```math
V_{k+1}(s) = \max_{a\in \mathcal{A}} \bigg[ \sum_{s'\in \mathcal{S}} \, \mathbb{P} (s'\mid s,a). (R(s')+\gamma V_k(s'))\bigg],
```
for all $s\neq s_T$ with $\gamma$ as the discount factor, and is zero otherwise. 

## Optimal Policy
The optimal policy $\pi^*: \mathcal{S} \rightarrow \mathcal{A}$ is defined by:

```math
\pi^*(s) = \argmax_{a\in \mathcal{A}} \bigg[\sum_{s'\in \mathcal{S}}\, \mathbb{P}(s'\mid s,a).(R(s')+\gamma \, V_k(s')) \bigg]
```

## Simulation
Given a starting state $s_0\in \mathcal{S}\backslash \{ s_T\}$, simulate a trajectory $\{s_0,s_1,s_2,\dots\}$ using policy $\pi^*$ and transition probability $\mathbb{P}(\dot \mid s_t, \pi^*(s_t))$, stopping at terminal state or after max number of steps. The algorithm halt when

```math
\max_{s\in \mathcal{S}} |V_{k+1}(s)-V_k(s)| < \epsilon
```


