### Lab: Value Iteration in a Grid World

### University of Virginia
### Reinforcement Learning
#### Last updated: December 11, 2023

---

#### Instructions:

Implement value iteration for a $4 \times 3$ gridworld environment. This will measure the value of each state. A robot in this world can make discrete moves: one step up, down, left or right. These actions are deterministic, meaning that the action selected will be taken with probability 1. There is a terminal state with reward +1 in the bottom right corner. All other states have reward 0. The discount factor is 0.9. Use tolerance $\theta=0.01$. Show all code and results.

**Note**: Do not use libraries from `networkx`, `gym`, `gymnasium` when solving this problem.

#### Total Points: 12

---

#### 1) **(POINTS: 2)** As part of your solution, create a GridWorld class with these attributes:

- `nrows` : number of rows in the grid
- `ncols` : number of columns in the grid

and these methods:

- `value_iteration()` with behavior described in [2] below
- `get_reward()` : given the agent row and column, return the reward

The class may include additional attributes and methods as well.

Create an instance using the class, and call `nrows`, `ncols`, and `get_reward()` to verify correctness.

You will not be graded on the implementation of `value_iteration()` for this problem.

#### 2) **(POINTS: 8)** Here, you will be graded on the implementation of `value_iteration()`.
Call `value_iteration()` to calculate and return the value function array. For each sweep over the states, have the function print out the intermediate array.


#### Enter all code here (you may also use multiple cells)

In [1]:
import numpy as np

#### 1) Create and test the class

In [18]:
class GridWorld():
    def __init__(self, nrows, ncols):
        self.nrows = nrows
        self.ncols = ncols
        self.values = np.zeros((nrows, ncols))
        self.actions = ['up', 'down', 'left', 'right']

    def value_iteration(self, theta=0.01, discount_factor=0.9, max_iter = 100):
      while True:
        delta = 0
        v = np.copy(self.values)
        for i in range(self.nrows):
          for j in range(self.ncols):
            if i == self.nrows - 1 and j == self.ncols - 1:
              continue
            else:
              max_value = -np.inf
              v[i,j] = np.max(v, )
              delta = max(delta, abs(v[i,j] - self.values[i,j]))
        self.values = v
        if delta < theta:
          break
        elif max_iter == 0:
          break
        else:
          max_iter -= 1
          print(self.values)

      return self.values

    def get_reward(self, row, col):
        if row == self.nrows - 1 and col == self.ncols - 1:
            return 1
        else:
            return 0

In [19]:
gw = GridWorld(4, 3)
print(gw.nrows)
print(gw.ncols)
print(gw.get_reward(0, 0))
print(gw.get_reward(2, 1))
print(gw.get_reward(3, 2))

4
3
0
0
1


#### 2) Run value iteration

In [20]:
gw.value_iteration(max_iter=10)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

#### 3) **(POINTS: 2)** Based on the value function: After the agent has moved right or down, does it ever make sense for it to backtrack (move up or left)? Explain your reasoning.