<h2>Environment: Gridwolrd</h2>


The Gridwolrd environemnt is composed of a class containing the maze itself, the possible actions allowed within the environment and helpoer functions to compute rewards and transition probabilties of the possible moves available. See the [Gridworld environment class notebook](envGridworld.ipynb) for details of this class. 

<img src="images/gridworld_grid.png" style="display:block; margin:auto">

<h3>Environment Gridworld: Maze</h3>

The maze is represented as a 20x20 numpy array of integers where each elelment of the 2D array represents a position in the maze and the value of each element signifies whether the position can be occupied by the agent (i.e. $maze[i,j]=0$) or if the position is a wall (i.e. $maze[i,j]=-1$).

$$
\large maze[i,j]
= 
\begin{cases}
0\quad\quad \,\,\,\,\,\,maze[i,j]\normalsize\text{ is not a wall } \\
-1\quad\,\,\,\,\,\,\,\,maze[i,j]\normalsize\text{ is a wall}
\end{cases}
$$

<h3>Environment Gridworld: Mask</h3>

The Gridworld class contains a data member mask which is an $nxm$ numpy array that is passed to helper functions in [helpers.ipynb](helpers.ipynb) which plot the state values and policies.

$$
\large mask[i,j]
= 
\begin{cases}
False\quad\quad maze[i,j]\normalsize \text{ is not a wall } \\
True\quad\,\, \,\,\,\,\,\,maze[i,j]\normalsize\text{ is a wall}
\end{cases}
$$

<h3>Environment Gridworld: States</h3>

Each white square in the maze represents a state which is a random variable which we denote as $S$.<br> Each observed state $s\in S$ is defined as an ordered pair $(x,y)$ which represents the position of the agent on the grid.
$$\large S\coloneqq\{(x, y): (x\in\mathbb{Z}) \cap (0\le x<n) \bigcap (y\in\mathbb{Z}) \cap (0\le y<m)\}$$

<h3>Environment Gridworld: Actions</h3>

Each action is a random variable which we denote with the symbol $A$ and there are four actions available to the agent.<br>
$$\large A\coloneqq\{Up, Down, Left, Right\}$$

<h3>Environment Gridworld: Rewards</h3>

At each time step the reward of an action taken is either -1 or 0 depending upon whether the next state is a finish line or non-finish line square on the grid. 

$$
\large R_t
= 
\begin{cases}
0\quad\quad \Large s_{t+1}\normalsize =\text{ finish line square on the grid } \\
-1\quad\, \Large s_{t+1}\normalsize\neq\text{ finish line square on the grid}
\end{cases}
$$

<h3>Environment Gridworld: Transitions</h3>

The transition probabilities used by the Gridwold environment assume that the probability of moving to the next state (i.e. square in the maze) given any any action $a \in A$ from any current state $s \in S$ is equal to 1. More simply there is only one possible next state $s' \in S$ for each action $a \in A$.

$$\large P(S_{t+1}=s_{t+1}|S_t=s_t, A=a_t) = 1 \qquad \forall \,s_{t+1}, s_t \in S,\, a_t \in A$$

In [1]:
import numpy as np
import import_ipynb

In [2]:
class envGridworld:
    def __init__(self, path_to_maze, finish_position):
        
        # load maze from numpy file
        self._maze = np.load(path_to_maze, mmap_mode='r')
        
        # Mask maze walls (indicated by -1) for seaborn heatmap in helper funciton plot_values
        self._mask = (self._maze == -1)
        
        i, j = finish_position
        # If finish position is valid (i.e. not a wall and within the grid boundary limits),
        # then set the final position
        if ( (self._maze[i, j] != -1) and (i >= 0) and (i < self._maze.shape[0]) and (j >= 0) and (j < self._maze.shape[1]) ):
            self._finish_position = tuple((i, j))
         
        # Define the four actions available to choose from in the environment
        self._actions = ("up", "down", "left", "right")
        
        # Create a 2D matrix of tuples representing the states (i.e. squares on the gird)
        self._states = np.zeros((self._maze.shape[0], self._maze.shape[1], 2), dtype=np.int32)
        for i in np.arange(self._maze.shape[0]):
            for j in np.arange(self._maze.shape[1]):
                self._states[i,j] = np.array((i,j))
        
    def reward(self, state, next_state):
        # Each step towards the finsih square has a reward of -1
        # Moving to the finish line square has a reward of 0
        if (next_state != self._finish_position):
            return -1
        return 0

    # This is the conditional probability of the reward and next state given the current state and action
    def p_sp_r_given_s_a(self, next_state, reward, state, action):
        return 1.0

    # This function returns the next state after choosing an action from the current state
    # after checking to see if the action would result in hitting a maze wall or falling off
    # of the grid. In either scenario the next state is reset to the current state
    def transitions(self, state, action):
        if (action == 0):
            next_state = (state[0]-1, state[1])
            # Hit the upper boundary of the maze or hit wall from below
            if (state[0] == 0 or self._maze[state[0]-1, state[1]] == -1):
                next_state = state
        if (action == 1):
            next_state = (state[0]+1, state[1])
            # Hit the bottom boundary of the maze or hit a wall from above
            if (state[0] == self._maze.shape[0]-1 or self._maze[state[0]+1, state[1]] == -1):
                next_state = state
        if (action == 2):
            next_state = (state[0], state[1]-1)
            # Hit the left boundary of maze or hit a wall from the right
            if (state[1] == 0 or self._maze[state[0], state[1]-1] == -1):
                next_state = state
        if (action == 3):
            next_state = (state[0], state[1]+1)            
            # Hit the right boundary of maze or hit a wall from the left
            if (state[1] == self._maze.shape[1]-1 or self._maze[state[0], state[1]+1] == -1):
                next_state = state
        r = self.reward(state, next_state)
        p = self.p_sp_r_given_s_a(next_state, r, state, action)
        return next_state, r, p
    
    # Getters and setters via property decorators
    @property
    def finish_position(self):
        return self._finish_position
    @finish_position.setter
    def finish_position(self, finish_position):
        i, j = finish_position
        print((i,j))
        if (self._maze[i,j] == -1):
            raise Exception("Tuple finish position is a wall in the maze")
            #print("Please choose a position that does not represent a wall in the maze")
        if(i < 0 or i > self._maze.shape[0] -1 or j < 0 or j > self._maze.shape[1]-1):
            raise Exception("Tuple finish_position is not valid")
        self._finish_position = tuple((i, j))
        return
    @property
    def States(self):
        return self._states
    @property
    def Actions(self):
        return self._actions
    @property
    def Maze(self):
        return self._maze
    @property
    def Mask(self):
        return self._mask

<h2>Instantiate a Gridworld object</h2>

In [3]:
# Set a finsih position
#fin_position = (0,4)

# Instantiate a Gridworld object
#ex_gridworld = envGridworld(path_to_maze='./maze2.npy', finish_position=fin_position)



In [4]:
# Get the maze data member
#ex_gridworld.Maze

In [5]:
# Get the maze actions available
#ex_gridworld.Actions

In [6]:
# Set and get the finish lize position on the grid
#ex_gridworld.finish_position = (0,0)
#ex_gridworld.finish_position

In [7]:
# Get the maze mask
#ex_gridworld.Mask

In [8]:
# Delete the gridworld object
#del ex_gridworld