# Gridworld

### Description

In this seminar we implement an example of the agent-environment interface used in reinforcement learning, called "gridworld".

The world consists of an $n \times m$ grid of squares, indexed by $(i,j)$ with $i = 0, 1, \dots, n-1$ and $j = 0, 1, \dots, m-1$.

The state of the environment consists of the player being in one of the squares.

The possible actions are steps in the directions "UP", "RIGHT", "DOWN", "LEFT".

The rewards and states after each action can depend on multiple factors:
- Some squares can give positive/negative rewards
- Squares might be "blocked" and not possible to step on
- Stepping off the board is impossible
- Invalid moves might give a negative reward as "punishment"
- There might be "portals" that take the player to a distant square, regardless of their action
- There might be deterministic or random effects that change the outcome of actions, e.g. "wind" or "ice".

Variants of this gridworld can be used to illustrate a wide range of concepts and algorithms in reinforcement learning.
For instance, in [Sutton & Barto](http://incompleteideas.net/book/the-book-2nd.html) see:
- Example 3.5, 3.8
- Figure 4.1
- Example 6.5, 6.6
- Figure 7.4
- Example 8.1, 8.3


You can either implement this class in a separate module (recommended) or inside a Jupyter cell, see `reuseCode.ipynb` for details.

### Implementation

Below is a suggested skeleton of a `GridWorld` class, feel free to modify or rename everything.
If you want to implement this class in a separate module, create a new file `gridworld.py`, copy the code there, and delete this code cell.

In [None]:

# A list of possible moves
MOVES = [
    (-1, 0), # Up
    (0, 1),  # Right
    (1, 0),  # Down
    (0, -1)  # Left
]

class GridWorld:
    def __init__(self, height = 5, width = 5, rewardDict = dict(), invalidActionReward = 0):
        # Store given dimensions and rewards in attributes
        self.height = height
        self.width = width
        self.rewardDict = rewardDict
        self.invalidActionReward = invalidActionReward

        # Set the initial position to (0, 0)
        self.pos = (0, 0)

    def step(self, action):
        """
        Perform an action.
        The argument `action` must be an integer (0, 1, 2, or 3),
        indicating one of the moves from `MOVES`.
        """
        
        # Get the move (tuple) corresponding to the given action (integer)
        move = MOVES[action]
        
        # Compute the new position
        newPos = (self.pos[0] + move[0], self.pos[1] + move[1])
        
        # Check if the new position is out of bounds
        if newPos[0] < 0 or newPos[0] >= self.height or newPos[1] < 0 or newPos[1] >= self.width:
            # If so, don't move, and return the invalid action reward
            return self.pos, self.invalidActionReward
        
        # Loop up the reward for landing here
        reward = self.rewardDict.get(newPos, 0)
        
        # Update the state of the world
        self.pos = newPos

        # Return the new position and the reward
        return newPos, reward
    
    def reset(self):
        self.pos = (0, 0)
        return self.pos
    
    def drawWorld(self):
        # Loop over all rows/columns
        for i in range(self.height):
            for j in range(self.width):
                # Print an "X" if we're at the current position
                # Otherwise, print a "."
                if (i, j) == self.pos:
                    print("X", end=" ")
                else:
                    print(".", end=" ")
            # Print a newline at the end of each row
            print()
    
    def play(self):
        print('Enter an integer to make a move (0, 1, 2, or 3).')
        print('Enter anything else to quit.\n')
        self.drawWorld()
        while True:
            x = input('> ')
            try:
                action = int(x)
            except:
                break
            (newPos, reward) = self.step(action)
            print('New position:', newPos)
            print('Reward:', reward, '\n')
            self.drawWorld()

Once you have implemented the basic methods above, you should be able to walk around in an empty gridworld! 🎉

To make things more interesting, implement for example:
- Positive rewards for reaching certain squares
- Negative rewards for "illegal" moves
- Blocked squares that cannot be entered
- Teleporting squares that move the player to another spot. This is useful to avoid optimal "back-and-forth" policies.
- A `.previewMove()` method to simulate given actions from a given start square.
- A random effect (ice, wind, ...) that changes the effect of some actions.


### Testing

Below you can test individual aspects of your gridworld class with short code cells.

You can re-run cells or run them out of order, but it is recommended that the notebook still works if you `Run All` in a fresh jupyter session.

If you implemented an interactive `.play()` method, you might not be able to test it from inside a jupyter notebook.

In [None]:
# not necessary, since we defined the class in this notebook:

# # Import the GridWorld class from `gridworld.py`
# from gridworld import GridWorld


In [None]:
# Create a dictionary of rewards
rewardDict = {
    (0, 3): 10,
    (1, 3): -10,
}

# Create a new gridworld instance
gw = GridWorld(6,7)

# Output instance
gw.drawWorld()

In [None]:
# Play interactively
# gw.play()

In [None]:
# Test any other behaviour that you implemented...