1 Understanding MDPs
==========

1.1 Chess
------------


1.2 LunarLander
------------

1.3 Model Based RL: Accessing Environment Dynamics
------------

The environment dynamics are controlled by the *reward function* and the *state transition function*. The *reward function* determines the reward that an action will produce if taken in the current state. The *state transition function* takes in a proposed action and the current state and returns the state that would be produced by taking this action. In a very simple implementation, a positive reward will be returned if the action produces a 'good' state and a negative reward, or cost, will be returned if the action produces a 'bad' state, as returned by the state transition function. For example, imagine that you have an actor who is tasked with caring for a plant and can either water or not water the plant once a day. The state transition function takes in the plant's current state, it's thirst level, and a proposed action, either watering or not watering the plant, and returns the thirst level of the plant after each action. The reward function can take in these proposed states, and return a reward based on whether the plant is at a health thirst level. 

2 Implementing a Gridworld
==========

2.1 Look up some examples
----------

1. [An introduction article that describes a simple implementation](https://towardsdatascience.com/reinforcement-learning-implement-grid-world-from-scratch-c5963765ebff)
2. [A simple visualization of deterministic and probabilitic policies for a gridworld with obstacles](https://www.youtube.com/watch?v=gThGerajccM)
3. [A simple toy gridworld with explanation](https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html)

2.2 Implementing the MDP
-----------

In [40]:
from enum import Enum
# Directions as an enum
class Direction(Enum):
    UP = 0
    LEFT = 1
    DOWN = 2
    RIGHT = 3
    
def direction_arithmetic(curr_pos, direction):
    row, col = curr_pos
    if direction == Direction.UP:
        row = row - 1
    elif direction == Direction.LEFT:
        col = col - 1
    elif direction == Direction.DOWN:
        row = row + 1
    elif direction == Direction.RIGHT:
        col = col + 1
    else:
        raise Exception(f"Unrecognized direction: {direction}")
    return (row, col)

# Define the GridWorld Class
class GridWorld:
 
    def __init__(self, width, height):
        # initialize the grid pr
        self.width = width
        self.height = height
        self.grid = [[0 for _ in range(width)] for _ in range(height)]
        self.win_state = (width-1,height-1)
        
    # for resetting the state of the world to the beginning
    def reset(self):
        self.grid = [[0 for _ in range(width)] for _ in range(height)]
        self.agent_state = (0,0)
        self.grid[self.agent_state[0]][self.agent_state[1]] = "A"
        self.grid[width-1][height-1] = "W"
    
    def valid(self, state):
        row, col = state
        # the below condition would need to be updated for obstacles
        return (row >=0 and row < self.height) and (col >=0 and col < self.width)
        
    def move(self, direction):
        target_state = direction_arithmetic(self.agent_state, direction)
        if self.valid(target_state):
            self.grid[self.agent_state[0]][self.agent_state[1]] = 0
            self.agent_state = target_state
            self.grid[self.agent_state[0]][self.agent_state[1]] = "A"
    
    # for printing and debugging
    def __repr__(self):
        s = ""
        for row in range(self.height):
            s += "==" * (self.width) + "="
            s += "\n"
            for col in range(self.width):
                s += f"|{self.grid[row][col]}"
            s +="|\n"
        s += "==" * (self.width) + "="
        return s

In [41]:
g = GridWorld(3,3)
print(g)
g.move(Direction.LEFT)
print(g)
g.move(Direction.RIGHT)
print(g)
g.move(Direction.DOWN)
print(g)

|A|0|0|
|0|0|0|
|0|0|W|
|A|0|0|
|0|0|0|
|0|0|W|
|0|A|0|
|0|0|0|
|0|0|W|
|0|0|0|
|0|A|0|
|0|0|W|


3 Implementing a policy
==============

3.1 Implement the basic agent
------------------

3.2 Evaluate the policy
----------------