<a href="https://colab.research.google.com/github/rahiakela/grokking-deep-reinforcement-Learning/blob/main/2-mathematical-foundations-of-reinforcement-learning/2_bandit_slippery_walk_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bandit slippery walk example

BW is a simple grid-world (GW) environment. GWs are a common type of environment for studying RL algorithms that are grids of any size. GWs can have any model (transition and reward functions) you can think of and can make any kind of actions available.

But, they all commonly make move actions available to the agent: 
- Left, 
- Down, 
- Right, 
- Up 

(or West, South, East, North, which is more precise because the agent has no heading and usually has no visibility of the full grid, but cardinal directions can also be more confusing).

And, of course, each action corresponds with its logical transition: Left goes left, and Right goes right. Also, they all tend to have a fully observable discrete state and observation spaces (that is, state equals observation) with integers representing the cell id location of the agent.

A “walk” is a special case of grid-world environments with a single row. In reality, what I call a “walk” is more commonly referred to as a “Corridor.” But, I use the term “walk” for all the grid-world environments with a single row.

Let’s say the surface of the walk is slippery and each action has a 20% chance of sending the agent backwards. I call this environment the bandit slippery walk (BSW).

BSW is still a one-row-grid world, a walk, a corridor, with only Left and Right actions available. Again, three states and two actions. The reward is the same as before, +1 when landing at the rightmost state (except when coming from the rightmost state-from itself), and zero otherwise.

However, the transition function is different: 80% of the time the agent moves to the intended cell, and 20% of time in the opposite direction.

A depiction of this environment would look as follows.

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bsw-environment.png?raw=1' width='800'/>

Identical to the BW environment! Interesting . . .

How do we know that the action effects are stochastic? How do we represent the “slippery” part of this problem? The graphical and table representations can help us with that.

A graphical representation of the BSW environment would look like the following.

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bsw-graph.png?raw=1' width='800'/>

The BSW environment has a stochastic transition function. Let’s now represent this environment in a table form as well.

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bsw-table.png?raw=1' width='800'/>

## Setup

In [None]:
!pip install git+https://github.com/mimoralea/gym-walk#egg=gym-walk

In [None]:
import gym, gym_walk

## Bandit Slippery Walk

In [None]:
############## Bandit Slippery Walk ###############################
# stochastic environment (80% action success, 20% backwards)
# 1-non-terminal states, 2-terminal states
# only reward is still at the right-most cell in the "walk"
# episodic environment, the agent terminates at the left- or right-most cell (after 1 action selection -- any action)
# agent starts in state 1 (middle of the walk) T-1-T
# actions left (0) or right (1)

P = {
    0: {
        0: [(1.0, 0, 0.0, True)],
        1: [(1.0, 0, 0.0, True)]
    },
    1: {
        0: [(0.8, 0, 0.0, True), (0.2, 2, 1.0, True)],
        1: [(0.8, 2, 1.0, True), (0.2, 0, 0.0, True)]
    },
    2: {
        0: [(1.0, 2, 0.0, True)],
        1: [(1.0, 2, 0.0, True)]
    }
}

P

{0: {0: [(1.0, 0, 0.0, True)], 1: [(1.0, 0, 0.0, True)]},
 1: {0: [(0.8, 0, 0.0, True), (0.2, 2, 1.0, True)],
  1: [(0.8, 2, 1.0, True), (0.2, 0, 0.0, True)]},
 2: {0: [(1.0, 2, 0.0, True)], 1: [(1.0, 2, 0.0, True)]}}

In [None]:
P = gym.make("BanditSlipperyWalk-v0").env.P
P

{0: {0: [(0.8, 0, 0.0, True), (0.0, 0, 0.0, True), (0.2, 0, 0.0, True)],
  1: [(0.8, 0, 0.0, True), (0.0, 0, 0.0, True), (0.2, 0, 0.0, True)]},
 1: {0: [(0.8, 0, 0.0, True), (0.0, 1, 0.0, False), (0.2, 2, 1.0, True)],
  1: [(0.8, 2, 1.0, True), (0.0, 1, 0.0, False), (0.2, 0, 0.0, True)]},
 2: {0: [(0.8, 2, 0.0, True), (0.0, 2, 0.0, True), (0.2, 2, 0.0, True)],
  1: [(0.8, 2, 0.0, True), (0.0, 2, 0.0, True), (0.2, 2, 0.0, True)]}}

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bsw-code.png?raw=1' width='800'/>