<a href="https://colab.research.google.com/github/rahiakela/grokking-deep-reinforcement-Learning/blob/main/2-mathematical-foundations-of-reinforcement-learning/1_bandit_walk_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Bandit walk example

BW is a simple grid-world (GW) environment. GWs are a common type of environment for studying RL algorithms that are grids of any size. GWs can have any model (transition and reward functions) you can think of and can make any kind of actions available.

But, they all commonly make move actions available to the agent: 
- Left, 
- Down, 
- Right, 
- Up 

(or West, South, East, North, which is more precise because the agent has no heading and usually has no visibility of the full grid, but cardinal directions can also be more confusing).

And, of course, each action corresponds with its logical transition: Left goes left, and Right goes right. Also, they all tend to have a fully observable discrete state and observation spaces (that is, state equals observation) with integers representing the cell id location of the agent.

A “walk” is a special case of grid-world environments with a single row. In reality, what I call a “walk” is more commonly referred to as a “Corridor.” But, I use the term “walk” for all the grid-world environments with a single row.

The bandit walk (BW) is a walk with three states, but only one non-terminal state. Environments that have a single non-terminal state are called “bandit” environments. “Bandit” here is an analogy to slot machines, which are also known as “one-armed bandits”; they have one arm and, if you like gambling, can empty your pockets, the same way a bandit would.

The BW environment has just two actions available: a Left (action 0) and an Right (action 1) action. BW has a deterministic transition function: a Left action always moves the agent to the Left, and a Right action always moves the agent to the right. The reward signal is a +1 when landing on the rightmost cell, 0 otherwise. The agent starts in the middle cell.

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bw-environment.png?raw=1' width='800'/>

A graphical representation of the BW environment would look like the following.

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bw-graph.png?raw=1' width='800'/>

We can also represent this environment in a table form.

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bw-table.png?raw=1' width='800'/>

## Setup

In [None]:
!pip install git+https://github.com/mimoralea/gym-walk#egg=gym-walk

In [10]:
import gym, gym_walk

## Bandit Walk

In [1]:
############## Bandit Walk ###############################
# deterministic environment (100% action success)
# 1-non-terminal states, 2-terminal states
# only reward is still at the right-most cell in the "walk"
# episodic environment, the agent terminates at the left- or right-most cell (after 1 action selection -- any action)
# agent starts in state 1 (middle of the walk) T-1-T
# actions left (0) or right (1)

P = {
    0: {
        0: [(1.0, 0, 0.0, True)],
        1: [(1.0, 0, 0.0, True)]
    },
    1: {
        0: [(1.0, 0, 0.0, True)],
        1: [(1.0, 2, 0.0, True)]
    },
    2: {
        0: [(1.0, 2, 0.0, True)],
        1: [(1.0, 2, 0.0, True)]
    }
}

P

{0: {0: [(1.0, 0, 0.0, True)], 1: [(1.0, 0, 0.0, True)]},
 1: {0: [(1.0, 0, 0.0, True)], 1: [(1.0, 2, 0.0, True)]},
 2: {0: [(1.0, 2, 0.0, True)], 1: [(1.0, 2, 0.0, True)]}}

In [11]:
P = gym.make("BanditWalk-v0").env.P
P

{0: {0: [(1.0, 0, 0.0, True), (0.0, 0, 0.0, True), (0.0, 0, 0.0, True)],
  1: [(1.0, 0, 0.0, True), (0.0, 0, 0.0, True), (0.0, 0, 0.0, True)]},
 1: {0: [(1.0, 0, 0.0, True), (0.0, 1, 0.0, False), (0.0, 2, 1.0, True)],
  1: [(1.0, 2, 1.0, True), (0.0, 1, 0.0, False), (0.0, 0, 0.0, True)]},
 2: {0: [(1.0, 2, 0.0, True), (0.0, 2, 0.0, True), (0.0, 2, 0.0, True)],
  1: [(1.0, 2, 0.0, True), (0.0, 2, 0.0, True), (0.0, 2, 0.0, True)]}}

<img src='https://github.com/rahiakela/img-repo/blob/master/reinforcement-learning/bw-code.png?raw=1' width='800'/>