# Linear World

In this notebook we implement a very simple example of the agent-environment interface used in reinforcement learning, called "linear world".

The world consists of $n$ places in a row, labelled $0, 1, \dots, n-1$, and the state of the world consists of the position where the player is located. The initial state has the player in the middle of the world:
- The (empty) world for $n=5$: `"_ _ _ _ _"`
- With the player in position $2$: `"_ _ X _ _"`

The actions that the agent can take are `LEFT` and `RIGHT`, each moving the player one place in the indicated direction. In the two outer positions ($0$ and $n-1$), both actions result in a step towards the inside:
- After action `RIGHT`: `"_ _ _ X _"`
- After action `RIGHT`: `"_ _ _ _ X"`
- After action `LEFT` or `RIGHT`: `"_ _ _ X _"`

The reward is $+1$ for an action that leaves the player in one of the outer positions, and $0$, else. A possible sequence of events in this setting is:
- `"_ _ X _ _"` $S_0 = 3, A_0 = \text{"RIGHT"}$
- `"_ _ _ X _"` $R_1 = 0, S_1 = 3, A_1 = \text{"LEFT"}$
- `"_ _ X _ _"` $R_2 = 0, S_2 = 2, A_2 = \text{"LEFT"}$
- `"_ X _ _ _"` $R_3 = 0, S_3 = 1, A_3 = \text{"LEFT"}$
- `"X _ _ _ _"` $R_4 = 1, S_4 = 0, A_4 = \text{"RIGHT"}$
- `"_ X _ _ _"` $R_5 = 0, S_5 = 1, A_5 = \dots$




*<span style="color:red">Below, the parts indicated by `#??` need to be filled in!</span>*

In [1]:
# We use constants 1 and 2 to represent LEFT, RIGHT:
LEFT = 1
RIGHT = 2

In [2]:
class LinearWorld:
    def __init__(self, length): # length is local, need to set as attribute in next line
        # Store length of world
        self.length = length
        
        # Initialize state of world in the middle
        self._state = length // 2
    
    def step(self, action): # function that takes us from before enviroment to after enviroment (ACTION)
        # state0 = self.state remember the state
        # Compute new state
        # (1. handle the outer two positions)
        if self._state == 0:
            self._state = 1
        elif self._state == self.length - 1:
            self._state = self.length - 2 
        # (2. change position to the left/right)
        elif action == LEFT:
            self._state -=1
        elif action == RIGHT:
            self._state += 1
        else: 
            raise Exception("Wrong input.")        

        # Compute reward  
        if self._state == 0 or self._state == self.length - 1:
            reward = 1 # at the edges
        else:
            reward = 0
        
        # Return state and reward
        return self._state, reward
    
    def reset(self):
        # Reset the position to the middle
        self._state = self.length // 2
    
    def showWorld(self):
        #Print a representation of the linear world
        
        # Start with an empty string
        ret = ''
        
        # Add "_" for every empty spot, "X" for the player
        for i in range(self.length):
            if i != self._state:
                ret += "_ "
            else: 
                ret += "X "
        
        # Print the complete string
        print(ret)

    def __str__(self):
        # (!) Advanced concept:
        # Custom string-conversion (used e.g. by `print()`)
        # Start with an empty string
        ret = ''
        
        # Add "_" for every empty spot, "X" for the player
        for i in range(self.length):
            if i != self._state:
                ret += "_ "
            else: 
                ret += "X "
        
        # Print the complete string
        return ret
        # Use the same logic as in .showWorld(), but return the string
        # (instead of printing it)
        #??  

In [3]:
lw = LinearWorld(5)
lw.showWorld()
print(lw)

_ _ X _ _ 
_ _ X _ _ 


## Testing the linear world

First, we create an instance of `LinearWorld`, then we use the `.step()` method to perform actions ($A_t$) and observe the resulting state ($S_{t+1}$) and reward ($R_{t+1}$).

In [4]:
# Create a new instance of the LinearWorld class
lw = LinearWorld(7)

In [5]:
# Check the properties `length` and `pos` of the instance
print(lw.length)
print(lw._state)

7
3


In [6]:
# Make a step and assign the outcome to variables (state + reward)
_state, reward = lw.step(RIGHT)

# Print the outcome
print(_state)
print(reward)

4
0


We can use the method `.showWorld()` to visualize the events "graphically":

In [None]:
# Make a step
#??

# Show the new state of the world
#??

## Two simple policies

Next, we implement two policies and see how they perform over a timespan of $T = 100$ steps:
- The random policy randomly chooses and action
- The "right" policy always goes `RIGHT`

In [7]:
# We use numpy to choose a random action
import random

In [8]:
# Number of steps
T = 100

In [13]:
# Run the random policy for T steps, update the total reward
lw = LinearWorld(7)
totalRandom = 0
for t in range(T):
    a = random.choice([LEFT, RIGHT])
    _state, reward = lw.step(a)
    totalRandom += reward
    lw.showWorld()
    print(totalRandom)

_ _ _ _ X _ _ 
0
_ _ _ X _ _ _ 
0
_ _ X _ _ _ _ 
0
_ _ _ X _ _ _ 
0
_ _ _ _ X _ _ 
0
_ _ _ X _ _ _ 
0
_ _ X _ _ _ _ 
0
_ _ _ X _ _ _ 
0
_ _ _ _ X _ _ 
0
_ _ _ _ _ X _ 
0
_ _ _ _ X _ _ 
0
_ _ _ _ _ X _ 
0
_ _ _ _ X _ _ 
0
_ _ _ _ _ X _ 
0
_ _ _ _ _ _ X 
1
_ _ _ _ _ X _ 
1
_ _ _ _ X _ _ 
1
_ _ _ X _ _ _ 
1
_ _ X _ _ _ _ 
1
_ X _ _ _ _ _ 
1
_ _ X _ _ _ _ 
1
_ X _ _ _ _ _ 
1
_ _ X _ _ _ _ 
1
_ _ _ X _ _ _ 
1
_ _ X _ _ _ _ 
1
_ _ _ X _ _ _ 
1
_ _ _ _ X _ _ 
1
_ _ _ _ _ X _ 
1
_ _ _ _ _ _ X 
2
_ _ _ _ _ X _ 
2
_ _ _ _ X _ _ 
2
_ _ _ X _ _ _ 
2
_ _ X _ _ _ _ 
2
_ X _ _ _ _ _ 
2
_ _ X _ _ _ _ 
2
_ _ _ X _ _ _ 
2
_ _ X _ _ _ _ 
2
_ X _ _ _ _ _ 
2
_ _ X _ _ _ _ 
2
_ X _ _ _ _ _ 
2
_ _ X _ _ _ _ 
2
_ _ _ X _ _ _ 
2
_ _ _ _ X _ _ 
2
_ _ _ _ _ X _ 
2
_ _ _ _ _ _ X 
3
_ _ _ _ _ X _ 
3
_ _ _ _ _ _ X 
4
_ _ _ _ _ X _ 
4
_ _ _ _ X _ _ 
4
_ _ _ X _ _ _ 
4
_ _ X _ _ _ _ 
4
_ _ _ X _ _ _ 
4
_ _ X _ _ _ _ 
4
_ X _ _ _ _ _ 
4
_ _ X _ _ _ _ 
4
_ X _ _ _ _ _ 
4
X _ _ _ _ _ _ 
5
_ X _ _ _ _ _ 
5
X _ _ _ _ _ _ 

In [None]:
# Check the total rewards we got
print(totalRandom)

In [14]:
# Run the "right" policy for T steps, update the total reward
lw = LinearWorld(7)
totalRight = 0
for t in range(T):
    _, reward = lw.step(RIGHT)
    totalRight += reward
    lw.showWorld()
print(totalRight)

_ _ _ _ X _ _ 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ _ X 
_ _ _ _ _ X _ 
_ _ _ _ _ 

In [None]:
# Check the total rewards we got
print(totalRight)