# Linear World

In this notebook we implement a very simple example of the agent-environment interface used in reinforcement learning, called "linear world".

The world consists of $n$ places in a row, labelled $0, 1, \dots, n-1$, and the state of the world consists of the position where the player is located. The initial state has the player in the middle of the world:
- The (empty) world for $n=5$: `"_ _ _ _ _"`
- With the player in position $2$: `"_ _ X _ _"`

The actions that the agent can take are `LEFT` and `RIGHT`, each moving the player one place in the indicated direction. In the two outer positions ($0$ and $n-1$), both actions result in a step towards the inside:
- After action `RIGHT`: `"_ _ _ X _"`
- After action `RIGHT`: `"_ _ _ _ X"`
- After action `LEFT` or `RIGHT`: `"_ _ _ X _"`

The reward is $+1$ for an action that leaves the player in one of the outer positions, and $0$, else. A possible sequence of events in this setting is:
- `"_ _ X _ _"` $S_0 = 3, A_0 = \text{``RIGHT"}$
- `"_ _ _ X _"` $R_1 = 0, S_1 = 3, A_1 = \text{``LEFT"}$
- `"_ _ X _ _"` $R_2 = 0, S_2 = 2, A_2 = \text{``LEFT"}$
- `"_ X _ _ _"` $R_3 = 0, S_3 = 1, A_3 = \text{``LEFT"}$
- `"X _ _ _ _"` $R_4 = 1, S_4 = 0, A_4 = \text{``RIGHT"}$
- `"_ X _ _ _"` $R_5 = 0, S_5 = 1, A_5 = \dots$




In [None]:
# We use constants 1 and 2 to represent LEFT, RIGHT:
LEFT = 1
RIGHT = 2

In [None]:
class LinearWorld:
    def __init__(self, length):
        # Store length of world
        self.length = length
        
        # Initialize state of world in the middle
        self.pos = length // 2
    
    def step(self, action):
        """
        Perform an action (going left or right)
        """
        # Compute new state
        if self.pos == 0:
            self.pos += 1
        elif self.pos == self.length - 1:
            self.pos -= 1
        elif action == LEFT:
            self.pos -= 1
        elif action == RIGHT:
            self.pos += 1
        else:
            raise Exception('Invalid action!')

        # Compute reward
        if self.pos == 0:
            reward = 1
        elif self.pos == self.length - 1:
            reward = 1
        else:
            reward = 0
        
        # Return state and reward
        return self.pos, reward
    
    def reset(self):
        """
        Reset the position to the middle
        """
        self.pos = self.length // 2
    
    def showWorld(self):
        """
        Print a representation of the linear world
        """
        # Start with an empty string
        text = ''
        
        # Add "_" for every empty spot, "X" for the player
        for i in range(self.length):
            if i == self.pos:
                text = text + 'X '
            else:
                text = text + '_ '
        print(text)
    
    def __str__(self):
        # (!) Advanced concept:
        # Custom string-conversion (used e.g. by `print()`)
        
        # Start with an empty string
        text = ''
        
        # Add "_" for every empty spot, "X" for the player
        for i in range(self.length):
            if i == self.pos:
                text = text + 'X '
            else:
                text = text + '_ '
        return text


## Testing the linear world

First, we create an instance of `LinearWorld`, then we use the `.step()` method to perform actions ($A_t$) and observe the resulting state ($S_{t+1}$) and reward ($R_{t+1}$).

In [None]:
# Create a new instance of the LinearWorld class
lw = LinearWorld(7)

In [None]:
# Check the properties `length` and `pos` of the instance
print(lw.length)
print(lw.pos)

In [None]:
# Make a step
lw.step(RIGHT)

In [None]:
# Make a step, store the outcome
state, reward = lw.step(RIGHT)

# Print the outcome
print("New state:", state)
print("Reward:", reward)

We can use the method `.showWorld()` to visualize the events "graphically":

In [None]:
# Make a step
state, reward = lw.step(LEFT)

# Print/show the outcome
print("New state:", state)
print("Reward:", reward)
lw.showWorld()

## Two simple policies

Next, we implement two policies and see how they perform over a timespan of $T = 100$ steps:
- The random policy randomly chooses an action
- The "right" policy always goes `RIGHT`

In [None]:
# We use numpy to choose a random action
import random

In [None]:
# Number of steps
T = 100

In [None]:
# Run the random policy
lw = LinearWorld(7)
totalRandom = 0
for t in range(T):
    action = random.choice([LEFT, RIGHT])
    state, reward = lw.step(action)
    totalRandom += reward
    # lw.showWorld() # Uncomment this to see each step

In [None]:
# Check the total rewards we got
print(totalRandom)

In [None]:
# Run the "right" policy
lw = LinearWorld(7)
totalRight = 0
for t in range(T):
    action = RIGHT
    state, reward = lw.step(action)
    totalRight += reward
    # lw.showWorld() # Uncomment this to see each step

In [None]:
# Check the total rewards we got
print(totalRight)