In [1]:
import gym

In [19]:
env = gym.make("Taxi-v3").env

env.render()

+---------+
|[43mR[0m: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+



In [5]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(6)
State Space Discrete(500)


- **blue letter** represents the current passenger *pick-up* location 
- **purple letter** is the current *destination*
- (R, G, Y, B) $\rightarrow$ (0,1,2,3)


In [11]:
state = env.encode(3, 1, 3, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

State: 332
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|Y| : |[34;1mB[0m: |
+---------+



## The Reward Table

When the Taxi environment is created, there is an initial Reward table that's also created, called `P`. We can think of it like a matrix that has the number of states as rows and number of actions as columns

Mapping : $states \ \times \ actions$

- The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at our current state in the illustration.  
- In this env, probability is always 1.0.  
- The nextstate is the state we would be in if we take the action at this index of the dict  
- All the movement actions have a -1 reward and the pickup/dropoff actions have -10 reward in this particular state. If we are in a state where the taxi has a passenger and is on top of the right destination, we would see a reward of 20 at the dropoff action (5)  
- done is used to tell us when we have successfully dropped off a passenger in the right location. Each successfull dropoff is the end of an episode  

In [12]:
env.P[state]

{0: [(1.0, 432, -1, False)],
 1: [(1.0, 232, -1, False)],
 2: [(1.0, 352, -1, False)],
 3: [(1.0, 332, -1, False)],
 4: [(1.0, 332, -10, False)],
 5: [(1.0, 332, -10, False)]}

## Solving the environment with Brute Force
We'll create an infinite loop which runs until one passenger reaches one destination (one episode), or in other words, when the received reward is 20. The env.action_space.sample() method automatically selects one random action from set of all possible actions.




In [13]:
epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))


Timesteps taken: 222
Penalties incurred: 77


In [17]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)


+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 222
State: 0
Action: 5
Reward: 20
