    ## Variations on the Taxi-Grid Enviroment
    
    ### Single Taxi With Fuel

In [1]:
import gym

First, we will demonstrate the most basic enviroment - one taxi with fuel. 

In [26]:
from onetaxifuel_env import OneTaxiFuelEnv
env = OneTaxiFuelEnv()
env.reset()
env.render()

+---------+
|[35mR[0m: |F: :G|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
Fuel: 10



TaxiFuelEnv is an enviroment where there is one taxi. The taxi is represented by the yellow highlighted block. The objective of the taxi in this enviroment is to move the passenger from the blue Y to the magenta R. The action space of the taxi is (0,1,2,3,4,5,6), where 0,1,2,3 are respectively move south, north, east, west, 4 is pickup passenger, 5 is dropoff passenger, and 6 is refuel.

In [27]:
state, reward, done, _ = env.step(6)
print("The next state is: " + str(state) + ", the reward for the last action is: " + str(reward) + ", and the episode is "+ {True: "", False: "not"}[done]  + " done.")

The next state is: 1858, the reward for the last action is: -10, and the episode is not done.


To preform an action, we use the step function. This returns a tuple which is the next state, which includes the value of the next state, the reward, which is -10, corresponding to the reward , and whether the episode has ended, which is False, since the episode ends when the taxi has dropped off the passenger. Now, we will navigate to the fuel station.

In [29]:
env.step(1)
env.step(3)
env.render()

+---------+
|[35mR[0m: |[43mF[0m: :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
Fuel: 8
  (West)


Here, note that since we have moved two steps, we have consumed two units of fuel. There is a large penalty for moving when the taxi does not have fuel. We will now refuel. 

In [32]:
state, reward, done, _ = env.step(6)
env.render()

+---------+
|[35mR[0m: |[43mF[0m: :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
Fuel: 10
  (Refill)


Here, since we have refuelled, we are back at full. Another thing to note is that the state of the enviroment is encoded using a single number that represents the state. However, sometimes it may be useful to decode what that number actually means (for example, when using deep-Q learning). We can do that using the decode function. 

In [41]:
x, y, pass_loc, pass_dest, fuel = list(env.decode(state))
print("The coordinates of the taxi are currently: " + "(" + str(x) + "," + str(y) +")")
print("The index of the passenger location is: " + str(pass_loc) + ", while the index of the passenger destination is: " + str(pass_dest))
print("Currently, the fuel level of the taxi is: " + str(fuel))

The coordinates of the taxi are currently: (0,2)
The index of the passenger location is: 2, while the index of the passenger destination is: 0
Currently, the fuel level of the taxi is 10


This concludes the demonstration for the single taxi with fuel environment.

### Multiple taxi with fuel

The most sophisticated enviroment is the multiple taxi with fuel environment. This environment supports an arbitrary number of taxis with fuel and an arbitrary number of passengers. By default, the enviroment is initialized with two taxis, one passenger, and the maximum (and starting) fuel of each taxi is 8. However, we are able to change all of those parameters.

In [67]:
from multitaxifuel_env import MultiTaxiFuelEnv
env = MultiTaxiFuelEnv(num_taxis = 2, num_passengers = 2, max_fuel = 8)
env.reset()
env.render()

+---------+
|X: |F: :[35mX[0m|
| :[43m_[0m| : : |
| : : : : |
| | :[41m_[0m| : |
|[34;1mX[0m| :G|[35m[34;1mX[0m[0m: |
+---------+
Taxi1: Fuel: 8, Location: (1,1)
Taxi2: Fuel: 8, Location: (3,2)
Passenger1: Location: (4, 3), Destination: (0, 4)
Passenger2: Location: (4, 0), Destination: (4, 3)


Initializing the enviroment, we have the location and fuel values for each of the passneger, as well as the location and destination values for each passenger. In the multiple passenger enviroment, the episode does not end until each passenger is delivered to their destination.

In [68]:
print(env.state)

[[[1, 1], [3, 2]], [8, 8], [[4, 3], [4, 0]], [[0, 4], [4, 3]], [0, 0]]


Now, note that the enviroment, rather than being stored as a single number, is instead stored as a list. Also, since there are multiple taxis, we will now preform a joint action, which will be returned as a list. The actions for each individual taxi are the same as in the original taxi fuel enviroment. For example, suppose that I wish for taxi 1 to go north (action 1) and for taxi 2 to go south (action 0). 

In [69]:
state, reward, done, _ = env.step([1,0])
print("Now the reward is given as the reward for each individual taxi: " + str(reward))

env.render()

Now the reward is given as the reward for each individual taxi: [-1, -1]
+---------+
|X:[43m_[0m|F: :[35mX[0m|
| : | : : |
| : : : : |
| | : | : |
|[34;1mX[0m| :[41mG[0m|[35m[34;1mX[0m[0m: |
+---------+
  (North ,South)
Taxi1: Fuel: 7, Location: (0,1)
Taxi2: Fuel: 7, Location: (4,2)
Passenger1: Location: (4, 3), Destination: (0, 4)
Passenger2: Location: (4, 0), Destination: (4, 3)


Now, we will try a random solution, and see how long it takes to finish an episode. In particular, at every time step, each of the taxis will choose a random action. We can easily do this using the env.action_space.sample() function, which returns a random sample of the action space corresponding to a random action by each taxi.

In [72]:
env.reset()
epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward[0] == -10 or reward[1] == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 11487
Penalties incurred: 11392


Here, as we have seen, the episode takes a very long time to finish, and the taxis incur many penalties, meaning that they have tried many invalid moves. We can see this in action by replaying the frames of that episode.

In [None]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|[35m[35mX[0m[0m: |F: :[43mX[0m|
| : | : :[34;1m [0m|
| : : : : |
|[34;1m [0m| : | : |
|[41mX[0m| :G|X: |
+---------+
  (Pickup ,Pickup)
Taxi1: Fuel: 0, Location: (0,4)
Taxi2: Fuel: 0, Location: (4,0)
Passenger1: Location: (1, 4), Destination: (0, 0)
Passenger2: Location: (3, 0), Destination: (0, 0)

Timestep: 2673
State: [[[4, 0], [0, 0]], [0, 0], [[1, 0], [2, 2]], [[0, 0], [0, 0]], [-1, -1]]
Action: [4 4]
Reward: [-10, -10]
