<font size="5">
 <div class="alert alert-block alert-info"><b>Master in Data Science - Iscte <b>
     </div>
</font> 
 
 
     
    
  <font size="5"> OEOD </font>
  
  
  
  <font size="3"> **Diana Aldea Mendes**, October 2023 </font>
  
   
  <font size="3"> *diana.mendes@iscte-iul.pt* </font> 
  
    
 
  
    
  <font color='blue'><font size="5"> <b>Week 6 - Self-Driving Taxi Problem<b></font></font>
  

# RL libraries

- **Open AI Gym** 
- KerasRL
- Tensorflow
- RL-Coach
- RLkit
- Stable Baseline
- Dopamine
- TF Agents

# Open AI Gym

- OpenAI **Gym** package: https://gymnasium.farama.org/
    - can be used to build RL algorithms
    - several environments are available - providing a state space and an action space, along with the rewards and outcome responses
    - you can also construct a **new** environment (*custom environment*)
    - Work Example in this notebook: **Taxi-v3 task**
    - `gym.make()` - create the environment and returns the object
    - `env.reset()` - resets the environment's state


## Example - Taxi task

- **Reference**: *Habib, N. (2019), Hands-on Q-Learning with Python. Practical Q-Learning with OpenAI Gym, Keras and Tensorflow, Packt.*

- **Goal**: a self-driving taxi with the task to collect passenger(s) from a starting location and drop them off at their desired destination in the fewest steps possible


### Environment

In [None]:
import numpy as np
import gym

env = gym.make('Taxi-v3', render_mode='ansi')
state = env.reset()

In [None]:
print(state)

# the value of state; it will be a different random value between 0 and 499 every time we run env.reset() 

In [None]:
## visualize the environment, 
# yellow rectangle = taxi is free
# green rectangle = taxi has a passenger
# letter = destinations (to pick-up or drop passenger)

print(env.render())

### States and Actions

- Each state variable is characterized by:
    - Where the taxi is now (out of 25 possibilities)
    - Where the passenger is now (inside the taxi or at one of the four locations marked R, G, B, or Y)
    - Where the passenger's destination is (R, G, B, or Y)
- This gives us 25 x 5 x 4 = 500 distinct states.

In [None]:
# run again 'reset()' function
env.reset()

In [None]:
print(env.render())
# the taxi agent has moved to a different random location

In [None]:
# get the number of states from the environment
env.observation_space
#print('State Space {}'.format(env.observation_space))

In [None]:
# get the number of actions from the environment
env.action_space

In [None]:
# which are the 6 actions???
# 0: South
# 1: North
# 2: East
# 3: West
# 4: Pickup
# 5: Drop-off


# generate a valid action from the action space (randomly selects an action)
env.action_space.sample()

In [None]:
# choose an action manually (1 - move north)
env.step(1)


######################################################################

- `env.step(1)` returns the following four variables:
    - observation: This refers to the new state that we are in (that is, state 232).
    - reward: This refers to the reward that we have received (-1).
    - done: This tells us whether we have successfully dropped off the passenger at the correct location (False).
    - truncated: True if episode truncates due to a time limit or other reason (False)
    - info: This provides any additional information that we may need for debugging.
- Usually, we are **not setting the action values manually**; instead, we will let the algorithm that we are running to choose them 

In [None]:
print(env.render())

### Random Agent

- In this case, all agents will take random actions and does not keep track of its actions or learn from them
- This is the **baseline agent** and serves as a control to compare the performance of other RL models

In [None]:
## random action - with 'env.action_space.sample()' function
# returns a random action

env.action_space.sample()


In [None]:
# send the random action to the next state
env.step(env.action_space.sample())

In [None]:
observation, reward, done, truncate, info = env.step(env.action_space.sample())

In [None]:
## consider state = 50
env.env.s=50
print(env.render())

In [None]:
# example (one action, one state change)
# take action 0 (move south)
env.step(0)
print(env.render())

In [None]:
print(env.step(0))

In [None]:
# observation: 452 - we are in state 452
# reward: -1
# done: False - do not reached the destination
# truncate: False
# info: {'prob': 1.0}

#### Creating a task loop
- The agent moves randomly until successfully reaches the goal
- Goal: drop the passenger at the correct location
- We simply use the basic Gym functions presented before


In [None]:
state = env.reset()
reward = 0
while reward != 20:
    observation, reward, done, truncate, info = env.step(env.action_space.sample())
print(env.render())

In [None]:
# reward = 0 , the goal was not reached
# reward !=20 - this is the ending condition, so we loop until the reward is less than 20
# the taxi drop-off the passenger at location B and receive the 20 points reward

In [None]:
# verify if the task was reached (use done output from 'env.step()')
done

########################################################
- At this point, we know that the agent has taken a series of random actions and has eventually reached the goal. 
- But what actions did it actually take? 
- How many steps did it take to get to the destination?
- All this information is important to compare with other algorithms and analyze their performance and efficiency

In [None]:
observation = env.reset()
count = 0
reward = 0
while reward != 20:
    observation, reward, done, truncate, info = env.step(env.action_space.sample())
    count += 1
print(env.render())

In [None]:
# number of steps to reach the goal
print(count)

In [None]:
# Now, we want to know each action that the agent takes during the loop
# render the game environment at each step, based on how many steps it takes for the agent to reach the destination.
##### very dense output 

observation = env.reset()
count = 0
reward = 0
while reward != 20:
    observation, reward, done, truncate, info = env.step(env.action_space.sample())
    count += 1
    print(env.render()) #render each step of the game loop

In [None]:
## supose now, that we only want to see the information related to the action taken at each step
# so, call 'env.action_space.sample()' insteed of 'env.render()'

observation = env.reset()
count = 0
reward = 0
while reward != 20:
    action = env.action_space.sample()
    observation, reward, done, truncate, info = env.step(action)
    count += 1
    print(action)
    #print(action, end=' ')
    

In [None]:
# define dictionary (action number : action description)
taxi_actions = {0 : 'South',1 : 'North',2 : 'East',3 : 'West',4 : 'Pickup',5 : 'Dropoff'}

In [None]:
taxi_actions.get(env.action_space.sample())

### Q-learning Agent

- **Q-learning** *agent* with a smart-taxi (self-driving taxi), discrete-state environment with small state space.
- **Goal**: collect passenger(s) from a starting location and drop them off at their desired destination in the fewest steps possible
- The taxi  collects a reward when it drops off a passenger and gets penalties for taking other actions
- All rewards are stored in the Q-table (maps states to actions)
- **Gym provides the environment, the actions, the states**
- *We have to provide the Q-learning algorithm that finds the optimal solution*
- *Using Gym will allow you to build reinforcement learning models, compare their performance, keep track of updated versions, and share your work*.

- Task to do:
    - Understanding how the agent updates the Q-table and uses it to make decisions
    - Adapting the appropriate Bellman equation to update the Q-table with each action
    - Understanding the role of the learning parameter (alpha) in the Bellman equation and what happens when they are adjusted
    - Implementing epsilon decay to improve the performance of your agent
    
- When the algorithm is complete, the agent will start out with no knowledge of the taxi environment and will quickly learn the rules that get it the highest rewards through exploration of the environment.
- During this process, the agent start to reach its goal more quickly and efficiently, and it will learn to do this without being explicitly programmed to do so. 


In [None]:
## import again the libraries and the environment

import gym
import numpy as np
import random
env = gym.make('Taxi-v3', render_mode='ansi')
state = env.reset()

In [None]:
# check the number of states and actions

print("Number of actions: %d" % env.action_space.n)
print("Number of states: %d" % env.observation_space.n)

In [None]:
# create the Q-table (with all entries = zero)
# Q-table is a two-dimensional numpy array (matrix)
# the first column = state
# the remaining 6 columns = the 6 possible actions

Q = np.zeros([env.observation_space.n, env.action_space.n])
print(Q)

In [None]:
# Now, for example, if we are in state 1 and decided to take action 2 (East), and the Q-value we calculate for
# this state-action pair is -1, we would update the Q-table.

# When the agent returns to state 1 again, it will look up the row in the Q-table for state 1 to get the action values. 
# When it does, it will see that the action with the lowest Q-value is currently action 2 (since is -1, others are 0).

In [None]:
## set hyperparameters
# discount rate

gamma = 0.9               

# initialize reward
reward =0

# initialize environment (initial random state)
state = env.reset()[0]

In [None]:
# 3 very important functions !!!!!

## returns the index of the maximum value in the <state> row of the Q-table
action = np.argmax(Q[state])  

# actualize state
next_state, reward, done, truncate, info = env.step(action)

# actualize q-table
Q[state,action] = reward + gamma * np.max(Q[next_state])

In [None]:
#############################################

# Define Bellman equation - select action with the current highest Q-value (simple greedy strategy)
# Q[state, action] = reward + gamma * np.max(Q[next_state])

# create update loop

#############################################

while reward != 20: 
    #choose current highest-valued action
    action = np.argmax(Q[state])
    #obtain reward and next state resulting from taking action
    next_state, reward, done, truncate, info = env.step(action)
    #update Q-value for state-action pair
    Q[state, action] = reward + gamma * np.max(Q[next_state])
    #update state
    state = next_state

#render final dropoff state
print(env.render())  ## this is the final state the environment reaches

In [None]:

## The same as before, but also counting the number of steps
## This is important in order to compare with other algorithms / strategies

##########################################################

Q = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.1
state = env.reset()[0]
count = 0
reward = 0
while reward != 20:
    action = np.argmax(Q[state])
    next_state, reward, done, truncate, info = env.step(action)
    Q[state, action] = reward + gamma * np.max(Q[next_state])
    state = next_state
    count += 1
    
print(env.render())
print('Counter: {}'.format(count))

In [None]:
### counter = 885, so the agent takes 885 random steps before dropping off the passenger at correct location

### running the loop several times - lead to similar results

In [None]:
#################################################

## adding more hyperparameters (alpha and epsilon)

## adding and updating alpha (learning rate) 
# Note that the Bellman equation change the expression

################################################

Q = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.1
alpha = 0.1
state = env.reset()[0]
count = 0
reward = 0

while reward != 20:
    action = np.argmax(Q[state])
    next_state, reward, done, truncate, info = env.step(action)
    Q[state, action] = Q[state, action] + alpha * (reward + gamma * \
    np.max(Q[next_state]) - Q[state, action])
    state = next_state
    count += 1
    
print(env.render())
print('Counter: {}'.format(count))

In [None]:
###################################################

## adding and updating epsilon
## Now, the agent has the ability to explore new actions it hasn't taken yet and balance out its exploitation of the high-valued actions it's already taken.

## in this case, we only add the epsilon-greedy rule into the update loop
# the Q-table update rule (Bellman eq. ) is the same as before

##################################################

Q = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.1
alpha = 0.1
epsilon = 0.1
state = env.reset()[0]
count = 0
reward = 0

while reward != 20:
    if np.random.rand() < epsilon:
    #exploration option
        action = env.action_space.sample()
    else:
        #exploitation option
        action = np.argmax(Q[state])
    next_state, reward, done, truncate, info = env.step(action)
    Q[state, action] = Q[state, action] + alpha * (reward + gamma * \
    np.max(Q[next_state]) - Q[state, action])
    state = next_state
    count += 1
    
    
print(env.render())
print('Counter: {}'.format(count))

### Performance measures and comparing models

- Note that, you can test and compare different values for epsilon ans alpha - as part of model-tuning process
- Now, we need to test their performance and make sure they are improving with respect to speed and accuracy.
- In what follows, we test the performance of our baseline agent (random-acting agent) against our Q-learning agent model and observe what happens when we run the model over more iterations.

In [None]:
#### random-acting agent algorithm
# change counter (count) with epochs - to distinguish each training time step from each full game loop
# cycle the agent completes:

state = env.reset()
epochs = 0
reward = 0
while reward != 20:
    state, reward, done, truncate, info = env.step(env.action_space.sample())
    epochs += 1
    

env.render()
print("Timesteps taken: {}".format(epochs))

In [None]:
### construct an episode loop and run 100 episodes

# We calculate the average number of time steps by dividing the total number of epochs per
# game iteration (time step) by the total number of episodes (number of game iterations run)


######################################################################

total_epochs = 0
episodes = 100
for episode in range(episodes):
    epochs = 0
    reward = 0
    state = env.reset()
    while reward != 20:
        action = env.action_space.sample()
        state, reward, done, truncate, info = env.step(action)
    epochs += 1
    total_epochs += epochs
    
print("Average timesteps taken: {}".format(total_epochs/episodes))

In [None]:
### now, we do the same, for the Q-learning agent
## observe that we obtain less timesteps and so, improve the results

##########################################

Q = np.zeros([env.observation_space.n, env.action_space.n])
gamma = 0.1
alpha = 0.1
epsilon = 0.1
total_epochs = 0
episodes = 100


for episode in range(episodes):
    epochs=0
    reward=0
    state = env.reset()[0]  
    while reward != 20:
        if np.random.rand() < epsilon:
        #exploration option
            action = env.action_space.sample()
        else:
            #exploitation option
            action = np.argmax(Q[state])
        next_state, reward, done, truncate, info = env.step(action)
        Q[state, action] = Q[state, action] + alpha * (reward + gamma * \
        np.max(Q[next_state]) - Q[state, action])
        state = next_state
        epochs += 1
    total_epochs +=epochs
    
    
print("Average timesteps taken: {}".format(total_epochs/episodes))

## TensorFlow

- As the number of states in a Q-learning task increases, a simple Q-table is no longer a practical way of modeling the state-action transition function. 
- Instead, we can use a Q-network, which is a type of neural network that is designed to approximate Q-values.

## HW 
### Exercise 1

- Run the random agent for 1000 episodes (or 10000 episodes if you have o good computer, since it takes some time to run). Which is your conclusion? You improved the results by increasing the training?
- Do the same as before for the Q-learning agent.
- What happens when you increase the number of episodes to 100000 (Q-learning)? Does the agent's performance get better or worse, or does it stay the same?

### Exercise 2
- Tune your Q-learning algorithm, by considering different values for alpha and gamma 
    - for example fix alpha =0.01 and vary gamma between 0.1 and 0.9
    - for example fix gamma =0.1 and vary alpha between 0.1 and 0.9
    - for example fix alpha =0.9 and vary gamma between 0.01 and 0.1
    - for example fix gamma =0.5 and vary alpha between 0.01 and 0.2

## Extra - Animation

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import numpy as np
import matplotlib.pyplot as plt
import random
from IPython.display import clear_output
from time import sleep
from matplotlib import animation
import gym


In [None]:
"""Initialize and validate the environment"""
env = gym.make("Taxi-v3", render_mode="rgb_array").env
state, _ = env.reset()

# Print dimensions of state and action space
print("State space: {}".format(env.observation_space))
print("Action space: {}".format(env.action_space))

# Sample random action
action = env.action_space.sample(env.action_mask(state))
next_state, reward, done, _, _ = env.step(action)

# Print output
print("State: {}".format(state))
print("Action: {}".format(action))
print("Action mask: {}".format(env.action_mask(state)))
print("Reward: {}".format(reward))

# Render and plot an environment frame
frame = env.render()
plt.imshow(frame)
plt.axis("off")
plt.show()

In [None]:
def run_animation(experience_buffer):
    """Function to run animation"""
    time_lag = 0.05  # Delay (in s) between frames
    for experience in experience_buffer:
        # Plot frame
        clear_output(wait=True)
        plt.imshow(experience['frame'])
        plt.axis('off')
        plt.show()

        # Print console output
        print(f"Episode: {experience['episode']}/{experience_buffer[-1]['episode']}")
        print(f"Epoch: {experience['epoch']}/{experience_buffer[-1]['epoch']}")
        print(f"State: {experience['state']}")
        print(f"Action: {experience['action']}")
        print(f"Reward: {experience['reward']}")
        # Pauze animation
        sleep(time_lag)

In [None]:
def store_episode_as_gif(experience_buffer, path='./', filename='animation.gif'):
    """Store episode as gif animation"""
    fps = 5   # Set framew per seconds
    dpi = 300  # Set dots per inch
    interval = 50  # Interval between frames (in ms)

    # Retrieve frames from experience buffer
    frames = []
    for experience in experience_buffer:
        frames.append(experience['frame'])

    # Fix frame size
    plt.figure(figsize=(frames[0].shape[1] / dpi, frames[0].shape[0] / dpi), dpi=dpi)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    # Generate animation
    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=interval)

    # Save output as gif
    anim.save(path + filename, writer='imagemagick', fps=fps)

In [None]:
"""Simulation with random agent"""
epoch = 0
num_failed_dropoffs = 0
experience_buffer = []
cum_reward = 0

done = False

state, _ = env.reset()

while not done:
    # Sample random action
    "Action selection without action mask"
    action = env.action_space.sample()

    "Action selection with action mask"
    #action = env.action_space.sample(env.action_mask(state))

    state, reward, done, _, _ = env.step(action)
    cum_reward += reward

    # Store experience in dictionary
    experience_buffer.append({
        "frame": env.render(),
        "episode": 1,
        "epoch": epoch,
        "state": state,
        "action": action,
        "reward": cum_reward,
        }
    )

    if reward == -10:
        num_failed_dropoffs += 1

    epoch += 1

# Run animation and print console output
run_animation(experience_buffer)

print("# epochs: {}".format(epoch))
print("# failed drop-offs: {}".format(num_failed_dropoffs))