# Q-Learning in Python

This notebook implements Q-Learning for tabular environments.

**Lab exercise created by [Víctor Campos](https://uk.linkedin.com/in/victor-campos-camunez), and adapted by [Xavier Giro-i-Nieto](https://imatge.upc.edu/web/people/xavier-giro) for the [Postgraduate course in Artificial Intelligence with Deep Learning](https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgrau-artificial-intelligence-deep-learning/) in [UPC School](https://www.talent.upc.edu/ing/) (2020).**

![Víctor Campos](https://scholar.googleusercontent.com/citations?view_op=view_photo&user=8fzVqSkAAAAJ&citpid=2)
![Xavier Giro-i-Nieto](https://scholar.googleusercontent.com/citations?view_op=view_photo&user=M3ZUEc8AAAAJ&citpid=9)

## Import dependencies

We will use OpenAI Gym to simulate the environment and numpy to perform computations. 

In [None]:
!pip install gym[toy_text] wandb --quiet

In [None]:
%%capture
!pip install pyglet==1.5.1
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip3 install pyvirtualdisplay

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(900, 900))
virtual_display.start()

In [None]:
import os
import gym
import glob
import wandb
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook as tqdm

In [None]:
def get_video_filename(dir="video"):
  glob_mp4 = os.path.join(dir, "*.mp4")
  mp4list = glob.glob(glob_mp4)
  assert len(mp4list) > 0, "couldnt find video files"
  return mp4list[-1]

## Visualize the environment

We will train an agent in the [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) environment. Read the documentation and make sure that you understand the structure of the problem before moving forward.

In [None]:
ACTIONS = {
    0: "LEFT",
    1: "DOWN",
    2: "RIGHT",
    3: "UP",
}

**Exercise #1.** Visualize a rollout of a random agent in the environment. Use the [documentation](https://gymnasium.farama.org/) for OpenAI Gym as a reference.

In [None]:
# Create an instance of the environment
env = gym.make("FrozenLake-v1", render_mode="rgb_array")
env = gym.wrappers.RecordVideo(env, "./video")

# TODO: reset the environment
...

done = False
total_rew = 0

# Allow a maximum of 10 interactions
for t in range(10):
  print("\nTimestep {}".format(t))
  env.render()

  # TODO: sample a random action
  action = ...
  
  # TODO: simulate the action in the environment (make sure to capture the 'done' signal)
  ...

  
  print(f"Action: {ACTIONS[action]}, Observation: {ob}, Reward: {r}")

  # Exit if the episode terminated
  if done:
    print("\nEpisode terminated early")
    break

In [None]:
wandb.login()

In [None]:
wandb.init(project="FROZEN_LAKE_V1")
wandb.run.name = 'frozenlake_random_agent'
mp4 = get_video_filename()
wandb.log({"Video eval": wandb.Video(mp4, fps=4, format="mp4")})
wandb.finish()

## Tabular Q-Learning Agent

We will now implement an agent that performs Q-Learning in the tabular setting. It will maintain a table of Q(s,a) values, with shape `num_states x num_actions`, that will be updated online with the stream of experience collected by interacting with the environment.

Exploration is critical in RL. The agent needs to continuously try actions that might seem suboptimal given its current beliefs in order to avoid getting trapped in local optima. We will use $\epsilon$-greedy exploration for this purpose: a strategy that will sample random actions with probability $\epsilon$, or will act greedily otherwise. We will decay $\epsilon$ through the course of training, starting with a high value that favors exploration and slowly transitioning towards a more greedy policy.

**Exercise #2.** Implement `QLearningAgent.greedy_action()`, a function that returns $\text{argmax}_aQ(s,a)$.

**Exercise #3.** Implement `QLearningAgent.eps_greedy_action()`, a function that returns a random action with probability $\epsilon$ or $\text{argmax}_aQ(s,a)$ otherwise.

**Exercise #4.** Implement `QLearningAgent.update_q_values()`, a function that receives a tuple $(s_t,a_t,r_t,s_{t+1},d)$ and performs a TD update to the table of Q values. Pay special attention to the computation of the TD target for the last step of an episode (when `done==True`).

In [None]:
class QLearningAgent:
  """Tabular Q-Learning Agent with epsilon-greedy exploration."""
  def __init__(self, env_id, step_size=0.5, gamma=0.99,
               init_eps=1.0, final_eps=0.05, eps_decay_steps=50000):
    # Use separate env instances for training and testing
    self.train_env = gym.make(env_id, render_mode="ansi")
    self.test_env = gym.make(env_id, render_mode="ansi")

    # Step size (this plays a similar role to the learning rate in SGD)
    self.step_size = step_size

    # Discount factor
    self.gamma = gamma

    # Epsilon, for epsilon-greedy exploration
    self.eps = init_eps
    self.init_eps = init_eps
    self.final_eps = final_eps
    self.eps_decay_steps = eps_decay_steps
    self.eps_delta = (self.final_eps - self.init_eps) / self.eps_decay_steps

    # Table of Q-values, initialized to zero
    self.q = np.zeros(
        (self.train_env.observation_space.n, self.train_env.action_space.n))
    
    # Keep track of the current state of the training env
    self.s = self.train_env.reset()

  def update_eps(self):
    """Update the value of epsilon, ensuring that self.eps>=self.final_eps."""
    self.eps = max(self.eps + self.eps_delta, self.final_eps)
  
  def greedy_action(self, s):
    # TODO: Returns argmax_a Q(s,a)

  def eps_greedy_action(self, s):
    # TODO: Returns random action with prob self.eps, or greedy action otherwise.

  def update_q_values(self, s, a, r, next_s, done):
    # TODO: Given a transition (s, a, r, s', done), perform a TD update to Q(s,a).

  def perform_train_step(self):
    """Performs one RL interaction and updates the Q-values."""
    # Act epsilon-greedily
    a = self.eps_greedy_action(self.s)
    next_s, r, done, _ = self.train_env.step(a)

    # Update table of Q values
    self.update_q_values(self.s, a, r, next_s, done)

    # Reset the env if the episode terminated
    if done:
      self.s = self.train_env.reset()
    else:
      self.s = next_s
    
    # Update epsilon
    self.update_eps()

  def test(self, render=False):
    """Perform an evaluation rollout with the greedy policy.
    Returns the cumulative reward."""
    s = self.test_env.reset()
    done = False
    cumulative_r = 0.
    while not done:
      if render:
        self.test_env.render()
      s, r, done, _ = self.test_env.step(self.greedy_action(s))
      cumulative_r += r
    return cumulative_r

## Training loop

We are now ready to train the agent. We will track the performance of the agent by performing evaluation rollouts periodically. In order to account for the stochasticity of the environment, the mean over several evaluation episodes is reported.

**Exercise #5.** Train the agent with different hyperparameter configurations. Which ones have a larger influence in the results? 

In [None]:
NUM_TRAINING_STEPS = 100000
EVALUATION_FREQ = 100
NUM_EVALUATION_EPISODES = 20

agent = QLearningAgent("FrozenLake-v1",
                       step_size=0.05, 
                       gamma=0.99,
                       init_eps=1.0, 
                       final_eps=0.1, 
                       eps_decay_steps=NUM_TRAINING_STEPS)

iter_history, rew_history = [], []
for iter_idx in tqdm(range(NUM_TRAINING_STEPS)):
  agent.perform_train_step()
  if iter_idx % EVALUATION_FREQ == 0 or iter_idx == (NUM_TRAINING_STEPS - 1):
    rew = np.mean([agent.test() for _ in range(NUM_EVALUATION_EPISODES)])
    iter_history.append(iter_idx + 1)
    rew_history.append(rew)

# Plot results
fig, ax = plt.subplots(1, 1, figsize=(9,4))
ax.plot(iter_history, rew_history, label="Agent's reward")
ax.plot(iter_history, 
        [agent.train_env.spec.reward_threshold] * len(iter_history),
        'r--', label="Maximum reward")
ax.set_xlabel("Environment steps")
ax.set_ylabel("Reward")
_ = ax.legend()

## Visualizing the learned policy

In [None]:
virtual_display = Display(visible=0, size=(900, 900))
virtual_display.start()

In [None]:
agent.test_env = gym.wrappers.RecordVideo(gym.make("FrozenLake-v1", render_mode="rgb_array"), "./video")

In [None]:
agent.test()

In [None]:
wandb.init(project="FROZEN_LAKE_V1")
wandb.run.name = 'frozenlake_final_agent'
mp4 = get_video_filename()
wandb.log({"Video eval": wandb.Video(mp4, fps=4, format="mp4")})
wandb.finish()