Q-Learning
--

1. Load the environment and other components for RL from “OpenAI Gym” library for "Self Driving Cab" optimized decision making task.
2. Apply Q-Learning algorithm to do optimization of "Self Driving Cab" task. Obtain the performance metrics
3. Compare the performance with and without Q-Learning algorithm. 
4. Give inference. 



---------

**REWARDS:**
- The agent should receive a high positive reward for a successful dropoff because this behavior is highly desired. 
- The agent should be penalized if it tries to drop off a passenger in wrong locations. 
- The agent should get a slight negative reward for not making it to the destination after every time-step. "Slight" negative because we would prefer our agent to reach late instead of making wrong moves trying to reach to the destination as fast as possible.
    

In [2]:
!pip install cmake 'gym[atari]' scipy

Defaulting to user installation because normal site-packages is not writeable
Collecting cmake
  Downloading cmake-3.31.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Collecting gym[atari]
  Downloading gym-0.26.2.tar.gz (721 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m721.7/721.7 kB[0m [31m928.0 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting cloudpickle>=1.2.0 (from gym[atari])
  Downloading cloudpickle-3.1.0-py3-none-any.whl.metadata (7.0 kB)
Collecting gym-notices>=0.0.4 (from gym[atari])
  Downloading gym_notices-0.0.8-py3-none-any.whl.metadata (1.0 kB)
Collecting ale-py~=0.8.0 (from gym[atari])
  Downloading ale_py-0.8.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Downloading cmake-3.31

In [2]:
import gym
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
import time

In [3]:
env = gym.make("Taxi-v3", render_mode="ansi")  #render mode: human, ansi, rgb_array
#reset the environment and render it
state = env.reset()
#render the env as ansi text with rgb
ansi_output = env.render()
print(ansi_output)

+---------+
|R: | : :G|
| : | : : |
| :[43m [0m: : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+




Problem statement from GYM documents:

""There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions."

In [4]:
env.reset() # reset environment to a new, random state
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(6)
State Space Discrete(500)


The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
The pipe ("|") represents a wall which the taxi cannot cross.
R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.

- 0 = south
- 1 = north
- 2 = east
- 3 = west
- 4 = pickup
- 5 = dropoff

In [22]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
print(env.render())

State: 328
+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : | : : |
| :[43m [0m: : : |
| | : | : |
|Y| : |B: |
+---------+




In [23]:
#REWARD TABLE
env.P[328]


{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

In [37]:
#Implementing QLearning
#Initializing Q-table
q_table = np.zeros([env.observation_space.n, env.action_space.n])

In [39]:
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

In [40]:
# Training parameters
num_episodes = 1000
all_epochs = []
all_penalties = []

# Training loop
for episode in range(num_episodes):
    state = env.reset()[0]  # Get initial state
    epochs, penalties, reward = 0, 0, 0
    done = False

    while not done:
        # Exploration vs. exploitation
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore action space
        else:
            action = np.argmax(q_table[state])  # Exploit learned values

        next_state, reward, done, _, _ = env.step(action)  # Take the action
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])

        # Update Q-value for current state-action pair
        q_table[state, action] = old_value + alpha * (reward + gamma * next_max - old_value)

        # Track penalties
        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1

    all_epochs.append(epochs)
    all_penalties.append(penalties)

print("Training finished.\n")


Training finished.



In [41]:
q_table[328]

array([-2.33546881, -2.32464586, -2.32769383, -2.33001073, -6.13422566,
       -4.40788394])

In [1]:
# Evaluate agent's performance after Q-learning
total_epochs, total_penalties = 0, 0
episodes = 10

for _ in range(episodes):
    state = env.reset()[0]  # Extract the state value from the reset output
    epochs, penalties, reward = 0, 0, 0
    done = False

    while not done:
        action = np.argmax(q_table[state])  # Choose the best action based on Q-table
        next_state, reward, done, _, _ = env.step(action)  # Take action
        
        if reward == -10:
            penalties += 1

        state = next_state  # Update the state
        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes with Q-Learning:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")


Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_47208/3641414001.py", line 6, in <module>
    state = env.reset()[0]  # Extract the state value from the reset output
NameError: name 'env' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 2105, in showtraceback
    stb = self.InteractiveTB.structured_traceback(
  File "/usr/local/lib/python3.8/dist-packages/IPython/core/ultratb.py", line 1396, in structured_traceback
    return FormattedTB.structured_traceback(
  File "/usr/local/lib/python3.8/dist-packages/IPython/core/ultratb.py", line 1287, in structured_traceback
    return VerboseTB.structured_traceback(
  File "/usr/local/lib/python3.8/dist-packages/IPython/cor