### Q-Learning

### University of Virginia
### Reinforcement Learning
#### Last updated: February 4, 2025

---


### SOURCES

- Reinforcement Learning, RS Sutton & AG Barto, 2nd edition. Chapter 6
- Mastering Reinforcement Learning with Python, Enes Bilgin. Chapter 5

### LEARNING OUTCOMES

- Explain how Q-Learning works and how it learns off policy
- Use Q-Learning to compute value functions  
- Perform sensitivity analysis on a Q-Learning algorithm
- Check for algorithm convergence

### CONCEPTS

- Q-Learning to act off policy
- The Q-Learning algorithm

---  

### I. Q-table

We recall the big picture of what we're trying to do:  
Given state space $S$ and action space $A$, learn values $Q(S,A)$  
These are organized in an array called the *Q-table*.

Q-Learning is a method for building this table.

We initialize the table (zeros, random values with zeros at terminal condition, etc.) and then use TD(0) updates for training.

<img src="https://github.com/tylergorecki/reinforcement_learning/blob/main/04_q_learning/Q-Learning_Matrix_Initialized_and_After_Training.png?raw=1">

### II. Q-Learning

Q-Learning is an **off-policy TD control algorithm** that was an early breakthrough in RL.

Quick reminder of what off-policy means:

We want action-value estimates. To make improvements requires exploring. These two things are at odds.

Consider: You're looking for a faster route to work. If you try different routes, some will be slower.  
These slower routes shouldn't factor into the timing of the optimal route. You separate optimal route timing from exploration.

We do this by maintaining two policies:
- behavior policy for learning
- target policy for learning optimality

Now we show the update equation for improving $q_\pi(s,a)$  
It is very similar to the update step for the state value.

Since we will use sample data, $Q$ will denote estimates of $q_\pi$

$Q(s,a) := Q(s,a) + \alpha [r + \gamma \underset{a}{\operatorname{\max}} Q(s',a) -  Q(s,a)]$

Explaining the different components:

<img src="https://github.com/tylergorecki/reinforcement_learning/blob/main/04_q_learning/q_learning_update.png?raw=1">

An important difference is the $\underset{a}{\operatorname{\max}} Q(s',a)$ term where you might have expected $Q(s',a)$  

The agent computes the most valuable action and uses this in updating.

However, the agent many not actually take this step when $S_{t+1}=s'$, $A_{t+1}=a$

This is what it means to act off policy: the target policy is separated from the behavior policy.

---

**Septic Shock**

Next, let's look at a computational example. The objective is to reduce the chance of septic shock, measured by the proxy SOFA score, by using a drug called a vasopressor. The values are for illustration only. Following the code are a series of exercises that we will work through.

Background:  
- **Septic shock**: a life-threatening condition that happens when blood pressure drops to a dangerously low level after an infection
- **Sequential Organ Failure Assessment (SOFA) score** is a scoring system that assesses the performance of several organ systems in the body. We will use this to measure state. Higher is more dangerous.
- **Vasopressor (vaso)** a drug that healthcare providers use to make blood vessels constrict (raising blood pressure) in patients with low blood pressure.

In [16]:
import numpy as np
import random
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

# Initialize states, actions, Q function

# states
sofa_levels = [0,1,2,3]
num_states = len(sofa_levels)
terminal_state = 3

# actions
vaso_dose = [0,1,2,3,4]
num_actions = len(vaso_dose)

# initialize array to store action values Q
Q = np.random.normal(size=(num_states, num_actions))
Q[terminal_state,:] = 0 # no action taken from terminal state, so no value


def act(epsilon, action_values):
    '''
    epsilon-greedy policy: return action using epsilon-greedy strategy
    '''
    action_size = len(action_values)
    if np.random.rand() <= epsilon: # random draw with prob epsilon
        return random.randrange(action_size)
    return np.argmax(action_values)  # returns action

def calc_reward(state):
    '''
    simple reward function for illustration. lower state value is better
    '''
    if state == 3:
        reward = -100
    elif state == 2:
        reward = -10
    elif state == 1:
        reward = 0
    else:
        reward = 10
    return reward

def determine_next_state(state, action):
    '''
    return next state from the environment
    to be replaced with simulated data or alternative
    '''
    if (state in [0,1,2]) & (action == 0): # no dose raises state
        next_state = min(terminal_state, state + 1)
    elif action in [3,4]: # higher doses lowers state (floored at zero)
        next_state = max(0, state - 1)
    else:
        next_state = random.choice([1,2])
    return next_state

# Run the Process
num_episodes = 500
max_timesteps = 100
epsilon = 0.1
alpha = 0.1 # weight on new data
gamma = 0.99 # discount factor
verbose = False

for ep in range(num_episodes):
    if ep % 10 == 0:
        print('episode',ep+1)
    #print('(state,action,reward,next_state) transitions')
    sofa_level = 0 # initialize state
    done = False
    for tm in range(max_timesteps):

        # given state, get action from policy
        vaso_dose = act(epsilon, Q[sofa_level,:])

        next_sofa = determine_next_state(sofa_level, vaso_dose)
        reward = calc_reward(next_sofa)
        transition = (sofa_level,vaso_dose,reward,next_sofa, done)

        if verbose:
            print(transition)

        # update Q(S,A) using TD(0)
        # Q(S,A) = Q(S,A) + alpha (r + gamma * max_a Q(S',a) - Q(S,A))
        Q[sofa_level,vaso_dose] += alpha*(reward+gamma*np.amax(Q[next_sofa,:])-Q[sofa_level,vaso_dose])

        sofa_level = next_sofa # update sofa for next iteration

        # terminal state check
        if next_sofa == terminal_state:
            done = True
            break
    if ep % 10 == 0:
        print('Q \n', Q)


episode 1
Q 
 [[ 0.899 -0.841  0.124  0.115  73.654]
 [-1.387 -2.533 -0.223 -0.498  21.655]
 [-0.308 -0.561 -0.884  3.142 -1.468]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 11
Q 
 [[ 177.852  106.359  256.306  333.157  613.237]
 [-1.387  33.445 -0.223 -0.498  513.998]
 [-0.308  0.402 -0.884  288.677 -1.468]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 21
Q 
 [[ 591.598  483.616  666.731  688.652  833.767]
 [ 107.298  33.445  60.161 -0.498  808.378]
 [-0.308  63.157 -0.884  673.190  71.531]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 31
Q 
 [[ 833.754  735.167  823.742  861.186  928.481]
 [ 175.198  114.271  143.179 -0.498  916.893]
 [-0.308  146.893  77.833  851.115  71.531]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 41
Q 
 [[ 939.741  879.561  906.340  946.215  968.261]
 [ 366.542  196.750  223.344  92.763  965.345]
 [-10.277  146.893  231.209  937.975  236.161]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 51
Q 
 [[ 963.081  938.100  949.304  970.955  986.426]
 [ 423.84

**Exercise 1**

If the agent is in state 0, what is the most valuable action? what is least valuable action? Enter your final Q estimate here.

When in state 1, the most valuable action is between giving a dosage of 3 or 4. The lease valuable action is giving a dosage of 2.

In [17]:
Q

array([[ 990.000,  982.315,  985.188,  1000.000,  1000.000],
       [ 967.268,  976.012,  979.938,  992.796,  1000.000],
       [-74.660,  841.440,  853.276,  990.000,  887.668],
       [ 0.000,  0.000,  0.000,  0.000,  0.000]])

**Exercise 2**

How do your answers change with different $\alpha$? different $\epsilon$? Enter your final Q estimates here.

In [18]:
import numpy as np
import random
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

# Initialize states, actions, Q function

# states
sofa_levels = [0,1,2,3]
num_states = len(sofa_levels)
terminal_state = 3

# actions
vaso_dose = [0,1,2,3,4]
num_actions = len(vaso_dose)

# initialize array to store action values Q
Q = np.random.normal(size=(num_states, num_actions))
Q[terminal_state,:] = 0 # no action taken from terminal state, so no value


def act(epsilon, action_values):
    '''
    epsilon-greedy policy: return action using epsilon-greedy strategy
    '''
    action_size = len(action_values)
    if np.random.rand() <= epsilon: # random draw with prob epsilon
        return random.randrange(action_size)
    return np.argmax(action_values)  # returns action

def calc_reward(state):
    '''
    simple reward function for illustration. lower state value is better
    '''
    if state == 3:
        reward = -100
    elif state == 2:
        reward = -10
    elif state == 1:
        reward = 0
    else:
        reward = 10
    return reward

def determine_next_state(state, action):
    '''
    return next state from the environment
    to be replaced with simulated data or alternative
    '''
    if (state in [0,1,2]) & (action == 0): # no dose raises state
        next_state = min(terminal_state, state + 1)
    elif action in [3,4]: # higher doses lowers state (floored at zero)
        next_state = max(0, state - 1)
    else:
        next_state = random.choice([1,2])
    return next_state

# Run the Process
num_episodes = 500
max_timesteps = 100
epsilon = 0.3
alpha = 0.3 # weight on new data
gamma = 0.99 # discount factor
verbose = False

for ep in range(num_episodes):
    if ep % 10 == 0:
        print('episode',ep+1)
    #print('(state,action,reward,next_state) transitions')
    sofa_level = 0 # initialize state
    done = False
    for tm in range(max_timesteps):

        # given state, get action from policy
        vaso_dose = act(epsilon, Q[sofa_level,:])

        next_sofa = determine_next_state(sofa_level, vaso_dose)
        reward = calc_reward(next_sofa)
        transition = (sofa_level,vaso_dose,reward,next_sofa, done)

        if verbose:
            print(transition)

        # update Q(S,A) using TD(0)
        # Q(S,A) = Q(S,A) + alpha (r + gamma * max_a Q(S',a) - Q(S,A))
        Q[sofa_level,vaso_dose] += alpha*(reward+gamma*np.amax(Q[next_sofa,:])-Q[sofa_level,vaso_dose])

        sofa_level = next_sofa # update sofa for next iteration

        # terminal state check
        if next_sofa == terminal_state:
            done = True
            break
    if ep % 10 == 0:
        print('Q \n', Q)


episode 1
Q 
 [[ 1.704  0.608  0.520 -0.870  1.627]
 [-2.324  1.354 -1.146 -1.972 -0.195]
 [-30.204 -2.390 -0.416 -0.746 -0.750]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 11
Q 
 [[ 654.053  591.254  621.730  625.148  692.543]
 [ 347.188  544.763  499.109  370.976  685.833]
 [-88.269  633.767  400.935  287.360  220.646]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 21
Q 
 [[ 882.578  856.642  853.482  892.148  907.840]
 [ 783.707  835.078  768.136  672.349  906.696]
 [-97.183  837.732  734.463  683.807  786.023]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 31
Q 
 [[ 964.936  957.260  950.200  975.106  977.816]
 [ 884.291  916.693  951.505  851.406  976.782]
 [-99.034  924.382  817.910  924.367  961.765]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 41
Q 
 [[ 982.222  974.199  973.744  993.292  994.628]
 [ 958.706  959.443  959.973  956.835  994.444]
 [-99.838  962.234  936.924  924.367  983.705]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 51
Q 
 [[ 988.342  976.865  975.797 

In [19]:
Q

array([[ 990.000,  979.321,  984.202,  1000.000,  1000.000],
       [ 970.100,  971.643,  981.977,  1000.000,  1000.000],
       [-100.000,  988.204,  986.504,  990.000,  990.000],
       [ 0.000,  0.000,  0.000,  0.000,  0.000]])

Changing alpha and epsilon to 0.2 instead of 0.1, the optimal dosage is still either 3 or 4, but the worst dosage is now 1. It was close between 1 and 2 again though. Also converges faster.

**Exercise 3**

We initialized Q with standard normal deviates. How do your answers in (1) change if you initialize Q with zeros?  
Enter your final Q estimates here.

In [20]:
import numpy as np
import random
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})

# Initialize states, actions, Q function

# states
sofa_levels = [0,1,2,3]
num_states = len(sofa_levels)
terminal_state = 3

# actions
vaso_dose = [0,1,2,3,4]
num_actions = len(vaso_dose)

# initialize array to store action values Q
Q = np.zeros(shape=(num_states, num_actions))
Q[terminal_state,:] = 0 # no action taken from terminal state, so no value


def act(epsilon, action_values):
    '''
    epsilon-greedy policy: return action using epsilon-greedy strategy
    '''
    action_size = len(action_values)
    if np.random.rand() <= epsilon: # random draw with prob epsilon
        return random.randrange(action_size)
    return np.argmax(action_values)  # returns action

def calc_reward(state):
    '''
    simple reward function for illustration. lower state value is better
    '''
    if state == 3:
        reward = -100
    elif state == 2:
        reward = -10
    elif state == 1:
        reward = 0
    else:
        reward = 10
    return reward

def determine_next_state(state, action):
    '''
    return next state from the environment
    to be replaced with simulated data or alternative
    '''
    if (state in [0,1,2]) & (action == 0): # no dose raises state
        next_state = min(terminal_state, state + 1)
    elif action in [3,4]: # higher doses lowers state (floored at zero)
        next_state = max(0, state - 1)
    else:
        next_state = random.choice([1,2])
    return next_state

# Run the Process
num_episodes = 500
max_timesteps = 100
epsilon = 0.3
alpha = 0.3 # weight on new data
gamma = 0.99 # discount factor
verbose = False

for ep in range(num_episodes):
    if ep % 10 == 0:
        print('episode',ep+1)
    #print('(state,action,reward,next_state) transitions')
    sofa_level = 0 # initialize state
    done = False
    for tm in range(max_timesteps):

        # given state, get action from policy
        vaso_dose = act(epsilon, Q[sofa_level,:])

        next_sofa = determine_next_state(sofa_level, vaso_dose)
        reward = calc_reward(next_sofa)
        transition = (sofa_level,vaso_dose,reward,next_sofa, done)

        if verbose:
            print(transition)

        # update Q(S,A) using TD(0)
        # Q(S,A) = Q(S,A) + alpha (r + gamma * max_a Q(S',a) - Q(S,A))
        Q[sofa_level,vaso_dose] += alpha*(reward+gamma*np.amax(Q[next_sofa,:])-Q[sofa_level,vaso_dose])

        sofa_level = next_sofa # update sofa for next iteration

        # terminal state check
        if next_sofa == terminal_state:
            done = True
            break
    if ep % 10 == 0:
        print('Q \n', Q)


episode 1
Q 
 [[ 0.000  0.000  0.000  0.000  0.000]
 [-3.000  0.000  0.000  0.000  0.000]
 [-30.000  0.000  0.000  0.000  0.000]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 11
Q 
 [[ 660.964  597.726  638.766  717.199  664.510]
 [ 341.218  500.433  375.451  703.695  252.176]
 [-91.765  448.751  231.209  672.253  410.354]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 21
Q 
 [[ 914.171  900.305  904.681  933.142  924.584]
 [ 797.950  796.476  874.986  931.560  826.807]
 [-98.023  714.785  716.138  916.275  663.926]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 31
Q 
 [[ 960.917  947.901  960.002  975.710  972.910]
 [ 919.177  888.930  928.414  974.922  963.027]
 [-99.668  832.437  847.855  961.912  852.714]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 41
Q 
 [[ 983.984  967.419  974.944  994.721  993.924]
 [ 956.565  972.893  969.741  994.568  989.405]
 [-99.837  902.953  939.409  984.220  935.022]
 [ 0.000  0.000  0.000  0.000  0.000]]
episode 51
Q 
 [[ 988.436  974.816  977.142 

In [21]:
Q

array([[ 990.000,  982.168,  979.743,  1000.000,  1000.000],
       [ 970.100,  980.337,  988.950,  1000.000,  1000.000],
       [-100.000,  979.339,  979.156,  990.000,  990.000],
       [ 0.000,  0.000,  0.000,  0.000,  0.000]])

The answers didn't seem to change, maybe didn't converge as quick.

**Exercise 4**

Does Q seem to converge? It will converge given enough iterations.

The code does seem to converge at different rates.

**Exercise 5**

Modify the code to return all transitions as a list of tuples. Paste the first 10 transitions below.

---

### III. Limitations of Q-Learning

As we've learned, Q-learning involves storing and updating a table or array of values $Q(S,A)$ where each element represents the value of a *(state,action)* tuple. This is called a *Q table*.

**As the number of states and actions (the *state-action space*) grows, this approach becomes unmanageable** in terms of both storage and computation. This occurs for continuous variables or discrete variables with a massive number of possible values.

There are two approaches to handle this issue:

- Quantize the values

For example, medication doses might be bucketed into dose ranges  

- Function approximators for Q  

The function approximation is now very popular, with neural nets playing a major role.

**Going Deep**

When deep neural networks are used with Q-Learning, the model is called a *Deep Q-Network*. We will study these next.

In general, pairing reinforcement learning with a deep neural network is called *Deep Reinforcement Learning*, abbreviated Deep RL.

---