# Q-learning Tutorial for Optimal Drug Delivery in Diabetes Management

In this tutorial, we'll walk through a simple reinforcement learning scenario: a **drug delivery system** for managing a diabetic patient's blood glucose levels. <br>
Using **Q-learning**, we'll train an agent to learn the optimal insulin dosage. This tutorial will cover:
- Defining the MDP components (states, actions, rewards)
- Implementing exploration with the ε-greedy method
- Updating Q-values and deriving the optimal policy.

In [7]:
import numpy as np
import random

## Step 1: Define the MDP Components

We define the states, actions, and rewards for our drug delivery problem.

- **States**: Represent the patient's blood glucose levels (e.g., "Low", "Normal", "High").
- **Actions**: Insulin dosages (e.g., No Insulin, Low Dose, High Dose).
- **Rewards**: Higher rewards for maintaining blood glucose in the "Normal" range

In [8]:
# Define states representing blood glucose levels
states = ['Low', 'Normal', 'High']

# Define actions (amounts of insulin to deliver)
actions = ['No Insulin', 'Low Dose', 'High Dose']

# Initialize a Q-table with all zero values
q_table = np.zeros((len(states), len(actions)))

# Define rewards (encouraging normal glucose levels)
rewards = {
    'Low': -10,       # Hypoglycemia penalty
    'Normal': 10,     # Ideal glucose level reward
    'High': -5        # Hyperglycemia penalty
}

# Print initial setup
print('States (Blood Glucose Levels):', states)
print('Actions (Insulin Dosage Options):', actions)
print('\nInitial Q-Table:')
print(q_table)

States (Blood Glucose Levels): ['Low', 'Normal', 'High']
Actions (Insulin Dosage Options): ['No Insulin', 'Low Dose', 'High Dose']

Initial Q-Table:
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


## Step 2: Define State Transitions and Rewards
Read the definitions of the state transitions below. Do they make sense?

In [12]:
# Define state transitions
def transition(state, action):
    '''
    Parameters: 
    state: state s of the agent
    action: action a the agent can take in state s
    
    Returns:
    The transition state s' from previous state s
    '''
    if state == 'Low':
        if action == 'No Insulin':
            return 'Low'
        elif action == 'Low Dose':
            return 'Normal'
        else:
            return 'High'
    elif state == 'Normal':
        if action == 'No Insulin':
            return 'High'
        elif action == 'Low Dose':
            return 'Normal'
        else:
            return 'Low'
    else:  # High
        if action == 'No Insulin':
            return 'High'
        elif action == 'Low Dose':
            return 'Normal'
        else:
            return 'Low'

# Check transitions with print statements
current_state = 'Normal'
for action in actions:
    next_state = transition(current_state, action)
    print(f'Transition from {current_state} with action {action} -> {next_state}')

Transition from Normal with action No Insulin -> High
Transition from Normal with action Low Dose -> Normal
Transition from Normal with action High Dose -> Low


## Step 3: Implement Q-learning Algorithm

In [None]:
# Q-learning parameters
alpha = 0.1      # Learning rate
gamma = 0.9      # Discount factor
epsilon = 0.2    # Initial exploration rate

# Helper function to get action based on ε-greedy strategy
def choose_action(state_index):
    if random.uniform(0, 1) < epsilon:
        return random.choice(range(len(actions)))  # Explore
    else:
        return np.argmax(q_table[state_index])     # Exploit

# Q-learning function
def q_learning(episodes=100):
    global epsilon   # Why did we define a global epsilon?
    for episode in range(episodes):
        state = random.choice(states)
        print(f'\nEpisode {episode+1} - Starting state: {state}')

        for step in range(10):  # Assume 10 steps per episode
            state_index = states.index(state)
            action_index = choose_action(state_index)
            action = actions[action_index]

            # Observe the next state and reward
            next_state = transition(state, action)
            reward = rewards[next_state]
            next_state_index = states.index(next_state)

            # Q-learning update
            old_value = q_table[state_index, action_index]
            next_max = np.max(q_table[next_state_index])
            q_table[state_index, action_index] = old_value + alpha * (reward + gamma * next_max - old_value)

            # Print details of the Q-update step
            print(f'  Step {step+1}: State={state}, Action={action}, Reward={reward}, Next State={next_state}')
            print(f'    Old Q-value: {old_value:.2f}, New Q-value: {q_table[state_index, action_index]:.2f}')

            # Update state and gradually reduce exploration rate
            state = next_state
            epsilon = max(epsilon * 0.99, 0.01)

# Run Q-learning with print statements to trace updates
q_learning(episodes=5)
print('\nFinal Q-Table after training:')
print(q_table)


Episode 1 - Starting state: Normal
  Step 1: State=Normal, Action=No Insulin, Reward=-5, Next State=High
    Old Q-value: 0.00, New Q-value: -0.50
  Step 2: State=High, Action=No Insulin, Reward=-5, Next State=High
    Old Q-value: 0.00, New Q-value: -0.50
  Step 3: State=High, Action=Low Dose, Reward=10, Next State=Normal
    Old Q-value: 0.00, New Q-value: 1.00
  Step 4: State=Normal, Action=Low Dose, Reward=10, Next State=Normal
    Old Q-value: 0.00, New Q-value: 1.00
  Step 5: State=Normal, Action=High Dose, Reward=-10, Next State=Low
    Old Q-value: 0.00, New Q-value: -1.00
  Step 6: State=Low, Action=No Insulin, Reward=-10, Next State=Low
    Old Q-value: 0.00, New Q-value: -1.00
  Step 7: State=Low, Action=Low Dose, Reward=10, Next State=Normal
    Old Q-value: 0.00, New Q-value: 1.09
  Step 8: State=Normal, Action=Low Dose, Reward=10, Next State=Normal
    Old Q-value: 1.00, New Q-value: 1.99
  Step 9: State=Normal, Action=Low Dose, Reward=10, Next State=Normal
    Old Q-val

### Answer the following questions on Step 3:
- In the code above what is the variable *"episode"*? HINT: Look for episodic tasks in reinforcement learning!
- What is training here? Can you connect the Q-learning update from the lecture? 
- What stopping criterion has been used here to check if the Q-values have stabilized?
- Why a "global epsilon" is used in the code above? 
- Are there any hyper-parameters in this model? Is it possible to do hyperparameter tuning here?

## Step 4: Extract Optimal Policy and Interpret Results

In [11]:
# Derive the optimal policy from the Q-table
optimal_policy = {}
for state in states:
    state_index = states.index(state)
    best_action_index = np.argmax(q_table[state_index])
    optimal_policy[state] = actions[best_action_index]

# Print optimal policy for students to see the results
print('\nOptimal Policy Derived from Q-Table:')
for state, action in optimal_policy.items():
    print(f'  In state {state}, the best action is: {action}')


Optimal Policy Derived from Q-Table:
  In state Low, the best action is: Low Dose
  In state Normal, the best action is: Low Dose
  In state High, the best action is: Low Dose


### Answer the following questions on Step 4:
- In the code above can you figure out how the optimal policy is calculate for every state from the following definition we saw in the lecture: <br>
$$
\begin{aligned}
𝜋^{∗}(𝑠)=\underset{a}{\operatorname{\argmax}} 𝑄(𝑠,𝑎)
\end{aligned}
$$
- Do you agree with the optimal policy calculated from the Q-table? What do the Q-values represent?
