# Understanding the QLearningAgent: Simple vs. Neural RL

This notebook explores the `QLearningAgent` class defined in `qlearn_agent.py`. This agent implements Q-learning, a fundamental Reinforcement Learning (RL) algorithm, and offers two modes of operation:

1.  **`simple` mode:** Uses a traditional Q-table (a dictionary) to store Q-values. Suitable for problems with small, discrete state spaces.
2.  **`neural` mode:** Uses a Deep Q-Network (DQN) (a neural network) to approximate Q-values. Necessary for problems with large or continuous state spaces.

We will cover:
*   Basic concepts of Reinforcement Learning and Q-Learning.
*   The difference between the 'simple' (tabular) and 'neural' (DQN) approaches.
*   Key components of the `QLearningAgent` class.
*   A practical example demonstrating how to define states, actions, rewards, and use the agent.

## Workflow Diagram

![QLearningAgent Workflow](./dual_ql_agent.png)

## 1. Reinforcement Learning (RL) Basics

Reinforcement Learning is a type of machine learning where an **agent** learns to make decisions by interacting with an **environment**. The goal is for the agent to learn a **policy** (a strategy) that maximizes a cumulative **reward** over time.

Key components:
*   **Agent:** The learner or decision-maker (e.g., our `QLearningAgent`).
*   **Environment:** The external system the agent interacts with (e.g., a game, a simulation, the real world).
*   **State (s):** A representation of the current situation of the environment.
*   **Action (a):** A choice the agent can make in a given state.
*   **Reward (r):** A scalar feedback signal from the environment indicating how good the last action was in the previous state.
*   **Policy (π):** The agent's strategy, mapping states to actions.
*   **Value Function (e.g., Q-value):** Predicts the expected future reward an agent can get by taking a specific action in a specific state and following a certain policy thereafter.

## 2. Q-Learning: Learning Action Values

Q-Learning is a popular RL algorithm that learns from experience without needing a model of the environment.

### The Basic Setup:

1. **State (s)**: What situation the agent is in right now (e.g., "I'm in the top-left corner of a maze").

2. **Action (a)**: A choice the agent can make (e.g., "move right," "move down").

3. **Reward (r)**: Immediate feedback for taking an action in a state (e.g., "+1 for getting closer to the goal," "-1 for bumping into a wall").

The agent's goal is to pick actions that, over time, give it the most total reward.

### The Q-Table

Q-learning keeps a table of numbers called Q-values, one entry for each state-action pair, written as Q(s,a).

- High Q(s,a) means "I think doing action a in state s will pay off well."
- Low Q(s,a) means "That action probably won't help much (or might hurt)."
- Initially, you don't know anything, so you might start with all Q(s,a)=0.

### Learning by Trial and Error

1. Pick an action in the current state, sometimes at random (exploration) and sometimes the best-known one (exploitation).
2. Take the action, observe the reward r and the new state s'.
3. Update the Q-value for (s,a) to account for what you just learned.

The core update formula is:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha [ r + \gamma \max_{a'} Q(s', a') - Q(s, a) ] $$

Where:
- α (alpha) is the "learning rate" (how strongly new experiences override old knowledge).
- γ (gamma) is the "discount factor" (how much you care about future rewards versus immediate ones).
- $\max_{a'} Q(s', a')$ is the best Q-value you think you can get from the new state s'.

### Intuition Behind the Update

You look at the difference between:
- What you actually got (r + best future Q)
- What you expected (old Q(s,a))

Multiply that "surprise" by α, and add it into your old estimate. Over many trials, Q(s,a) converges toward the true long-term value of taking a in s.

### What Happens Over Time

- **At first**: The agent explores randomly, and Q-values jump around.
- **As it gathers experience**: Q(s,a) estimates get better and better.
- **Eventually**: The agent mostly picks the action with the highest Q(s,a) in each state, so it behaves optimally.

### Simple Example: Maze Navigation

- State = your current square in the maze.
- Actions = up/down/left/right.
- Reward = +10 for reaching the exit, -1 for each move (to encourage shorter paths).

The agent tries random moves, updates Q, and gradually learns the fastest route to the exit.

**Bottom line**: Q-learning is a straightforward way for an agent to learn how valuable each action is in each situation by continuously updating its estimates based on actual rewards received. Over time and repeated experience, it figures out which actions lead to the best long-term payoff.

## 3. The `QLearningAgent`: Simple vs. Neural Mode

Our `QLearningAgent` implements the Q-learning update logic. The key difference lies in how it stores and retrieves the \(Q(s, a)\) values.

### 3.1. `mode='simple'` (Tabular Q-Learning)

*   **How it works:** Uses a Python dictionary (`self.q_table`) as a lookup table.
    *   **Keys:** State representations (must be hashable, e.g., tuples).
    *   **Values:** NumPy arrays where each element represents the Q-value for a specific action in that state. `q_table[state]` returns `[Q(state, action_0), Q(state, action_1), ...] `.
*   **State Representation:** States must be discrete and hashable. For example, if a state is defined by `(position_x, position_y)`, the tuple `(5, 3)` can be a key.
*   **Pros:**
    *   Conceptually simple and easy to understand.
    *   Guaranteed to converge to the optimal Q-values under certain conditions (enough exploration, appropriate learning rate decay).
    *   Interpretable: You can directly inspect the learned Q-values for any state.
*   **Cons:**
    *   **Curse of Dimensionality:** Doesn't scale to problems with many state variables or continuous variables. The number of possible states (and thus the size of the Q-table) grows exponentially with the number of state features. Imagine a state with 10 features, each having 10 possible values -> 10^10 states!
    *   No generalization: It learns values only for states it has explicitly visited. It cannot estimate Q-values for unseen but similar states.

### 3.2. `mode='neural'` (Deep Q-Network - DQN): Learning with a "Brain"

When the number of possible situations (states) becomes enormous, or when states involve continuous values (like temperature or speed), the simple table approach (`mode='simple'`) breaks down. Imagine trying to create a table entry for every possible chess board position – it's impossible! This is called the "curse of dimensionality."

To overcome this, we use a **Deep Q-Network (DQN)**. Think of it like giving our agent a small "brain" – a neural network – instead of just a lookup table.

*   **How it works:** Instead of storing a Q-value for every single state-action pair, the neural network (`self.model`) learns a *function* that *estimates* the Q-value.
    *   **The Goal:** We want the network to learn the ideal Q-function, often called $Q^*(s, a)$, which tells us the true long-term value of taking action `a` in state `s`.
    *   **The Approximation:** The network learns an *approximation* of this ideal function, represented as $Q(s, a; \theta)$. Here, $\theta$ (theta) stands for all the adjustable parameters (weights and biases) inside the network – the things the network "learns" by adjusting during training. So, the network aims to make $Q(s, a; \theta)$ as close as possible to $Q^*(s, a)$.
    *   **Input:** You feed the network a description of the current state `s`, but it must be converted into a list of numbers (a numerical vector). For example, instead of "sunny day", you might input `[1, 0, 25.5]` representing weather type and temperature.
    *   **Output:** The network outputs a list (vector) of estimated Q-values, one for each possible action the agent can take from that state. For example, if the agent can go `left`, `right`, or `stay`, the output might look like `[10.2, -5.1, 1.5]`, meaning the network estimates the value of going left as 10.2, right as -5.1, and staying as 1.5. The agent would typically choose the action with the highest estimated Q-value (in this case, `left`).

*   **State Representation (Turning Situations into Numbers):** Neural networks only understand numbers. So, any description of the state (text, images, categories, measurements) must be converted into a fixed-size list (vector) of numbers. This process is handled by the `_preprocess_state` function (which often needs customization for specific problems).
    *   This allows us to handle complex states:
        *   **Continuous values:** Like temperature (25.5°C) or speed (60.3 mph) can be used directly (perhaps after scaling them to a standard range like 0 to 1).
        *   **High-dimensional discrete features:** Like words in a sentence or pixels in an image. Techniques like *embeddings* (representing words/items as dense vectors capturing meaning) or *one-hot encoding* (representing categories like 'cat'/'dog'/'bird' as `[1,0,0]`, `[0,1,0]`, `[0,0,1]`) are used to create these numerical vectors.

*   **Pros (Why use a Neural Network?):**
    *   **Scalability (Handles Huge Problems):** The network learns general patterns. Its size (number of parameters $\theta$) doesn't necessarily explode even if there are billions of possible states. It can handle problems where creating a Q-table would be impossible due to memory limitations.
    *   **Generalization (Learning from Similarity):** The network can make intelligent guesses about situations it hasn't encountered before. If it learns that state A (e.g., temperature 20°C) leads to a good outcome, and it encounters state B (e.g., temperature 21°C), it can infer that state B might also be good because they are similar numerically. The simple table method can only learn about states it has explicitly visited.

*   **Cons (Downsides and Challenges):**
    *   **More Complex Setup:** Designing the neural network (choosing layers, number of neurons) and tuning its learning process (setting the learning rate, choosing an optimizer algorithm) requires more expertise and experimentation than the simple table.
    *   **Training Instability:** Sometimes, the learning process for a neural network can be unstable – the Q-value estimates might jump around wildly or fail to converge. Advanced techniques like *Experience Replay* (learning from a shuffled buffer of past experiences instead of just the latest one) and *Target Networks* (using a separate, slower-updating network to stabilize the target values) are often needed, adding complexity (these are not implemented in this basic version).
    *   **Less Interpretable ("Black Box"):** It's much harder to look inside the trained neural network and understand *exactly why* it's predicting a certain Q-value for a given state and action. With the Q-table, you could just look up the value.
    *   **Needs More Data and Power:** Neural networks typically require significantly more experience (data points) to learn effectively compared to tabular methods. Training them also demands more computational power, often benefiting greatly from GPUs (Graphics Processing Units).

## 4. Key Components of `QLearningAgent`

Let's look at some important methods in the `qlearn_agent.py` script.

### `__init__(...)`
Initializes the agent. Key parameters:
*   `alpha`: Learning rate (how quickly the agent adapts).
*   `gamma`: Discount factor (preference for immediate vs. future rewards).
*   `epsilon`: Exploration rate (probability of choosing a random action vs. the best-known action).
*   `mode`: `'simple'` or `'neural'`.
*   `state_dim`: **Required** for `'neural'` mode. Defines the number of features in the input vector to the neural network *after* preprocessing.
*   `n_actions`: Number of possible actions the agent can take.
*   `data_file`: Path for saving/loading episode/error data (and Q-table in simple mode).
*   `model_file`: Path for saving/loading the neural network model (neural mode).

### `load_data()` / `save_data()`
These methods handle persistence.
*   **Simple Mode:** Saves/loads the `q_table` dictionary, `episodes` count, and `error_list` to/from `data_file` using `pickle`.
*   **Neural Mode:** Saves/loads the neural network (`self.model`) using Keras' `save_model`/`load_model` to/from `model_file`. It also saves/loads `episodes` and `error_list` to/from `data_file` using `pickle`.

### `_preprocess_state(state)` (Crucial for Neural Mode)
This method is responsible for converting the raw state representation into a numerical NumPy array suitable for the neural network. **You will almost certainly need to customize this method based on your specific problem.**

The current implementation assumes the input `state` is already a list or tuple of numbers and reshapes it into `(1, state_dim)`.

**Example Customization:**
Let's say our state for a chatbot response agent is represented by a dictionary:
`state = {'user_sentiment': 0.8, 'message_length': 55, 'topic': 'booking'}`
And possible topics are `['greeting', 'booking', 'support', 'other']`.
Our desired `state_dim` needs to account for all features numerically.

1.  **Numerical Features:**
    *   `user_sentiment`: Already numerical (e.g., -1 to 1). Might need scaling if the range is very large. Let's assume it's fine. (1 feature)
    *   `message_length`: Numerical. Let's scale it to be roughly between 0 and 1 (e.g., assuming max length is 200). Scaled length = `min(message_length / 200.0, 1.0)`. (1 feature)
2.  **Categorical Features:**
    *   `topic`: Needs conversion. **One-Hot Encoding** is common. 'booking' would become `[0, 1, 0, 0]`. (4 features)

**Total `state_dim` = 1 (sentiment) + 1 (scaled length) + 4 (topic OHE) = 6**

A customized `_preprocess_state` might look like this:

In [None]:
import numpy as np # Already imported, but good practice in cell
import logging # Already imported

def custom_preprocess_state(state_dict, state_dim_expected=6):
    """
    Example preprocessing for a specific state dictionary.
    Converts {'user_sentiment': float, 'message_length': int, 'topic': str}
    into a NumPy array of shape (1, 6).
    """
    topics = ['greeting', 'booking', 'support', 'other']
    
    try:
        # 1. Extract and scale numerical features
        sentiment = state_dict.get('user_sentiment', 0.0) # Default to neutral
        
        msg_len = state_dict.get('message_length', 0)
        scaled_length = min(msg_len / 200.0, 1.0) # Example scaling
        
        # 2. One-Hot Encode categorical features
        topic = state_dict.get('topic', 'other') # Default to 'other'
        topic_vector = np.zeros(len(topics))
        if topic in topics:
            topic_index = topics.index(topic)
            topic_vector[topic_index] = 1.0
        else: # Handle unknown topic - maybe map to 'other'?
             topic_index = topics.index('other')
             topic_vector[topic_index] = 1.0
            
        # 3. Concatenate features into a single vector
        feature_vector = np.concatenate(([sentiment, scaled_length], topic_vector))
        
        # 4. Reshape for Keras (batch size of 1)
        processed_arr = feature_vector.reshape(1, -1)
        
        # 5. Validation
        if processed_arr.shape[1] != state_dim_expected:
            raise ValueError(f"Processed state shape {processed_arr.shape} != expected ({1, state_dim_expected})")
            
        return processed_arr.astype(np.float32)

    except Exception as e:
        logging.error(f"Error processing state {state_dict}: {e}")
        # Return zeros or handle error appropriately
        return np.zeros((1, state_dim_expected), dtype=np.float32) 

# --- Test it ---
example_state = {'user_sentiment': 0.8, 'message_length': 55, 'topic': 'booking'}
processed = custom_preprocess_state(example_state)
print(f"Original state: {example_state}")
print(f"Processed state shape: {processed.shape}")
print(f"Processed state content: {processed}")

example_state_unknown = {'user_sentiment': -0.5, 'message_length': 300, 'topic': 'complaint'}
processed_unknown = custom_preprocess_state(example_state_unknown)
print(f"\nOriginal state: {example_state_unknown}")
print(f"Processed state shape: {processed_unknown.shape}")
print(f"Processed state content: {processed_unknown}")

: 

*Note: In the actual agent, you would replace the body of `_preprocess_state` with this custom logic, ensuring `self.state_dim` matches the output.*

### `get_q_values(state)`
Returns the Q-values for all actions in the given `state`.
*   **Simple Mode:** Looks up `state` in `self.q_table`. If the state is new, it initializes Q-values (usually to zeros). Requires the state to be hashable (e.g., tuple).
*   **Neural Mode:** Calls `_preprocess_state(state)` and then uses `self.model.predict()` (or `self.model()`) on the processed state to get the Q-values.

### `update_q_value(...)` / `_train_step(...)`
This is where learning happens based on a transition `(state, action, reward, next_state)`.
*   **Simple Mode:** Directly applies the Q-learning update rule to modify the value in `self.q_table[state][action]`.
*   **Neural Mode:**
    1.  Calculates the TD Target: \( \text{target} = r + \gamma \max_{a'} Q(s', a'; \theta) \). This requires predicting Q-values for the `next_state` using the network.
    2.  Calls `_train_step(state_tensor, action_tensor, target_tensor)`.
    3.  `_train_step` performs one step of gradient descent using the optimizer (`self.optimizer`). It calculates the loss (e.g., Mean Squared Error) between the network's prediction for the *action taken* (`Q(s, action; \theta)`) and the calculated `target`. The network weights (\(\theta\)) are updated to minimize this loss.

### `choose_action(state)`
Implements the **Epsilon-Greedy** strategy for exploration/exploitation:
1.  Generate a random number between 0 and 1.
2.  If the number is less than `self.epsilon`:
    *   **Explore:** Choose a random action.
3.  Otherwise (with probability `1 - epsilon`):
    *   **Exploit:** Choose the action with the highest Q-value according to `get_q_values(state)`.

## 5. Practical Example: Simple Grid World Navigation

Let's define a very simple environment: a 3x3 grid.
*   **States:** Agent's position `(row, col)`, e.g., `(0, 0)` to `(2, 2)`.
*   **Actions:** `0: Up, 1: Down, 2: Left, 3: Right`. (Total `n_actions = 4`).
*   **Environment:**
    *   Start at `(0, 0)`.
    *   Goal at `(2, 2)`.
    *   Obstacle at `(1, 1)`.
    *   Hitting a wall or the obstacle keeps the agent in the same state.
*   **Rewards:**
    *   +10 for reaching the goal `(2, 2)`.
    *   -10 for hitting the obstacle `(1, 1)`.
    *   -1 for any other move (encourages shortest path).
*   **Episode End:** Reaching the goal or the obstacle.

In [None]:
# Define environment parameters
GRID_SIZE = 3
GOAL_STATE = (GRID_SIZE - 1, GRID_SIZE - 1) # (2, 2)
START_STATE = (0, 0)
OBSTACLE_STATE = (1, 1)
N_ACTIONS = 4 # 0: Up, 1: Down, 2: Left, 3: Right

# Simple environment step function
def environment_step(state, action):
    """Simulates one step in the grid world."""
    row, col = state
    
    # Apply action
    if action == 0: # Up
        row = max(0, row - 1)
    elif action == 1: # Down
        row = min(GRID_SIZE - 1, row + 1)
    elif action == 2: # Left
        col = max(0, col - 1)
    elif action == 3: # Right
        col = min(GRID_SIZE - 1, col + 1)
    
    next_state = (row, col)
    
    # Check for obstacle collision AFTER movement attempt
    # Simplified reward logic - This was refined in the original text, but let's keep the final version:
    if next_state == GOAL_STATE:
        reward = 10
        done = True
    elif next_state == OBSTACLE_STATE: # If move resulted in landing ON obstacle
        reward = -10
        done = True
        # The original text discusses how to handle this (stay in previous state or move into obstacle state)
        # This version allows moving into it and ending. 
        pass 
    else:
        reward = -1 # Step cost
        done = False
        
    return next_state, reward, done

### 5.1 Using `mode='simple'`

Since the state space is small (3x3 = 9 states), the simple Q-table approach is perfect. The state `(row, col)` is already a tuple, which is hashable.

In [None]:
# Assume QLearningAgent class is defined in qlearn_agent.py and importable
from qlearn_agent import QLearningAgent 

# Agent parameters
ALPHA = 0.1 # Learning rate (lower for simple mode often works)
GAMMA = 0.9 # Discount factor
EPSILON = 1.0 # Initial exploration rate (start high)
EPSILON_DECAY = 0.995 # Decay factor per episode
EPSILON_MIN = 0.05 # Minimum exploration rate
N_EPISODES = 1000 # Number of training episodes

# Initialize agent
agent_simple = QLearningAgent(
    alpha=ALPHA, 
    gamma=GAMMA, 
    epsilon=EPSILON, 
    mode='simple', 
    n_actions=N_ACTIONS,
    data_file="rl_agent/simple_q_data.pkl" # Use separate file
)
print("Simple Agent Initialized")

# Training loop
current_epsilon = EPSILON
episode_rewards = []

print("--- Training Simple Q-Learning Agent ---")
for episode in range(N_EPISODES):
    state = START_STATE
    done = False
    total_reward = 0
    steps = 0
    
    while not done and steps < 100: # Add max steps per episode
        # Choose action using epsilon-greedy
        action = agent_simple.choose_action(state)
        
        # Take action, observe outcome
        next_state, reward, done = environment_step(state, action)
        
        # Update Q-value
        _ , td_error = agent_simple.update_q_value(state, action, reward, next_state, use_next_state=not done)
        
        state = next_state
        total_reward += reward
        steps += 1
        
    episode_rewards.append(total_reward)
    
    # Decay epsilon
    current_epsilon = max(EPSILON_MIN, current_epsilon * EPSILON_DECAY)
    agent_simple.epsilon = current_epsilon # Update agent's epsilon
    
    if (episode + 1) % 100 == 0:
        avg_reward = np.mean(episode_rewards[-100:])
        print(f"Episode {episode + 1}/{N_EPISODES} | Avg Reward (last 100): {avg_reward:.2f} | Epsilon: {current_epsilon:.3f}")

print("--- Training Finished ---") 

# Inspect Q-values
print("\nQ-values for Start State (0, 0):", agent_simple.get_q_values((0, 0)))
print("Q-values for State (1, 0):", agent_simple.get_q_values((1, 0))) 
print("Q-values for State (1, 2):", agent_simple.get_q_values((1, 2))) 

# Testing Greedy Policy
print("\n--- Testing Greedy Policy ---")
state = START_STATE
steps = 0
path = [state]
agent_simple.epsilon = 0.0 # Turn off exploration
while state != GOAL_STATE and state != OBSTACLE_STATE and steps < 20:
    action = agent_simple.choose_action(state) # Choose best action
    state, reward, done = environment_step(state, action)
    path.append(state)
    steps += 1
    if done:
        break
print("Path:", path)
print("Reached Goal?", state == GOAL_STATE)
print("Hit Obstacle?", state == OBSTACLE_STATE)

# Clean up
data_file_path = "rl_agent/simple_q_data.pkl"
if os.path.exists(data_file_path):
    os.remove(data_file_path)
    print(f"Removed {data_file_path}")

### 5.2 Using `mode='neural'`

Although overkill for this tiny grid world, let's demonstrate how to set it up. We need to preprocess the state `(row, col)` into a numerical vector.

**Preprocessing:** We can represent the state as a 2-element vector `[row, col]`. It's often good practice to scale these features, especially if the grid were larger. Let's scale them to `[0, 1]` by dividing by `(GRID_SIZE - 1)`.

**`state_dim` will be 2.**

In [None]:
# Define the preprocessing function specifically for this grid world
def preprocess_grid_state(state_tuple, grid_size):
    """Converts (row, col) tuple to a scaled NumPy array [row_scaled, col_scaled]."""
    row, col = state_tuple
    # Scale features to [0, 1]
    row_scaled = row / (grid_size - 1.0)
    col_scaled = col / (grid_size - 1.0)
    
    feature_vector = np.array([row_scaled, col_scaled], dtype=np.float32)
    # Reshape for Keras: (1, state_dim)
    return feature_vector.reshape(1, -1)

# --- Test preprocessing ---
test_state = (1, 2)
state_dim = 2 # We have two features: row_scaled, col_scaled
processed_test = preprocess_grid_state(test_state, GRID_SIZE)
print(f"Original: {test_state}, Processed: {processed_test}, Shape: {processed_test.shape}")

# Note on agent's internal preprocessing mentioned in the original text:
# We would typically modify the agent's _preprocess_state or pass the preprocessor.
# For this example, we'll preprocess *before* calling agent methods OR monkey-patch.

In [None]:
# Agent parameters for Neural mode (might need different tuning)
ALPHA_NEURAL = 0.001 # Learning rate for Adam optimizer (often smaller for NNs)
GAMMA_NEURAL = 0.9 
EPSILON_NEURAL = 1.0 
EPSILON_DECAY_NEURAL = 0.998 # May need slower decay
EPSILON_MIN_NEURAL = 0.05
N_EPISODES_NEURAL = 2000 # May need more episodes for NN to converge

# State dimension after preprocessing
STATE_DIM_GRID = 2 

# --- Initialize agent (Placeholder) --- 
# This requires the QLearningAgent class definition and TensorFlow
# Make sure TensorFlow is installed: pip install tensorflow
agent_neural = QLearningAgent(
    alpha=ALPHA_NEURAL, # Used by Adam optimizer if optimizer state not loaded
    gamma=GAMMA_NEURAL, 
    epsilon=EPSILON_NEURAL, 
    mode='neural', 
    state_dim=STATE_DIM_GRID, 
    n_actions=N_ACTIONS,
    data_file="rl_agent/neural_q_data.pkl", # Separate common data file
    model_file="rl_agent/neural_q_model.h5"  # Separate model file
)
print("Neural Agent Initialized (Placeholder - requires QLearningAgent class)")

# !! Important: Agent needs to use the custom preprocessor !!
# Option 1: Modify the class (recommended)
# Option 2: Monkey-patch the instance (quick demo)
# Option 3: Preprocess manually before every call (verbose)

# --- Training loop (Placeholder) --- 
# This loop needs the 'agent_neural' object and the monkey-patch or class modification
current_epsilon = EPSILON_NEURAL
episode_rewards_neural = []
episode_losses = [] # Track NN loss

print("\n--- Training Neural (DQN) Agent (Simulation - requires Agent) ---")
# TEMPORARY MONKEY-PATCH (Example - Apply to actual agent object if created)
# This replaces the method ONLY for the 'agent_neural' instance
if 'agent_neural' in locals(): # Check if agent was created
     agent_neural._preprocess_state = lambda state: preprocess_grid_state(state, GRID_SIZE)
     print("Agent's _preprocess_state monkey-patched.")
else:
     print("Agent object not created, cannot monkey-patch.")

for episode in range(N_EPISODES_NEURAL):
    state_tuple = START_STATE
    done = False
    total_reward = 0
    total_loss = 0
    steps = 0
    
    while not done and steps < 100: 
        # Action choice (relies on patched _preprocess_state)
        action = agent_neural.choose_action(state_tuple) 
        
        # Environment step
        next_state_tuple, reward, done = environment_step(state_tuple, action)
        
        # Update / Train (relies on patched _preprocess_state)
        loss, td_error = agent_neural.update_q_value(
            state_tuple, 
            action, 
            reward, 
            next_state_tuple, 
            use_next_state=not done
        )
        
        if not np.isnan(loss):
             total_loss += loss
            
        state_tuple = next_state_tuple
        total_reward += reward
        steps += 1
        
    episode_rewards_neural.append(total_reward)
    avg_loss = total_loss / steps if steps > 0 else 0
    episode_losses.append(avg_loss)
    
    # Decay epsilon
    current_epsilon = max(EPSILON_MIN_NEURAL, current_epsilon * EPSILON_DECAY_NEURAL)
    agent_neural.epsilon = current_epsilon # Update agent's epsilon
    
    if (episode + 1) % 100 == 0:
        avg_reward = np.mean(episode_rewards_neural[-100:])
        avg_loss_report = np.mean(episode_losses[-100:])
        print(f"Episode {episode + 1}/{N_EPISODES_NEURAL} | Avg Reward: {avg_reward:.2f} | Avg Loss: {avg_loss_report:.4f} | Epsilon: {current_epsilon:.3f}")

print("--- Training Finished (Simulation) ---")

# --- Testing Greedy Policy (Placeholder) --- 
print("\n--- Testing Greedy Policy (Neural - Simulation) ---")
state_tuple = START_STATE
steps = 0
path = [state_tuple]
agent_neural.epsilon = 0.0 # Turn off exploration
while state_tuple != GOAL_STATE and state_tuple != OBSTACLE_STATE and steps < 20:
    # Use the patched preprocessor implicitly via choose_action
    action = agent_neural.choose_action(state_tuple) 
    state_tuple, reward, done = environment_step(state_tuple, action)
    path.append(state_tuple)
    steps += 1
    if done:
        break
print("Path:", path)
print("Reached Goal?", state_tuple == GOAL_STATE)
print("Hit Obstacle?", state_tuple == OBSTACLE_STATE)

# --- Clean up (Placeholder) --- 
data_file_path = "rl_agent/neural_q_data.pkl"
model_file_path = "rl_agent/neural_q_model.h5"
if os.path.exists(data_file_path):
    os.remove(data_file_path)
    print(f"Removed {data_file_path}")
if os.path.exists(model_file_path):
    os.remove(model_file_path)
    print(f"Removed {model_file_path}")

## 6. Conclusion & Next Steps

This notebook demonstrated the `QLearningAgent` which supports both tabular ('simple') and neural network-based ('neural') Q-learning.

*   **Simple Mode:** Effective and interpretable for small, discrete state spaces. Limited by the curse of dimensionality.
*   **Neural Mode (DQN):** Can handle large/continuous state spaces through function approximation and generalize to unseen states. Requires careful state preprocessing and potentially more complex training techniques.

**Key Takeaways:**
*   The choice between modes depends heavily on the nature and size of your problem's state space.
*   State representation and preprocessing are critical, especially for DQN. Features should be numerical, and often scaled or encoded appropriately (e.g., one-hot encoding for categoricals).
*   Hyperparameter tuning (alpha, gamma, epsilon, network architecture, optimizer learning rate) is crucial for good performance.

**Potential Improvements (especially for DQN):**
*   **Experience Replay:** Store transitions `(s, a, r, s')` in a buffer and sample mini-batches randomly for training. This breaks correlations between consecutive samples and improves stability.
*   **Target Network:** Use a separate, periodically updated copy of the main network to calculate the TD target values. This further stabilizes training by reducing the chasing of a moving target.
*   **Epsilon Decay:** Implement a more sophisticated epsilon decay schedule.
*   **More Complex Environments:** Try applying the agent to more challenging problems.

## 7. Core Agent Usage: Initialization, Getting Q-Values, and Updating

Let's look at the fundamental operations of the `QLearningAgent`:

1.  **Initialization (`__init__`)**: Creates the agent. Key parameters include:
    *   `mode`: 'simple' or 'neural'.
    *   `state_dim` (required for 'neural'): The number of features in your state representation.
    *   `n_actions`: The total number of possible actions the agent can take.
    *   `alpha`, `gamma`, `epsilon`: Learning rate, discount factor, and exploration rate.
    *   `data_file`, `model_file`: Paths for saving/loading agent state and the neural model.

2.  **Getting Q-Values (`get_q_values`)**: Given a state, this method returns the predicted Q-values for all possible actions in that state.
    *   In 'simple' mode, it looks up the state in the Q-table (or returns defaults if the state is new).
    *   In 'neural' mode, it feeds the state representation into the neural network and gets the output layer's activations (representing Q-values for each action).

3.  **Updating Q-Values (`update_q_value`)**: This is the core learning step. It takes the experience tuple `(state, action, reward, next_state)` and updates the agent's knowledge.
    *   In 'simple' mode, it applies the Q-learning update rule directly to the Q-table entry for `(state, action)`.
    *   In 'neural' mode, it calculates the target Q-value using the reward and the maximum predicted Q-value for the `next_state`, then performs a gradient descent step (using the optimizer like Adam) to adjust the network's weights to minimize the difference (Temporal Difference error) between the predicted Q-value for `(state, action)` and the target value.

In [None]:
# Example demonstrating core agent usage (Neural Mode)
import numpy as np

# Assuming qlearn_agent.py is in the parent directory or accessible via PYTHONPATH
# If running directly from the directory containing reinforcement_learning, adjust path:
# sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from reinforcement_learning import QLearningAgent # Make sure this import works in your environment

# --- Parameters (matching RlMetaRag context) ---
STATE_DIM = 10 # Example: 9 features + query length
N_ACTIONS = 3  # Example: 3 different RAG techniques
ALPHA = 0.01   # Learning rate for the optimizer
GAMMA = 0.95   # Discount factor
EPSILON = 0.1  # Exploration rate (less relevant for direct update demo)
DATA_FILE = "./temp_neural_data.pkl"
MODEL_FILE = "./temp_neural_model.h5"

# --- Sample Data ---
# Represent state as a numpy array (e.g., extracted query features)
sample_state = np.array([0, 1, 1, 0.5, 0.2, 15, 0.8, 0.1, 0.0, 15], dtype=np.float32)
sample_action = 1 # Agent chose the second RAG technique (index 1)
sample_reward = 0.8 # Received a positive reward (e.g., user rated the result as good)
# Next state could be the features of a subsequent query, or None if terminal
# For this example, let's assume a non-terminal step with a new state
sample_next_state = np.array([1, 2, 0, 0.8, 0.1, 25, 0.9, 0.5, 0.2, 25], dtype=np.float32)

# --- Agent Initialization ---
print(f"Initializing agent in neural mode (State Dim: {STATE_DIM}, Actions: {N_ACTIONS})")
agent = QLearningAgent(
    mode='neural',
    state_dim=STATE_DIM,
    n_actions=N_ACTIONS,
    alpha=ALPHA,
    gamma=GAMMA,
    epsilon=EPSILON,
    data_file=DATA_FILE,
    model_file=MODEL_FILE
)

# --- Get Initial Q-Values ---
initial_q_values = agent.get_q_values(sample_state)

# --- Update Q-Value based on experience ---
agent.update_q_value(sample_state, sample_action, sample_reward, sample_next_state)


# --- Get Q-Values After Update ---
# Note: A single update might only slightly change the NN weights.
# Significant changes require more training steps/epochs.
updated_q_values = agent.get_q_values(sample_state)

