<a href="https://colab.research.google.com/github/vigneshpalanivelr/MeachineLearningAI/blob/master/QN-DQN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q-Learning Variables

### Learning Rate (lr) - New information vs old knowledge?

1.   **What it controls:** How much we update our Q-values with each new experience.
2.   **Think of it as:** How much do you trust new information vs. old knowledge?

**Examples:**

0.   **lr = 0.01:** "I'll barely change my belief, I trust my old knowledge more"
1.   **lr = 0.1:** "I'll update my belief by 10% based on this new experience"
2.   **lr = 1.0:** "I'll completely replace my old belief with this new experience"

In [None]:
# Current Q-value for "go right" = 5.0
# New experience suggests it should be 8.0
# Target = 8.0, Current = 5.0, Difference = 3.0

# With lr = 0.1:
new_q_value = 5.0 + 0.1 * (8.0 - 5.0) = 5.0 + 0.3 = 5.3

# With lr = 0.5:
new_q_value = 5.0 + 0.5 * (8.0 - 5.0) = 5.0 + 1.5 = 6.5

# With lr = 1.0:
new_q_value = 5.0 + 1.0 * (8.0 - 5.0) = 5.0 + 3.0 = 8.0

## Discount Factor (gamma) - long-term vs. short-term gains



1.   **What it controls:** How much we value future rewards compared to immediate rewards.
2.   **Think of it as:** How much do you care about long-term vs. short-term gains?
3.   **Real-world analogy:** Would you rather have \$10 now or $100  next year? Gamma represents your **"patience level."**

**Examples:**
1.   **gamma = 0.95:** "Future rewards are worth 95% of immediate rewards"
2.   **gamma = 0.0:** "I only care about immediate rewards" (greedy)
3.   **gamma = 1.0:** "Future rewards are just as valuable as immediate rewards"


In [None]:
# Immediate reward = 10
# Expected future reward = 100

# With gamma = 0.0:
target = 10 + 0.0 * 100 = 10  # Only immediate reward matters

# With gamma = 0.5:
target = 10 + 0.5 * 100 = 60  # Future reward discounted by 50%

# With gamma = 0.95:
target = 10 + 0.95 * 100 = 105  # Future reward almost fully valued

## **Epsilon (epsilon)**

1.   **What it controls:** The exploration vs. exploitation trade-off during action selection.
2.   **Think of it as:** How often should I try something new vs. stick with what I know works?

**Examples:**
1.   **epsilon = 0.1:** "90% of the time, choose the best action; 10% of the time, explore randomly"
2.   **epsilon = 0.5:** "50% exploration, 50% exploitation"
3.   **epsilon = 0.0:** "Always choose the best known action" (pure exploitation)

In [3]:
# Current Q-values: [2.1, 8.5, 1.2, 0.9] for actions [Up, Right, Down, Left]
# Best action is "Right" (index 1) with Q-value 8.5

# With epsilon = 0.1:
# 90% chance: Choose "Right"
# 10% chance: Choose random action (Up, Right, Down, or Left)

# With epsilon = 0.0:
# 100% chance: Always choose "Right"

In [4]:
def train_agent():
    state = env.reset()

    while not done:
        # Epsilon controls exploration
        if random.random() < epsilon:
            action = random.choice(actions)  # Explore
        else:
            action = argmax(q_table[state])  # Exploit

        next_state, reward, done = env.step(action)

        # Gamma affects how we value future rewards
        if done:
            target = reward
        else:
            target = reward + gamma * max(q_table[next_state])

        # Learning rate controls how much we update
        q_table[state, action] += lr * (target - q_table[state, action])

        state = next_state

## np.argmax()

In [16]:
import numpy as np

def demonstrate_argmax():
    """Demonstrate how np.argmax works in different scenarios"""

    print("="*60)
    print("🔍 UNDERSTANDING np.argmax()")
    print("="*60)

    # Example 1: Basic usage
    print("\n📍 Example 1: Basic Array")
    print("-"*30)
    values = [1.2, 3.8, 2.1, 0.5]
    max_index = np.argmax(values)
    max_value = values[max_index]

    print(f"Array:           {values}")
    print(f"Indices:         [0, 1, 2, 3]")
    print(f"np.argmax():     {max_index}")
    print(f"Max value:       {max_value}")
    print(f"Explanation:     Index {max_index} contains the largest value ({max_value})")

    # Example 2: Q-learning context
    print("\n📍 Example 2: Q-Learning Actions")
    print("-"*30)
    q_values = np.array([2.1, -1.0, 0.5, 3.5])
    actions = ["Up", "Down", "Left", "Right"]
    best_action_index = np.argmax(q_values)
    best_action_name = actions[best_action_index]
    best_q_value = q_values[best_action_index]

    print("Q-values by action:")
    for i, (action, q_val) in enumerate(zip(actions, q_values)):
        marker = " 👈 BEST!" if i == best_action_index else ""
        print(f"  Index {i}: {action:>5} = {q_val:5.1f}{marker}")

    print(f"\nnp.argmax(q_values): {best_action_index}")
    print(f"Best action:         {best_action_name}")
    print(f"Best Q-value:        {best_q_value}")

    # Example 3: Edge cases
    print("\n📍 Example 3: Edge Cases")
    print("-"*30)

    # All same values
    same_values = [2.0, 2.0, 2.0, 2.0]
    print(f"All same values: {same_values}")
    print(f"np.argmax():     {np.argmax(same_values)} (returns first occurrence)")

    # Negative values
    negative_values = [-5.0, -1.0, -3.0, -2.0]
    print(f"All negative:    {negative_values}")
    print(f"np.argmax():     {np.argmax(negative_values)} (index of least negative)")

    # Single value
    single_value = [42.0]
    print(f"Single value:    {single_value}")
    print(f"np.argmax():     {np.argmax(single_value)} (only option)")

    # Example 4: Compare with related functions
    print("\n📍 Example 4: Related Functions Comparison")
    print("-"*30)
    test_array = [1.5, 4.2, 2.8, 0.9]

    print(f"Array:               {test_array}")
    print(f"np.argmax():         {np.argmax(test_array)} (index of max)")
    print(f"np.max():            {np.max(test_array)} (actual max value)")
    print(f"np.argmin():         {np.argmin(test_array)} (index of min)")
    print(f"np.min():            {np.min(test_array)} (actual min value)")

def q_learning_action_selection_demo():
    """Show how argmax is used in Q-learning action selection"""

    print("\n" + "="*60)
    print("🎮 Q-LEARNING ACTION SELECTION WITH ARGMAX")
    print("="*60)

    # Simulate different states with different Q-values
    states_info = [
        {"state": 0, "q_values": [0.1, 0.3, 0.2, 0.8], "description": "Clear best choice"},
        {"state": 1, "q_values": [2.1, 2.1, 1.5, 2.1], "description": "Tie between actions"},
        {"state": 2, "q_values": [-1.0, -0.5, -2.0, -0.3], "description": "All negative Q-values"},
        {"state": 3, "q_values": [0.0, 0.0, 0.0, 0.0], "description": "All zeros (untrained)"}
    ]

    actions = ["Up", "Down", "Left", "Right"]

    for state_info in states_info:
        state = state_info["state"]
        q_values = np.array(state_info["q_values"])
        description = state_info["description"]

        print(f"\n🏠 State {state}: {description}")
        print("-" * 40)

        # Show Q-values
        print("Q-values:")
        for i, (action, q_val) in enumerate(zip(actions, q_values)):
            print(f"  Action {i} ({action:>5}): {q_val:5.1f}")

        # Apply argmax
        best_action_idx = np.argmax(q_values)
        best_action_name = actions[best_action_idx]
        best_q_value = q_values[best_action_idx]

        print(f"\nAction Selection:")
        print(f"  np.argmax(q_values) = {best_action_idx}")
        print(f"  → Choose: Action {best_action_idx} ({best_action_name})")
        print(f"  → Q-value: {best_q_value}")

        # Show what happens with ties
        if len(np.where(q_values == best_q_value)[0]) > 1:
            tied_indices = np.where(q_values == best_q_value)[0]
            print(f"  ⚠️  Note: Tied with actions {list(tied_indices)} - argmax picks first")

def manual_vs_argmax_comparison():
    """Compare manual max finding vs np.argmax"""

    print("\n" + "="*60)
    print("🔧 MANUAL vs np.argmax() COMPARISON")
    print("="*60)

    q_values = [1.2, 3.8, 2.1, 0.5]

    print(f"Q-values: {q_values}")
    print()

    # Manual way (what argmax does internally)
    print("🔨 Manual Method:")
    max_value = q_values[0]
    max_index = 0

    for i in range(len(q_values)):
        print(f"  Step {i+1}: Check index {i}, value = {q_values[i]}")
        if q_values[i] > max_value:
            max_value = q_values[i]
            max_index = i
            print(f"           New maximum! Update max_index to {i}")
        else:
            print(f"           Not larger than current max ({max_value})")

    print(f"  Final result: max_index = {max_index}, max_value = {max_value}")

    # Using argmax
    print(f"\n⚡ Using np.argmax():")
    argmax_result = np.argmax(q_values)
    print(f"  np.argmax(q_values) = {argmax_result}")
    print(f"  Same result: {max_index == argmax_result} ✅")

if __name__ == "__main__":
    demonstrate_argmax()
    q_learning_action_selection_demo()
    manual_vs_argmax_comparison()

    print("\n" + "="*60)
    print("📝 SUMMARY")
    print("="*60)
    print("• np.argmax(array) returns the INDEX of the largest value")
    print("• In Q-learning: argmax selects the action with highest Q-value")
    print("• If multiple values tie for max, argmax returns the first index")
    print("• This is how agents choose the 'best' action (exploitation)")
    print("="*60)

🔍 UNDERSTANDING np.argmax()

📍 Example 1: Basic Array
------------------------------
Array:           [1.2, 3.8, 2.1, 0.5]
Indices:         [0, 1, 2, 3]
np.argmax():     1
Max value:       3.8
Explanation:     Index 1 contains the largest value (3.8)

📍 Example 2: Q-Learning Actions
------------------------------
Q-values by action:
  Index 0:    Up =   2.1
  Index 1:  Down =  -1.0
  Index 2:  Left =   0.5
  Index 3: Right =   3.5 👈 BEST!

np.argmax(q_values): 3
Best action:         Right
Best Q-value:        3.5

📍 Example 3: Edge Cases
------------------------------
All same values: [2.0, 2.0, 2.0, 2.0]
np.argmax():     0 (returns first occurrence)
All negative:    [-5.0, -1.0, -3.0, -2.0]
np.argmax():     1 (index of least negative)
Single value:    [42.0]
np.argmax():     0 (only option)

📍 Example 4: Related Functions Comparison
------------------------------
Array:               [1.5, 4.2, 2.8, 0.9]
np.argmax():         1 (index of max)
np.max():            4.2 (actual max value)

# QLearningAgent (Bare)

In [11]:
import numpy as np
import random

class QLearningAgent:
    def __init__(self, num_states, num_actions, lr=0.1, gamma=0.95, epsilon=0.1):
        # Initialize Q-table with zeros
        self.q_table = np.zeros((num_states, num_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon

    def get_action(self, state):
        # Epsilon-greedy action selection
        if random.random() < self.epsilon:
            return random.randint(0, len(self.q_table[state]) - 1)
        else:
            return np.argmax(self.q_table[state])  # Direct table lookup

    def update(self, state, action, reward, next_state, done):
        # Direct Q-table update using Bellman equation
        if done:
            target = reward
        else:
            target = reward + self.gamma * np.max(self.q_table[next_state])

        # Update specific table entry
        self.q_table[state, action] += self.lr * (target - self.q_table[state, action])

# Usage
agent = QLearningAgent(num_states=100, num_actions=4)
state = 5
action = agent.get_action(state)  # Just lookup q_table[5]

# QLearningAgent (Verbose)

In [12]:
import numpy as np
import random

class QLearningAgent:
    def __init__(self, num_states, num_actions, lr=0.1, gamma=0.95, epsilon=0.1):
        # Initialize Q-table with zeros
        self.q_table = np.zeros((num_states, num_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        self.num_states = num_states
        self.num_actions = num_actions

        # Logging counters
        self.exploration_count = 0
        self.exploitation_count = 0
        self.update_count = 0

        print(f"🤖 Q-Learning Agent Initialized:")
        print(f"   📊 Q-table shape   : {self.q_table.shape}")
        print(f"   📈 Learning rate (lr)     : {self.lr}")
        print(f"   💰 Discount fact (gamma)  : {self.gamma}")
        print(f"   🎯 Explortn rate (epsilon): {self.epsilon}")
        print(f"   🧠 Initial Q-table (all zeros):")
        print(f"{self.q_table}\n")

    def get_action(self, state, verbose=True):
        """Get action using epsilon-greedy strategy with detailed logging"""

        # Get current Q-values for this state
        current_q_values = self.q_table[state]
        best_action = np.argmax(current_q_values)
        best_q_value = current_q_values[best_action]

        if verbose:
            print(f"🎮 Action Selection for State {state}:")
            print(f"   📋 Current Q-values: {current_q_values}")
            print(f"   ⭐ Best action would be: {best_action} (Q-value: {best_q_value:.3f})")

        # Epsilon-greedy decision
        random_prob = random.random()
        if random_prob < self.epsilon:
            # Exploration: choose random action
            action = random.randint(0, self.num_actions - 1)
            self.exploration_count += 1
            if verbose:
                print(f"   🎲 EXPLORING! Random prob {random_prob:.3f} < epsilon {self.epsilon}")
                print(f"   ➡️  Chose random action: {action}")
                print(f"   📊 Exploration count: {self.exploration_count}")
        else:
            # Exploitation: choose best known action
            action = best_action
            self.exploitation_count += 1
            if verbose:
                print(f"   🎯 EXPLOITING! Random prob {random_prob:.3f} >= epsilon {self.epsilon}")
                print(f"   ➡️  Chose best action: {action}")
                print(f"   📊 Exploitation count: {self.exploitation_count}")

        if verbose:
            print(f"   🔄 Explore/Exploit ratio: {self.exploration_count}/{self.exploitation_count}\n")

        return action

    def update(self, state, action, reward, next_state, done, verbose=True):
        """Update Q-table with detailed logging of the learning process"""

        self.update_count += 1

        if verbose:
            print(f"📚 Q-Learning Update #{self.update_count}:")
            print(f"   🏁 State: {state} → Action: {action} → Reward: {reward} → Next State: {next_state}")
            print(f"   ⚡ Episode done: {done}")

        # Store old Q-value for comparison
        old_q_value = self.q_table[state, action]

        # Calculate target Q-value
        if done:
            target = reward
            if verbose:
                print(f"   🎯 Target calculation (episode ended):")
                print(f"      Target = reward = {reward}")
        else:
            next_q_values = self.q_table[next_state]
            max_next_q = np.max(next_q_values)
            best_next_action = np.argmax(next_q_values)
            target = reward + self.gamma * max_next_q

            if verbose:
                print(f"   🎯 Target calculation (episode continues):")
                print(f"      Next state Q-values: {next_q_values}")
                print(f"      Best next action: {best_next_action} (Q-value: {max_next_q:.3f})")
                print(f"      Target = reward + gamma * max_next_Q")
                print(f"      Target = {reward} + {self.gamma} * {max_next_q:.3f} = {target:.3f}")

        # Calculate temporal difference error
        td_error = target - old_q_value

        # Update Q-value using Q-learning formula
        new_q_value = old_q_value + self.lr * td_error
        self.q_table[state, action] = new_q_value

        if verbose:
            print(f"   🔄 Q-value Update:")
            print(f"      Old Q-value: {old_q_value:.3f}")
            print(f"      TD Error: target - old = {target:.3f} - {old_q_value:.3f} = {td_error:.3f}")
            print(f"      Learning step: lr * TD_error = {self.lr} * {td_error:.3f} = {self.lr * td_error:.3f}")
            print(f"      New Q-value: old + learning_step = {old_q_value:.3f} + {self.lr * td_error:.3f} = {new_q_value:.3f}")
            print(f"      📈 Change: {new_q_value - old_q_value:+.3f}")

        return td_error

    def print_q_table(self, title="Current Q-Table"):
        """Print the current Q-table in a readable format"""
        print(f"\n📊 {title}:")
        print("State\\Action", end="")
        for a in range(self.num_actions):
            print(f"     Action{a}", end="")
        print()
        print("-" * (12 + 12 * self.num_actions))

        for s in range(self.num_states):
            print(f"State {s:2d}   ", end="")
            for a in range(self.num_actions):
                print(f"{self.q_table[s, a]:8.3f}    ", end="")
            print()
        print()

    def print_policy(self):
        """Print the current policy (best action for each state)"""
        print("🎯 Current Policy (Best Action per State):")
        for s in range(self.num_states):
            best_action = np.argmax(self.q_table[s])
            best_q_value = self.q_table[s, best_action]
            print(f"   State {s}: Action {best_action} (Q-value: {best_q_value:.3f})")
        print()


def simple_environment_step(state, action, num_states=5):
    """
    Simple environment for demonstration:
    - States: 0, 1, 2, 3, 4
    - Actions: 0 (left), 1 (right)
    - Goal: Reach state 4 (rightmost)
    - Reward: +10 for reaching goal, -1 for each step, -5 for going out of bounds
    """

    if action == 0:  # Move left
        next_state = max(0, state - 1)
    else:  # Move right (action == 1)
        next_state = min(num_states - 1, state + 1)

    # Calculate reward
    if next_state == num_states - 1:  # Reached goal
        reward = 10
        done = True
    elif (action == 0 and state == 0) or (action == 1 and state == num_states - 1):
        # Tried to go out of bounds
        reward = -5
        done = False
    else:
        reward = -1  # Normal step cost
        done = False

    return next_state, reward, done


def train_agent(episodes=5, max_steps_per_episode=10, verbose=True):
    """Train the Q-learning agent with detailed logging"""

    # Initialize agent
    agent = QLearningAgent(num_states=5, num_actions=2, lr=0.1, gamma=0.9, epsilon=0.3)

    print("🚀 Starting Training!\n")
    print("=" * 80)

    total_rewards = []

    for episode in range(episodes):
        print(f"\n🎬 EPISODE {episode + 1}/{episodes}")
        print("=" * 40)

        # Reset environment
        state = 0  # Always start at leftmost state
        episode_reward = 0
        step_count = 0

        print(f"🏁 Starting at state {state}")

        for step in range(max_steps_per_episode):
            step_count += 1
            print(f"\n📍 Step {step_count}:")
            print("-" * 20)

            # Get action
            action = agent.get_action(state, verbose=verbose)

            # Take action in environment
            next_state, reward, done = simple_environment_step(state, action)
            episode_reward += reward

            if verbose:
                print(f"🌍 Environment Response:")
                print(f"   🎬 Action taken: {action} ({'left' if action == 0 else 'right'})")
                print(f"   📍 State transition: {state} → {next_state}")
                print(f"   💰 Reward received: {reward}")
                print(f"   📊 Episode reward so far: {episode_reward}")

            # Update Q-table
            td_error = agent.update(state, action, reward, next_state, done, verbose=verbose)

            # Move to next state
            state = next_state

            if done:
                if verbose:
                    print(f"✅ Episode completed! Goal reached in {step_count} steps!")
                break

            if step_count >= max_steps_per_episode:
                if verbose:
                    print(f"⏰ Episode ended: Maximum steps ({max_steps_per_episode}) reached")
                break

        total_rewards.append(episode_reward)

        print(f"\n📈 Episode {episode + 1} Summary:")
        print(f"   Total Reward: {episode_reward}")
        print(f"   Steps Taken: {step_count}")
        print(f"   Goal Reached: {'Yes' if done else 'No'}")

        # Print Q-table after each episode
        agent.print_q_table(f"Q-Table after Episode {episode + 1}")
        agent.print_policy()

        print("=" * 80)

    print(f"\n🎉 Training Complete!")
    print(f"📊 Total Episodes: {episodes}")
    print(f"📈 Rewards per Episode: {total_rewards}")
    print(f"🎯 Average Reward: {np.mean(total_rewards):.2f}")
    print(f"🔍 Exploration vs Exploitation: {agent.exploration_count} vs {agent.exploitation_count}")

    agent.print_q_table("Final Q-Table")
    agent.print_policy()

    return agent

# Example usage
if __name__ == "__main__":
    # Train with detailed logging
    trained_agent = train_agent(episodes=3, max_steps_per_episode=8, verbose=True)

    print("\n" + "="*80)
    print("🧪 Testing the trained agent (no more learning, just exploitation):")
    print("="*80)

    # Test the trained agent
    trained_agent.epsilon = 0.0  # No more exploration, pure exploitation
    state = 0
    step = 0

    print(f"🏁 Starting test at state {state}")

    while step < 10:
        step += 1
        action = trained_agent.get_action(state, verbose=True)
        next_state, reward, done = simple_environment_step(state, action)
        print(f"   🎬 Took action {action} → moved to state {next_state}, got reward {reward}")
        state = next_state

        if done:
            print(f"✅ Goal reached in {step} steps!")
            break

    if not done:
        print(f"❌ Failed to reach goal in {step} steps")

🤖 Q-Learning Agent Initialized:
   📊 Q-table shape: (5, 2)
   📈 Learning rate (lr): 0.1
   💰 Discount factor (gamma): 0.9
   🎯 Exploration rate (epsilon): 0.3
   🧠 Initial Q-table (all zeros):
      [[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]

🚀 Starting Training!


🎬 EPISODE 1/3
🏁 Starting at state 0

📍 Step 1:
--------------------
🎮 Action Selection for State 0:
   📋 Current Q-values: [0. 0.]
   ⭐ Best action would be: 0 (Q-value: 0.000)
   🎯 EXPLOITING! Random prob 0.311 >= epsilon 0.3
   ➡️  Chose best action: 0
   📊 Exploitation count: 1
   🔄 Explore/Exploit ratio: 0/1

🌍 Environment Response:
   🎬 Action taken: 0 (left)
   📍 State transition: 0 → 0
   💰 Reward received: -5
   📊 Episode reward so far: -5
📚 Q-Learning Update #1:
   🏁 State: 0 → Action: 0 → Reward: -5 → Next State: 0
   ⚡ Episode done: False
   🎯 Target calculation (episode continues):
      Next state Q-values: [0. 0.]
      Best next action: 0 (Q-value: 0.000)
      Target = reward + gamma * max_next_Q
      Tar

# QLearningAgent(Table Verbose)

In [14]:
import numpy as np
import random
from datetime import datetime

class QLearningAgent:
    def __init__(self, num_states, num_actions, lr=0.1, gamma=0.95, epsilon=0.1):
        # Initialize Q-table with zeros
        self.q_table = np.zeros((num_states, num_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon
        self.num_states = num_states
        self.num_actions = num_actions

        # Logging counters
        self.exploration_count = 0
        self.exploitation_count = 0
        self.update_count = 0

        self._print_initialization()

    def _print_initialization(self):
        """Print agent initialization in organized format"""
        print("\n" + "="*60)
        print("🤖 Q-LEARNING AGENT INITIALIZATION")
        print("="*60)
        print(f"│ Q-table Shape       │ {self.q_table.shape}")
        print(f"│ Learning Rate (α)   │ {self.lr}")
        print(f"│ Discount Factor (γ) │ {self.gamma}")
        print(f"│ Exploration Rate (ε)│ {self.epsilon}")
        print("─"*60)
        print("Initial Q-Table (all zeros):")
        self._print_q_table_compact()
        print("="*60 + "\n")

    def get_action(self, state, verbose=True):
        """Get action using epsilon-greedy strategy"""
        current_q_values = self.q_table[state]
        best_action = np.argmax(current_q_values)
        best_q_value = current_q_values[best_action]

        if verbose:
            print(f"┌─ ACTION SELECTION (State {state}) " + "─"*25)
            print(f"│ Q-values: {self._format_array(current_q_values)}")
            print(f"│ Best:     Action {best_action} (Q={best_q_value:.3f})")

        # Epsilon-greedy decision
        random_prob = random.random()
        if random_prob < self.epsilon:
            action = random.randint(0, self.num_actions - 1)
            self.exploration_count += 1
            decision_type = "EXPLORE 🎲"
            if verbose:
                print(f"│ Decision: {decision_type} ({random_prob:.3f} < {self.epsilon})")
                print(f"│ Chosen:   Action {action} (random)")
        # Exploit decision
        else:
            action = best_action
            self.exploitation_count += 1
            decision_type = "EXPLOIT 🎯"
            if verbose:
                print(f"│ Decision: {decision_type} ({random_prob:.3f} ≥ {self.epsilon})")
                print(f"│ Chosen:   Action {action} (greedy)")

        if verbose:
            total_decisions = self.exploration_count + self.exploitation_count
            explore_pct = (self.exploration_count / total_decisions * 100) if total_decisions > 0 else 0
            print(f"│ Stats:    Explore {self.exploration_count}/{total_decisions} ({explore_pct:.1f}%)")
            print("└" + "─"*50)

        return action

    def update(self, state, action, reward, next_state, done, verbose=True):
        """Update Q-table with organized logging"""
        self.update_count += 1
        old_q_value = self.q_table[state, action]

        if verbose:
            print(f"┌─ Q-UPDATE #{self.update_count} " + "─"*35)
            print(f"│ Transition: S{state} --A{action}--> S{next_state} (R={reward:+.1f})")
            print(f"│ Done:       {done}")

        # Calculate target
        if done:
            target = reward
            if verbose:
                print(f"│ Target:     {target:.3f} (episode ended, no future)")
        else:
            next_q_values = self.q_table[next_state]
            max_next_q = np.max(next_q_values)
            target = reward + self.gamma * max_next_q
            if verbose:
                print(f"│ Next Q's:   {self._format_array(next_q_values)}")
                print(f"│ Target:     {reward} + {self.gamma}×{max_next_q:.3f} = {target:.3f}")

        # Update Q-value
        td_error = target - old_q_value
        new_q_value = old_q_value + self.lr * td_error
        self.q_table[state, action] = new_q_value

        if verbose:
            print(f"│ Q-Update:   {old_q_value:.3f} + {self.lr}×{td_error:.3f} = {new_q_value:.3f}")
            print(f"│ Change:     {new_q_value - old_q_value:+.3f}")
            print("└" + "─"*50)

        return td_error

    def _format_array(self, arr, decimals=2):
        """Format numpy array for clean display"""
        return "[" + ", ".join([f"{x:.{decimals}f}" for x in arr]) + "]"

    def _print_q_table_compact(self):
        """Print Q-table in compact format"""
        print("     ", end="")
        for a in range(self.num_actions):
            print(f"   A{a}  ", end="")
        print()

        for s in range(self.num_states):
            print(f"S{s} │ ", end="")
            for a in range(self.num_actions):
                print(f"{self.q_table[s, a]:5.2f}", end=" ")
            print()

    def print_q_table(self, title="Q-TABLE"):
        """Print Q-table with header"""
        print(f"\n┌─ {title} " + "─"*(50-len(title)))
        self._print_q_table_compact()
        print("└" + "─"*50)

    def print_policy(self):
        """Print current policy"""
        print("\n┌─ CURRENT POLICY " + "─"*33)
        for s in range(self.num_states):
            best_action = np.argmax(self.q_table[s])
            best_q_value = self.q_table[s, best_action]
            print(f"│ State {s}: Action {best_action} (Q={best_q_value:.3f})")
        print("└" + "─"*50)


class TrainingLogger:
    """Separate class to handle training-level logging"""

    def __init__(self):
        self.episode_data = []
        self.start_time = datetime.now()

    def log_episode_start(self, episode, total_episodes):
        """Log episode start"""
        print(f"\n{'='*60}")
        print(f"🎬 EPISODE {episode + 1:2d}/{total_episodes} - {datetime.now().strftime('%H:%M:%S')}")
        print(f"{'='*60}")

    def log_step_start(self, step, state):
        """Log step start"""
        print(f"\n📍 STEP {step:2d} - Current State: {state}")
        print("─"*30)

    def log_environment_response(self, action, state, next_state, reward, done):
        """Log environment response"""
        action_name = "LEFT" if action == 0 else "RIGHT"
        print(f"┌─ ENVIRONMENT RESPONSE " + "─"*27)
        print(f"│ Action:     {action} ({action_name})")
        print(f"│ Transition: {state} → {next_state}")
        print(f"│ Reward:     {reward:+.1f}")
        print(f"│ Done:       {done}")
        print("└" + "─"*50)

    def log_episode_summary(self, episode, steps, reward, done, goal_state=4):
        """Log episode summary"""
        status = "SUCCESS ✅" if done and steps > 0 else "TIMEOUT ⏰"
        self.episode_data.append({
            'episode': episode + 1,
            'steps': steps,
            'reward': reward,
            'success': done
        })

        print(f"\n┌─ EPISODE {episode + 1} SUMMARY " + "─"*28)
        print(f"│ Status:     {status}")
        print(f"│ Steps:      {steps}")
        print(f"│ Reward:     {reward:+.1f}")
        print(f"│ Goal:       {'Reached' if done else 'Not reached'}")
        print("└" + "─"*50)

    def print_training_summary(self, agent):
        """Print final training summary"""
        total_episodes = len(self.episode_data)
        successful_episodes = sum(1 for ep in self.episode_data if ep['success'])
        avg_reward = np.mean([ep['reward'] for ep in self.episode_data])
        avg_steps = np.mean([ep['steps'] for ep in self.episode_data])

        print(f"\n{'='*60}")
        print("🎉 TRAINING COMPLETE")
        print(f"{'='*60}")
        print(f"│ Episodes:      {total_episodes}")
        print(f"│ Success Rate:  {successful_episodes}/{total_episodes} ({successful_episodes/total_episodes*100:.1f}%)")
        print(f"│ Avg Reward:    {avg_reward:+.2f}")
        print(f"│ Avg Steps:     {avg_steps:.1f}")
        print(f"│ Exploration:   {agent.exploration_count}/{agent.exploration_count + agent.exploitation_count} ({agent.exploration_count/(agent.exploration_count + agent.exploitation_count)*100:.1f}%)")
        print(f"│ Duration:      {(datetime.now() - self.start_time).total_seconds():.1f}s")
        print("─"*60)

        # Episode-by-episode breakdown
        print("EPISODE BREAKDOWN:")
        print("Ep# │ Steps │ Reward │ Status")
        print("────┼───────┼────────┼─────────")
        for ep in self.episode_data:
            status = "✅" if ep['success'] else "⏰"
            print(f"{ep['episode']:2d}  │  {ep['steps']:2d}   │ {ep['reward']:+6.1f} │   {status}")
        print("="*60)


def simple_environment_step(state, action, num_states=5):
    """Simple environment: move left/right, goal is rightmost state"""
    if action == 0:  # Move left
        next_state = max(0, state - 1)
    else:  # Move right
        next_state = min(num_states - 1, state + 1)

    # Rewards
    if next_state == num_states - 1:  # Goal reached
        reward = 10
        done = True
    elif (action == 0 and state == 0) or (action == 1 and state == num_states - 1):
        reward = -5  # Hit boundary
        done = False
    else:
        reward = -1  # Step cost
        done = False

    return next_state, reward, done


def train_agent(episodes=5, max_steps_per_episode=10, verbose=True):
    """Train agent with organized logging"""

    # Initialize
    agent = QLearningAgent(num_states=5, num_actions=2, lr=0.1, gamma=0.9, epsilon=0.3)
    logger = TrainingLogger()

    # Training loop
    for episode in range(episodes):
        logger.log_episode_start(episode, episodes)

        state = 0  # Start state
        episode_reward = 0
        step_count = 0

        for step in range(max_steps_per_episode):
            step_count += 1

            if verbose:
                logger.log_step_start(step_count, state)

            # Get action
            action = agent.get_action(state, verbose=verbose)

            # Environment step
            next_state, reward, done = simple_environment_step(state, action)
            episode_reward += reward

            if verbose:
                logger.log_environment_response(action, state, next_state, reward, done)

            # Update agent
            agent.update(state, action, reward, next_state, done, verbose=verbose)

            state = next_state

            if done:
                break

        # Episode summary
        logger.log_episode_summary(episode, step_count, episode_reward, done)

        # Show Q-table and policy after each episode
        agent.print_q_table(f"Q-TABLE AFTER EPISODE {episode + 1}")
        agent.print_policy()

    # Final summary
    logger.print_training_summary(agent)
    agent.print_q_table("FINAL Q-TABLE")
    agent.print_policy()

    return agent


def test_trained_agent(agent, max_steps=10):
    """Test the trained agent"""
    print(f"\n{'='*60}")
    print("🧪 TESTING TRAINED AGENT (Pure Exploitation)")
    print(f"{'='*60}")

    agent.epsilon = 0.0  # No exploration
    state = 0
    step = 0

    print(f"Starting at state {state}\n")

    while step < max_steps:
        step += 1
        print(f"Step {step}:")

        action = agent.get_action(state, verbose=False)
        next_state, reward, done = simple_environment_step(state, action)

        action_name = "LEFT" if action == 0 else "RIGHT"
        print(f"  Action: {action} ({action_name}) → State {state} → {next_state} (Reward: {reward:+.1f})")

        state = next_state

        if done:
            print(f"✅ SUCCESS! Goal reached in {step} steps!")
            break

    if not done:
        print(f"❌ FAILED! Could not reach goal in {max_steps} steps")

    print("="*60)


# Example usage
if __name__ == "__main__":
    print("🚀 Q-LEARNING TRAINING DEMONSTRATION")
    print("Environment: 5 states (0,1,2,3,4), 2 actions (left/right), goal=state 4")

    # Train with organized output
    trained_agent = train_agent(episodes=4, max_steps_per_episode=8, verbose=True)

    # Test the trained agent
    test_trained_agent(trained_agent, max_steps=10)

🚀 Q-LEARNING TRAINING DEMONSTRATION
Environment: 5 states (0,1,2,3,4), 2 actions (left/right), goal=state 4

🤖 Q-LEARNING AGENT INITIALIZATION
│ Q-table Shape       │ (5, 2)
│ Learning Rate (α)   │ 0.1
│ Discount Factor (γ) │ 0.9
│ Exploration Rate (ε)│ 0.3
────────────────────────────────────────────────────────────
Initial Q-Table (all zeros):
        A0     A1  
S0 │  0.00  0.00 
S1 │  0.00  0.00 
S2 │  0.00  0.00 
S3 │  0.00  0.00 
S4 │  0.00  0.00 


🎬 EPISODE  1/4 - 03:11:44

📍 STEP  1 - Current State: 0
──────────────────────────────
┌─ ACTION SELECTION (State 0) ─────────────────────────
│ Q-values: [0.00, 0.00]
│ Best:     Action 0 (Q=0.000)
│ Decision: EXPLOIT 🎯 (0.480 ≥ 0.3)
│ Chosen:   Action 0 (greedy)
│ Stats:    Explore 0/1 (0.0%)
└──────────────────────────────────────────────────
┌─ ENVIRONMENT RESPONSE ───────────────────────────
│ Action:     0 (LEFT)
│ Transition: 0 → 0
│ Reward:     -5.0
│ Done:       False
└──────────────────────────────────────────────────
┌─ Q-U