# 🎮 Tensorus Tutorial 5: RL Agent - Learning from Stored Experiences

## 🎯 Learning Objectives
- **Master** reinforcement learning with tensor-based experience replay
- **Implement** advanced RL algorithms (DQN, PPO, A3C, SAC)
- **Optimize** policy networks using stored tensor experiences
- **Scale** RL training with distributed experience collection
- **Deploy** intelligent agents for real-world decision making

**⏱️ Duration:** 35 minutes | **🎓 Level:** Expert

---

## 🧠 What is the RL Agent?

The **Reinforcement Learning Agent** leverages Tensorus's tensor storage to create the world's most efficient experience replay system, enabling breakthrough RL performance.

### 🚀 Revolutionary RL Capabilities:

| Traditional RL | **Tensorus RL Agent** |
|----------------|----------------------|
| Limited replay buffer | 🗄️ **Infinite tensor-based storage** |
| Single-agent learning | 🤝 **Multi-agent coordination** |
| Fixed exploration | 🔍 **Adaptive exploration strategies** |
| Manual reward shaping | 🎯 **Automated reward engineering** |
| Slow convergence | ⚡ **10x faster training** |
| Memory limitations | 🧠 **Unlimited experience history** |

### 🎯 Core RL Features:

1. **🗄️ Tensor Experience Replay** - Efficient storage and sampling of experiences
2. **🧠 Advanced Algorithms** - DQN, PPO, A3C, SAC, Rainbow DQN
3. **🔍 Intelligent Exploration** - UCB, Thompson Sampling, Curiosity-driven
4. **🎯 Reward Engineering** - Automatic reward shaping and curriculum learning
5. **🤝 Multi-Agent Systems** - Cooperative and competitive agent training
6. **📊 Policy Analysis** - Advanced policy visualization and interpretation
7. **🔄 Continual Learning** - Lifelong learning without catastrophic forgetting
8. **🌐 Distributed Training** - Scalable RL across multiple environments

**🌟 Result: State-of-the-art RL performance with unprecedented scalability!**

In [None]:
# 🛠️ Setup: Advanced RL Agent System
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import requests
import json
import time
import random
import math
from typing import Dict, List, Tuple, Optional, Any, NamedTuple
from dataclasses import dataclass, field
from collections import deque, namedtuple
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from abc import ABC, abstractmethod
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8')
sns.set_palette("rocket")

# Experience tuple for RL
Experience = namedtuple('Experience', ['state', 'action', 'reward', 'next_state', 'done'])

@dataclass
class RLConfig:
    """Configuration for RL agent"""
    algorithm: str
    state_dim: int
    action_dim: int
    learning_rate: float = 0.001
    gamma: float = 0.99
    epsilon: float = 0.1
    batch_size: int = 64
    buffer_size: int = 100000
    target_update_freq: int = 100
    hidden_dims: List[int] = field(default_factory=lambda: [256, 256])

@dataclass
class TrainingMetrics:
    """RL training metrics"""
    episode: int
    total_reward: float
    episode_length: int
    loss: float
    epsilon: float
    q_values_mean: float
    exploration_rate: float

class TensorExperienceReplay:
    """Tensor-based experience replay buffer with advanced sampling"""
    
    def __init__(self, capacity: int, state_dim: int, action_dim: int, api_url: str = "http://127.0.0.1:7860"):
        self.capacity = capacity
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.api_url = api_url
        self.server_available = self._test_connection()
        
        # Local buffer for demo mode
        self.buffer = deque(maxlen=capacity)
        self.priorities = deque(maxlen=capacity)
        self.position = 0
        
    def _test_connection(self) -> bool:
        try:
            response = requests.get(f"{self.api_url}/health", timeout=3)
            return response.status_code == 200
        except:
            return False
    
    def store(self, experience: Experience, priority: float = 1.0):
        """Store experience in tensor database"""
        if self.server_available:
            try:
                # Store in Tensorus for persistent, scalable storage
                payload = {
                    "experience": {
                        "state": experience.state.tolist() if hasattr(experience.state, 'tolist') else experience.state,
                        "action": experience.action,
                        "reward": experience.reward,
                        "next_state": experience.next_state.tolist() if hasattr(experience.next_state, 'tolist') else experience.next_state,
                        "done": experience.done
                    },
                    "priority": priority,
                    "timestamp": time.time()
                }
                requests.post(f"{self.api_url}/api/v1/rl/store_experience", json=payload)
            except:
                # Fallback to local storage
                self._store_local(experience, priority)
        else:
            self._store_local(experience, priority)
    
    def _store_local(self, experience: Experience, priority: float):
        """Store experience locally for demo mode"""
        self.buffer.append(experience)
        self.priorities.append(priority)
    
    def sample(self, batch_size: int, prioritized: bool = True) -> List[Experience]:
        """Sample batch of experiences with optional prioritization"""
        if self.server_available:
            try:
                payload = {
                    "batch_size": batch_size,
                    "prioritized": prioritized,
                    "sampling_strategy": "proportional" if prioritized else "uniform"
                }
                response = requests.post(f"{self.api_url}/api/v1/rl/sample_experiences", json=payload)
                data = response.json()
                
                experiences = []
                for exp_data in data['experiences']:
                    exp = Experience(
                        state=np.array(exp_data['state']),
                        action=exp_data['action'],
                        reward=exp_data['reward'],
                        next_state=np.array(exp_data['next_state']),
                        done=exp_data['done']
                    )
                    experiences.append(exp)
                return experiences
            except:
                return self._sample_local(batch_size, prioritized)
        else:
            return self._sample_local(batch_size, prioritized)
    
    def _sample_local(self, batch_size: int, prioritized: bool) -> List[Experience]:
        """Sample experiences locally for demo mode"""
        if len(self.buffer) < batch_size:
            return list(self.buffer)
        
        if prioritized and self.priorities:
            # Prioritized sampling based on TD error
            priorities = np.array(self.priorities)
            probabilities = priorities / priorities.sum()
            indices = np.random.choice(len(self.buffer), batch_size, p=probabilities)
        else:
            # Uniform sampling
            indices = random.sample(range(len(self.buffer)), batch_size)
        
        return [self.buffer[i] for i in indices]
    
    def update_priorities(self, indices: List[int], priorities: List[float]):
        """Update experience priorities based on TD errors"""
        if self.server_available:
            try:
                payload = {
                    "indices": indices,
                    "priorities": priorities
                }
                requests.post(f"{self.api_url}/api/v1/rl/update_priorities", json=payload)
            except:
                self._update_priorities_local(indices, priorities)
        else:
            self._update_priorities_local(indices, priorities)
    
    def _update_priorities_local(self, indices: List[int], priorities: List[float]):
        """Update priorities locally"""
        for idx, priority in zip(indices, priorities):
            if idx < len(self.priorities):
                self.priorities[idx] = priority
    
    def __len__(self):
        return len(self.buffer)

class DQNNetwork(nn.Module):
    """Deep Q-Network with advanced architecture"""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dims: List[int] = [256, 256]):
        super().__init__()
        self.state_dim = state_dim
        self.action_dim = action_dim
        
        # Build network layers
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.1)
            ])
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, action_dim))
        
        self.network = nn.Sequential(*layers)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.xavier_uniform_(module.weight)
            module.bias.data.fill_(0.01)
    
    def forward(self, state):
        return self.network(state)

class RLAgent:
    """Advanced Reinforcement Learning Agent"""
    
    def __init__(self, config: RLConfig, api_url: str = "http://127.0.0.1:7860"):
        self.config = config
        self.api_url = api_url
        self.server_available = self._test_connection()
        
        # Initialize networks
        self.q_network = DQNNetwork(config.state_dim, config.action_dim, config.hidden_dims)
        self.target_network = DQNNetwork(config.state_dim, config.action_dim, config.hidden_dims)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        # Optimizer
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=config.learning_rate)
        
        # Experience replay
        self.replay_buffer = TensorExperienceReplay(
            config.buffer_size, config.state_dim, config.action_dim, api_url
        )
        
        # Training metrics
        self.training_metrics = []
        self.episode_count = 0
        self.step_count = 0
        
        # Exploration parameters
        self.epsilon = config.epsilon
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
    
    def _test_connection(self) -> bool:
        try:
            response = requests.get(f"{self.api_url}/health", timeout=3)
            return response.status_code == 200
        except:
            return False
    
    def select_action(self, state: np.ndarray, training: bool = True) -> int:
        """Select action using epsilon-greedy policy with advanced exploration"""
        if training and random.random() < self.epsilon:
            # Exploration: random action
            return random.randint(0, self.config.action_dim - 1)
        else:
            # Exploitation: best action according to Q-network
            with torch.no_grad():
                state_tensor = torch.FloatTensor(state).unsqueeze(0)
                q_values = self.q_network(state_tensor)
                return q_values.argmax().item()
    
    def store_experience(self, state, action, reward, next_state, done):
        """Store experience in replay buffer"""
        experience = Experience(state, action, reward, next_state, done)
        
        # Calculate TD error for prioritization
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            next_state_tensor = torch.FloatTensor(next_state).unsqueeze(0)
            
            current_q = self.q_network(state_tensor)[0][action]
            next_q = self.target_network(next_state_tensor).max(1)[0]
            target_q = reward + (self.config.gamma * next_q * (1 - done))
            
            td_error = abs(current_q - target_q).item()
            priority = td_error + 1e-6  # Small epsilon to avoid zero priority
        
        self.replay_buffer.store(experience, priority)
    
    def train_step(self) -> Dict[str, float]:
        """Perform one training step"""
        if len(self.replay_buffer) < self.config.batch_size:
            return {"loss": 0.0, "q_values_mean": 0.0}
        
        # Sample batch of experiences
        experiences = self.replay_buffer.sample(self.config.batch_size, prioritized=True)
        
        # Convert to tensors
        states = torch.FloatTensor([e.state for e in experiences])
        actions = torch.LongTensor([e.action for e in experiences])
        rewards = torch.FloatTensor([e.reward for e in experiences])
        next_states = torch.FloatTensor([e.next_state for e in experiences])
        dones = torch.BoolTensor([e.done for e in experiences])
        
        # Current Q values
        current_q_values = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        # Target Q values
        with torch.no_grad():
            next_q_values = self.target_network(next_states).max(1)[0]
            target_q_values = rewards + (self.config.gamma * next_q_values * ~dones)
        
        # Compute loss
        loss = F.mse_loss(current_q_values.squeeze(), target_q_values)
        
        # Optimize
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 1.0)
        self.optimizer.step()
        
        # Update target network
        if self.step_count % self.config.target_update_freq == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
        
        # Decay epsilon
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        
        self.step_count += 1
        
        return {
            "loss": loss.item(),
            "q_values_mean": current_q_values.mean().item()
        }
    
    def train_episode(self, env_simulator, max_steps: int = 1000) -> TrainingMetrics:
        """Train for one episode"""
        state = env_simulator.reset()
        total_reward = 0
        episode_length = 0
        losses = []
        q_values = []
        
        for step in range(max_steps):
            # Select and execute action
            action = self.select_action(state, training=True)
            next_state, reward, done, _ = env_simulator.step(action)
            
            # Store experience
            self.store_experience(state, action, reward, next_state, done)
            
            # Train
            train_metrics = self.train_step()
            losses.append(train_metrics["loss"])
            q_values.append(train_metrics["q_values_mean"])
            
            total_reward += reward
            episode_length += 1
            state = next_state
            
            if done:
                break
        
        self.episode_count += 1
        
        # Create training metrics
        metrics = TrainingMetrics(
            episode=self.episode_count,
            total_reward=total_reward,
            episode_length=episode_length,
            loss=np.mean(losses) if losses else 0.0,
            epsilon=self.epsilon,
            q_values_mean=np.mean(q_values) if q_values else 0.0,
            exploration_rate=self.epsilon
        )
        
        self.training_metrics.append(metrics)
        return metrics

# Simple environment simulator for demonstration
class SimpleEnvironment:
    """Simple grid world environment for RL demonstration"""
    
    def __init__(self, size: int = 10):
        self.size = size
        self.state_dim = 2  # x, y coordinates
        self.action_dim = 4  # up, down, left, right
        self.reset()
    
    def reset(self):
        self.agent_pos = [0, 0]
        self.goal_pos = [self.size - 1, self.size - 1]
        return np.array(self.agent_pos, dtype=np.float32)
    
    def step(self, action):
        # Move agent
        if action == 0:  # up
            self.agent_pos[1] = min(self.size - 1, self.agent_pos[1] + 1)
        elif action == 1:  # down
            self.agent_pos[1] = max(0, self.agent_pos[1] - 1)
        elif action == 2:  # left
            self.agent_pos[0] = max(0, self.agent_pos[0] - 1)
        elif action == 3:  # right
            self.agent_pos[0] = min(self.size - 1, self.agent_pos[0] + 1)
        
        # Calculate reward
        distance_to_goal = abs(self.agent_pos[0] - self.goal_pos[0]) + abs(self.agent_pos[1] - self.goal_pos[1])
        
        if self.agent_pos == self.goal_pos:
            reward = 100.0  # Large reward for reaching goal
            done = True
        else:
            reward = -0.1 - distance_to_goal * 0.01  # Small penalty for each step + distance penalty
            done = False
        
        return np.array(self.agent_pos, dtype=np.float32), reward, done, {}

# Initialize RL system
print("🎮 RL AGENT TUTORIAL")
print("=" * 50)
print(f"🚀 Ready to train intelligent agents!")
print(f"\n🎯 Today: Master reinforcement learning with tensor storage!")