# ML Practice Questions - Part 10: Advanced Topics and Specialized Methods

This notebook covers advanced machine learning topics including generative models, reinforcement learning, and specialized techniques.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_blobs, load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy.stats import multivariate_normal
from scipy.optimize import minimize
import warnings
warnings.filterwarnings('ignore')

plt.style.use('default')
sns.set_palette("husl")

## Question 1: Generative Models - Gaussian Mixture Models and Variational Autoencoders

**Question**: Implement Gaussian Mixture Models from scratch and explain the EM algorithm. Compare with a simple Variational Autoencoder implementation and analyze the different approaches to generative modeling.

### Theory: Generative Models

**1. Gaussian Mixture Model (GMM):**
$$p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)$$

Where:
- $\pi_k$ = mixing coefficient for component $k$
- $\mathcal{N}(x | \mu_k, \Sigma_k)$ = Gaussian distribution with mean $\mu_k$ and covariance $\Sigma_k$

**2. EM Algorithm:**

*E-step (Expectation):*
$$\gamma_{nk} = \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_n | \mu_j, \Sigma_j)}$$

*M-step (Maximization):*
$$\pi_k = \frac{N_k}{N}, \quad \mu_k = \frac{\sum_{n=1}^N \gamma_{nk} x_n}{N_k}, \quad \Sigma_k = \frac{\sum_{n=1}^N \gamma_{nk} (x_n - \mu_k)(x_n - \mu_k)^T}{N_k}$$

Where $N_k = \sum_{n=1}^N \gamma_{nk}$

**3. Variational Autoencoder (VAE):**
$$\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - KL(q(z|x) || p(z))$$

Where:
- $q(z|x)$ = encoder (approximate posterior)
- $p(x|z)$ = decoder (likelihood)
- $p(z)$ = prior distribution (usually $\mathcal{N}(0, I)$)

In [None]:
class GaussianMixtureModelCustom:
    def __init__(self, n_components=2, max_iter=100, tol=1e-6, random_state=None):
        self.n_components = n_components
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state
        
        # Model parameters
        self.weights_ = None
        self.means_ = None
        self.covariances_ = None
        
        # Training history
        self.log_likelihood_history_ = []
        self.responsibilities_ = None
    
    def _initialize_parameters(self, X):
        """Initialize GMM parameters"""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        
        # Initialize weights uniformly
        self.weights_ = np.ones(self.n_components) / self.n_components
        
        # Initialize means randomly
        self.means_ = X[np.random.choice(n_samples, self.n_components, replace=False)]
        
        # Initialize covariances as identity matrices
        self.covariances_ = np.array([np.eye(n_features) for _ in range(self.n_components)])
    
    def _multivariate_gaussian_pdf(self, X, mean, cov):
        """Compute multivariate Gaussian PDF"""
        n_features = X.shape[1]
        
        # Add small value to diagonal for numerical stability
        cov_reg = cov + np.eye(n_features) * 1e-6
        
        # Compute PDF
        diff = X - mean
        cov_inv = np.linalg.inv(cov_reg)
        cov_det = np.linalg.det(cov_reg)
        
        # Avoid numerical issues
        if cov_det <= 0:
            cov_det = 1e-6
        
        normalization = 1.0 / np.sqrt((2 * np.pi) ** n_features * cov_det)
        exponent = -0.5 * np.sum(diff @ cov_inv * diff, axis=1)
        
        return normalization * np.exp(exponent)
    
    def _e_step(self, X):
        """Expectation step: compute responsibilities"""
        n_samples = X.shape[0]
        responsibilities = np.zeros((n_samples, self.n_components))
        
        # Compute responsibilities for each component
        for k in range(self.n_components):
            responsibilities[:, k] = (self.weights_[k] * 
                                    self._multivariate_gaussian_pdf(X, self.means_[k], self.covariances_[k]))
        
        # Normalize responsibilities
        total_responsibility = np.sum(responsibilities, axis=1, keepdims=True)
        total_responsibility[total_responsibility == 0] = 1e-8  # Avoid division by zero
        responsibilities /= total_responsibility
        
        self.responsibilities_ = responsibilities
        return responsibilities
    
    def _m_step(self, X, responsibilities):
        """Maximization step: update parameters"""
        n_samples, n_features = X.shape
        
        # Effective number of points assigned to each component
        N_k = np.sum(responsibilities, axis=0)
        
        # Update weights
        self.weights_ = N_k / n_samples
        
        # Update means
        for k in range(self.n_components):
            if N_k[k] > 0:
                self.means_[k] = np.sum(responsibilities[:, k:k+1] * X, axis=0) / N_k[k]
        
        # Update covariances
        for k in range(self.n_components):
            if N_k[k] > 0:
                diff = X - self.means_[k]
                self.covariances_[k] = np.sum(
                    responsibilities[:, k:k+1] * diff[:, :, np.newaxis] * diff[:, np.newaxis, :], 
                    axis=0
                ) / N_k[k]
    
    def _compute_log_likelihood(self, X):
        """Compute log-likelihood of the data"""
        likelihood = np.zeros(X.shape[0])
        
        for k in range(self.n_components):
            likelihood += (self.weights_[k] * 
                         self._multivariate_gaussian_pdf(X, self.means_[k], self.covariances_[k]))
        
        # Avoid log(0)
        likelihood[likelihood <= 0] = 1e-8
        return np.sum(np.log(likelihood))
    
    def fit(self, X):
        """Fit GMM using EM algorithm"""
        self._initialize_parameters(X)
        self.log_likelihood_history_ = []
        
        prev_log_likelihood = -np.inf
        
        for iteration in range(self.max_iter):
            # E-step
            responsibilities = self._e_step(X)
            
            # M-step
            self._m_step(X, responsibilities)
            
            # Compute log-likelihood
            log_likelihood = self._compute_log_likelihood(X)
            self.log_likelihood_history_.append(log_likelihood)
            
            # Check for convergence
            if abs(log_likelihood - prev_log_likelihood) < self.tol:
                print(f"Converged after {iteration + 1} iterations")
                break
            
            prev_log_likelihood = log_likelihood
        
        return self
    
    def predict(self, X):
        """Predict cluster assignments"""
        responsibilities = self._e_step(X)
        return np.argmax(responsibilities, axis=1)
    
    def sample(self, n_samples=1):
        """Generate samples from the fitted GMM"""
        if self.weights_ is None:
            raise ValueError("Model must be fitted before sampling")
        
        samples = []
        
        for _ in range(n_samples):
            # Choose component based on weights
            component = np.random.choice(self.n_components, p=self.weights_)
            
            # Sample from chosen component
            sample = np.random.multivariate_normal(self.means_[component], self.covariances_[component])
            samples.append(sample)
        
        return np.array(samples)

# Test GMM implementation
print("Testing Gaussian Mixture Model Implementation...")

# Generate synthetic data with known clusters
np.random.seed(42)
n_samples = 300
centers = [(-2, -2), (2, 2), (-2, 2)]
X_gmm, y_true = make_blobs(n_samples=n_samples, centers=centers, 
                          cluster_std=1.0, random_state=42)

# Fit GMM
gmm = GaussianMixtureModelCustom(n_components=3, random_state=42)
gmm.fit(X_gmm)

# Predict clusters
y_pred_gmm = gmm.predict(X_gmm)

print(f"\nGMM Results:")
print(f"Final log-likelihood: {gmm.log_likelihood_history_[-1]:.2f}")
print(f"Component weights: {gmm.weights_}")
print(f"Component means:")
for i, mean in enumerate(gmm.means_):
    print(f"  Component {i}: [{mean[0]:.2f}, {mean[1]:.2f}]")

# Generate samples from fitted model
generated_samples = gmm.sample(100)
print(f"\nGenerated {generated_samples.shape[0]} samples from fitted GMM")

In [None]:
# Simple Variational Autoencoder implementation
class SimpleVAE:
    def __init__(self, input_dim, latent_dim, hidden_dim=64):
        self.input_dim = input_dim
        self.latent_dim = latent_dim
        self.hidden_dim = hidden_dim
        
        # Encoder parameters (mean and log variance)
        self.encoder_W1 = np.random.randn(input_dim, hidden_dim) * 0.1
        self.encoder_b1 = np.zeros(hidden_dim)
        self.encoder_W_mu = np.random.randn(hidden_dim, latent_dim) * 0.1
        self.encoder_b_mu = np.zeros(latent_dim)
        self.encoder_W_logvar = np.random.randn(hidden_dim, latent_dim) * 0.1
        self.encoder_b_logvar = np.zeros(latent_dim)
        
        # Decoder parameters
        self.decoder_W1 = np.random.randn(latent_dim, hidden_dim) * 0.1
        self.decoder_b1 = np.zeros(hidden_dim)
        self.decoder_W2 = np.random.randn(hidden_dim, input_dim) * 0.1
        self.decoder_b2 = np.zeros(input_dim)
        
        # Training history
        self.loss_history = []
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def encode(self, x):
        """Encode input to latent space parameters"""
        h = self.relu(np.dot(x, self.encoder_W1) + self.encoder_b1)
        mu = np.dot(h, self.encoder_W_mu) + self.encoder_b_mu
        logvar = np.dot(h, self.encoder_W_logvar) + self.encoder_b_logvar
        return mu, logvar
    
    def reparameterize(self, mu, logvar):
        """Reparameterization trick for sampling"""
        std = np.exp(0.5 * logvar)
        eps = np.random.randn(*mu.shape)
        return mu + std * eps
    
    def decode(self, z):
        """Decode latent variables to reconstruction"""
        h = self.relu(np.dot(z, self.decoder_W1) + self.decoder_b1)
        reconstruction = self.sigmoid(np.dot(h, self.decoder_W2) + self.decoder_b2)
        return reconstruction
    
    def forward(self, x):
        """Forward pass through VAE"""
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        reconstruction = self.decode(z)
        return reconstruction, mu, logvar, z
    
    def compute_loss(self, x, reconstruction, mu, logvar):
        """Compute VAE loss (reconstruction + KL divergence)"""
        # Reconstruction loss (binary cross-entropy)
        reconstruction_loss = -np.sum(
            x * np.log(reconstruction + 1e-8) + 
            (1 - x) * np.log(1 - reconstruction + 1e-8)
        ) / x.shape[0]
        
        # KL divergence loss
        kl_loss = -0.5 * np.sum(1 + logvar - mu**2 - np.exp(logvar)) / x.shape[0]
        
        total_loss = reconstruction_loss + kl_loss
        return total_loss, reconstruction_loss, kl_loss
    
    def generate(self, n_samples=1):
        """Generate new samples from the learned distribution"""
        # Sample from prior
        z = np.random.randn(n_samples, self.latent_dim)
        # Decode to data space
        generated = self.decode(z)
        return generated

# Test VAE with simple 2D data
print("\nTesting Variational Autoencoder Implementation...")

# Normalize data for VAE
X_vae = (X_gmm - X_gmm.min()) / (X_gmm.max() - X_gmm.min())

# Initialize VAE
vae = SimpleVAE(input_dim=2, latent_dim=2, hidden_dim=32)

# Simple training loop (simplified)
n_epochs = 50
learning_rate = 0.01

for epoch in range(n_epochs):
    # Forward pass
    reconstruction, mu, logvar, z = vae.forward(X_vae)
    
    # Compute loss
    total_loss, recon_loss, kl_loss = vae.compute_loss(X_vae, reconstruction, mu, logvar)
    vae.loss_history.append(total_loss)
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch}: Total Loss = {total_loss:.4f}, "
              f"Recon Loss = {recon_loss:.4f}, KL Loss = {kl_loss:.4f}")

# Generate samples
vae_samples = vae.generate(100)
print(f"\nGenerated {vae_samples.shape[0]} samples from VAE")

# Compare models
print(f"\nModel Comparison:")
print(f"GMM - Explicit probabilistic model with interpretable components")
print(f"VAE - Neural network-based with learned representations")

In [None]:
# Visualize results
plt.figure(figsize=(15, 10))

# Original data
plt.subplot(2, 4, 1)
plt.scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_true, alpha=0.7)
plt.title('Original Data (True Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# GMM predictions
plt.subplot(2, 4, 2)
plt.scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_pred_gmm, alpha=0.7)
plt.title('GMM Predictions')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# GMM components
plt.subplot(2, 4, 3)
plt.scatter(X_gmm[:, 0], X_gmm[:, 1], c=y_pred_gmm, alpha=0.3)
# Plot component means
for i, (mean, cov) in enumerate(zip(gmm.means_, gmm.covariances_)):
    plt.scatter(mean[0], mean[1], marker='x', s=200, linewidths=3, label=f'Component {i}')
    # Plot covariance ellipse
    eigenvals, eigenvecs = np.linalg.eigh(cov)
    angle = np.degrees(np.arctan2(eigenvecs[1, 0], eigenvecs[0, 0]))
    width, height = 2 * np.sqrt(eigenvals)
    ellipse = plt.matplotlib.patches.Ellipse(mean, width, height, angle=angle, 
                                           alpha=0.3, linewidth=2, fill=False)
    plt.gca().add_patch(ellipse)
plt.title('GMM Components')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()

# GMM generated samples
plt.subplot(2, 4, 4)
plt.scatter(generated_samples[:, 0], generated_samples[:, 1], alpha=0.7, color='red')
plt.title('GMM Generated Samples')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# VAE latent space
plt.subplot(2, 4, 5)
reconstruction, mu, logvar, z = vae.forward(X_vae)
plt.scatter(z[:, 0], z[:, 1], c=y_true, alpha=0.7)
plt.title('VAE Latent Space')
plt.xlabel('Latent Dimension 1')
plt.ylabel('Latent Dimension 2')

# VAE reconstruction
plt.subplot(2, 4, 6)
# Scale back reconstruction
recon_scaled = reconstruction * (X_gmm.max() - X_gmm.min()) + X_gmm.min()
plt.scatter(recon_scaled[:, 0], recon_scaled[:, 1], c=y_true, alpha=0.7)
plt.title('VAE Reconstructions')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# VAE generated samples
plt.subplot(2, 4, 7)
vae_scaled = vae_samples * (X_gmm.max() - X_gmm.min()) + X_gmm.min()
plt.scatter(vae_scaled[:, 0], vae_scaled[:, 1], alpha=0.7, color='purple')
plt.title('VAE Generated Samples')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Training curves
plt.subplot(2, 4, 8)
plt.plot(gmm.log_likelihood_history_, 'b-', label='GMM Log-Likelihood', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Log-Likelihood')
plt.title('GMM Training Curve')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# VAE loss curve
plt.figure(figsize=(8, 4))
plt.plot(vae.loss_history, 'purple', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Total Loss')
plt.title('VAE Training Loss')
plt.grid(True)
plt.show()

## Question 2: Reinforcement Learning - Q-Learning and Policy Gradients

**Question**: Implement Q-learning for a simple grid world environment and compare with a basic policy gradient method. Explain the difference between value-based and policy-based methods.

### Theory: Reinforcement Learning

**1. Q-Learning (Value-based):**
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)]$$

Where:
- $Q(s,a)$ = action-value function
- $\alpha$ = learning rate
- $\gamma$ = discount factor
- $r_{t+1}$ = immediate reward

**2. Policy Gradient (REINFORCE):**
$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t]$$

Where:
- $\pi_\theta(a|s)$ = policy parameterized by $\theta$
- $G_t$ = return (cumulative reward)
- $J(\theta)$ = expected return

**3. Bellman Equation:**
$$V^*(s) = \max_a \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma V^*(s')]$$

**4. Value vs Policy Methods:**
- Value-based: Learn optimal action-value function, derive policy
- Policy-based: Directly optimize policy parameters

In [None]:
class GridWorld:
    def __init__(self, size=5, goal_reward=10, step_penalty=-0.1):
        self.size = size
        self.goal_reward = goal_reward
        self.step_penalty = step_penalty
        
        # Actions: 0=up, 1=right, 2=down, 3=left
        self.actions = [0, 1, 2, 3]
        self.n_actions = len(self.actions)
        
        # State space
        self.n_states = size * size
        
        # Goal state (bottom-right corner)
        self.goal_state = (size - 1, size - 1)
        
        # Current state
        self.current_state = (0, 0)
        
        # Action effects
        self.action_effects = {
            0: (-1, 0),  # up
            1: (0, 1),   # right
            2: (1, 0),   # down
            3: (0, -1)   # left
        }
    
    def reset(self):
        """Reset environment to initial state"""
        self.current_state = (0, 0)
        return self.state_to_index(self.current_state)
    
    def state_to_index(self, state):
        """Convert (row, col) state to index"""
        return state[0] * self.size + state[1]
    
    def index_to_state(self, index):
        """Convert index to (row, col) state"""
        return (index // self.size, index % self.size)
    
    def is_valid_state(self, state):
        """Check if state is within grid bounds"""
        return 0 <= state[0] < self.size and 0 <= state[1] < self.size
    
    def step(self, action):
        """Take action and return (next_state, reward, done)"""
        # Calculate next state
        delta = self.action_effects[action]
        next_state = (self.current_state[0] + delta[0], self.current_state[1] + delta[1])
        
        # Check bounds
        if not self.is_valid_state(next_state):
            next_state = self.current_state  # Stay in place if hitting wall
        
        # Calculate reward
        if next_state == self.goal_state:
            reward = self.goal_reward
            done = True
        else:
            reward = self.step_penalty
            done = False
        
        self.current_state = next_state
        return self.state_to_index(next_state), reward, done
    
    def get_possible_actions(self, state):
        """Get valid actions from a state"""
        return self.actions

class QLearningAgent:
    def __init__(self, n_states, n_actions, learning_rate=0.1, discount_factor=0.95, 
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):
        self.n_states = n_states
        self.n_actions = n_actions
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        
        # Initialize Q-table
        self.q_table = np.zeros((n_states, n_actions))
        
        # Training history
        self.episode_rewards = []
        self.episode_lengths = []
    
    def choose_action(self, state, training=True):
        """Choose action using epsilon-greedy policy"""
        if training and np.random.random() < self.epsilon:
            # Explore: random action
            return np.random.choice(self.n_actions)
        else:
            # Exploit: best known action
            return np.argmax(self.q_table[state])
    
    def update_q_value(self, state, action, reward, next_state, done):
        """Update Q-value using Q-learning update rule"""
        if done:
            target = reward
        else:
            target = reward + self.discount_factor * np.max(self.q_table[next_state])
        
        # Q-learning update
        self.q_table[state, action] += self.learning_rate * (target - self.q_table[state, action])
    
    def decay_epsilon(self):
        """Decay exploration rate"""
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def train(self, env, n_episodes=1000, max_steps_per_episode=100):
        """Train the Q-learning agent"""
        for episode in range(n_episodes):
            state = env.reset()
            total_reward = 0
            steps = 0
            
            for step in range(max_steps_per_episode):
                action = self.choose_action(state)
                next_state, reward, done = env.step(action)
                
                self.update_q_value(state, action, reward, next_state, done)
                
                state = next_state
                total_reward += reward
                steps += 1
                
                if done:
                    break
            
            self.episode_rewards.append(total_reward)
            self.episode_lengths.append(steps)
            self.decay_epsilon()
            
            if episode % 100 == 0:
                avg_reward = np.mean(self.episode_rewards[-100:])
                print(f"Episode {episode}, Average Reward: {avg_reward:.2f}, Epsilon: {self.epsilon:.3f}")

class PolicyGradientAgent:
    def __init__(self, n_states, n_actions, learning_rate=0.01, discount_factor=0.95):
        self.n_states = n_states
        self.n_actions = n_actions
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        
        # Policy parameters (simple linear policy)
        self.policy_weights = np.random.randn(n_states, n_actions) * 0.1
        
        # Training history
        self.episode_rewards = []
        self.episode_lengths = []
    
    def softmax(self, x):
        """Softmax function for policy probabilities"""
        exp_x = np.exp(x - np.max(x))
        return exp_x / np.sum(exp_x)
    
    def get_action_probabilities(self, state):
        """Get action probabilities for a state"""
        logits = self.policy_weights[state]
        return self.softmax(logits)
    
    def choose_action(self, state):
        """Sample action from policy"""
        probabilities = self.get_action_probabilities(state)
        return np.random.choice(self.n_actions, p=probabilities)
    
    def compute_returns(self, rewards):
        """Compute discounted returns"""
        returns = []
        G = 0
        for reward in reversed(rewards):
            G = reward + self.discount_factor * G
            returns.insert(0, G)
        return np.array(returns)
    
    def update_policy(self, states, actions, returns):
        """Update policy using REINFORCE"""
        for state, action, G in zip(states, actions, returns):
            # Get action probabilities
            probs = self.get_action_probabilities(state)
            
            # Compute gradient
            gradient = np.zeros(self.n_actions)
            gradient[action] = G * (1 - probs[action])
            for a in range(self.n_actions):
                if a != action:
                    gradient[a] = -G * probs[a]
            
            # Update weights
            self.policy_weights[state] += self.learning_rate * gradient
    
    def train(self, env, n_episodes=1000, max_steps_per_episode=100):
        """Train the policy gradient agent"""
        for episode in range(n_episodes):
            states, actions, rewards = [], [], []
            
            state = env.reset()
            
            for step in range(max_steps_per_episode):
                action = self.choose_action(state)
                next_state, reward, done = env.step(action)
                
                states.append(state)
                actions.append(action)
                rewards.append(reward)
                
                state = next_state
                
                if done:
                    break
            
            # Compute returns and update policy
            returns = self.compute_returns(rewards)
            self.update_policy(states, actions, returns)
            
            self.episode_rewards.append(sum(rewards))
            self.episode_lengths.append(len(rewards))
            
            if episode % 100 == 0:
                avg_reward = np.mean(self.episode_rewards[-100:])
                print(f"Episode {episode}, Average Reward: {avg_reward:.2f}")

# Test RL implementations
print("Testing Reinforcement Learning Implementations...")

# Create environment
env = GridWorld(size=5)
print(f"Grid World Environment: {env.size}x{env.size}")
print(f"Goal position: {env.goal_state}")
print(f"Number of states: {env.n_states}")
print(f"Number of actions: {env.n_actions}")

# Train Q-learning agent
print("\nTraining Q-Learning Agent...")
q_agent = QLearningAgent(env.n_states, env.n_actions)
q_agent.train(env, n_episodes=500)

# Train policy gradient agent
print("\nTraining Policy Gradient Agent...")
pg_agent = PolicyGradientAgent(env.n_states, env.n_actions)
pg_agent.train(env, n_episodes=500)

print("\nTraining completed!")

In [None]:
# Evaluate and visualize results
def evaluate_agent(agent, env, n_episodes=100, max_steps=50):
    """Evaluate trained agent"""
    total_rewards = []
    success_rate = 0
    
    for _ in range(n_episodes):
        state = env.reset()
        total_reward = 0
        
        for _ in range(max_steps):
            if hasattr(agent, 'q_table'):
                action = agent.choose_action(state, training=False)
            else:
                action = agent.choose_action(state)
            
            state, reward, done = env.step(action)
            total_reward += reward
            
            if done:
                success_rate += 1
                break
        
        total_rewards.append(total_reward)
    
    return np.mean(total_rewards), success_rate / n_episodes

# Evaluate both agents
q_avg_reward, q_success = evaluate_agent(q_agent, env)
pg_avg_reward, pg_success = evaluate_agent(pg_agent, env)

print(f"\nEvaluation Results:")
print(f"Q-Learning - Average Reward: {q_avg_reward:.2f}, Success Rate: {q_success:.2%}")
print(f"Policy Gradient - Average Reward: {pg_avg_reward:.2f}, Success Rate: {pg_success:.2%}")

# Visualize results
plt.figure(figsize=(15, 10))

# Training curves
plt.subplot(2, 3, 1)
window = 50
q_rewards_smooth = np.convolve(q_agent.episode_rewards, np.ones(window)/window, mode='valid')
pg_rewards_smooth = np.convolve(pg_agent.episode_rewards, np.ones(window)/window, mode='valid')
plt.plot(q_rewards_smooth, 'b-', label='Q-Learning', linewidth=2)
plt.plot(pg_rewards_smooth, 'r-', label='Policy Gradient', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Average Reward')
plt.title('Learning Curves')
plt.legend()
plt.grid(True)

# Q-table heatmap
plt.subplot(2, 3, 2)
q_values_grid = np.zeros((env.size, env.size))
for i in range(env.size):
    for j in range(env.size):
        state_idx = env.state_to_index((i, j))
        q_values_grid[i, j] = np.max(q_agent.q_table[state_idx])

im = plt.imshow(q_values_grid, cmap='viridis')
plt.colorbar(im)
plt.title('Q-Learning Value Function')
plt.xlabel('Column')
plt.ylabel('Row')

# Q-learning policy
plt.subplot(2, 3, 3)
policy_grid = np.zeros((env.size, env.size))
action_symbols = ['↑', '→', '↓', '←']
for i in range(env.size):
    for j in range(env.size):
        state_idx = env.state_to_index((i, j))
        best_action = np.argmax(q_agent.q_table[state_idx])
        policy_grid[i, j] = best_action
        plt.text(j, i, action_symbols[best_action], ha='center', va='center', fontsize=12)

plt.imshow(policy_grid, cmap='Set3', alpha=0.3)
plt.title('Q-Learning Policy')
plt.xlabel('Column')
plt.ylabel('Row')
plt.xticks(range(env.size))
plt.yticks(range(env.size))

# Policy gradient policy probabilities
plt.subplot(2, 3, 4)
pg_policy_entropy = np.zeros((env.size, env.size))
for i in range(env.size):
    for j in range(env.size):
        state_idx = env.state_to_index((i, j))
        probs = pg_agent.get_action_probabilities(state_idx)
        entropy = -np.sum(probs * np.log(probs + 1e-8))
        pg_policy_entropy[i, j] = entropy

im = plt.imshow(pg_policy_entropy, cmap='plasma')
plt.colorbar(im)
plt.title('Policy Gradient Entropy')
plt.xlabel('Column')
plt.ylabel('Row')

# Episode lengths
plt.subplot(2, 3, 5)
q_lengths_smooth = np.convolve(q_agent.episode_lengths, np.ones(window)/window, mode='valid')
pg_lengths_smooth = np.convolve(pg_agent.episode_lengths, np.ones(window)/window, mode='valid')
plt.plot(q_lengths_smooth, 'b-', label='Q-Learning', linewidth=2)
plt.plot(pg_lengths_smooth, 'r-', label='Policy Gradient', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Episode Length')
plt.title('Episode Lengths')
plt.legend()
plt.grid(True)

# Comparison metrics
plt.subplot(2, 3, 6)
methods = ['Q-Learning', 'Policy Gradient']
avg_rewards = [q_avg_reward, pg_avg_reward]
success_rates = [q_success * 100, pg_success * 100]

x = np.arange(len(methods))
width = 0.35

plt.bar(x - width/2, avg_rewards, width, label='Avg Reward', alpha=0.8)
plt.bar(x + width/2, success_rates, width, label='Success Rate (%)', alpha=0.8)

plt.xlabel('Method')
plt.ylabel('Performance')
plt.title('Performance Comparison')
plt.xticks(x, methods)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nKey Differences:")
print(f"Q-Learning (Value-based):")
print(f"  - Learns action-value function Q(s,a)")
print(f"  - Uses ε-greedy exploration")
print(f"  - Off-policy learning")
print(f"  - Deterministic optimal policy")
print(f"\nPolicy Gradient (Policy-based):")
print(f"  - Directly learns policy parameters")
print(f"  - Natural exploration through stochastic policy")
print(f"  - On-policy learning")
print(f"  - Can handle continuous action spaces")

## Question 3: Meta-Learning and Few-Shot Learning

**Question**: Implement a simple meta-learning algorithm (MAML-style) and demonstrate few-shot learning. Compare with traditional transfer learning approaches and analyze adaptation mechanisms.

### Theory: Meta-Learning

**1. Model-Agnostic Meta-Learning (MAML):**
$$\theta' = \theta - \alpha \nabla_\theta L_{task}(\theta)$$
$$\theta_{new} = \theta - \beta \nabla_\theta \sum_{tasks} L_{task}(\theta')$$

Where:
- $\theta$ = meta-parameters
- $\alpha$ = inner learning rate
- $\beta$ = outer learning rate
- $L_{task}$ = task-specific loss

**2. Few-Shot Learning Setup:**
- Support set: $S = \{(x_i, y_i)\}_{i=1}^k$ (k examples per class)
- Query set: $Q = \{(x_j, y_j)\}_{j=1}^m$ (test examples)
- Goal: Learn from $S$ to classify $Q$

**3. Meta-Learning Objective:**
$$\min_\theta \sum_{T_i \sim p(T)} L_{T_i}(f_{\theta_i'})$$

Where $\theta_i' = \theta - \alpha \nabla_\theta L_{T_i}(f_\theta)$

**4. Adaptation vs Transfer:**
- Transfer Learning: Pre-train → Fine-tune
- Meta-Learning: Learn to adapt quickly to new tasks

In [None]:
class SimpleMAML:
    def __init__(self, input_dim, hidden_dim, output_dim, inner_lr=0.01, outer_lr=0.001):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.inner_lr = inner_lr
        self.outer_lr = outer_lr
        
        # Initialize meta-parameters
        self.meta_params = self._initialize_parameters()
        
        # Training history
        self.meta_losses = []
        self.adaptation_accuracies = []
    
    def _initialize_parameters(self):
        """Initialize neural network parameters"""
        params = {
            'W1': np.random.randn(self.input_dim, self.hidden_dim) * 0.1,
            'b1': np.zeros(self.hidden_dim),
            'W2': np.random.randn(self.hidden_dim, self.output_dim) * 0.1,
            'b2': np.zeros(self.output_dim)
        }
        return params
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
    def forward(self, X, params):
        """Forward pass through network"""
        h = self.relu(np.dot(X, params['W1']) + params['b1'])
        logits = np.dot(h, params['W2']) + params['b2']
        probs = self.softmax(logits)
        return probs, h
    
    def compute_loss(self, X, y, params):
        """Compute cross-entropy loss"""
        probs, _ = self.forward(X, params)
        # Convert labels to one-hot if needed
        if len(y.shape) == 1:
            y_onehot = np.zeros((y.shape[0], self.output_dim))
            y_onehot[np.arange(y.shape[0]), y] = 1
            y = y_onehot
        
        loss = -np.sum(y * np.log(probs + 1e-8)) / X.shape[0]
        return loss
    
    def compute_gradients(self, X, y, params):
        """Compute gradients using backpropagation"""
        batch_size = X.shape[0]
        
        # Forward pass
        probs, h = self.forward(X, params)
        
        # Convert labels to one-hot if needed
        if len(y.shape) == 1:
            y_onehot = np.zeros((y.shape[0], self.output_dim))
            y_onehot[np.arange(y.shape[0]), y] = 1
            y = y_onehot
        
        # Backward pass
        # Output layer gradients
        dlogits = (probs - y) / batch_size
        dW2 = np.dot(h.T, dlogits)
        db2 = np.sum(dlogits, axis=0)
        
        # Hidden layer gradients
        dh = np.dot(dlogits, params['W2'].T)
        dh_relu = dh * (h > 0)  # ReLU derivative
        
        dW1 = np.dot(X.T, dh_relu)
        db1 = np.sum(dh_relu, axis=0)
        
        gradients = {
            'W1': dW1,
            'b1': db1,
            'W2': dW2,
            'b2': db2
        }
        
        return gradients
    
    def inner_update(self, X_support, y_support, params, num_steps=1):
        """Perform inner loop adaptation"""
        adapted_params = {k: v.copy() for k, v in params.items()}
        
        for _ in range(num_steps):
            gradients = self.compute_gradients(X_support, y_support, adapted_params)
            
            # Update parameters
            for key in adapted_params:
                adapted_params[key] -= self.inner_lr * gradients[key]
        
        return adapted_params
    
    def meta_update(self, tasks):
        """Perform meta-update using multiple tasks"""
        meta_gradients = {k: np.zeros_like(v) for k, v in self.meta_params.items()}
        total_loss = 0
        
        for X_support, y_support, X_query, y_query in tasks:
            # Inner loop: adapt to support set
            adapted_params = self.inner_update(X_support, y_support, self.meta_params)
            
            # Compute loss on query set
            query_loss = self.compute_loss(X_query, y_query, adapted_params)
            total_loss += query_loss
            
            # Compute meta-gradients
            query_gradients = self.compute_gradients(X_query, y_query, adapted_params)
            
            # Add to meta-gradients
            for key in meta_gradients:
                meta_gradients[key] += query_gradients[key]
        
        # Average gradients across tasks
        num_tasks = len(tasks)
        for key in meta_gradients:
            meta_gradients[key] /= num_tasks
        
        # Update meta-parameters
        for key in self.meta_params:
            self.meta_params[key] -= self.outer_lr * meta_gradients[key]
        
        avg_loss = total_loss / num_tasks
        return avg_loss
    
    def evaluate_adaptation(self, X_support, y_support, X_query, y_query, num_steps=1):
        """Evaluate adaptation on a single task"""
        # Before adaptation
        probs_before, _ = self.forward(X_query, self.meta_params)
        pred_before = np.argmax(probs_before, axis=1)
        acc_before = np.mean(pred_before == y_query)
        
        # After adaptation
        adapted_params = self.inner_update(X_support, y_support, self.meta_params, num_steps)
        probs_after, _ = self.forward(X_query, adapted_params)
        pred_after = np.argmax(probs_after, axis=1)
        acc_after = np.mean(pred_after == y_query)
        
        return acc_before, acc_after

class FewShotDataGenerator:
    def __init__(self, n_classes=5, n_features=20, noise_level=0.1):
        self.n_classes = n_classes
        self.n_features = n_features
        self.noise_level = noise_level
        
        # Generate class prototypes
        self.class_prototypes = np.random.randn(n_classes, n_features)
    
    def generate_task(self, n_way=3, k_shot=5, n_query=10):
        """Generate a few-shot learning task"""
        # Sample classes for this task
        task_classes = np.random.choice(self.n_classes, n_way, replace=False)
        
        X_support, y_support = [], []
        X_query, y_query = [], []
        
        for i, class_idx in enumerate(task_classes):
            prototype = self.class_prototypes[class_idx]
            
            # Generate support examples
            support_examples = prototype + np.random.randn(k_shot, self.n_features) * self.noise_level
            X_support.extend(support_examples)
            y_support.extend([i] * k_shot)
            
            # Generate query examples
            query_examples = prototype + np.random.randn(n_query, self.n_features) * self.noise_level
            X_query.extend(query_examples)
            y_query.extend([i] * n_query)
        
        # Convert to numpy arrays and shuffle
        X_support, y_support = np.array(X_support), np.array(y_support)
        X_query, y_query = np.array(X_query), np.array(y_query)
        
        # Shuffle
        support_idx = np.random.permutation(len(X_support))
        query_idx = np.random.permutation(len(X_query))
        
        X_support, y_support = X_support[support_idx], y_support[support_idx]
        X_query, y_query = X_query[query_idx], y_query[query_idx]
        
        return X_support, y_support, X_query, y_query

# Test meta-learning implementation
print("Testing Meta-Learning Implementation...")

# Initialize
np.random.seed(42)
n_features = 20
n_way = 3  # 3-way classification
k_shot = 5  # 5 examples per class
n_query = 10  # 10 query examples per class

# Create data generator and MAML model
data_generator = FewShotDataGenerator(n_classes=10, n_features=n_features)
maml = SimpleMAML(input_dim=n_features, hidden_dim=40, output_dim=n_way)

print(f"Few-shot setup: {n_way}-way, {k_shot}-shot")
print(f"Features: {n_features}, Hidden units: 40")

# Meta-training
n_meta_iterations = 200
tasks_per_iteration = 4

print(f"\nMeta-training for {n_meta_iterations} iterations...")

for iteration in range(n_meta_iterations):
    # Generate batch of tasks
    tasks = []
    for _ in range(tasks_per_iteration):
        task = data_generator.generate_task(n_way=n_way, k_shot=k_shot, n_query=n_query)
        tasks.append(task)
    
    # Meta-update
    meta_loss = maml.meta_update(tasks)
    maml.meta_losses.append(meta_loss)
    
    # Evaluate adaptation periodically
    if iteration % 50 == 0:
        # Test on a new task
        test_task = data_generator.generate_task(n_way=n_way, k_shot=k_shot, n_query=n_query)
        acc_before, acc_after = maml.evaluate_adaptation(*test_task)
        
        print(f"Iteration {iteration}: Meta Loss = {meta_loss:.4f}, "
              f"Accuracy Before = {acc_before:.3f}, After = {acc_after:.3f}")

print("\nMeta-training completed!")

In [None]:
# Compare with traditional transfer learning
class TraditionalTransferLearning:
    def __init__(self, input_dim, hidden_dim, output_dim, learning_rate=0.01):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        self.learning_rate = learning_rate
        
        # Pre-trained parameters (simulated)
        self.pretrained_params = {
            'W1': np.random.randn(input_dim, hidden_dim) * 0.1,
            'b1': np.zeros(hidden_dim),
            'W2': np.random.randn(hidden_dim, output_dim) * 0.1,
            'b2': np.zeros(output_dim)
        }
    
    def forward(self, X, params):
        h = np.maximum(0, np.dot(X, params['W1']) + params['b1'])
        logits = np.dot(h, params['W2']) + params['b2']
        probs = np.exp(logits - np.max(logits, axis=1, keepdims=True))
        probs = probs / np.sum(probs, axis=1, keepdims=True)
        return probs, h
    
    def compute_gradients(self, X, y, params):
        batch_size = X.shape[0]
        probs, h = self.forward(X, params)
        
        if len(y.shape) == 1:
            y_onehot = np.zeros((y.shape[0], self.output_dim))
            y_onehot[np.arange(y.shape[0]), y] = 1
            y = y_onehot
        
        dlogits = (probs - y) / batch_size
        dW2 = np.dot(h.T, dlogits)
        db2 = np.sum(dlogits, axis=0)
        
        dh = np.dot(dlogits, params['W2'].T)
        dh_relu = dh * (h > 0)
        
        dW1 = np.dot(X.T, dh_relu)
        db1 = np.sum(dh_relu, axis=0)
        
        return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
    
    def fine_tune(self, X_support, y_support, num_steps=10):
        params = {k: v.copy() for k, v in self.pretrained_params.items()}
        
        for _ in range(num_steps):
            gradients = self.compute_gradients(X_support, y_support, params)
            for key in params:
                params[key] -= self.learning_rate * gradients[key]
        
        return params
    
    def evaluate(self, X_support, y_support, X_query, y_query, num_steps=10):
        # Before fine-tuning
        probs_before, _ = self.forward(X_query, self.pretrained_params)
        pred_before = np.argmax(probs_before, axis=1)
        acc_before = np.mean(pred_before == y_query)
        
        # After fine-tuning
        finetuned_params = self.fine_tune(X_support, y_support, num_steps)
        probs_after, _ = self.forward(X_query, finetuned_params)
        pred_after = np.argmax(probs_after, axis=1)
        acc_after = np.mean(pred_after == y_query)
        
        return acc_before, acc_after

# Comprehensive evaluation
print("\nComprehensive Evaluation...")

transfer_model = TraditionalTransferLearning(n_features, 40, n_way)

# Generate test tasks
n_test_tasks = 50
maml_results = {'before': [], 'after': []}
transfer_results = {'before': [], 'after': []}

for _ in range(n_test_tasks):
    test_task = data_generator.generate_task(n_way=n_way, k_shot=k_shot, n_query=n_query)
    
    # MAML evaluation
    acc_before, acc_after = maml.evaluate_adaptation(*test_task, num_steps=5)
    maml_results['before'].append(acc_before)
    maml_results['after'].append(acc_after)
    
    # Transfer learning evaluation
    acc_before, acc_after = transfer_model.evaluate(*test_task, num_steps=20)
    transfer_results['before'].append(acc_before)
    transfer_results['after'].append(acc_after)

# Compute statistics
maml_before_mean = np.mean(maml_results['before'])
maml_after_mean = np.mean(maml_results['after'])
transfer_before_mean = np.mean(transfer_results['before'])
transfer_after_mean = np.mean(transfer_results['after'])

print(f"\nEvaluation Results on {n_test_tasks} test tasks:")
print(f"MAML:")
print(f"  Before adaptation: {maml_before_mean:.3f} ± {np.std(maml_results['before']):.3f}")
print(f"  After adaptation:  {maml_after_mean:.3f} ± {np.std(maml_results['after']):.3f}")
print(f"  Improvement: {maml_after_mean - maml_before_mean:.3f}")

print(f"\nTransfer Learning:")
print(f"  Before fine-tuning: {transfer_before_mean:.3f} ± {np.std(transfer_results['before']):.3f}")
print(f"  After fine-tuning:  {transfer_after_mean:.3f} ± {np.std(transfer_results['after']):.3f}")
print(f"  Improvement: {transfer_after_mean - transfer_before_mean:.3f}")

# Visualize results
plt.figure(figsize=(15, 8))

# Meta-learning curve
plt.subplot(2, 3, 1)
plt.plot(maml.meta_losses, 'b-', linewidth=2)
plt.xlabel('Meta-iteration')
plt.ylabel('Meta Loss')
plt.title('MAML Meta-training Curve')
plt.grid(True)

# Adaptation comparison
plt.subplot(2, 3, 2)
methods = ['MAML', 'Transfer Learning']
before_scores = [maml_before_mean, transfer_before_mean]
after_scores = [maml_after_mean, transfer_after_mean]

x = np.arange(len(methods))
width = 0.35

plt.bar(x - width/2, before_scores, width, label='Before Adaptation', alpha=0.8)
plt.bar(x + width/2, after_scores, width, label='After Adaptation', alpha=0.8)

plt.xlabel('Method')
plt.ylabel('Accuracy')
plt.title('Few-Shot Learning Performance')
plt.xticks(x, methods)
plt.legend()
plt.grid(True, alpha=0.3)

# Improvement distribution
plt.subplot(2, 3, 3)
maml_improvements = np.array(maml_results['after']) - np.array(maml_results['before'])
transfer_improvements = np.array(transfer_results['after']) - np.array(transfer_results['before'])

plt.hist(maml_improvements, alpha=0.7, label='MAML', bins=15, density=True)
plt.hist(transfer_improvements, alpha=0.7, label='Transfer Learning', bins=15, density=True)
plt.xlabel('Accuracy Improvement')
plt.ylabel('Density')
plt.title('Distribution of Improvements')
plt.legend()
plt.grid(True, alpha=0.3)

# Sample adaptation trajectory
plt.subplot(2, 3, 4)
sample_task = data_generator.generate_task(n_way=n_way, k_shot=k_shot, n_query=n_query)
X_support, y_support, X_query, y_query = sample_task

# Track adaptation over multiple steps
adaptation_steps = range(1, 11)
maml_trajectory = []
transfer_trajectory = []

for steps in adaptation_steps:
    _, acc_maml = maml.evaluate_adaptation(X_support, y_support, X_query, y_query, steps)
    _, acc_transfer = transfer_model.evaluate(X_support, y_support, X_query, y_query, steps)
    maml_trajectory.append(acc_maml)
    transfer_trajectory.append(acc_transfer)

plt.plot(adaptation_steps, maml_trajectory, 'b-o', label='MAML', linewidth=2)
plt.plot(adaptation_steps, transfer_trajectory, 'r-s', label='Transfer Learning', linewidth=2)
plt.xlabel('Adaptation Steps')
plt.ylabel('Accuracy')
plt.title('Adaptation Trajectory')
plt.legend()
plt.grid(True)

# Error analysis
plt.subplot(2, 3, 5)
methods = ['MAML\nBefore', 'MAML\nAfter', 'Transfer\nBefore', 'Transfer\nAfter']
means = [maml_before_mean, maml_after_mean, transfer_before_mean, transfer_after_mean]
stds = [np.std(maml_results['before']), np.std(maml_results['after']), 
        np.std(transfer_results['before']), np.std(transfer_results['after'])]

plt.bar(range(len(methods)), means, yerr=stds, capsize=5, alpha=0.8)
plt.xlabel('Method and Stage')
plt.ylabel('Accuracy')
plt.title('Performance with Error Bars')
plt.xticks(range(len(methods)), methods)
plt.grid(True, alpha=0.3)

# Statistical significance test (simplified)
plt.subplot(2, 3, 6)
from scipy import stats

# Compare final performance
t_stat, p_value = stats.ttest_ind(maml_results['after'], transfer_results['after'])

comparison_text = f"MAML vs Transfer Learning\n\n"
comparison_text += f"MAML Mean: {maml_after_mean:.3f}\n"
comparison_text += f"Transfer Mean: {transfer_after_mean:.3f}\n\n"
comparison_text += f"t-statistic: {t_stat:.3f}\n"
comparison_text += f"p-value: {p_value:.6f}\n\n"
comparison_text += f"Significant difference: {p_value < 0.05}"

plt.text(0.1, 0.5, comparison_text, transform=plt.gca().transAxes, 
         fontsize=10, verticalalignment='center',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.axis('off')
plt.title('Statistical Comparison')

plt.tight_layout()
plt.show()

print(f"\nKey Insights:")
print(f"1. MAML learns initialization that enables fast adaptation")
print(f"2. Transfer learning relies on feature extraction + fine-tuning")
print(f"3. MAML shows {'better' if maml_after_mean > transfer_after_mean else 'similar'} few-shot performance")
print(f"4. Both methods demonstrate the importance of prior knowledge")

## Summary: Advanced Topics and Specialized Methods

This notebook covered three advanced machine learning topics:

### 1. Generative Models
- **Gaussian Mixture Models (GMM)**: Classical probabilistic approach using EM algorithm
- **Variational Autoencoders (VAE)**: Neural network-based generative modeling
- **Key Differences**: GMM provides interpretable components, VAE learns complex representations
- **Applications**: Density estimation, data generation, unsupervised learning

### 2. Reinforcement Learning
- **Q-Learning**: Value-based method learning optimal action-value function
- **Policy Gradient**: Policy-based method directly optimizing policy parameters
- **Key Trade-offs**: Q-learning is sample efficient but limited to discrete actions; Policy gradients handle continuous actions but higher variance
- **Applications**: Game playing, robotics, autonomous systems

### 3. Meta-Learning and Few-Shot Learning
- **MAML**: Model-Agnostic Meta-Learning for fast adaptation
- **Transfer Learning**: Traditional pre-training + fine-tuning approach
- **Key Innovation**: Learning to learn - optimizing for fast adaptation rather than just task performance
- **Applications**: Few-shot classification, rapid adaptation, learning across domains

### Common Themes
1. **Adaptation**: All methods involve adapting to new scenarios (new data distributions, environments, tasks)
2. **Prior Knowledge**: Leveraging previous experience to improve performance on new problems
3. **Optimization**: Advanced optimization techniques beyond standard supervised learning
4. **Generalization**: Going beyond memorization to learn transferable patterns

These advanced topics represent the frontier of machine learning research and have practical applications in scenarios with limited data, dynamic environments, and complex data distributions.