# Understanding Mixture-of-Experts (MoE) Models in AI
## A Deep Dive into Their Structure and Applications

This notebook demonstrates the key concepts and implementation of Mixture-of-Experts (MoE) models in AI, with practical examples using PyTorch.

In [None]:
# Import required libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Basic MoE Architecture

A Mixture-of-Experts model consists of multiple expert networks and a gating network that determines which experts to use for each input. Below we implement a basic MoE model.

In [None]:
class Expert(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
    
    def forward(self, x):
        return self.net(x)

class GatingNetwork(nn.Module):
    def __init__(self, input_dim, num_experts):
        super().__init__()
        self.gate = nn.Linear(input_dim, num_experts)
    
    def forward(self, x):
        return F.softmax(self.gate(x), dim=-1)

## 2. Complete MoE Implementation

Now let's implement the full MoE model that combines experts and gating network.

In [None]:
class MoEModel(nn.Module):
    def __init__(self, num_experts, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.experts = nn.ModuleList([
            Expert(input_dim, hidden_dim, output_dim) 
            for _ in range(num_experts)
        ])
        self.gating = GatingNetwork(input_dim, num_experts)
        
    def forward(self, x):
        # Get expert weights from gating network
        gates = self.gating(x)
        
        # Get output from each expert
        expert_outputs = torch.stack([expert(x) for expert in self.experts])
        
        # Combine expert outputs weighted by gates
        final_output = torch.sum(gates.unsqueeze(-1) * expert_outputs, dim=0)
        return final_output

## 3. Training Example

Let's create a simple example to demonstrate training an MoE model.

In [None]:
# Generate synthetic data
def generate_data(n_samples=1000):
    X = torch.randn(n_samples, 10)  # 10-dimensional input
    y = torch.sum(X ** 2, dim=1)  # Target is sum of squares
    return X, y

# Create model and optimizer
model = MoEModel(num_experts=5, input_dim=10, hidden_dim=32, output_dim=1)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.MSELoss()

# Training loop
losses = []
for epoch in range(100):
    X, y = generate_data()
    optimizer.zero_grad()
    
    output = model(X).squeeze()
    loss = criterion(output, y)
    
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

## 4. Visualizing Training Progress

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('MoE Training Loss')
plt.grid(True)
plt.show()