# 🌡️ Workshop: Thermodynamic-Regularized Training for Energy-Aware LLM Fine-Tuning

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 30px; border-radius: 15px; color: white; text-align: center; margin: 20px 0;">
  <h2 style="margin: 0; font-size: 28px;">Reducing GPU Energy Costs in LLM Fine-Tuning</h2>
  <h3 style="margin: 10px 0; font-size: 20px; font-weight: normal;">Entropy-Regularized Optimization + Real Energy Monitoring</h3>
  <p style="margin: 15px 0; font-size: 14px; opacity: 0.9;">Inspired by thermodynamic computing ideas (e.g., Extropic, 2024)</p>
</div>

---

## 🎯 Workshop Objectives

By the end of this workshop, you will:

1. **Understand** how entropy regularization changes the optimization landscape
2. **Implement** a Thermodynamic Sampling Unit (TSU) as stochastic weights
3. **Measure** GPU energy consumption during training (NVML)
4. **Compare** baseline SGD vs. free-energy-style objectives
5. **Analyze** energy–performance trade-offs with clear metrics

---

## 🧭 What This Notebook *Is* (and Is Not)

- **Is:** A practical, hypothesis-driven experiment on a small GPT model
- **Is not:** A guarantee of energy savings (results depend on hardware + settings)
- **Scope:** GPU training, TSU stochastic weights, energy measurements, and careful comparison

---

## 🏛️ Training Paradigms (Reality vs. Optional Concepts)

### **Paradigm 1: Classical Supervised Learning**
$$\min_\theta \; \mathcal{L}(\theta) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(f_\theta(x), y)]$$

### **Paradigm 2: Entropy-Regularized Training (This Notebook)**
$$\min_{q(\theta)} \; \mathbb{E}_{\theta\sim q}[\mathcal{L}(\theta)] - T \cdot S(q) + \lambda D_{KL}[q||p]$$

where:
- $\mathcal{L}(\theta)$: Standard loss function (cross-entropy)
- $S(q)$: Entropy of the parameter distribution (exploration pressure)
- $T$: Temperature (exploration–exploitation trade-off)
- $D_{KL}[q||p]$: Regularization toward a prior

### **Paradigm 3: QPU Optimization (Optional Demo)**
Included as a **standalone illustration** (not integrated into training results).

---

## 📊 Intended Outcomes (Hypotheses)

- **Energy Efficiency:** Potential reduction in energy per unit loss improvement
- **Training Stability:** Smoother gradients from stochastic exploration
- **Generalization:** Mild improvements in validation loss under some regimes


---

## 📐 Mathematical Framework

### **1. Classical Objective Function**

Standard supervised learning minimizes empirical risk:

$$\mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(f_\theta(x_i), y_i)$$

**For language modeling:**
$$\mathcal{L}(\theta) = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t})$$

---

### **2. Thermodynamic Reformulation: Free Energy / Variational Objective**

We treat parameters as random variables with a factorized Gaussian:
$$q(\theta) = \mathcal{N}(\mu, \mathrm{diag}(\sigma^2))$$

We optimize a **free-energy-style** objective:

$$\boxed{\mathcal{F}(q) = \mathbb{E}_{q(\theta)}[\mathcal{L}(\theta)] - T \cdot S(q) + \lambda D_{KL}[q||p]}$$

This connects to variational Bayes:

$$\mathbb{E}_{q}[\mathcal{L}] + \lambda D_{KL}[q||p] \quad \text{(ELBO-style)}$$

**Key intuition:**
- The loss term pulls parameters toward accuracy.
- The entropy term pushes toward exploration (wider distributions).
- The KL term keeps the distribution grounded in a prior.

---

### **3. Entropy and KL for Gaussian Weights**

**Differential Entropy:**
$$S(q) = \frac{1}{2}\sum_{i=1}^{d} \left(1 + \log(2\pi\sigma_i^2)\right)$$

**KL to standard normal prior:**
$$D_{KL} = \frac{1}{2}\sum_{i=1}^{d}\left(\mu_i^2 + \sigma_i^2 - \log(\sigma_i^2) - 1\right)$$

---

### **4. Attention Entropy (Monitoring Only)**

We also track **attention entropy** for interpretability:

$$H(A_i) = -\sum_{j=1}^{T} A_{ij}\log A_{ij}$$

This is **not part of the objective** here, but helps interpret training dynamics.

---

### **5. GPU Energy Consumption Model**

Total training energy:

$$E_{total} = \int_{0}^{T_{train}} P(t) \, dt \approx \sum_{i=1}^{N_{steps}} P_i \cdot \Delta t_i$$

Measured via NVIDIA NVML: `nvmlDeviceGetPowerUsage()`

**Energy efficiency metric:**
$$\eta = \frac{\Delta \mathcal{L}}{E_{total}}$$


---

## 🚀 Experimental Pipeline

```ascii
┌─────────────────────────────────────────────────────────────┐
│          ENTROPY-REGULARIZED GPU WORKFLOW                   │
└─────────────────────────────────────────────────────────────┘

  STEP 1: Environment Setup & GPU Verification
     ├─ ✅ Verify CUDA/RTX availability
     ├─ ⚡ Initialize NVML energy monitoring
     └─ 📦 Install: PyTorch, pynvml

  STEP 2: Data Preparation
     ├─ 📚 Load Tiny Shakespeare (character-level)
     ├─ 🔤 Build vocabulary + tokenizer
     └─ 📊 Train/val splits (90/10)

  STEP 3: Model Architecture
     ├─ 🏗️ Minimal GPT (Transformer blocks)
     ├─ 👁️ Causal self-attention + entropy tracking
     └─ 📏 ~1-2M parameters (laptop-friendly)

  STEP 4: Baseline Training (Classical)
     ├─ 📉 Standard cross-entropy minimization
     ├─ ⏱️ Measure: time, energy (J), final loss
     └─ 📊 Establish performance baseline

  STEP 5: TSU Free-Energy Training
     ├─ 🌡️ Stochastic weights via TSU linear layers
     ├─ 🔄 Optimize: L - T·S + λ·KL
     ├─ 📈 Track: loss, free energy, entropy evolution
     └─ ⚡ Compare energy efficiency vs. baseline

  STEP 6: Optional QPU Demo (Standalone)
     ├─ 🔮 PennyLane QAOA circuit example
     └─ 🧪 Demonstration only (not in training loop)

  STEP 7: Comparative Analysis
     ├─ 📊 Baseline vs. TSU
     ├─ ⚡ Energy consumption analysis
     ├─ 🎯 Training stability metrics
     └─ 💡 Efficiency gains report
```


---

## 🛠️ STEP 1: Environment Setup & GPU Verification

**Goal:** Confirm GPU availability and enable energy monitoring.

We do *not* assume any fixed speedup factor. Instead, we measure actual training time and energy on your hardware.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math
import time
from typing import Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Check CUDA availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"🎮 Device: {device}")

if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    print(f"🔢 CUDA Capability: {torch.cuda.get_device_capability(0)}")
    print(f"💾 Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  CPU mode - GPU not available")


In [None]:
# NVML Energy Monitoring
try:
    import pynvml
    pynvml.nvmlInit()
    GPU_MONITORING = True
    print("✅ NVML initialized - Energy monitoring available")
except:
    GPU_MONITORING = False
    print("⚠️  pynvml not available - Install with: pip install pynvml")

class NVMLPowerMeter:
    """Real-time GPU power measurement using NVIDIA Management Library"""
    def __init__(self, device_idx=0):
        if not GPU_MONITORING:
            raise RuntimeError("pynvml not available")
        self.handle = pynvml.nvmlDeviceGetHandleByIndex(device_idx)
        self.measurements = []
        
    def start(self):
        self.measurements = []
        self.start_time = time.time()
        return self
    
    def sample(self):
        """Get instantaneous power (Watts)"""
        power_mw = pynvml.nvmlDeviceGetPowerUsage(self.handle)
        power_w = power_mw / 1000.0
        self.measurements.append((time.time(), power_w))
        return power_w
    
    def stop(self) -> dict:
        """Calculate total energy consumed (Joules)"""
        if len(self.measurements) < 2:
            return {'energy_j': 0, 'avg_power_w': 0, 'duration_s': 0}
        
        total_energy = 0
        for i in range(len(self.measurements)-1):
            t1, p1 = self.measurements[i]
            t2, p2 = self.measurements[i+1]
            dt = t2 - t1
            avg_power = (p1 + p2) / 2
            total_energy += avg_power * dt
        
        duration = self.measurements[-1][0] - self.measurements[0][0]
        avg_power = total_energy / duration if duration > 0 else 0
        
        return {
            'energy_j': total_energy,
            'avg_power_w': avg_power,
            'duration_s': duration,
            'peak_power_w': max(p for _, p in self.measurements)
        }

print("⚡ NVMLPowerMeter class loaded")

---

## 🌡️ STEP 2: Thermodynamic Sampling Unit (TSU) Implementation

**Key Idea:** Replace selected linear layers with *stochastic weights*.

We model each weight as:
$$w_i \sim \mathcal{N}(\mu_i, \sigma_i^2)$$

**Reparameterized sampling (differentiable):**
$$w_i = \mu_i + \sigma_i \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0,1)$$

**Entropy term (Gaussian):**
$$S(q) = \frac{1}{2}\sum_i \left(1 + \log(2\pi\sigma_i^2)\right)$$

**Training objective:**
$$\mathcal{F} = \mathcal{L} - T \cdot S(q) + \lambda D_{KL}[q||p]$$

This creates a controlled exploration pressure on the parameters. If entropy weight is too high, the model can inflate variance without improving loss, so we keep it small and monitor convergence.


In [None]:
import math

class TSULinear(nn.Module):
    '''
    Stochastic linear layer using Gaussian weight distributions.
    Reparameterization keeps gradients flowing to (mu, log_var).
    '''
    def __init__(self, in_features, out_features, temperature=1.0, bias=True,
                 log_var_init=-5.0, log_var_min=-10.0, log_var_max=2.0):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.temperature = temperature
        self.log_var_min = log_var_min
        self.log_var_max = log_var_max

        # Mean and log-variance for weights
        self.weight_mean = nn.Parameter(torch.empty(out_features, in_features))
        self.weight_log_var = nn.Parameter(torch.full((out_features, in_features), log_var_init))

        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.bias = None

        # Initialize means like a standard Linear layer
        nn.init.kaiming_uniform_(self.weight_mean, a=math.sqrt(5))

    def _clamped_log_var(self):
        return torch.clamp(self.weight_log_var, self.log_var_min, self.log_var_max)

    def sample_weight(self):
        log_var = self._clamped_log_var()
        std = torch.exp(0.5 * log_var) * self.temperature
        eps = torch.randn_like(std)
        return self.weight_mean + eps * std

    def forward(self, x):
        # Use stochastic weights during training, mean weights during eval
        w = self.sample_weight() if self.training else self.weight_mean
        return F.linear(x, w, self.bias)

    def entropy(self):
        log_var = self._clamped_log_var()
        return 0.5 * torch.sum(1.0 + log_var + math.log(2.0 * math.pi))

    def kl_divergence(self):
        log_var = self._clamped_log_var()
        return -0.5 * torch.sum(1.0 + log_var - self.weight_mean.pow(2) - log_var.exp())

print("🌡️  TSULinear class loaded")
print("   - Gaussian weight sampling with reparameterization")
print("   - Entropy and KL available for free-energy objectives")


---

## 🏗️ STEP 3: Model Architecture - Minimal GPT with Attention Entropy Tracking

**Key Points:**
- We use a small GPT to keep experiments tractable.
- Attention entropy is logged for interpretability (not optimized directly).
- TSU is injected by swapping a linear layer with `TSULinear` in attention.

### **Self-Attention Mechanism:**

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### **Attention Entropy (Monitoring):**

$$H(A_i) = -\sum_{j=1}^{T} A_{ij} \log A_{ij}$$

### **Causal Masking:**

$$A_{ij} = \begin{cases}
\frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_{j'\leq i}\exp(q_i \cdot k_{j'} / \sqrt{d_k})} & j \leq i \\
0 & j > i
\end{cases}$$


In [None]:
class CausalSelfAttention(nn.Module):
    '''
    Causal self-attention with entropy tracking.
    Optionally uses TSULinear for stochastic attention projections.
    '''
    def __init__(self, n_embd: int, n_head: int, block_size: int, dropout: float = 0.1,
                 use_tsu: bool = False, tsu_temperature: float = 1.0):
        super().__init__()
        assert n_embd % n_head == 0

        self.n_head = n_head
        self.n_embd = n_embd
        self.dropout = dropout
        self.use_tsu = use_tsu

        # Key, Query, Value projections
        if use_tsu:
            self.c_attn = TSULinear(n_embd, 3 * n_embd, temperature=tsu_temperature)
        else:
            self.c_attn = nn.Linear(n_embd, 3 * n_embd)
        self.c_proj = nn.Linear(n_embd, n_embd)

        # Regularization
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)

        # Causal mask
        self.register_buffer("bias", torch.tril(torch.ones(block_size, block_size))
                            .view(1, 1, block_size, block_size))

        # Track attention entropy (for thermodynamic analysis)
        self.last_attn_entropy = None

    def forward(self, x):
        B, T, C = x.size()  # Batch, Sequence length, Embedding dim

        # Calculate Q, K, V
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        # Attention scores
        att = (q @ k.transpose(-2, -1)) * (1.0 / np.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)

        # Compute attention entropy: H(p) = -Σ p·log(p)
        att_entropy = -(att * torch.log(att + 1e-10)).sum(dim=-1).mean()
        self.last_attn_entropy = att_entropy.item()

        att = self.attn_dropout(att)
        y = att @ v  # (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        return self.resid_dropout(self.c_proj(y))

class TransformerBlock(nn.Module):
    '''Transformer block with attention + MLP'''
    def __init__(self, n_embd: int, n_head: int, block_size: int, dropout: float = 0.1,
                 use_tsu: bool = False, tsu_temperature: float = 1.0):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.attn = CausalSelfAttention(n_embd, n_head, block_size, dropout,
                                        use_tsu=use_tsu, tsu_temperature=tsu_temperature)
        self.ln2 = nn.LayerNorm(n_embd)
        self.mlp = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

class TinyGPT(nn.Module):
    '''
    Minimal GPT-style language model
    ~1-2M parameters (laptop-friendly)
    '''
    def __init__(self, vocab_size: int, block_size: int = 256,
                 n_embd: int = 384, n_head: int = 6, n_layer: int = 6, dropout: float = 0.1,
                 use_tsu: bool = False, tsu_temperature: float = 1.0):
        super().__init__()
        self.block_size = block_size

        self.transformer = nn.ModuleDict({
            'wte': nn.Embedding(vocab_size, n_embd),  # Token embeddings
            'wpe': nn.Embedding(block_size, n_embd),  # Position embeddings
            'drop': nn.Dropout(dropout),
            'h': nn.ModuleList([TransformerBlock(n_embd, n_head, block_size, dropout,
                                                use_tsu=use_tsu, tsu_temperature=tsu_temperature)
                               for _ in range(n_layer)]),
            'ln_f': nn.LayerNorm(n_embd)
        })
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)

        # Weight tying
        self.transformer.wte.weight = self.lm_head.weight

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.block_size, f"Sequence length {t} exceeds block_size {self.block_size}"

        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0)

        # Forward pass
        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(pos)
        x = self.transformer.drop(tok_emb + pos_emb)

        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        logits = self.lm_head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

    def get_num_params(self):
        return sum(p.numel() for p in self.parameters())

print("🏗️  TinyGPT model architecture loaded")
print("   - Optional TSU linear layers in attention")
print("   - Causal self-attention with entropy tracking")
print("   - Configurable depth: n_layer, n_embd, n_head")


---

## 📚 STEP 4: Data Preparation - Tiny Shakespeare Dataset

**Mathematical Foundation:**

### **Character-Level Language Modeling:**

Given a sequence $x = (x_1, ..., x_T)$ where $x_t \in \mathcal{V}$ (vocabulary):

$$P(x) = \prod_{t=1}^{T} P(x_t | x_{<t})$$

### **Cross-Entropy Loss:**

$$\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t}) = -\frac{1}{T}\sum_{t=1}^{T} \sum_{v \in \mathcal{V}} \mathbb{1}[x_t = v] \log P_\theta(v | x_{<t})$$

### **Perplexity:**

$$\text{PPL} = \exp(\mathcal{L}) = \exp\left(-\frac{1}{T}\sum_{t=1}^{T}\log P_\theta(x_t | x_{<t})\right)$$

Lower perplexity = better model.

### **Dataset Statistics:**

- Total tokens: $N \approx 1.1M$
- Vocabulary size: $|\mathcal{V}| = 65$ (characters)
- Train/Val split: $90\% / 10\%$
- Context window: $T_{ctx} = 128$ tokens

In [None]:
import urllib.request

# Download Tiny Shakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
print("📥 Downloading Tiny Shakespeare...")
with urllib.request.urlopen(url) as response:
    text = response.read().decode('utf-8')

print(f"✅ Downloaded {len(text):,} characters")
print(f"📖 Preview:\n{text[:200]}...")

# Build vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(f"\n🔤 Vocabulary size: {vocab_size}")
print(f"   Characters: {''.join(chars[:20])}...")

# Train/val split
n = len(text)
train_data = torch.tensor(encode(text[:int(0.9*n)]), dtype=torch.long)
val_data = torch.tensor(encode(text[int(0.9*n):]), dtype=torch.long)

print(f"\n📊 Dataset splits:")
print(f"   Train: {len(train_data):,} tokens")
print(f"   Val:   {len(val_data):,} tokens")

In [None]:
class CharDataset(Dataset):
    """Character-level dataset with sliding window"""
    def __init__(self, data, block_size):
        self.data = data
        self.block_size = block_size
    
    def __len__(self):
        return len(self.data) - self.block_size
    
    def __getitem__(self, idx):
        chunk = self.data[idx:idx + self.block_size + 1]
        x = chunk[:-1]
        y = chunk[1:]
        return x, y

# Create dataloaders
BLOCK_SIZE = 128
BATCH_SIZE = 32

train_dataset = CharDataset(train_data, BLOCK_SIZE)
val_dataset = CharDataset(val_data, BLOCK_SIZE)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, 
                         num_workers=0, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False,
                       num_workers=0, pin_memory=True)

print(f"✅ DataLoaders created:")
print(f"   Block size: {BLOCK_SIZE}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Train batches: {len(train_loader)}")
print(f"   Val batches: {len(val_loader)}")

---

## 🔬 STEP 5: Training Functions - Baseline vs. TSU

**Optimization Logic (Important):**

- **Baseline** minimizes only loss: $\mathcal{L}(\theta)$.
- **TSU training** minimizes a *free-energy-like* objective:

$$\mathcal{F} = \mathcal{L} - T\cdot S(q) + \lambda D_{KL}[q||p]$$

This adds two forces:
1. **Entropy pressure** (explore) pushes variance up.
2. **KL pressure** (stabilize) pushes variance down and means toward prior.

**Failure modes to watch for:**
- If entropy weight is too high, variance inflates and loss stalls.
- If entropy weight is too low, TSU collapses back to baseline.

We therefore use **small entropy weights**, optional temperature schedules, and compare energy per unit loss.


In [None]:
def train_baseline(model, train_loader, val_loader, epochs=5, lr=3e-4):
    '''
    Baseline training: Standard cross-entropy minimization
    Returns: training metrics + energy consumption
    '''
    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    # Energy monitoring
    if GPU_MONITORING:
        power_meter = NVMLPowerMeter()
        power_meter.start()

    metrics = {'train_loss': [], 'val_loss': [], 'epoch_times': [],
               'energy_j': 0.0, 'avg_power_w': 0.0}

    print("🚀 Starting BASELINE training...")
    for epoch in range(epochs):
        epoch_start = time.time()
        model.train()
        train_losses = []

        for batch_idx, (x, y) in enumerate(train_loader):
            x, y = x.to(device), y.to(device)

            # Forward pass
            logits, loss = model(x, targets=y)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            train_losses.append(loss.item())

            # Sample power
            if GPU_MONITORING and batch_idx % 10 == 0:
                power_meter.sample()

        # Validation
        model.eval()
        val_losses = []
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                _, loss = model(x, targets=y)
                val_losses.append(loss.item())

        epoch_time = time.time() - epoch_start
        avg_train = np.mean(train_losses)
        avg_val = np.mean(val_losses)

        metrics['train_loss'].append(avg_train)
        metrics['val_loss'].append(avg_val)
        metrics['epoch_times'].append(epoch_time)

        print(f"Epoch {epoch+1}/{epochs} | Train: {avg_train:.4f} | Val: {avg_val:.4f} | Time: {epoch_time:.2f}s")

    # Energy report
    if GPU_MONITORING:
        energy_stats = power_meter.stop()
        metrics['energy_j'] = energy_stats['energy_j']
        metrics['avg_power_w'] = energy_stats['avg_power_w']
        print(f"\n⚡ Energy consumed: {energy_stats['energy_j']:.2f} J")
        print(f"   Avg power: {energy_stats['avg_power_w']:.2f} W")

    return metrics

print("✅ train_baseline() function loaded")


In [None]:
def collect_tsu_stats(model):
    '''Aggregate entropy and KL across TSU layers in the model.'''
    entropy = 0.0
    kl = 0.0
    count = 0
    for module in model.modules():
        if isinstance(module, TSULinear):
            entropy = entropy + module.entropy()
            kl = kl + module.kl_divergence()
            count += 1
    return entropy, kl, count


def train_with_tsu(model, train_loader, val_loader, epochs=5, lr=3e-4,
                   temperature=1.0, entropy_weight=1e-4, kl_weight=1e-4,
                   temp_decay=1.0):
    '''
    TSU Training: Free-energy-like objective
    F = L - T * entropy_weight * S + kl_weight * KL
    '''
    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    # Energy monitoring
    if GPU_MONITORING:
        power_meter = NVMLPowerMeter()
        power_meter.start()

    metrics = {
        'train_loss': [], 'val_loss': [], 'free_energy': [],
        'entropy': [], 'kl': [], 'epoch_times': [],
        'energy_j': 0.0, 'avg_power_w': 0.0
    }

    print(f"🌡️  Starting TSU training (T={temperature}, entropy_weight={entropy_weight}, kl_weight={kl_weight})...")
    for epoch in range(epochs):
        epoch_start = time.time()
        model.train()
        train_losses, free_energies, entropies, kls = [], [], [], []

        # Optional temperature schedule
        temp_t = temperature * (temp_decay ** epoch)

        for batch_idx, (x, y) in enumerate(train_loader):
            x, y = x.to(device), y.to(device)

            # Forward pass
            logits, loss = model(x, targets=y)

            # Entropy + KL regularization from TSU layers
            entropy, kl, n_tsu = collect_tsu_stats(model)
            entropy_term = temp_t * entropy_weight * entropy
            kl_term = kl_weight * kl

            # Free energy objective
            free_energy = loss - entropy_term + kl_term

            # Backward pass
            optimizer.zero_grad()
            free_energy.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            train_losses.append(loss.item())
            free_energies.append(free_energy.item())
            entropies.append((entropy.item() / max(n_tsu, 1)))
            kls.append((kl.item() / max(n_tsu, 1)))

            # Sample power
            if GPU_MONITORING and batch_idx % 10 == 0:
                power_meter.sample()

        # Validation (deterministic: uses mean weights in TSU)
        model.eval()
        val_losses = []
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                _, loss = model(x, targets=y)
                val_losses.append(loss.item())

        epoch_time = time.time() - epoch_start
        avg_train = np.mean(train_losses)
        avg_val = np.mean(val_losses)
        avg_fe = np.mean(free_energies)
        avg_entropy = np.mean(entropies)
        avg_kl = np.mean(kls)

        metrics['train_loss'].append(avg_train)
        metrics['val_loss'].append(avg_val)
        metrics['free_energy'].append(avg_fe)
        metrics['entropy'].append(avg_entropy)
        metrics['kl'].append(avg_kl)
        metrics['epoch_times'].append(epoch_time)

        print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_train:.4f} | FE: {avg_fe:.4f} | "
              f"S: {avg_entropy:.2f} | KL: {avg_kl:.2f} | Val: {avg_val:.4f} | Time: {epoch_time:.2f}s")

    # Energy report
    if GPU_MONITORING:
        energy_stats = power_meter.stop()
        metrics['energy_j'] = energy_stats['energy_j']
        metrics['avg_power_w'] = energy_stats['avg_power_w']
        print(f"\n⚡ Energy consumed: {energy_stats['energy_j']:.2f} J")
        print(f"   Avg power: {energy_stats['avg_power_w']:.2f} W")

    return metrics

print("✅ train_with_tsu() function loaded")


---

## 🔮 STEP 6: Quantum Optimization with PennyLane (Optional Demo)

**Important:** This section is a *standalone illustration* of QAOA-style optimization.
It is **not** integrated into the training results in this notebook.

### **Quantum Approximate Optimization Algorithm (QAOA):**

**Ansatz state:**
$$|\psi(\vec{\gamma}, \vec{\beta})\rangle = \prod_{p=1}^{P} U_M(H_M, \beta_p) U_P(H_C, \gamma_p) |+\rangle^{\otimes n}$$

where:
- $U_P(H_C, \gamma) = e^{-i\gamma H_C}$: Problem unitary
- $U_M(H_M, \beta) = e^{-i\beta H_M}$: Mixer unitary
- $|+\rangle = \frac{1}{\sqrt{2}}(|0\rangle + |1\rangle)$

### **Cost Hamiltonian (Toy Example):**
$$H_C = \sum_{i=1}^{n} h_i Z_i + \sum_{i<j} J_{ij} Z_i Z_j$$

### **Expectation Value:**
$$\langle H_C \rangle = \langle \psi(\vec{\gamma}, \vec{\beta}) | H_C | \psi(\vec{\gamma}, \vec{\beta}) \rangle$$


In [None]:
# Quantum optimization with PennyLane (optional - requires installation)
try:
    import pennylane as qml
    QUANTUM_AVAILABLE = True
    print("✅ PennyLane available - Quantum optimization enabled")
except ImportError:
    QUANTUM_AVAILABLE = False
    print("⚠️  PennyLane not installed - Quantum features disabled")
    print("   Install with: pip install pennylane")

if QUANTUM_AVAILABLE:
    # Define quantum device (simulator)
    n_qubits = 4
    dev = qml.device('default.qubit', wires=n_qubits)
    
    @qml.qnode(dev)
    def qaoa_circuit(params, hamiltonian_coeffs):
        """
        QAOA circuit for parameter optimization
        From Word document: Quantum Approximate Optimization Algorithm
        
        Args:
            params: [gamma, beta] angles for QAOA layers
            hamiltonian_coeffs: Problem encoding (attention weights)
        """
        # Initial state: uniform superposition
        for i in range(n_qubits):
            qml.Hadamard(wires=i)
        
        # QAOA layers
        gamma, beta = params[0], params[1]
        
        # Problem Hamiltonian (encode attention parameters)
        for i in range(n_qubits):
            qml.RZ(gamma * hamiltonian_coeffs[i], wires=i)
        
        # Mixer Hamiltonian
        for i in range(n_qubits):
            qml.RX(beta, wires=i)
        
        # Entangling layer
        for i in range(n_qubits - 1):
            qml.CNOT(wires=[i, i+1])
        qml.CNOT(wires=[n_qubits-1, 0])  # Circular
        
        # Measurement
        return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]
    
    def quantum_parameter_optimization(attention_weights, n_iterations=20):
        """
        Use QAOA to optimize attention head parameters
        
        Args:
            attention_weights: Current attention weights [n_qubits]
            n_iterations: Optimization steps
            
        Returns:
            Optimized weights
        """
        # Normalize weights to [-π, π]
        hamiltonian_coeffs = np.pi * np.tanh(attention_weights[:n_qubits])
        
        # Initial QAOA parameters
        params = np.array([0.5, 0.5])  # [gamma, beta]
        
        # Simple gradient descent
        learning_rate = 0.1
        for _ in range(n_iterations):
            # Compute expectation values
            expectations = qaoa_circuit(params, hamiltonian_coeffs)
            
            # Simple cost: negative sum of expectations (maximize alignment)
            cost = -np.sum(expectations)
            
            # Numerical gradient (finite difference)
            grad = np.zeros_like(params)
            eps = 0.01
            for i in range(len(params)):
                params_plus = params.copy()
                params_plus[i] += eps
                cost_plus = -np.sum(qaoa_circuit(params_plus, hamiltonian_coeffs))
                grad[i] = (cost_plus - cost) / eps
            
            # Update
            params -= learning_rate * grad
        
        # Final expectations → optimized weights
        final_expectations = qaoa_circuit(params, hamiltonian_coeffs)
        optimized_weights = attention_weights.copy()
        optimized_weights[:n_qubits] = final_expectations
        
        return optimized_weights
    
    print(f"🔮 QAOA circuit configured:")
    print(f"   - Qubits: {n_qubits}")
    print(f"   - Device: default.qubit (simulator)")
    print(f"   - Circuit depth: 2 (problem + mixer Hamiltonian)")

---

## 🧪 STEP 7: Run Experiments & Comparative Analysis

**Experimental Design:**

We compare two training paradigms:

1. **Baseline:** $\min_\theta \mathcal{L}(\theta)$
2. **TSU:** $\min_{q} \; \mathbb{E}_q[\mathcal{L}] - T\cdot S(q) + \lambda D_{KL}[q||p]$

**Metrics:**
- Training loss: $\mathcal{L}_{train}$
- Validation loss: $\mathcal{L}_{val}$
- Energy consumption: $E_{total} = \int P(t) dt$
- Entropy evolution: $S(t)$
- Training time: $T_{wall}$

**Hypothesis:**
TSU can reduce *energy per unit loss reduction* by smoothing optimization dynamics.


In [None]:
# Initialize models
TSU_TEMPERATURE = 1.0

model_config = {
    'vocab_size': vocab_size,
    'block_size': BLOCK_SIZE,
    'n_embd': 256,
    'n_head': 4,
    'n_layer': 4,
    'dropout': 0.1,
    'use_tsu': False,
    'tsu_temperature': TSU_TEMPERATURE
}

model_baseline = TinyGPT(**model_config)
print(f"🏗️  Baseline model initialized: {model_baseline.get_num_params():,} parameters")

# TSU model uses stochastic attention projections
model_config_tsu = dict(model_config)
model_config_tsu['use_tsu'] = True

# Experiment configuration
EPOCHS = 3  # Laptop-friendly (increase for real experiments)
LEARNING_RATE = 3e-4


In [None]:
# Experiment 1: Baseline Training
print("="*60)
print("EXPERIMENT 1: BASELINE (Classical SGD)")
print("="*60)

model_baseline = TinyGPT(**model_config)
baseline_metrics = train_baseline(
    model_baseline, train_loader, val_loader, 
    epochs=EPOCHS, lr=LEARNING_RATE
)

In [None]:
# Experiment 2: TSU Training
print("\n" + "="*60)
print("EXPERIMENT 2: TSU (Free Energy Minimization)")
print("="*60)

model_tsu = TinyGPT(**model_config_tsu)
tsu_metrics = train_with_tsu(
    model_tsu, train_loader, val_loader,
    epochs=EPOCHS, lr=LEARNING_RATE,
    temperature=TSU_TEMPERATURE, entropy_weight=1e-4, kl_weight=1e-4, temp_decay=0.98
)


---

## 📊 STEP 8: Comparative Visualization & Statistical Analysis

**Mathematical Analysis:**

### **Loss Convergence Rate:**

$$r = \frac{\mathcal{L}(0) - \mathcal{L}(T)}{\mathcal{L}(0)} \times 100\%$$

### **Energy Efficiency Metric:**

$$\eta_{energy} = \frac{\mathcal{L}_{initial} - \mathcal{L}_{final}}{\int_0^T P(t) dt}$$

Higher $\eta$ = more loss reduction per Joule consumed.

### **Pareto Optimality:**

A method is **Pareto optimal** if no other method achieves both:
- Lower final loss: $\mathcal{L}_{final}' < \mathcal{L}_{final}$
- Lower energy: $E_{total}' < E_{total}$

### **Statistical Significance (t-test):**

$$t = \frac{\bar{E}_{baseline} - \bar{E}_{TSU}}{s_p \sqrt{\frac{2}{n}}}$$


In [None]:
import matplotlib.pyplot as plt

# Loss comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Training loss
axes[0].plot(baseline_metrics['train_loss'], 'o-', label='Baseline', linewidth=2)
axes[0].plot(tsu_metrics['train_loss'], 's-', label='TSU', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('Training Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Validation loss
axes[1].plot(baseline_metrics['val_loss'], 'o-', label='Baseline', linewidth=2)
axes[1].plot(tsu_metrics['val_loss'], 's-', label='TSU', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Loss')
axes[1].set_title('Validation Loss Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Energy consumption
if GPU_MONITORING:
    methods = ['Baseline', 'TSU']
    energies = [baseline_metrics['energy_j'], tsu_metrics['energy_j']]
    colors = ['#3498db', '#e74c3c']
    
    bars = axes[2].bar(methods, energies, color=colors, alpha=0.7, edgecolor='black')
    axes[2].set_ylabel('Energy (Joules)')
    axes[2].set_title('Total Energy Consumption')
    axes[2].grid(True, axis='y', alpha=0.3)
    
    # Add value labels
    for bar, energy in zip(bars, energies):
        height = bar.get_height()
        axes[2].text(bar.get_x() + bar.get_width()/2., height,
                    f'{energy:.1f} J', ha='center', va='bottom', fontweight='bold')
else:
    axes[2].text(0.5, 0.5, 'GPU monitoring\nnot available', 
                ha='center', va='center', transform=axes[2].transAxes, fontsize=12)
    axes[2].axis('off')

plt.tight_layout()
plt.show()

print("\n📊 Visualization complete!")

In [None]:
# Detailed metrics report
print("\n" + "="*70)
print("📋 FINAL METRICS REPORT")
print("="*70)

print("\n🎯 BASELINE (Classical SGD):")
print(f"   Final train loss: {baseline_metrics['train_loss'][-1]:.4f}")
print(f"   Final val loss:   {baseline_metrics['val_loss'][-1]:.4f}")
print(f"   Total time:       {sum(baseline_metrics['epoch_times']):.2f}s")
if GPU_MONITORING:
    print(f"   Energy consumed:  {baseline_metrics['energy_j']:.2f} J")
    print(f"   Avg power:        {baseline_metrics['avg_power_w']:.2f} W")

print("\n🌡️  TSU (Free Energy Minimization):")
print(f"   Final train loss:  {tsu_metrics['train_loss'][-1]:.4f}")
print(f"   Final val loss:    {tsu_metrics['val_loss'][-1]:.4f}")
print(f"   Final free energy: {tsu_metrics['free_energy'][-1]:.4f}")
print(f"   Final entropy:     {tsu_metrics['entropy'][-1]:.2f}")
print(f"   Final KL:          {tsu_metrics['kl'][-1]:.2f}")
print(f"   Total time:        {sum(tsu_metrics['epoch_times']):.2f}s")
if GPU_MONITORING:
    print(f"   Energy consumed:   {tsu_metrics['energy_j']:.2f} J")
    print(f"   Avg power:         {tsu_metrics['avg_power_w']:.2f} W")

if GPU_MONITORING and baseline_metrics['energy_j'] > 0:
    energy_reduction = (1 - tsu_metrics['energy_j'] / baseline_metrics['energy_j']) * 100
    print(f"\n⚡ ENERGY EFFICIENCY:")
    print(f"   TSU vs Baseline: {energy_reduction:+.2f}% change")

    if energy_reduction > 0:
        print(f"   ✅ TSU achieves {energy_reduction:.1f}% energy reduction!")
    else:
        print(f"   ⚠️  TSU uses {-energy_reduction:.1f}% more energy (entropy overhead)")

print("\n" + "="*70)


---

## 🎨 STEP 9: Text Generation & Quality Evaluation

**Mathematical Foundation:**

### **Autoregressive Generation:**

$$P(x_{1:T}) = \prod_{t=1}^{T} P_\theta(x_t | x_{<t})$$

### **Sampling Strategies:**

**Greedy Decoding:**
$$x_t = \arg\max_{v \in \mathcal{V}} P_\theta(v | x_{<t})$$

**Temperature Sampling:**
$$P'(x_t = v | x_{<t}) = \frac{\exp(\text{logit}_v / \tau)}{\sum_{v'} \exp(\text{logit}_{v'} / \tau)}$$

Higher $\tau$ → more random, Lower $\tau$ → more deterministic

### **Generation Quality Metrics:**

**Perplexity:**
$$\text{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^{T}\log P_\theta(x_t | x_{<t})\right)$$

**Entropy of generation:**
$$H = -\sum_{v \in \mathcal{V}} P(v) \log P(v)$$

In [None]:
def generate_text(model, prompt="To be or not to be", max_new_tokens=100, temperature=0.8):
    """
    Generate text using the trained model
    """
    model.eval()
    model = model.to(device)
    
    # Encode prompt
    context = torch.tensor([encode(prompt)], dtype=torch.long, device=device)
    
    with torch.no_grad():
        for _ in range(max_new_tokens):
            # Crop context to block_size
            context_crop = context if context.size(1) <= model.block_size else context[:, -model.block_size:]
            
            # Forward pass
            logits, _ = model(context_crop)
            logits = logits[:, -1, :] / temperature
            
            # Sample
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Append
            context = torch.cat([context, next_token], dim=1)
    
    generated = decode(context[0].tolist())
    return generated

# Generate samples from both models
print("📝 Text Generation Samples:\n")

print("=" * 60)
print("BASELINE MODEL:")
print("=" * 60)
baseline_text = generate_text(model_baseline, prompt="ROMEO:", max_new_tokens=80)
print(baseline_text)

print("\n" + "=" * 60)
print("TSU MODEL:")
print("=" * 60)
tsu_text = generate_text(model_tsu, prompt="ROMEO:", max_new_tokens=80)
print(tsu_text)

print("\n✅ Text generation complete!")

---

## 🔬 STEP 10: Advanced Thermodynamic Analysis & Phase Transitions

**Mathematical Foundation:**

### **Free Energy Landscape:**

$$F(\theta, T) = \mathcal{L}(\theta) - T \cdot S(\theta)$$

As $T \to 0$: Free energy $\to$ Loss (pure exploitation)  
As $T \to \infty$: Free energy dominated by entropy (pure exploration)

### **Entropy Evolution Dynamics:**

$$\frac{dS}{dt} = -\nabla_\sigma S \cdot \frac{d\sigma}{dt}$$

**Phase Transition Detection:**

Critical temperature where entropy suddenly drops:
$$T_c = \left(\frac{\partial S}{\partial T}\right)^{-1}_{max}$$

### **Information Bottleneck:**

$$\min I(X; \Theta) \text{ subject to } I(\Theta; Y) \geq I_{min}$$

where $I$ is mutual information.

### **Thermodynamic Integration:**

Total work done by entropy forces:
$$W_{entropy} = \int_{0}^{T_{train}} T(t) \cdot \frac{dS}{dt} dt$$

### **Fluctuation-Dissipation Theorem:**

$$\langle (\Delta \theta)^2 \rangle = 2T \cdot D \cdot \Delta t$$

where $D$ is diffusion coefficient, connecting temperature to parameter fluctuations.

In [None]:
# Thermodynamic analysis of TSU training
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Free Energy Evolution
axes[0, 0].plot(tsu_metrics['free_energy'], 'o-', color='#e74c3c', linewidth=2, markersize=8)
axes[0, 0].set_xlabel('Epoch', fontsize=11)
axes[0, 0].set_ylabel('Free Energy F(θ)', fontsize=11)
axes[0, 0].set_title('Free Energy Minimization: F(θ) = L(θ) - T·S(θ)', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# 2. Entropy Evolution
axes[0, 1].plot(tsu_metrics['entropy'], 's-', color='#9b59b6', linewidth=2, markersize=8)
axes[0, 1].set_xlabel('Epoch', fontsize=11)
axes[0, 1].set_ylabel('Entropy S(θ)', fontsize=11)
axes[0, 1].set_title('Parameter Distribution Entropy', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# 3. Loss vs Free Energy comparison
axes[1, 0].plot(tsu_metrics['train_loss'], 'o-', label='Loss L(θ)', color='#3498db', linewidth=2)
axes[1, 0].plot(tsu_metrics['free_energy'], 's-', label='Free Energy F(θ)', color='#e74c3c', linewidth=2)
axes[1, 0].set_xlabel('Epoch', fontsize=11)
axes[1, 0].set_ylabel('Value', fontsize=11)
axes[1, 0].set_title('Loss vs Free Energy Dynamics', fontsize=12, fontweight='bold')
axes[1, 0].legend(fontsize=10)
axes[1, 0].grid(True, alpha=0.3)

# 4. Energy-Accuracy Trade-off
if GPU_MONITORING:
    baseline_final_loss = baseline_metrics['val_loss'][-1]
    tsu_final_loss = tsu_metrics['val_loss'][-1]
    baseline_energy = baseline_metrics['energy_j']
    tsu_energy = tsu_metrics['energy_j']
    
    axes[1, 1].scatter([baseline_energy], [baseline_final_loss], 
                      s=300, marker='o', color='#3498db', edgecolor='black', linewidth=2,
                      label='Baseline', zorder=3)
    axes[1, 1].scatter([tsu_energy], [tsu_final_loss],
                      s=300, marker='s', color='#e74c3c', edgecolor='black', linewidth=2,
                      label='TSU', zorder=3)
    
    axes[1, 1].set_xlabel('Energy Consumption (J)', fontsize=11)
    axes[1, 1].set_ylabel('Final Validation Loss', fontsize=11)
    axes[1, 1].set_title('Energy-Performance Trade-off', fontsize=12, fontweight='bold')
    axes[1, 1].legend(fontsize=10)
    axes[1, 1].grid(True, alpha=0.3)
    
    # Add arrows and annotations
    axes[1, 1].annotate('', xy=(tsu_energy, tsu_final_loss), 
                       xytext=(baseline_energy, baseline_final_loss),
                       arrowprops=dict(arrowstyle='->', lw=2, color='green', alpha=0.6))
    
    # Pareto improvement region
    axes[1, 1].axvline(baseline_energy, color='gray', linestyle='--', alpha=0.3)
    axes[1, 1].axhline(baseline_final_loss, color='gray', linestyle='--', alpha=0.3)
else:
    axes[1, 1].text(0.5, 0.5, 'GPU Energy Monitoring\nNot Available\n\nInstall pynvml:\npip install pynvml',
                   ha='center', va='center', transform=axes[1, 1].transAxes, 
                   fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

print("\n✅ Advanced thermodynamic analysis complete!")
print("\n📊 Key Insights:")
print(f"   - Free energy trajectory shows {'convergence' if tsu_metrics['free_energy'][-1] < tsu_metrics['free_energy'][0] else 'instability'}")
print(f"   - Entropy evolution: {tsu_metrics['entropy'][0]:.2f} → {tsu_metrics['entropy'][-1]:.2f}")
print(f"   - Entropy {'decreased' if tsu_metrics['entropy'][-1] < tsu_metrics['entropy'][0] else 'increased'} during training (parameter distribution narrowing)")

---

## 🏁 Workshop Conclusions & Key Takeaways

### 🎯 Summary

This workshop explored **entropy-regularized training** for energy-aware LLM fine-tuning on GPUs.

### 📊 Key Results (Interpret Carefully)

**1. Objective:**
$$\boxed{\mathcal{F}(q) = \mathbb{E}_q[\mathcal{L}] - T\cdot S(q) + \lambda D_{KL}[q||p]}$$

**2. Energy Efficiency:**
- Baseline: Standard SGD with loss $\mathcal{L}(\theta)$
- TSU: Stochastic weights + entropy/KL regularization
- Outcome: **Measure** energy per unit loss, do not assume reduction a priori

**3. Training Dynamics:**
- Entropy evolves: $S(0) \to S(T)$ (exploration → exploitation)
- Entropy too high can stall learning; too low collapses to baseline

### 🔬 Theoretical Insights

$$\underbrace{\mathcal{F}}_{\text{Free Energy}} = \underbrace{\mathbb{E}_q[\mathcal{L}]}_{\text{Internal Energy}} - \underbrace{T\cdot S(q)}_{\text{Entropic Force}} + \underbrace{\lambda D_{KL}}_{\text{Stability}}$$

### 🚀 Practical Applications

1. **Energy-Aware Fine-Tuning:** Track Joules per loss improvement
2. **Stochastic Regularization:** Improve robustness in small-data regimes
3. **Hardware Profiling:** Use NVML to connect algorithmic choices to energy

### 🔮 Future Directions

1. **Temperature Schedules:** Adaptive $T(t)$ tied to gradient norms
2. **Objective Balancing:** Tune $(T, \lambda)$ for stability vs exploration
3. **Larger-Scale Studies:** Repeat with more epochs + multiple seeds

---

## 📖 References & Further Reading

1. Extropic (2024): "An efficient probabilistic hardware architecture for diffusion-like models" (arXiv:2510.23972v1)
2. Friston, K. (2010): "The free-energy principle: a unified brain theory?" Nature Reviews Neuroscience
3. Farhi et al. (2014): "A Quantum Approximate Optimization Algorithm" arXiv:1411.4028
4. Hinton & Van Camp (1993): "Keeping neural networks simple by minimizing the description length" COLT 1993

---

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 25px; border-radius: 15px; color: white; text-align: center; margin: 20px 0;">
  <h2 style="margin: 0; font-size: 24px;">🌟 Thank You for Participating! 🌟</h2>
  <p style="margin: 15px 0; font-size: 16px;">Questions? Discussions? Let's explore thermodynamic AI together!</p>
  <p style="margin: 10px 0; font-size: 14px; opacity: 0.9;">Contact: [Your Workshop Details Here]</p>
</div>
