# HOPE: Hierarchical Optimization with Persistent Experience for AI-Generated Text Detection

This notebook implements HOPE (Hierarchical Optimization with Persistent Experience), a neural architecture based on Google's Nested Learning paradigm for AI-generated text detection.

## Key Components:
1. **Neural Memory Module** - A learnable MLP that stores key-value associations with surprise-based updates
2. **Continuum Memory System (CMS)** - Multiple memory modules updating at different frequencies
3. **Nested Optimization** - Multi-level optimization with different update rates
4. **Self-Referential Learning** - Memory modules that can optimize their own parameters

## References:
- [Nested Learning: A new ML paradigm for continual learning](https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/)
- [Titans: Learning to Memorize at Test Time](https://arxiv.org/abs/2501.00663)

Flow: Setup → Data → Neural Memory Module → CMS → HOPE Model → Training → Inference

In [None]:
# Core imports
import os
import random
import math
from pathlib import Path
from typing import Optional, List, Tuple, Dict, Any
from dataclasses import dataclass

import numpy as np
import pandas as pd
import joblib

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

from transformers import (
    AutoModel,
    AutoTokenizer,
    set_seed,
)
from sklearn.model_selection import StratifiedKFold, GroupKFold
from sklearn.metrics import roc_auc_score, accuracy_score
from tqdm.auto import tqdm

# Set seeds for reproducibility
SEED = 42
set_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {DEVICE}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Configuration
@dataclass
class HOPEConfig:
    """Configuration for HOPE model."""
    # Base encoder
    base_model: str = "microsoft/deberta-v3-base"
    max_length: int = 512
    
    # Neural Memory Module
    memory_dim: int = 256  # Dimension of memory keys/values
    memory_hidden_dim: int = 512  # Hidden dimension in memory MLP
    memory_layers: int = 2  # Number of layers in memory MLP
    
    # Continuum Memory System (CMS)
    num_memory_levels: int = 3  # Number of memory modules at different frequencies
    update_frequencies: Tuple[int, ...] = (1, 4, 16)  # Update every N steps for each level
    
    # Surprise mechanism
    surprise_momentum: float = 0.9  # eta_t: decay for past surprise
    surprise_scale: float = 0.1  # theta_t: scale for momentary surprise
    forget_rate: float = 0.01  # alpha_t: forgetting factor
    
    # Training
    learning_rate: float = 2e-5
    memory_lr: float = 1e-3  # Higher LR for memory modules (faster adaptation)
    weight_decay: float = 0.01
    num_epochs: int = 5
    batch_size: int = 128
    gradient_accumulation: int = 2
    warmup_ratio: float = 0.1
    
    # CV
    n_splits: int = 5
    
    # Classifier
    num_classes: int = 2
    dropout: float = 0.1
    
    # Nested optimization
    inner_steps: int = 1  # Steps for inner (memory) optimization per outer step
    use_self_referential: bool = True  # Enable HOPE's self-referential updates

config = HOPEConfig()
print(f"Config: {config}")

In [None]:
# Data paths - using cleaned train/test datasets
CWD = Path.cwd()

# Find the data directory
if (CWD / "src/ai_vs_human").exists():
    DATA_DIR = CWD / "src/ai_vs_human"
elif CWD.name == "ai_vs_human":
    DATA_DIR = CWD
else:
    DATA_DIR = CWD

# Specific dataset files
TRAIN_FILE = "merged_ai_human_multisocial_features_cleaned_train.csv"
TEST_FILE = "merged_ai_human_multisocial_features_cleaned_test.csv"

DATA_PATH = DATA_DIR / TRAIN_FILE
TEST_PATH = DATA_DIR / TEST_FILE

if not DATA_PATH.exists():
    raise FileNotFoundError(f"Training data not found at {DATA_PATH}")

if not TEST_PATH.exists():
    print(f"Warning: Test data not found at {TEST_PATH}, will skip test inference")
    TEST_PATH = None

WORK_DIR = DATA_DIR
MODEL_DIR = WORK_DIR / "models" / "hope"
MODEL_DIR.mkdir(parents=True, exist_ok=True)
OOF_DIR = WORK_DIR / "oof"
OOF_DIR.mkdir(exist_ok=True)

print(f"Training data: {DATA_PATH}")
print(f"Test data: {TEST_PATH}")
print(f"Model directory: {MODEL_DIR}")

## Neural Memory Module

The core of Titans/HOPE is a **neural memory module** - an MLP that learns key-value associations through surprise-based updates.

### Key Equations:
- **Memory Update**: $\mathcal{M}_t = (1 - \alpha_t)\mathcal{M}_{t-1} + S_t$
- **Surprise Metric**: $S_t = \eta_t S_{t-1} - \theta_t \nabla\ell(\mathcal{M}_{t-1}; x_t)$
- **Loss**: $\ell(\mathcal{M}_{t-1}; x_t) = \|\mathcal{M}_{t-1}(k_t) - v_t\|_2^2$

Where:
- $\alpha_t$ controls forgetting
- $\eta_t$ is surprise momentum (past surprise decay)
- $\theta_t$ scales momentary surprise

In [None]:
class NeuralMemoryModule(nn.Module):
    """
    Neural Memory Module from Titans architecture.
    
    Unlike traditional RNNs that use fixed-size vectors, this module uses
    an MLP to store key-value associations in its parameters. The memory
    is updated based on "surprise" - how unexpected each input is.
    """
    
    def __init__(
        self,
        input_dim: int,
        memory_dim: int,
        hidden_dim: int,
        num_layers: int = 2,
        surprise_momentum: float = 0.9,
        surprise_scale: float = 0.1,
        forget_rate: float = 0.01,
    ):
        super().__init__()
        self.input_dim = input_dim
        self.memory_dim = memory_dim
        self.hidden_dim = hidden_dim
        
        # Hyperparameters for surprise-based updates
        self.eta = surprise_momentum  # Past surprise decay
        self.theta = surprise_scale   # Momentary surprise scale
        self.alpha = forget_rate      # Forgetting factor
        
        # Key-Value projections
        self.key_proj = nn.Linear(input_dim, memory_dim)
        self.value_proj = nn.Linear(input_dim, memory_dim)
        self.query_proj = nn.Linear(input_dim, memory_dim)
        
        # Memory MLP: maps keys to values
        layers = []
        dims = [memory_dim] + [hidden_dim] * (num_layers - 1) + [memory_dim]
        for i in range(len(dims) - 1):
            layers.append(nn.Linear(dims[i], dims[i + 1]))
            if i < len(dims) - 2:
                layers.append(nn.SiLU())  # SiLU activation as in Titans
                layers.append(nn.LayerNorm(dims[i + 1]))
        self.memory_mlp = nn.Sequential(*layers)
        
        # Learnable gates for adaptive surprise parameters
        self.eta_gate = nn.Sequential(
            nn.Linear(input_dim, 1),
            nn.Sigmoid()
        )
        self.theta_gate = nn.Sequential(
            nn.Linear(input_dim, 1),
            nn.Sigmoid()
        )
        self.alpha_gate = nn.Sequential(
            nn.Linear(input_dim, 1),
            nn.Sigmoid()
        )
        
        # Accumulated surprise (momentum buffer)
        self.register_buffer('surprise_momentum_buffer', None)
        
        # Output projection
        self.output_proj = nn.Linear(memory_dim, input_dim)
        
    def compute_surprise(self, keys: torch.Tensor, values: torch.Tensor) -> torch.Tensor:
        """
        Compute surprise metric based on prediction error.
        Surprise = gradient of associative memory loss w.r.t. input.
        """
        # Forward pass through memory
        predicted_values = self.memory_mlp(keys)
        
        # Associative memory loss: ||M(k) - v||^2
        loss = F.mse_loss(predicted_values, values, reduction='none')
        surprise = loss.mean(dim=-1, keepdim=True)  # Per-sample surprise
        
        return surprise, loss.mean()
    
    def forward(
        self,
        x: torch.Tensor,
        update_memory: bool = True,
        return_surprise: bool = False,
    ) -> torch.Tensor:
        """
        Forward pass with optional memory update.
        
        Args:
            x: Input tensor [batch, seq_len, input_dim] or [batch, input_dim]
            update_memory: Whether to update memory based on surprise
            return_surprise: Whether to return surprise metrics
        """
        # Handle 2D input
        if x.dim() == 2:
            x = x.unsqueeze(1)
        
        batch_size, seq_len, _ = x.shape
        
        # Project to key-value-query space
        keys = self.key_proj(x)  # [B, S, memory_dim]
        values = self.value_proj(x)
        queries = self.query_proj(x)
        
        # Normalize keys and queries (as in Titans)
        keys = F.normalize(keys, p=2, dim=-1)
        queries = F.normalize(queries, p=2, dim=-1)
        
        # Compute adaptive gate values
        x_pooled = x.mean(dim=1)  # [B, input_dim]
        eta_t = self.eta * self.eta_gate(x_pooled).squeeze(-1)  # Data-dependent decay
        theta_t = self.theta * self.theta_gate(x_pooled).squeeze(-1)
        alpha_t = self.alpha * self.alpha_gate(x_pooled).squeeze(-1)
        
        # Compute surprise
        surprise, assoc_loss = self.compute_surprise(keys, values)
        
        # Memory read: retrieve based on query
        retrieved = self.memory_mlp(queries)  # [B, S, memory_dim]
        
        # Output projection
        output = self.output_proj(retrieved)
        
        if output.size(1) == 1:
            output = output.squeeze(1)
        
        if return_surprise:
            return output, {
                'surprise': surprise.mean(),
                'assoc_loss': assoc_loss,
                'eta': eta_t.mean(),
                'theta': theta_t.mean(),
                'alpha': alpha_t.mean(),
            }
        
        return output
    
    def get_memory_loss(self, x: torch.Tensor) -> torch.Tensor:
        """Get the associative memory loss for training."""
        if x.dim() == 2:
            x = x.unsqueeze(1)
        
        keys = F.normalize(self.key_proj(x), p=2, dim=-1)
        values = self.value_proj(x)
        
        predicted = self.memory_mlp(keys)
        return F.mse_loss(predicted, values)


# Test the memory module
print("Testing Neural Memory Module...")
mem = NeuralMemoryModule(input_dim=768, memory_dim=256, hidden_dim=512)
test_input = torch.randn(4, 10, 768)  # [batch=4, seq=10, dim=768]
output, metrics = mem(test_input, return_surprise=True)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Metrics: {metrics}")

## Continuum Memory System (CMS)

HOPE extends Titans with a **Continuum Memory System** - multiple memory modules that update at different frequencies:
- **Fast memory** (level 0): Updates every step, captures immediate context
- **Medium memory** (level 1): Updates every 4 steps, captures recent patterns  
- **Slow memory** (level 2): Updates every 16 steps, consolidates abstract knowledge

This creates a spectrum of memory timescales, similar to how human memory works.

In [None]:
class ContinuumMemorySystem(nn.Module):
    """
    Continuum Memory System (CMS) from HOPE.
    
    A stack of neural memory modules, each updating at a different frequency.
    This creates a spectrum of memory timescales from fast (immediate context)
    to slow (consolidated knowledge).
    """
    
    def __init__(
        self,
        input_dim: int,
        memory_dim: int,
        hidden_dim: int,
        num_levels: int = 3,
        update_frequencies: Tuple[int, ...] = (1, 4, 16),
        surprise_momentum: float = 0.9,
        surprise_scale: float = 0.1,
        forget_rate: float = 0.01,
    ):
        super().__init__()
        self.num_levels = num_levels
        self.update_frequencies = update_frequencies[:num_levels]
        
        # Create memory modules for each level
        # Slower levels have stronger momentum (more persistent memory)
        self.memory_levels = nn.ModuleList([
            NeuralMemoryModule(
                input_dim=input_dim,
                memory_dim=memory_dim,
                hidden_dim=hidden_dim,
                surprise_momentum=surprise_momentum + 0.05 * i,  # Increase momentum for slower levels
                surprise_scale=surprise_scale / (i + 1),  # Decrease scale for slower levels
                forget_rate=forget_rate / (i + 1),  # Slower forgetting for slower levels
            )
            for i in range(num_levels)
        ])
        
        # Fusion layer to combine outputs from all levels
        self.fusion = nn.Sequential(
            nn.Linear(input_dim * num_levels, input_dim),
            nn.SiLU(),
            nn.LayerNorm(input_dim),
            nn.Linear(input_dim, input_dim),
        )
        
        # Learnable level weights
        self.level_weights = nn.Parameter(torch.ones(num_levels) / num_levels)
        
        # Step counter for frequency-based updates
        self.register_buffer('step_counter', torch.tensor(0))
        
    def forward(
        self,
        x: torch.Tensor,
        return_level_outputs: bool = False,
    ) -> torch.Tensor:
        """
        Forward pass through all memory levels.
        
        Args:
            x: Input tensor [batch, seq_len, input_dim] or [batch, input_dim]
            return_level_outputs: Return individual level outputs for analysis
        """
        level_outputs = []
        level_metrics = []
        
        for i, (memory, freq) in enumerate(zip(self.memory_levels, self.update_frequencies)):
            # Check if this level should update
            should_update = (self.step_counter % freq == 0)
            
            # Forward through memory level
            output, metrics = memory(x, update_memory=should_update, return_surprise=True)
            level_outputs.append(output)
            level_metrics.append(metrics)
        
        # Increment step counter
        if self.training:
            self.step_counter += 1
        
        # Weighted combination of level outputs
        weights = F.softmax(self.level_weights, dim=0)
        
        # Handle different output shapes
        if level_outputs[0].dim() == 2:
            # [batch, dim] outputs
            stacked = torch.stack(level_outputs, dim=-1)  # [B, D, L]
            weighted = (stacked * weights.view(1, 1, -1)).sum(dim=-1)  # [B, D]
            
            # Also compute fusion output
            concat = torch.cat(level_outputs, dim=-1)  # [B, D*L]
            fused = self.fusion(concat)  # [B, D]
            
            # Combine weighted sum and fusion
            output = weighted + fused
        else:
            # [batch, seq, dim] outputs
            stacked = torch.stack(level_outputs, dim=-1)  # [B, S, D, L]
            weighted = (stacked * weights.view(1, 1, 1, -1)).sum(dim=-1)  # [B, S, D]
            
            concat = torch.cat(level_outputs, dim=-1)  # [B, S, D*L]
            fused = self.fusion(concat)  # [B, S, D]
            
            output = weighted + fused
        
        if return_level_outputs:
            return output, level_outputs, level_metrics
        
        return output
    
    def get_total_memory_loss(self, x: torch.Tensor) -> torch.Tensor:
        """Get combined associative memory loss from all levels."""
        total_loss = 0
        for i, (memory, freq) in enumerate(zip(self.memory_levels, self.update_frequencies)):
            # Weight loss by update frequency (faster levels contribute more)
            weight = 1.0 / freq
            total_loss += weight * memory.get_memory_loss(x)
        return total_loss
    
    def reset_step_counter(self):
        """Reset step counter (call at start of each epoch)."""
        self.step_counter.zero_()


# Test CMS
print("Testing Continuum Memory System...")
cms = ContinuumMemorySystem(
    input_dim=768,
    memory_dim=256,
    hidden_dim=512,
    num_levels=3,
    update_frequencies=(1, 4, 16)
)
test_input = torch.randn(4, 768)
output, level_outputs, metrics = cms(test_input, return_level_outputs=True)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Level weights: {F.softmax(cms.level_weights, dim=0).data}")
print(f"Memory loss: {cms.get_total_memory_loss(test_input):.4f}")

## HOPE Model for Text Classification

The full HOPE architecture combines:
1. **Encoder** - DeBERTa-v3-base for text encoding
2. **Continuum Memory System** - Multi-frequency memory for capturing patterns
3. **Self-Referential Module** - Allows memory to optimize itself (HOPE's key innovation)
4. **Classifier Head** - For binary AI/Human classification

In [None]:
class SelfReferentialModule(nn.Module):
    """
    Self-Referential Module - HOPE's key innovation over Titans.
    
    This module allows the memory system to optimize its own parameters
    through a meta-learning loop, enabling unbounded in-context learning.
    """
    
    def __init__(self, dim: int, hidden_dim: int = 256):
        super().__init__()
        self.dim = dim
        
        # Meta-network that predicts parameter updates
        self.meta_net = nn.Sequential(
            nn.Linear(dim * 2, hidden_dim),
            nn.SiLU(),
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, dim),
            nn.Tanh(),  # Bound updates
        )
        
        # Gating for self-referential updates
        self.gate = nn.Sequential(
            nn.Linear(dim, 1),
            nn.Sigmoid(),
        )
        
        # Persistent memory (learnable, data-independent)
        self.persistent_memory = nn.Parameter(torch.randn(1, dim) * 0.02)
        
    def forward(self, x: torch.Tensor, memory_output: torch.Tensor) -> torch.Tensor:
        """
        Self-referential forward pass.
        
        Args:
            x: Original input
            memory_output: Output from CMS
        """
        # Combine input and memory output for meta-learning
        combined = torch.cat([x, memory_output], dim=-1)
        
        # Predict update direction
        update = self.meta_net(combined)
        
        # Gated update
        gate = self.gate(memory_output)
        
        # Apply self-referential update
        output = memory_output + gate * update
        
        # Add persistent memory (expands to batch size)
        persistent = self.persistent_memory.expand(x.size(0), -1)
        output = output + 0.1 * persistent
        
        return output


class HOPEClassifier(nn.Module):
    """
    HOPE (Hierarchical Optimization with Persistent Experience) for Text Classification.
    
    Architecture:
    1. Text Encoder (DeBERTa) -> contextual embeddings
    2. Continuum Memory System -> multi-frequency memory processing
    3. Self-Referential Module -> meta-learning for memory optimization
    4. Classifier Head -> AI/Human prediction
    """
    
    def __init__(self, config: HOPEConfig):
        super().__init__()
        self.config = config
        
        # Text encoder
        self.encoder = AutoModel.from_pretrained(config.base_model)
        self.encoder_dim = self.encoder.config.hidden_size
        
        # Freeze early encoder layers (optional, for efficiency)
        # for param in self.encoder.embeddings.parameters():
        #     param.requires_grad = False
        
        # Continuum Memory System
        self.cms = ContinuumMemorySystem(
            input_dim=self.encoder_dim,
            memory_dim=config.memory_dim,
            hidden_dim=config.memory_hidden_dim,
            num_levels=config.num_memory_levels,
            update_frequencies=config.update_frequencies,
            surprise_momentum=config.surprise_momentum,
            surprise_scale=config.surprise_scale,
            forget_rate=config.forget_rate,
        )
        
        # Self-referential module (HOPE's addition)
        self.self_ref = SelfReferentialModule(
            dim=self.encoder_dim,
            hidden_dim=config.memory_hidden_dim,
        ) if config.use_self_referential else None
        
        # Pooling options
        self.attention_pool = nn.Sequential(
            nn.Linear(self.encoder_dim, 1),
        )
        
        # Classifier head
        self.dropout = nn.Dropout(config.dropout)
        self.classifier = nn.Sequential(
            nn.Linear(self.encoder_dim * 2, self.encoder_dim),
            nn.SiLU(),
            nn.LayerNorm(self.encoder_dim),
            nn.Dropout(config.dropout),
            nn.Linear(self.encoder_dim, config.num_classes),
        )
        
    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
        labels: Optional[torch.Tensor] = None,
        return_memory_loss: bool = False,
    ) -> Dict[str, torch.Tensor]:
        """
        Forward pass.
        
        Args:
            input_ids: Token IDs [batch, seq_len]
            attention_mask: Attention mask [batch, seq_len]
            labels: Optional labels for loss computation
            return_memory_loss: Whether to return memory loss for nested optimization
        """
        # Encode text
        encoder_output = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        hidden_states = encoder_output.last_hidden_state  # [B, S, D]
        
        # Mean pooling for CMS input
        mask_expanded = attention_mask.unsqueeze(-1).float()
        sum_hidden = (hidden_states * mask_expanded).sum(dim=1)
        mean_pooled = sum_hidden / mask_expanded.sum(dim=1).clamp(min=1e-9)
        
        # Attention-weighted pooling
        attn_weights = self.attention_pool(hidden_states).squeeze(-1)  # [B, S]
        attn_weights = attn_weights.masked_fill(~attention_mask.bool(), float('-inf'))
        attn_weights = F.softmax(attn_weights, dim=-1)  # [B, S]
        attn_pooled = (hidden_states * attn_weights.unsqueeze(-1)).sum(dim=1)  # [B, D]
        
        # Process through Continuum Memory System
        memory_output = self.cms(mean_pooled)  # [B, D]
        
        # Self-referential processing (HOPE's key feature)
        if self.self_ref is not None:
            memory_output = self.self_ref(mean_pooled, memory_output)
        
        # Combine encoder output with memory output
        combined = torch.cat([attn_pooled, memory_output], dim=-1)  # [B, D*2]
        combined = self.dropout(combined)
        
        # Classification
        logits = self.classifier(combined)  # [B, num_classes]
        
        output = {'logits': logits}
        
        if labels is not None:
            loss = F.cross_entropy(logits, labels)
            output['loss'] = loss
        
        if return_memory_loss:
            # Memory loss for nested optimization
            memory_loss = self.cms.get_total_memory_loss(mean_pooled)
            output['memory_loss'] = memory_loss
        
        return output
    
    def get_memory_parameters(self):
        """Get parameters of memory modules (for separate optimizer)."""
        params = list(self.cms.parameters())
        if self.self_ref is not None:
            params.extend(list(self.self_ref.parameters()))
        return params
    
    def get_encoder_parameters(self):
        """Get parameters of encoder and classifier (for main optimizer)."""
        params = list(self.encoder.parameters())
        params.extend(list(self.classifier.parameters()))
        params.extend(list(self.attention_pool.parameters()))
        return params


# Quick test
print("Testing HOPE Classifier...")
test_config = HOPEConfig()
# Smaller test model
test_config.base_model = "microsoft/deberta-v3-base"
print(f"Loading {test_config.base_model}...")

In [None]:
# Dataset class
class TextDataset(Dataset):
    """Dataset for AI vs Human text classification."""
    
    def __init__(
        self,
        df: pd.DataFrame,
        tokenizer,
        text_col: str = "text",
        label_col: Optional[str] = "label",
        max_length: int = 512,
    ):
        self.df = df.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.text_col = text_col
        self.label_col = label_col
        self.max_length = max_length
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx: int):
        text = str(self.df.loc[idx, self.text_col])
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt',
        )
        
        item = {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
        }
        
        if self.label_col and self.label_col in self.df.columns:
            item['labels'] = torch.tensor(int(self.df.loc[idx, self.label_col]))
        
        return item


# Load and prepare data
print("Loading data...")
train_df = pd.read_csv(DATA_PATH)
test_df = pd.read_csv(TEST_PATH) if TEST_PATH is not None else None

text_col = "text"
label_col = "label"

# Handle alternate column names
if text_col not in train_df.columns:
    for alt in ["text_content", "content"]:
        if alt in train_df.columns:
            train_df = train_df.rename(columns={alt: text_col})
            if test_df is not None:
                test_df = test_df.rename(columns={alt: text_col})
            break

print(f"Train shape: {train_df.shape}")
print(f"Label distribution:\n{train_df[label_col].value_counts()}")

if test_df is not None:
    print(f"\nTest shape: {test_df.shape}")
    if label_col in test_df.columns:
        print(f"Test label distribution:\n{test_df[label_col].value_counts()}")

y = train_df[label_col].values

# Setup CV
N_SPLITS = config.n_splits
cv = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)
folds = list(cv.split(train_df, y))
print(f"\nUsing {N_SPLITS}-fold StratifiedKFold")

## Training with Nested Optimization

HOPE uses **nested optimization** - two optimization loops running at different rates:

1. **Outer Loop** (slow): Updates encoder and classifier parameters
2. **Inner Loop** (fast): Updates memory module parameters

This allows the memory to adapt quickly while the encoder learns stable representations.

In [None]:
def train_fold(
    fold: int,
    train_idx: np.ndarray,
    val_idx: np.ndarray,
    config: HOPEConfig,
) -> Tuple[np.ndarray, float, str]:
    """
    Train a single fold with nested optimization.
    
    Returns:
        oof_preds: Out-of-fold predictions for validation set
        best_auc: Best validation AUC achieved
        model_path: Path to saved model
    """
    print(f"\n{'='*50}")
    print(f"Training Fold {fold + 1}/{N_SPLITS}")
    print(f"{'='*50}")
    
    # Prepare data
    train_data = train_df.iloc[train_idx]
    val_data = train_df.iloc[val_idx]
    
    tokenizer = AutoTokenizer.from_pretrained(config.base_model, use_fast=True)
    
    train_dataset = TextDataset(train_data, tokenizer, text_col, label_col, config.max_length)
    val_dataset = TextDataset(val_data, tokenizer, text_col, label_col, config.max_length)
    
    train_loader = DataLoader(
        train_dataset,
        batch_size=config.batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True,
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=config.batch_size * 2,
        shuffle=False,
        num_workers=4,
        pin_memory=True,
    )
    
    # Initialize model
    model = HOPEClassifier(config).to(DEVICE)
    
    # Nested optimization: two optimizers with different learning rates
    # Outer optimizer: encoder + classifier (slow)
    outer_optimizer = AdamW(
        model.get_encoder_parameters(),
        lr=config.learning_rate,
        weight_decay=config.weight_decay,
    )
    
    # Inner optimizer: memory modules (fast)
    inner_optimizer = AdamW(
        model.get_memory_parameters(),
        lr=config.memory_lr,
        weight_decay=config.weight_decay * 0.1,  # Less regularization for memory
    )
    
    # Learning rate schedulers
    total_steps = len(train_loader) * config.num_epochs // config.gradient_accumulation
    warmup_steps = int(total_steps * config.warmup_ratio)
    
    outer_scheduler = CosineAnnealingLR(outer_optimizer, T_max=total_steps)
    inner_scheduler = CosineAnnealingLR(inner_optimizer, T_max=total_steps)
    
    # Training loop
    best_auc = 0
    best_model_state = None
    fold_dir = MODEL_DIR / f"fold_{fold}"
    fold_dir.mkdir(exist_ok=True)
    
    scaler = torch.amp.GradScaler('cuda') if DEVICE.type == 'cuda' else None
    
    for epoch in range(config.num_epochs):
        model.train()
        model.cms.reset_step_counter()  # Reset CMS step counter each epoch
        
        epoch_loss = 0
        epoch_memory_loss = 0
        num_batches = 0
        
        pbar = tqdm(train_loader, desc=f"Epoch {epoch + 1}/{config.num_epochs}")
        
        outer_optimizer.zero_grad()
        inner_optimizer.zero_grad()
        
        for step, batch in enumerate(pbar):
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            labels = batch['labels'].to(DEVICE)
            
            # Forward pass with mixed precision
            with torch.amp.autocast('cuda', enabled=scaler is not None):
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    labels=labels,
                    return_memory_loss=True,
                )
                
                # Classification loss
                cls_loss = outputs['loss'] / config.gradient_accumulation
                
                # Memory loss (for nested optimization)
                mem_loss = outputs['memory_loss'] / config.gradient_accumulation
                
                # Combined loss
                total_loss = cls_loss + 0.1 * mem_loss  # Weight memory loss
            
            # Backward pass
            if scaler is not None:
                scaler.scale(total_loss).backward()
            else:
                total_loss.backward()
            
            epoch_loss += cls_loss.item() * config.gradient_accumulation
            epoch_memory_loss += mem_loss.item() * config.gradient_accumulation
            num_batches += 1
            
            # Gradient accumulation step
            if (step + 1) % config.gradient_accumulation == 0:
                if scaler is not None:
                    # Clip gradients
                    scaler.unscale_(outer_optimizer)
                    scaler.unscale_(inner_optimizer)
                    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                    
                    # Nested optimization: inner steps for memory
                    for _ in range(config.inner_steps):
                        scaler.step(inner_optimizer)
                    
                    # Outer step for encoder
                    scaler.step(outer_optimizer)
                    scaler.update()
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                    for _ in range(config.inner_steps):
                        inner_optimizer.step()
                    outer_optimizer.step()
                
                outer_optimizer.zero_grad()
                inner_optimizer.zero_grad()
                
                outer_scheduler.step()
                inner_scheduler.step()
            
            pbar.set_postfix({
                'loss': f"{epoch_loss / num_batches:.4f}",
                'mem_loss': f"{epoch_memory_loss / num_batches:.4f}",
            })
        
        # Validation
        model.eval()
        val_preds = []
        val_labels = []
        
        with torch.no_grad():
            for batch in tqdm(val_loader, desc="Validating"):
                input_ids = batch['input_ids'].to(DEVICE)
                attention_mask = batch['attention_mask'].to(DEVICE)
                
                with torch.amp.autocast('cuda', enabled=scaler is not None):
                    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                
                probs = F.softmax(outputs['logits'], dim=-1)[:, 1]
                val_preds.extend(probs.cpu().numpy())
                val_labels.extend(batch['labels'].numpy())
        
        val_preds = np.array(val_preds)
        val_labels = np.array(val_labels)
        
        val_auc = roc_auc_score(val_labels, val_preds)
        val_acc = accuracy_score(val_labels, (val_preds > 0.5).astype(int))
        
        print(f"Epoch {epoch + 1}: Loss={epoch_loss/num_batches:.4f}, "
              f"Val AUC={val_auc:.5f}, Val Acc={val_acc:.5f}")
        
        # Save best model
        if val_auc > best_auc:
            best_auc = val_auc
            best_model_state = model.state_dict().copy()
            print(f"  -> New best model! AUC: {best_auc:.5f}")
    
    # Save best model
    model_path = fold_dir / "best_model.pt"
    torch.save(best_model_state, model_path)
    
    # Load best model for final predictions
    model.load_state_dict(best_model_state)
    model.eval()
    
    # Get OOF predictions
    oof_preds = []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            
            with torch.amp.autocast('cuda', enabled=scaler is not None):
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            
            probs = F.softmax(outputs['logits'], dim=-1)[:, 1]
            oof_preds.extend(probs.cpu().numpy())
    
    # Cleanup
    del model
    torch.cuda.empty_cache()
    
    return np.array(oof_preds), best_auc, str(model_path)

In [None]:
# Run training for all folds
oof_predictions = np.zeros(len(train_df))
fold_aucs = []
model_paths = []

for fold, (train_idx, val_idx) in enumerate(folds):
    oof_preds, best_auc, model_path = train_fold(
        fold=fold,
        train_idx=train_idx,
        val_idx=val_idx,
        config=config,
    )
    
    oof_predictions[val_idx] = oof_preds
    fold_aucs.append(best_auc)
    model_paths.append(model_path)

# Overall OOF performance
overall_auc = roc_auc_score(y, oof_predictions)
print(f"\n{'='*50}")
print(f"HOPE OOF Results")
print(f"{'='*50}")
print(f"Fold AUCs: {[f'{auc:.5f}' for auc in fold_aucs]}")
print(f"Mean Fold AUC: {np.mean(fold_aucs):.5f} (+/- {np.std(fold_aucs):.5f})")
print(f"Overall OOF AUC: {overall_auc:.5f}")

# Save OOF predictions
pd.DataFrame({'oof_hope': oof_predictions}).to_csv(OOF_DIR / 'oof_hope.csv', index=False)
print(f"\nSaved OOF predictions to {OOF_DIR / 'oof_hope.csv'}")

In [None]:
# Inference on test set
def predict_test(test_df: pd.DataFrame, model_paths: List[str], config: HOPEConfig) -> np.ndarray:
    """
    Generate predictions on test set by averaging across all fold models.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.base_model, use_fast=True)
    test_dataset = TextDataset(test_df, tokenizer, text_col, None, config.max_length)
    test_loader = DataLoader(
        test_dataset,
        batch_size=config.batch_size * 2,
        shuffle=False,
        num_workers=4,
        pin_memory=True,
    )
    
    all_preds = []
    
    for model_path in model_paths:
        print(f"Loading model from {model_path}")
        model = HOPEClassifier(config).to(DEVICE)
        model.load_state_dict(torch.load(model_path, map_location=DEVICE, weights_only=True))
        model.eval()
        
        fold_preds = []
        with torch.no_grad():
            for batch in tqdm(test_loader, desc="Predicting"):
                input_ids = batch['input_ids'].to(DEVICE)
                attention_mask = batch['attention_mask'].to(DEVICE)
                
                with torch.amp.autocast('cuda', enabled=DEVICE.type == 'cuda'):
                    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                
                probs = F.softmax(outputs['logits'], dim=-1)[:, 1]
                fold_preds.extend(probs.cpu().numpy())
        
        all_preds.append(fold_preds)
        del model
        torch.cuda.empty_cache()
    
    # Average across folds
    return np.mean(all_preds, axis=0)


if test_df is not None:
    print("\nRunning inference on test set...")
    test_preds = predict_test(test_df, model_paths, config)
    
    # Save submission
    submission = pd.DataFrame({
        'id': test_df.index if 'id' not in test_df.columns else test_df['id'],
        'prediction': test_preds,
    })
    submission.to_csv(WORK_DIR / 'submission_hope.csv', index=False)
    print(f"Saved submission to {WORK_DIR / 'submission_hope.csv'}")
    
    # If test has labels, compute AUC
    if label_col in test_df.columns:
        test_auc = roc_auc_score(test_df[label_col], test_preds)
        test_acc = accuracy_score(test_df[label_col], (test_preds > 0.5).astype(int))
        print(f"Test AUC: {test_auc:.5f}")
        print(f"Test Accuracy: {test_acc:.5f}")
else:
    print("\nNo test data available - skipping inference.")

In [None]:
# Summary and comparison with other models
print("\n" + "="*60)
print("HOPE Model Summary")
print("="*60)

# Load OOF from other models for comparison
comparison_models = {
    'HOPE': oof_predictions,
}

for name in ['deberta', 'lgb', 'sgd', 'stack']:
    path = OOF_DIR / f'oof_{name}.csv'
    if path.exists():
        df = pd.read_csv(path)
        col = f'oof_{name}'
        if col in df.columns and len(df) == len(y):
            comparison_models[name.upper()] = df[col].values

print("\nOOF ROC-AUC Comparison:")
print("-" * 40)
results = []
for name, preds in comparison_models.items():
    auc = roc_auc_score(y, preds)
    results.append({'Model': name, 'OOF AUC': f'{auc:.5f}'})
    print(f"{name:15s}: {auc:.5f}")

print("\n" + "="*60)
print("HOPE Architecture Components:")
print("-" * 40)
print(f"  - Base Encoder: {config.base_model}")
print(f"  - Memory Levels: {config.num_memory_levels}")
print(f"  - Update Frequencies: {config.update_frequencies}")
print(f"  - Self-Referential: {config.use_self_referential}")
print(f"  - Nested Optimization: inner_steps={config.inner_steps}")
print("="*60)

## References

1. [Nested Learning: A new ML paradigm for continual learning](https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/) - Google Research Blog
2. [Titans: Learning to Memorize at Test Time](https://arxiv.org/abs/2501.00663) - arXiv paper
3. [HOPE: Hierarchical Optimization with Persistent Experience](https://www.etavrian.com/news/google-nested-learning-hope-outperforms-transformers) - NeurIPS 2025

### Key Innovations:
- **Neural Memory Module**: Uses MLP parameters to store key-value associations
- **Surprise-Based Updates**: Memory updates proportional to prediction error (gradient)
- **Continuum Memory System**: Multiple memory modules at different update frequencies
- **Self-Referential Learning**: Memory can optimize its own parameters (HOPE's addition)
- **Nested Optimization**: Separate learning rates for encoder vs memory modules