# Gated DeltaNet Research Notebook

This notebook is for researching Gated DeltaNet on Google Colab.

**Paper**: [Gated Delta Networks: Improving Mamba2 with Delta Rule](https://arxiv.org/abs/2412.06464)

## 1. Check GPU Availability

In [5]:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️ WARNING: No GPU detected! Go to Runtime > Change runtime type > GPU")

CUDA available: True
GPU: Tesla T4
GPU Memory: 15.83 GB


## 2. Setup and Installation

In [6]:
# Clone the repository
!git clone https://github.com/vukrosic/deltanet.git
%cd deltanet

Cloning into 'deltanet'...
remote: Enumerating objects: 16235, done.[K
remote: Counting objects: 100% (16235/16235), done.[K
remote: Compressing objects: 100% (4519/4519), done.[K
remote: Total 16235 (delta 11634), reused 16235 (delta 11634), pack-reused 0 (from 0)[K
Receiving objects: 100% (16235/16235), 5.88 MiB | 12.53 MiB/s, done.
Resolving deltas: 100% (11634/11634), done.
/content/flash-linear-attention/deltanet


In [7]:
# Install dependencies
!pip install -e .
!pip install transformers einops

Obtaining file:///content/flash-linear-attention/deltanet
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: flash-linear-attention
  Building editable for flash-linear-attention (pyproject.toml) ... [?25l[?25hdone
  Created wheel for flash-linear-attention: filename=flash_linear_attention-0.4.1-0.editable-py3-none-any.whl size=6427 sha256=c23ff8267c46dcc33837fe57ac4ee6e2acf5529926984e7071b590c5f5217b04
  Stored in directory: /tmp/pip-ephem-wheel-cache-b6a_t059/wheels/41/d0/36/a81a7c4cb1b511f113149b9c4b029e59f6cb849b5c86664b3a
Successfully built flash-linear-attention
Installing collected packages: flash-linear-attention
  Attempting uninstall: flash-linear-attention
    Found existing installation: flash-linear-attention 0.4.1
    Uninstallin



## 3. Import Gated DeltaNet

In [8]:
import torch
from fla.layers import GatedDeltaNet
from fla.models import GatedDeltaNetConfig, GatedDeltaNetForCausalLM, GatedDeltaNetModel

# Set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
print("Imports successful!")

Using device: cuda
Imports successful!


## 4. Basic Layer Usage

In [9]:
# Create a Gated DeltaNet layer and move to GPU
layer = GatedDeltaNet(
    hidden_size=512,
    expand_v=2.0,
    head_dim=64,
    num_heads=6,  # 6 * 64 = 384 = 0.75 * 512
    mode='chunk',
    use_gate=True,
    use_short_conv=True,
).to(device)

# Test with random input on GPU
batch_size = 2
seq_len = 128
hidden_size = 512

x = torch.randn(batch_size, seq_len, hidden_size, device=device)
output, _, _ = layer(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Input device: {x.device}")
print(f"Output device: {output.device}")

Input shape: torch.Size([2, 128, 512])
Output shape: torch.Size([2, 128, 512])
Input device: cuda:0
Output device: cuda:0


## 5. Model Configuration

In [10]:
# Create a small Gated DeltaNet model
config = GatedDeltaNetConfig(
    hidden_size=768,
    num_hidden_layers=12,
    num_heads=12,
    head_dim=64,
    vocab_size=50257,
)

model = GatedDeltaNetForCausalLM(config).to(device)
print(f"Model created with {sum(p.numel() for p in model.parameters())/1e6:.2f}M parameters")
print(f"Model device: {next(model.parameters()).device}")

Model created with 190.83M parameters
Model device: cuda:0


## 6. Forward Pass Test

In [None]:
# Test forward pass
input_ids = torch.randint(0, config.vocab_size, (2, 64), device=device)
outputs = model(input_ids)

print(f"Input IDs shape: {input_ids.shape}")
print(f"Logits shape: {outputs.logits.shape}")
print(f"Logits device: {outputs.logits.device}")

## 7. Memory Usage Check

In [None]:
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
    print(f"Max GPU memory allocated: {torch.cuda.max_memory_allocated(0) / 1e9:.2f} GB")

## 8. Test Different Sequence Lengths

In [None]:
import time

# Test different sequence lengths
seq_lengths = [64, 128, 256, 512, 1024]
batch_size = 2
hidden_size = 512

layer = GatedDeltaNet(
    hidden_size=hidden_size,
    expand_v=2.0,
    head_dim=64,
    num_heads=6,
    mode='chunk',
    use_gate=True,
    use_short_conv=True,
).to(device)

print("Sequence Length | Time (ms) | Memory (MB)")
print("-" * 45)

for seq_len in seq_lengths:
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    
    x = torch.randn(batch_size, seq_len, hidden_size, device=device)
    
    # Warmup
    _ = layer(x)
    
    # Measure time
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    start = time.time()
    output, _, _ = layer(x)
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    elapsed = (time.time() - start) * 1000
    
    mem_mb = torch.cuda.max_memory_allocated(0) / 1e6 if torch.cuda.is_available() else 0
    
    print(f"{seq_len:14d} | {elapsed:9.2f} | {mem_mb:11.2f}")

## 9. Research Experiments

Add your research experiments below:

In [None]:
# Your experiments here


## 10. Key Architecture Components

### Gated DeltaNet Layer Parameters:

- **hidden_size**: Hidden dimension
- **expand_v**: Value dimension expansion ratio (default: 2.0)
- **head_dim**: Dimension per head
- **num_heads**: Number of attention heads (num_heads * head_dim = 0.75 * hidden_size when use_gate=True)
- **num_v_heads**: Number of value heads (GVA if > num_heads)
- **mode**: Kernel mode ('chunk' for training, 'fused_recurrent' for inference)
- **use_beta**: Use beta parameter
- **use_gate**: Use output gating (recommended: True)
- **use_short_conv**: Use short convolutions (crucial for performance!)
- **allow_neg_eigval**: Allow negative eigenvalues
- **conv_size**: Convolution kernel size (default: 4)

### Key Operations:
- `chunk_gated_delta_rule`: Chunk-based implementation (training)
- `fused_recurrent_gated_delta_rule`: Fused recurrent (inference)

### GPU Requirements:
- T4 (16GB): Good for models up to ~350M parameters
- V100 (16GB): Good for models up to ~1B parameters
- A100 (40GB): Good for models up to ~7B parameters