# Quantum Transformer: Complete Tutorial and Experiments

This notebook provides a comprehensive guide to the Quantum Transformer library, covering:

1. **Installation and Setup** - Environment configuration and library imports
2. **Basic Quantum Transformer Experiments** - Model creation and forward pass
3. **Quantum Self-Attention** - Attention mechanisms and visualization
4. **Molecular Property Prediction** - SMILES tokenization and energy prediction
5. **Benchmarking and Analysis** - Performance comparison with classical models
6. **Advanced Quantum Features** - Positional encoding and variational layers

---

## Theoretical Background

### Quantum Attention Mechanism

Classical attention computes similarity scores using dot products: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

In quantum transformers, we replace this with quantum circuits that compute overlaps between quantum states.

### SWAP Test

The SWAP test is a quantum algorithm for measuring the overlap (fidelity) between two quantum states $|\psi\rangle$ and $|\phi\rangle$:

1. Prepare an ancilla qubit in state $|0\rangle$
2. Apply Hadamard to ancilla: $|+\rangle = \frac{1}{\sqrt{2}}(|0\rangle + |1\rangle)$
3. Apply controlled-SWAP between the two states
4. Apply Hadamard to ancilla
5. Measure ancilla: $P(0) = \frac{1}{2}(1 + |\langle\psi|\phi\rangle|^2)$

### Quantum Feed-Forward Networks

Variational quantum circuits serve as feed-forward layers with:
- Data encoding via rotation gates (RY, RZ)
- Parameterized rotation layers (Rot gates)
- Entangling layers (CNOT gates)
- Measurement to extract classical output

---
## 1. Installation and Setup

In [None]:
# Install required packages (uncomment if needed)
# !pip install torch>=2.0.0 pennylane>=0.32.0 matplotlib seaborn numpy

In [None]:
import sys
import warnings
warnings.filterwarnings('ignore')

# Core libraries
import numpy as np
import torch
import torch.nn as nn
import pennylane as qml

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"PennyLane version: {qml.__version__}")
print(f"NumPy version: {np.__version__}")

In [None]:
# Import Quantum Transformer components
import sys
sys.path.insert(0, '..')

from quantum_transformers import (
    QuantumTransformer,
    QuantumTransformerConfig,
    QuantumMultiHeadAttention,
    QuantumAttentionConfig,
    SwapTestAttention,
    QuantumFeedForward,
    QuantumPositionalEncoding,
    QuantumSinusoidalEncoding,
    QuantumRotationalEncoding,
    QuantumAmplitudeEncoding,
    QuantumAngleEncoding,
    VariationalLayer,
    QuantumTransformerForMolecules,
    MolecularModelConfig,
    SMILESTokenizer,
    get_info,
)

# Display library info
info = get_info()
for key, value in info.items():
    print(f"{key}: {value}")

---
## 2. Basic Quantum Transformer Experiments

In [None]:
# Create QuantumTransformerConfig with NISQ-friendly parameters
config = QuantumTransformerConfig(
    n_qubits=4,       # 4 qubits for NISQ simulation
    n_heads=2,        # 2 attention heads
    n_layers=4,       # 4 transformer layers
    d_model=16,       # Model dimension (must be <= 2^n_qubits * n_heads)
    d_ff=64,          # Feed-forward dimension
    max_seq_len=64,   # Maximum sequence length
    dropout=0.1,      # Dropout rate
    attention_type="swap_test",
    device="default.qubit"
)

print("Quantum Transformer Configuration:")
print(f"  - Number of qubits: {config.n_qubits}")
print(f"  - Number of heads: {config.n_heads}")
print(f"  - Number of layers: {config.n_layers}")
print(f"  - Model dimension: {config.d_model}")
print(f"  - Max sequence length: {config.max_seq_len}")

In [None]:
# Create Quantum Transformer model
model = QuantumTransformer(config)

# Display model architecture
print("Quantum Transformer Model:")
print(model)

# Count parameters
params = model.count_parameters()
print(f"\nModel Parameters:")
print(f"  - Total: {params['total']:,}")
print(f"  - Trainable: {params['trainable']:,}")

In [None]:
# Create dummy input tensor
batch_size = 2
seq_len = 8
d_model = 16

dummy_input = torch.randn(batch_size, seq_len, d_model)
print(f"Input shape: {dummy_input.shape}")
print(f"Input sample (first 3 values of first sequence):")
print(dummy_input[0, 0, :3])

In [None]:
# Forward pass through Quantum Transformer
model.eval()
with torch.no_grad():
    output = model(dummy_input)

print(f"Output shape: {output.shape}")
print(f"Expected shape: ({batch_size}, {seq_len}, {d_model})")
print(f"\nOutput sample (first sequence, first position):")
print(output[0, 0, :])

In [None]:
# Visualize input vs output distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(dummy_input.flatten().numpy(), bins=50, alpha=0.7, color='steelblue')
axes[0].set_title('Input Distribution')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

axes[1].hist(output.flatten().numpy(), bins=50, alpha=0.7, color='darkorange')
axes[1].set_title('Output Distribution')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

---
## 3. Quantum Self-Attention

In [None]:
# Create Quantum Attention Configuration
attn_config = QuantumAttentionConfig(
    n_qubits=4,
    n_heads=2,
    d_k=8,
    attention_type="swap_test",
    dropout=0.1
)

# Create Multi-Head Attention module
mha = QuantumMultiHeadAttention(attn_config)
print("Quantum Multi-Head Attention:")
print(mha)

In [None]:
# Create SWAP Test Attention for visualization
swap_attention = SwapTestAttention(n_qubits=4)

# Create sample query and key tensors
batch_size = 1
seq_len = 4  # Small sequence for visualization
d_k = 8

query = torch.randn(batch_size, seq_len, d_k)
key = torch.randn(batch_size, seq_len, d_k)

print(f"Query shape: {query.shape}")
print(f"Key shape: {key.shape}")

In [None]:
# Compute attention scores using SWAP test
with torch.no_grad():
    attention_scores = swap_attention(query, key)

print(f"Attention scores shape: {attention_scores.shape}")
print(f"\nAttention matrix:")
print(attention_scores[0].numpy())

In [None]:
# Visualize attention pattern as heatmap
fig, ax = plt.subplots(figsize=(8, 6))

attn_matrix = attention_scores[0].numpy()
sns.heatmap(
    attn_matrix,
    annot=True,
    fmt='.3f',
    cmap='viridis',
    xticklabels=[f'K{i}' for i in range(seq_len)],
    yticklabels=[f'Q{i}' for i in range(seq_len)],
    ax=ax
)
ax.set_title('Quantum Attention Pattern (SWAP Test)')
ax.set_xlabel('Key Position')
ax.set_ylabel('Query Position')
plt.tight_layout()
plt.show()

In [None]:
# Apply softmax to get attention weights
attention_weights = torch.softmax(attention_scores, dim=-1)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw scores
sns.heatmap(attention_scores[0].numpy(), annot=True, fmt='.3f',
            cmap='coolwarm', ax=axes[0])
axes[0].set_title('Raw Attention Scores')
axes[0].set_xlabel('Key Position')
axes[0].set_ylabel('Query Position')

# Softmax weights
sns.heatmap(attention_weights[0].numpy(), annot=True, fmt='.3f',
            cmap='viridis', ax=axes[1])
axes[1].set_title('Attention Weights (After Softmax)')
axes[1].set_xlabel('Key Position')
axes[1].set_ylabel('Query Position')

plt.tight_layout()
plt.show()

---
## 4. Molecular Property Prediction

In [None]:
# Initialize SMILES Tokenizer
tokenizer = SMILESTokenizer()

print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"PAD token ID: {tokenizer.pad_token_id}")
print(f"UNK token ID: {tokenizer.unk_token_id}")
print(f"\nSample vocabulary entries:")
for token, idx in list(tokenizer.vocab.items())[:10]:
    print(f"  '{token}': {idx}")

In [None]:
# Sample molecules for prediction
molecules = {
    "Ethanol": "CCO",
    "Cyclohexane": "C1CCCCC1",
    "Carbon Dioxide": "O=C=O",
    "Benzene": "c1ccccc1",
    "Methane": "C",
}

# Tokenize molecules
print("Tokenization Results:")
print("-" * 60)
for name, smiles in molecules.items():
    tokens = tokenizer.tokenize(smiles)
    encoded = tokenizer.encode(smiles, max_length=32)
    print(f"{name} ({smiles}):")
    print(f"  Tokens: {tokens}")
    print(f"  Encoded: {encoded[:10]}...")
    print()

In [None]:
# Create Molecular Model Configuration
mol_config = MolecularModelConfig(
    vocab_size=tokenizer.vocab_size,
    max_seq_len=32,
    n_qubits=4,
    n_heads=2,
    n_layers=2,  # Reduced for faster execution
    d_model=16,
    dropout=0.1,
    task="energy"
)

# Create model
mol_model = QuantumTransformerForMolecules(mol_config)
print("Molecular Transformer Configuration:")
print(f"  - Task: {mol_config.task}")
print(f"  - Vocab size: {mol_config.vocab_size}")
print(f"  - Model dimension: {mol_config.d_model}")

In [None]:
# Predict energies for molecules
mol_model.eval()
predictions = {}

print("Energy Predictions:")
print("-" * 40)

for name, smiles in molecules.items():
    input_ids = tokenizer(smiles, max_length=32, return_tensors="pt")
    
    with torch.no_grad():
        energy = mol_model.predict_energy(input_ids)
    
    predictions[name] = energy.item()
    print(f"{name}: {energy.item():.4f} Ha")

print("\n(Note: Values are from untrained model for demonstration)")

In [None]:
# Visualize predictions
fig, ax = plt.subplots(figsize=(10, 6))

names = list(predictions.keys())
values = list(predictions.values())
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(names)))

bars = ax.bar(names, values, color=colors, edgecolor='black', linewidth=1.2)
ax.set_ylabel('Predicted Energy (Hartree)')
ax.set_title('Quantum Transformer Molecular Energy Predictions')
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)

# Add value labels
for bar, val in zip(bars, values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
            f'{val:.3f}', ha='center', va='bottom', fontsize=10)

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

---
## 5. Benchmarking and Analysis

In [None]:
def calculate_mae(predictions, targets):
    """Calculate Mean Absolute Error."""
    predictions = np.array(predictions)
    targets = np.array(targets)
    return np.mean(np.abs(predictions - targets))

def calculate_rmse(predictions, targets):
    """Calculate Root Mean Square Error."""
    predictions = np.array(predictions)
    targets = np.array(targets)
    return np.sqrt(np.mean((predictions - targets) ** 2))

# Reference energies (dummy data for demonstration)
reference_energies = {
    "Ethanol": -154.08,
    "Cyclohexane": -234.51,
    "Carbon Dioxide": -188.22,
    "Benzene": -232.15,
    "Methane": -40.42,
}

print("Reference vs Predicted Energies:")
print("-" * 50)
for name in molecules.keys():
    ref = reference_energies[name]
    pred = predictions[name]
    error = abs(ref - pred)
    print(f"{name}:")
    print(f"  Reference: {ref:.2f} Ha")
    print(f"  Predicted: {pred:.4f} Ha")
    print(f"  |Error|: {error:.2f} Ha")

In [None]:
# Classical Transformer for comparison (simplified)
class ClassicalTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=16, n_heads=2, n_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=n_heads, batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        self.head = nn.Linear(d_model, 1)
    
    def forward(self, x):
        x = self.embedding(x)
        x = self.encoder(x)
        x = x.mean(dim=1)
        return self.head(x)

# Create classical model
classical_model = ClassicalTransformer(tokenizer.vocab_size)
classical_model.eval()

# Get classical predictions
classical_predictions = {}
for name, smiles in molecules.items():
    input_ids = tokenizer(smiles, max_length=32, return_tensors="pt")
    with torch.no_grad():
        energy = classical_model(input_ids)
    classical_predictions[name] = energy.item()

print("Classical Transformer Predictions:")
for name, val in classical_predictions.items():
    print(f"  {name}: {val:.4f} Ha")

In [None]:
# Calculate MAE for both models
ref_values = list(reference_energies.values())
quantum_preds = list(predictions.values())
classical_preds = list(classical_predictions.values())

quantum_mae = calculate_mae(quantum_preds, ref_values)
classical_mae = calculate_mae(classical_preds, ref_values)

quantum_rmse = calculate_rmse(quantum_preds, ref_values)
classical_rmse = calculate_rmse(classical_preds, ref_values)

print("Performance Comparison:")
print("-" * 40)
print(f"Quantum Transformer MAE:  {quantum_mae:.2f} Ha")
print(f"Classical Transformer MAE: {classical_mae:.2f} Ha")
print(f"Quantum Transformer RMSE:  {quantum_rmse:.2f} Ha")
print(f"Classical Transformer RMSE: {classical_rmse:.2f} Ha")
print("\n(Note: Both models are untrained - metrics for comparison only)")

In [None]:
# Visualize performance comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# MAE Comparison
models = ['Quantum Transformer', 'Classical Transformer']
mae_values = [quantum_mae, classical_mae]
colors = ['#2ecc71', '#e74c3c']

axes[0].bar(models, mae_values, color=colors, edgecolor='black', linewidth=1.5)
axes[0].set_ylabel('Mean Absolute Error (Ha)')
axes[0].set_title('MAE Comparison')
for i, v in enumerate(mae_values):
    axes[0].text(i, v + 1, f'{v:.2f}', ha='center', va='bottom', fontweight='bold')

# Prediction Comparison
x = np.arange(len(molecules))
width = 0.25

axes[1].bar(x - width, ref_values, width, label='Reference', color='#3498db')
axes[1].bar(x, quantum_preds, width, label='Quantum', color='#2ecc71')
axes[1].bar(x + width, classical_preds, width, label='Classical', color='#e74c3c')

axes[1].set_ylabel('Energy (Ha)')
axes[1].set_title('Prediction Comparison by Molecule')
axes[1].set_xticks(x)
axes[1].set_xticklabels(list(molecules.keys()), rotation=45, ha='right')
axes[1].legend()

plt.tight_layout()
plt.show()

---
## 6. Advanced Quantum Features

In [None]:
# Quantum Positional Encoding
d_model = 16
max_len = 32
n_qubits = 4

pos_encoder = QuantumSinusoidalEncoding(d_model=d_model, max_len=max_len, n_qubits=n_qubits)

# Get positional encodings
pe = pos_encoder.pe.numpy()

print(f"Positional Encoding shape: {pe.shape}")
print(f"Sample encoding for position 0: {pe[0, :4]}")

In [None]:
# Visualize positional encodings
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap of positional encodings
sns.heatmap(pe[:16, :], cmap='RdBu_r', center=0, ax=axes[0])
axes[0].set_title('Positional Encoding Heatmap')
axes[0].set_xlabel('Dimension')
axes[0].set_ylabel('Position')

# Line plot of first 4 dimensions
positions = np.arange(max_len)
for dim in range(4):
    axes[1].plot(positions, pe[:, dim], label=f'Dim {dim}', linewidth=2)
axes[1].set_title('Positional Encoding Curves')
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Encoding Value')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Quantum Amplitude Encoding
amplitude_encoder = QuantumAmplitudeEncoding(n_qubits=4)

# Create sample input
sample_input = torch.randn(4, 16)  # 4 samples, 16 features

# Before encoding - visualize input amplitudes
print("Before Encoding:")
print(f"  Input shape: {sample_input.shape}")
print(f"  Input range: [{sample_input.min():.3f}, {sample_input.max():.3f}]")

# After encoding
encoded = amplitude_encoder(sample_input)
print(f"\nAfter Encoding:")
print(f"  Encoded shape: {encoded.shape}")
print(f"  Encoded norm: {torch.norm(encoded[0]):.4f}")

In [None]:
# Visualize quantum amplitudes before and after encoding
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before encoding (normalized input)
normalized_input = sample_input / (sample_input.norm(dim=-1, keepdim=True) + 1e-8)
axes[0].bar(range(16), normalized_input[0].numpy(), color='steelblue', alpha=0.7)
axes[0].set_title('Normalized Input Amplitudes')
axes[0].set_xlabel('Feature Index')
axes[0].set_ylabel('Amplitude')

# After encoding (quantum state amplitudes)
if encoded.shape[-1] == 16:
    real_parts = encoded[0].real.numpy() if encoded.is_complex() else encoded[0].numpy()
    axes[1].bar(range(len(real_parts)), np.abs(real_parts), color='darkorange', alpha=0.7)
else:
    axes[1].bar(range(encoded.shape[-1]), np.abs(encoded[0].numpy()), color='darkorange', alpha=0.7)
axes[1].set_title('Quantum State Amplitudes')
axes[1].set_xlabel('Basis State Index')
axes[1].set_ylabel('|Amplitude|')

plt.tight_layout()
plt.show()

In [None]:
# Variational Feed-Forward Layer
ffn = QuantumFeedForward(
    d_model=16,
    n_qubits=4,
    n_layers=3,
    dropout=0.0,
    entanglement="circular"
)

# Create batch input
batch_input = torch.randn(2, 4, 16)  # batch=2, seq=4, d_model=16

print("Variational Feed-Forward Layer:")
print(f"  Input shape: {batch_input.shape}")

# Forward pass
with torch.no_grad():
    ffn_output = ffn(batch_input)

print(f"  Output shape: {ffn_output.shape}")
print(f"  \nWeight parameters:")
print(f"    Shape: {ffn.weights.shape}")
print(f"    Range: [{ffn.weights.min():.3f}, {ffn.weights.max():.3f}]")

In [None]:
# Visualize FFN transformation
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Input distribution
axes[0].hist(batch_input.flatten().numpy(), bins=30, color='steelblue', alpha=0.7)
axes[0].set_title('Input Distribution')
axes[0].set_xlabel('Value')

# Output distribution
axes[1].hist(ffn_output.flatten().numpy(), bins=30, color='darkorange', alpha=0.7)
axes[1].set_title('Output Distribution (After Variational Circuit)')
axes[1].set_xlabel('Value')

# Transformation visualization
in_flat = batch_input[0, 0, :8].numpy()
out_flat = ffn_output[0, 0, :8].numpy()
x = np.arange(8)
width = 0.35

axes[2].bar(x - width/2, in_flat, width, label='Input', color='steelblue')
axes[2].bar(x + width/2, out_flat, width, label='Output', color='darkorange')
axes[2].set_title('Input vs Output (First 8 dims)')
axes[2].set_xlabel('Dimension')
axes[2].legend()

plt.tight_layout()
plt.show()

In [None]:
# Visualize variational circuit weights
weights = ffn.weights.detach().numpy()

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

gate_names = ['RX', 'RY', 'RZ']
for layer_idx in range(3):
    im = axes[layer_idx].imshow(weights[layer_idx], cmap='coolwarm', aspect='auto')
    axes[layer_idx].set_title(f'Layer {layer_idx + 1} Rotation Parameters')
    axes[layer_idx].set_xlabel('Rotation Gate')
    axes[layer_idx].set_ylabel('Qubit')
    axes[layer_idx].set_xticks([0, 1, 2])
    axes[layer_idx].set_xticklabels(gate_names)
    plt.colorbar(im, ax=axes[layer_idx])

plt.tight_layout()
plt.show()

---
## Summary

This notebook demonstrated the key features of the Quantum Transformer library:

1. **Basic Model Creation**: Built a QuantumTransformer with 4 qubits, 2 heads, and 4 layers
2. **SWAP Test Attention**: Visualized quantum attention patterns using the SWAP test algorithm
3. **Molecular Prediction**: Used SMILES tokenization and energy prediction for molecules
4. **Benchmarking**: Compared Quantum vs Classical Transformer performance
5. **Advanced Features**: Explored positional encoding, amplitude encoding, and variational layers

### Key Takeaways

- Quantum Transformers use quantum circuits for attention computation
- SWAP test measures state overlap for similarity scoring
- Variational circuits provide trainable non-linear transformations
- NISQ-friendly designs use 4-8 qubits for practical simulation

### Next Steps

- Train models on real QM9 dataset for molecular property prediction
- Experiment with different entanglement patterns
- Deploy on real quantum hardware via PennyLane plugins