# Quantum Transformer Architecture for Molecular Intelligence

## Abstract
This research notebook presents a conceptual implementation of a **Quantum Transformer** architecture designed for sequence modeling, with a focus on molecular representation. We explore the translation of classical transformer components - specifically Self-Attention, Positional Encoding, and Feed-Forward Networks - into their quantum mechanical counterparts. Using **PennyLane** as the quantum simulation framework, we demonstrate how quantum interference via the SWAP test can serve as an attention mechanism, and how variational quantum circuits can function as trainable transformation layers. We explicitly acknowledge the simulation nature of this implementation and discuss the theoretical advantages alongside the significant challenges posed by Noisy Intermediate-Scale Quantum (NISQ) devices, such as barren plateaus and scalability constraints.

## 1. Background Theory

### 1.1 Classical Transformers
The Transformer architecture revolutionized natural language processing by relying entirely on the **attention mechanism** to draw global dependencies between input and output. The core operation is the Scaled Dot-Product Attention:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

where $Q$ (Query), $K$ (Key), and $V$ (Value) are matrices derived from the input embeddings.

### 1.2 Motivation for Quantum Transformers
Quantum computing offers a unique perspective on attention:
- **Exponential State Space**: An $n$-qubit system represents a $2^n$-dimensional complex vector space, potentially allowing for richer embeddings of high-dimensional molecular data.
- **Interference**: Quantum algorithms naturally compute inner products (similarity) between states via interference, which is mathematically analogous to the dot product in attention mechanisms.
- **Entanglement**: Quantum states can represent correlations between particles (qubits) that classical statistical models struggle to capture efficiently, making them promising for modeling electron correlations in molecules.

### 1.3 Limitations for Molecules
Classical graph neural networks and transformers treat atoms as discrete nodes with classical feature vectors. They often require heavy approximations to model quantum mechanical properties like orbital interactions or superposition, which are native to a quantum computing framework.

## 2. Quantum Computing Primer

We use the **PennyLane** library for hybrid quantum-classical computing.

- **Qubit**: The fundamental unit of quantum information, represented as $|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$.
- **Superposition**: The ability to be in state $|0\rangle$ and $|1\rangle$ simultaneously.
- **Parameterized Quantum Circuits (PQC)**: Quantum circuits with tunable parameters $\theta$. These are the "neural networks" of the quantum world.
- **Measurement**: Extracting information from the quantum state, typically calculating the expectation value $\langle Z \rangle$.

In [None]:
import pennylane as qml
from pennylane import numpy as np
import torch
import matplotlib.pyplot as plt

# Setting random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("PennyLane version:", qml.__version__)
print("PyTorch version:", torch.__version__)

## 3. Quantum Data Encoding

To process classical data (like molecular features) on a quantum computer, we must map it to a quantum state. We use **Angle Encoding** for its simplicity and efficiency in parameterized circuits.

Given a feature vector $x = [x_1, \dots, x_N]$, we map each $x_i$ to the rotation angle of a qubit:

$$ |x\rangle = \bigotimes_{i=1}^N R_y(x_i)|0\rangle $$

Where $R_y(\theta)$ is a rotation around the Y-axis.

In [None]:
def angle_encoding(features, wires):
    """
    Encodes a classical feature vector into quantum state using Ry rotations.
    
    Args:
        features (array): Input vector of size N.
        wires (list): List of qubits to encode onto.
    """
    # Normalize features to [0, pi] or similar range if needed. 
    # Here simple scaling is assumed.
    for i, feature in enumerate(features):
        if i < len(wires):
            qml.RY(feature, wires=wires[i])

## 4. Quantum Positional Encoding

In transformers, the order of the sequence matters. We propose a **Rotational Quantum Positional Encoding**.

We add a position-dependent rotation angle to the encoding:
$$ \theta_{pos} = \theta_{data} + \lambda_{pos} $$

This is implemented as an additional rotation layer.

In [None]:
def positional_encoding(position_index, wires):
    """
    Adds a position-dependent rotation to the qubits.
    The angle depends on the position index in the sequence.
    """
    for i, wire in enumerate(wires):
        # Simple sinusoidal frequency scale
        angle = position_index / (10000 ** (2 * i / len(wires)))
        qml.RZ(angle, wires=wire)

## 5. Quantum Self-Attention (SWAP Test)

The heart of the Quantum Transformer. We replace the dot product $Q \cdot K^T$ with a quantum fidelity measure computed via the **SWAP Test**.

The SWAP test estimates the overlap $|\langle \psi | \phi \rangle|^2$ between two states $|\psi\rangle$ (Query) and $|\phi\rangle$ (Key).

**Protocol**:
1. Prepare an ancilla qubit in $|0\rangle$.
2. Apply Hadamard to ancilla.
3. Apply Controlled-SWAP (CSWAP) between the Query register and Key register, controlled by the ancilla.
4. Apply Hadamard to ancilla.
5. Measure ancilla in Z basis. The probability of measuring $|0\rangle$ is $P(0) = \frac{1}{2} (1 + |\langle \psi | \phi \rangle|^2)$.

In [None]:
dev_swap = qml.device("default.qubit", wires=5) # 2 for Q, 2 for K, 1 ancilla

@qml.qnode(dev_swap)
def quantum_attention_score(q_params, k_params):
    """
    Computes attention score between Query and Key states.
    
    Args:
        q_params (array): Parameters to prepare Query state.
        k_params (array): Parameters to prepare Key state.
    """
    # Register mapping
    ancilla = 0
    q_wires = [1, 2]
    k_wires = [3, 4]
    
    # 1. State Preparation (Embedding Q and K)
    angle_encoding(q_params, q_wires)
    angle_encoding(k_params, k_wires)
    
    # 2. SWAP Test
    qml.Hadamard(wires=ancilla)
    for q, k in zip(q_wires, k_wires):
        qml.CSWAP(wires=[ancilla, q, k])
    qml.Hadamard(wires=ancilla)
    
    # Return probability of measuring |0>
    return qml.probs(wires=ancilla)

# Visualization
q_dummy = [0.1, 0.2]
k_dummy = [0.1, 0.2] # Identical states should give overlap = 1.0 -> P(0) = 1.0
probs = quantum_attention_score(q_dummy, k_dummy)
overlap = 2 * probs[0] - 1

print(f"P(0): {probs[0]:.4f}")
print(f"Calculated Overlap |<Q|K>|^2: {overlap:.4f}")

qml.draw_mpl(quantum_attention_score)(q_dummy, k_dummy)
plt.show()

## 6. Quantum Feed-Forward Network (QFFN)

Instead of a classical MLP, we use a Variational Quantum Circuit (VQC). This acts as a parameterized transformation $U(\theta)$ on the attention output.

**Structure**:
1. **Strongly Entangling Layers**: To mix information across qubits.
2. **Rotations**: Trainable parameters ($R_x, R_y, R_z$).
3. **Measurement**: Expectation values of Pauli-Z operators.

In [None]:
# QFFN Device
n_wires_ffn = 2
dev_ffn = qml.device("default.qubit", wires=n_wires_ffn)

@qml.qnode(dev_ffn)
def quantum_feed_forward(inputs, weights):
    """
    Variational Quantum Circuit acting as FFN.
    
    Args:
        inputs (array): Input state parameters.
        weights (array): Trainable weights for the circuit layers.
    """
    # Re-uploading input data
    angle_encoding(inputs, wires=range(n_wires_ffn))
    
    # Trainable highly entangled layers
    qml.StronglyEntanglingLayers(weights, wires=range(n_wires_ffn))
    
    # Measurement (Non-linearity comes from measurement collapse & decoding)
    return [qml.expval(qml.PauliZ(i)) for i in range(n_wires_ffn)]

## 7. Quantum Transformer Block Integration

We now assemble the full block. For simplicity in simulation:
1. **Input**: A sequence of token vectors.
2. **Self-Attention**: Compute pairwise overlaps using the SWAP test function.
3. **Aggregation**: Classically weight the Value vectors (V is simplified here as embedding) by the attention scores. (Note: Full coherent quantum aggregation is extremely difficult; we use a hybrid approach).
4. **Residual**: Add input to attention output.
5. **QFFN**: Pass the result through the VQC.

In [None]:
class QuantumTransformerBlock(torch.nn.Module):
    def __init__(self, d_model, n_qubits, n_layers_ffn):
        super().__init__()
        self.d_model = d_model
        self.n_qubits = n_qubits
        
        # FF Weights (Random init)
        shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers_ffn, n_wires=n_qubits)
        self.ffn_weights = torch.nn.Parameter(torch.rand(shape))
        
        # Projection matrices for Q, K, V (Classical for hybrid simulation efficiency)
        self.W_q = torch.nn.Linear(d_model, n_qubits)
        self.W_k = torch.nn.Linear(d_model, n_qubits)
        self.W_v = torch.nn.Linear(d_model, d_model)
        
    def forward(self, x):
        """
        Args:
            x: tensor of shape [batch, seq_len, d_model]
        """
        batch, seq, dim = x.shape
        output = []
        
        # Process one sequence from batch (for demonstration)
        seq_x = x[0]
        
        # 1. Linear Projections
        Q = self.W_q(seq_x) # [seq, n_qubits]
        K = self.W_k(seq_x) # [seq, n_qubits]
        V = self.W_v(seq_x) # [seq, d_model]
        
        # 2. Quantum Attention Mechanism
        attention_output = torch.zeros_like(seq_x)
        
        for i in range(seq): # For each query token
            scores = []
            for j in range(seq): # For each key token
                # Compute Quantum Fidelity via SWAP Test
                # Note: In real training, we'd batch this or use a custom PyTorch Function
                q_vec = Q[i].detach().numpy()
                k_vec = K[j].detach().numpy()
                
                # Run quantum circuit
                prob_0 = quantum_attention_score(q_vec, k_vec)[0]
                overlap = 2 * prob_0 - 1 # range [-1, 1] approx
                scores.append(overlap)
            
            # Softmax on quantum scores
            scores = torch.tensor(scores)
            weights = torch.softmax(scores, dim=0)
            
            # Weighted sum of Values
            context = torch.zeros(self.d_model)
            for j in range(seq):
                context += weights[j] * V[j]
            
            attention_output[i] = context
            
        # Residual Connection
        x_res = seq_x + attention_output
        
        # 3. Quantum Feed-Forward
        final_output = torch.zeros_like(x_res)
        for i in range(seq):
            # Map vector to QFFN input gates
            # We take first n_qubits features for encoding
            ff_in = x_res[i, :self.n_qubits].detach().numpy()
            
            # Run VQC
            q_out = quantum_feed_forward(ff_in, self.ffn_weights.detach().numpy())
            
            # Pad result back to d_model dimension
            final_output[i, :self.n_qubits] = torch.tensor(q_out)
            
        return final_output.unsqueeze(0)

## 8. Mini Experiment

We simulate a sequence of 3 tokens, each with 2 features. This could represent a simplified molecule like $H_2O$ where each token is an atom feature.

In [None]:
# Initialize Model
d_model = 4
n_qubits = 2 # Matches QFFN and Attention wire count
n_layers_vqc = 2

q_transformer = QuantumTransformerBlock(d_model, n_qubits, n_layers_vqc)

# Mock Input: Batch=1, Seq=3, Dim=4
input_data = torch.rand(1, 3, d_model)

print("Input Sequence shape:", input_data.shape)

# Run Forward Pass
print("\nRunning Quantum Transformer Block...")
output = q_transformer(input_data)

print("\nOutput shape:", output.shape)
print("Output tensor:\n", output)

# Interpretation
print("\nInterpretation:")
print("The output tensor represents the transformed feature representation of the sequence,")
print("where relationships between tokens have been processed via interaction (SWAP test)")
print("and non-linear transformation (VQC).")

## 9. Limitations and Research Challenges

While the conceptual framework is promising, several hurdles exist for practical deployment:

1.  **Scalability**: The SWAP test requires $O(N^2)$ circuit executions for a sequence of length $N$, which is slow on current hardware.
2.  **Barren Plateaus**: Variational quantum circuits (like the QFFN) suffer from vanishing gradients in high-dimensional Hilbert spaces, making training difficult.
3.  **Data Loading**: Encoding dense classical vectors into quantum states (State Preparation) is broadly considered efficient only for specific data structures; general amplitude encoding is circuit-deep.
4.  **Hardware Noise**: Without Error Correction, the fidelity of SWAP tests drops rapidly with circuit depth, reducing attention accuracy.

## 10. Conclusion

This notebook demonstrated a functional prototype of a Quantum Transformer. We successfully replaced the core mathematical operations of a transformer - dot product attention and feed-forward non-linearity - with quantum circuit equivalents. Future work should focus on **coherent quantum attention** mechanisms that do not require intermediate measurement, thereby preserving quantum advantage throughout the entire pipeline.