# Transformers in Biomedicine: From Clinical Notes to Genomics

## Section 1: Overview & Prerequisites

### Summary of the Lecture Topic

This notebook explores the transformative impact of Transformers and Large Language Models (LLMs) on the biomedical field, as detailed in the lecture by Dr. Vivek Natarajan. We will journey from the first principles of why sequence-based models like Transformers are a natural fit for biomedical data (clinical notes, electronic health records, proteins, genomes) to specific, state-of-the-art applications. 

The lecture highlights that while general-purpose LLMs encode a surprising amount of clinical knowledge, they require careful alignment to meet the safety-critical standards of medicine. We will examine **Med-PaLM**, a model aligned using instruction prompt tuning, which closes the gap with expert clinicians in medical question answering. 

Furthermore, we will delve into the biological stack, exploring architectural innovations required for modeling long biological sequences. This includes the **Performer** architecture for efficient protein modeling, **ProtNLM** for protein function annotation, **DeepConsensus** for improving genomic sequencing accuracy, and **Enformer** for predicting gene expression by modeling long-range DNA interactions. The overarching theme is the convergence of Transformers as a foundational architecture for building powerful, multimodal AI models in biomedicine.

### Prerequisite Knowledge

**Mathematical Concepts:**
- **Linear Algebra:** Vectors, matrices, dot products, matrix multiplication.
- **Calculus:** Gradients and their role in optimization (e.g., gradient descent).
- **Probability & Statistics:** Softmax function, probability distributions, conditional probability.
- **Information Theory:** Cross-entropy loss.
- **Matrix Decompositions:** Understanding of low-rank approximation (conceptual level for the Performer model).

**Machine Learning / Computer Science Concepts:**
- **Fundamentals of Neural Networks:** Neurons, layers, activation functions, backpropagation.
- **Sequence Models:** Basic understanding of Recurrent Neural Networks (RNNs) and their limitations (vanishing/exploding gradients).
- **The Transformer Architecture:** Self-attention, multi-head attention, positional encodings, encoder-decoder structure.
- **Large Language Models (LLMs):** Concepts of pre-training on large corpora and fine-tuning for specific tasks.
- **Prompt Engineering:** Few-shot prompting, chain-of-thought prompting.
- **Convolutional Neural Networks (CNNs):** Basic understanding of convolutions and receptive fields.
- **Biology (Conceptual):**
    - **Proteins:** Sequences of amino acids.
    - **Genomics:** DNA, base pairs (A, T, C, G), genes, coding vs. non-coding variants, transcription.

### Hierarchy of Topics & Learning Objectives

**Hierarchy:**
1.  **Foundations:** Why Transformers are suited for sequential biomedical data.
2.  **Core Mechanism:** Deep dive into the self-attention mechanism, the engine of Transformers.
3.  **Clinical Applications:** Applying LLMs to medical question answering (Med-PaLM) and the importance of domain alignment.
4.  **Biological Applications & Architectural Innovations:**
    - Modeling long protein sequences efficiently (Performer).
    - Annotating protein functions (ProtNLM).
    - Correcting DNA sequencing errors (DeepConsensus).
    - Predicting gene expression from DNA (Enformer).
5.  **Experimental Analysis & Future Directions:** Reproducing key experimental insights and discussing the future of biomedical AI.

**Learning Objectives:**
- Understand the fundamental reasons for applying Transformers to biomedical data.
- Implement the core self-attention mechanism from scratch.
- Grasp the concept and importance of aligning LLMs for safety-critical domains like medicine.
- Understand the architectural challenges posed by long biological sequences and the solutions proposed (e.g., Performer, Enformer).
- Appreciate the breadth of applications for Transformers in biomedicine, from clinical practice to fundamental biology.

**Estimated Time:** 2 - 3 hours.

# Section 2: Mathematical Foundations

The core innovation of the Transformer is the **self-attention mechanism**. It allows the model to weigh the importance of different words (or tokens, amino acids, nucleotides) in an input sequence when processing a specific word. This directly addresses the challenge of modeling long-range dependencies, a key feature in all the biomedical applications discussed.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display
import time
import math

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True)) # Subtract max for numerical stability
    return e_x / e_x.sum(axis=-1, keepdims=True)

def educational_self_attention(X, mask=None):
    """
    Clear implementation of self-attention for understanding.
    - Based directly on the original Transformer paper.
    - Extensive comments explaining each step.
    - No black-box libraries for the core concept.
    
    Args:
        X (np.array): Input matrix of shape (sequence_length, embedding_dim).
        mask (np.array): Optional mask to hide certain positions.
    
    Returns:
        Tuple[np.array, np.array]: The context vectors (output) and the attention weights.
    """
    seq_len, d_model = X.shape
    d_k = d_v = d_model # For simplicity, let key/value/query dims be the same

    # Step 1: Initialize weight matrices for Query, Key, Value.
    # In a real model, these are learned during training.
    # Here, we use random matrices for demonstration.
    np.random.seed(42)
    W_q = np.random.randn(d_model, d_k)
    W_k = np.random.randn(d_model, d_k)
    W_v = np.random.randn(d_model, d_v)

    # Step 2: Project the input X into Query, Key, and Value spaces.
    # Q, K, V are the core components of attention.
    Q = X @ W_q
    K = X @ W_k
    V = X @ W_v

    # Step 3: Calculate attention scores.
    # This is the dot product of a query with all keys. It measures similarity.
    # The result is a (seq_len, seq_len) matrix where scores[i, j] is the attention
    # of token i to token j.
    scores = (Q @ K.T) / np.sqrt(d_k)

    # Step 4: Apply mask (if provided).
    # This is used in decoders to prevent a position from attending to future positions.
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores) # Set masked values to a large negative number

    # Step 5: Convert scores to probabilities (attention weights) using softmax.
    # The weights for each token will sum to 1.
    attention_weights = softmax(scores)

    # Step 6: Compute the context vector (the output of the attention layer).
    # This is a weighted sum of the Value vectors, where the weights are the
    # attention probabilities. Tokens with higher attention contribute more.
    context_vectors = attention_weights @ V

    return context_vectors, attention_weights

# --- Demonstration ---
# Let's simulate a sequence of 5 tokens (e.g., amino acids), each with an embedding of size 8.
sequence_length = 5
embedding_dim = 8
X_input = np.random.randn(sequence_length, embedding_dim)

context, weights = educational_self_attention(X_input)

print("Input Shape:", X_input.shape)
print("Output (Context Vectors) Shape:", context.shape)
print("Attention Weights Shape:", weights.shape)
print("\nAttention Weights (rounded):")
print(np.round(weights, 2))

# Verify that weights for each token sum to 1
print("\nSum of weights for each token:", np.sum(weights, axis=1))

### Interactive Visualization of Self-Attention

The matrix of attention weights shows *which* tokens are paying attention to *which other* tokens. A high value at `weights[i, j]` means that when the model is processing token `i`, it places a high importance on token `j`.

Use the slider below to select a token from the input sequence and visualize its attention weights with respect to all other tokens in the sequence.

In [None]:
def interactive_attention_explorer():
    """
    Interactive widget to explore attention weights.
    """
    # Use the weights calculated from the previous cell
    labels = [f"Token {i+1}" for i in range(sequence_length)]

    def plot_attention(token_idx):
        plt.figure(figsize=(10, 6))
        
        # Plot the full attention matrix
        ax1 = plt.subplot(1, 2, 1)
        sns.heatmap(weights, annot=True, fmt=".2f", cmap='viridis', xticklabels=labels, yticklabels=labels, ax=ax1)
        ax1.set_title('Full Attention Matrix (W)')
        ax1.set_xlabel('Key Tokens')
        ax1.set_ylabel('Query Tokens')
        
        # Highlight the selected query token's row
        rect = plt.Rectangle((0, token_idx), sequence_length, 1, fill=False, edgecolor='red', lw=3)
        ax1.add_patch(rect)

        # Plot the attention weights for the selected token
        ax2 = plt.subplot(1, 2, 2)
        ax2.bar(labels, weights[token_idx])
        ax2.set_title(f'Attention from {labels[token_idx]} to others')
        ax2.set_ylabel('Attention Weight')
        ax2.set_ylim(0, 1.0)
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

    token_slider = widgets.IntSlider(
        value=0, min=0, max=sequence_length - 1, step=1, description='Query Token:'
    )
    
    widgets.interactive(plot_attention, token_idx=token_slider)

interactive_attention_explorer()

# Section 3: Prerequisite Algorithms

### Building a Transformer Block

Before tackling the advanced models from the lecture, it's crucial to understand a standard Transformer block. This block is the fundamental repeating unit in Transformer architectures. It typically consists of:

1.  **Multi-Head Self-Attention:** An extension of the self-attention we just implemented. It runs the attention mechanism multiple times in parallel with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions.
2.  **Add & Norm (Layer Normalization):** A residual connection (adding the input to the output of the attention layer) followed by layer normalization. This helps stabilize training.
3.  **Feed-Forward Network:** A simple position-wise fully connected neural network.
4.  **Another Add & Norm Layer.**

Let's build a conceptual implementation to see how these pieces fit together.

In [None]:
class EducationalTransformerBlock:
    """
    A simplified, educational implementation of a Transformer block.
    """
    def __init__(self, d_model, num_heads):
        """
        Args:
            d_model (int): The dimensionality of the input embeddings.
            num_heads (int): The number of attention heads.
        """
        self.d_model = d_model
        self.num_heads = num_heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_head = d_model // num_heads

        # --- Multi-Head Attention Parameters ---
        # In a real model, these are learned. We initialize them randomly.
        self.W_q = [np.random.randn(d_model, self.d_head) for _ in range(num_heads)]
        self.W_k = [np.random.randn(d_model, self.d_head) for _ in range(num_heads)]
        self.W_v = [np.random.randn(d_model, self.d_head) for _ in range(num_heads)]
        self.W_o = np.random.randn(d_model, d_model) # Output projection matrix

        # --- Feed-Forward Network Parameters ---
        self.W1_ff = np.random.randn(d_model, d_model * 4)
        self.b1_ff = np.random.randn(d_model * 4)
        self.W2_ff = np.random.randn(d_model * 4, d_model)
        self.b2_ff = np.random.randn(d_model)
    
    def _layer_norm(self, x, epsilon=1e-5):
        # Simple from-scratch layer normalization
        mean = x.mean(axis=-1, keepdims=True)
        std = x.std(axis=-1, keepdims=True)
        return (x - mean) / (std + epsilon)

    def _multi_head_attention(self, X):
        heads = []
        for i in range(self.num_heads):
            # Project to Q, K, V for this head
            Q_i = X @ self.W_q[i]
            K_i = X @ self.W_k[i]
            V_i = X @ self.W_v[i]
            
            # Calculate attention scores for this head
            scores_i = (Q_i @ K_i.T) / np.sqrt(self.d_head)
            weights_i = softmax(scores_i)
            
            # Calculate context for this head
            head_output = weights_i @ V_i
            heads.append(head_output)
        
        # Concatenate heads and apply final linear projection
        concatenated_heads = np.concatenate(heads, axis=-1)
        mha_output = concatenated_heads @ self.W_o
        return mha_output

    def _feed_forward(self, x):
        # Simple two-layer feed-forward network with ReLU activation
        hidden = np.maximum(0, x @ self.W1_ff + self.b1_ff) # ReLU
        output = hidden @ self.W2_ff + self.b2_ff
        return output

    def forward(self, X):
        """Forward pass through the Transformer block."""
        # 1. Multi-Head Attention sub-layer
        attention_output = self._multi_head_attention(X)
        
        # 2. Add & Norm for the attention sub-layer
        sublayer1_output = self._layer_norm(X + attention_output)
        
        # 3. Feed-Forward sub-layer
        ff_output = self._feed_forward(sublayer1_output)
        
        # 4. Add & Norm for the feed-forward sub-layer
        block_output = self._layer_norm(sublayer1_output + ff_output)
        
        return block_output

# --- Demonstration ---
d_model = 64
num_heads = 8
seq_len = 10
input_sequence = np.random.randn(seq_len, d_model)

transformer_block = EducationalTransformerBlock(d_model=d_model, num_heads=num_heads)
output_sequence = transformer_block.forward(input_sequence)

print("Input Sequence Shape:", input_sequence.shape)
print("Output Sequence Shape:", output_sequence.shape)

# Section 4: Core Research Content from the Lecture

Now we will dive into the specific models and concepts presented in the lecture, building upon our foundational understanding of Transformers.

### 4.1 Med-PaLM: Aligning LLMs for Clinical Applications

**Motivation:** General-purpose LLMs like FLAN-PaLM, while knowledgeable, are not ready for clinical use out-of-the-box. They can hallucinate, provide incomplete answers, and fail to meet the high safety and quality bar of the medical domain. 

**Contribution:** The Med-PaLM paper introduces a methodology for aligning an LLM to the medical domain using a highly data- and compute-efficient technique called **Instruction Prompt Tuning**.

**Instruction Prompt Tuning:**
- The massive base LLM (e.g., 540B parameter FLAN-PaLM) is **frozen**. Its weights are not updated.
- A small set of new, trainable vectors (a "soft prompt") are prepended to the input.
- During tuning, only these prompt vectors are updated using a small, high-quality dataset of instruction-response pairs curated by clinicians.
- This teaches the model *how* to access and formulate its existing medical knowledge in a way that aligns with clinical expectations (e.g., being comprehensive, cautious, and avoiding harm).

Below is a conceptual code block illustrating how one might set up such a tuning process. Note that we cannot run this as we don't have access to the PaLM model.

In [None]:
def conceptual_instruction_prompt_tuning():
    """
    Conceptual demonstration of instruction prompt tuning.
    This is purely illustrative.
    """
    
    class MockLargeLanguageModel:
        def __init__(self):
            print("Initialized a massive, FROZEN language model (e.g., FLAN-PaLM 540B).")
            # In reality, this would contain billions of learned weights.
            self.is_frozen = True
        
        def forward(self, input_embedding):
            # This is a massive simplification of the forward pass.
            print(f"LLM processing input of shape: {input_embedding.shape}")
            # The model generates text based on the input embedding.
            return "...generated medical answer..."

    # 1. Load the massive, pre-trained base model and freeze it.
    base_llm = MockLargeLanguageModel()

    # 2. Define a small, trainable prompt.
    # Let's say our prompt consists of 20 tokens, and model embedding size is 1024.
    prompt_length = 20
    embedding_dim = 1024
    trainable_prompt_vectors = np.random.randn(prompt_length, embedding_dim) # These are the ONLY parameters we train!

    # 3. Prepare the actual input.
    # Example question from the MultiMedQA benchmark.
    input_question = "What are the side effects of metformin?"
    # This text would be converted to token embeddings.
    # Let's say it becomes 8 tokens.
    input_embeddings = np.random.randn(8, embedding_dim)

    # 4. Prepend the trainable prompt to the input embeddings.
    combined_input = np.concatenate([trainable_prompt_vectors, input_embeddings], axis=0)
    print(f"Original input shape: {input_embeddings.shape}")
    print(f"Trainable prompt shape: {trainable_prompt_vectors.shape}")
    print(f"Combined input shape for LLM: {combined_input.shape}")

    # 5. Get the model's output.
    model_output = base_llm.forward(combined_input)

    # 6. During training:
    #    - Compare `model_output` with a clinician-curated reference answer.
    #    - Calculate a loss.
    #    - Backpropagate the error to update ONLY `trainable_prompt_vectors`.
    #    - The `base_llm` weights remain unchanged.
    print("\n--- Training Phase (Conceptual) ---")
    print("Model output compared to clinician's ideal answer.")
    print("Loss is computed and gradients are backpropagated.")
    print("ONLY the trainable_prompt_vectors are updated.")
    print("The massive base_llm remains frozen. This is highly efficient!")

conceptual_instruction_prompt_tuning()

### 4.2 Performer: Efficient Transformers for Long Biological Sequences

**Motivation:** The standard self-attention mechanism has a complexity of O(L²) with respect to the sequence length L, both in time and memory. This is because it computes a full L x L attention matrix. For long biological sequences like proteins (thousands of amino acids) or DNA (billions of base pairs), this is computationally prohibitive.

**Contribution:** The Performer model introduces an efficient Transformer architecture that approximates the softmax attention mechanism, reducing the complexity to O(L). It does this via a technique called **FAVOR+ (Fast Attention Via Positive Orthogonal Random Features)**.

**Mathematical Derivation (High-Level):**
1.  The standard attention is calculated as `Attention(Q, K, V) = softmax(QKᵀ/√d_k)V`.
2.  The `softmax` function can be expressed using `exp`. The core computation is `exp(q_i • k_j)` for each query `q_i` and key `k_j`.
3.  The key insight is that the exponential kernel `exp(x•y)` can be approximated by the dot product of high-dimensional random feature maps: `exp(x•y) ≈ E[φ(x) • φ(y)]`, where `φ` is a mapping to a higher dimensional space using random features.
4.  By applying this, the attention matrix `A = exp(QKᵀ)` can be approximated as `A ≈ Q'K'ᵀ`, where `Q'` and `K'` are the inputs mapped by `φ`.
5.  The final computation becomes `(Q' (K'ᵀ V))`. Crucially, the order of multiplication is changed. Instead of computing the `L x L` matrix `Q'K'ᵀ`, we first compute the `m x d_v` matrix `K'ᵀ V` (where `m` is the number of random features, and `m << L`). This avoids the quadratic bottleneck.

The result is a linear-time and linear-memory attention mechanism.

In [None]:
def educational_performer_attention(Q, K, V, num_random_features):
    """
    Clear implementation of the Performer's core approximation for understanding.
    - Based directly on the FAVOR+ mechanism described in the paper.
    - No black-box libraries for the core approximation.
    
    Args:
        Q, K, V: Query, Key, Value matrices of shape (seq_len, d_k).
        num_random_features (int): The number of random features (m) to use for approximation.
    
    Returns:
        np.array: The approximated context vectors.
    """
    seq_len, d_k = Q.shape

    # Step 1: Generate a random projection matrix.
    # This is the core of the random feature map φ.
    np.random.seed(0)
    random_projection_matrix = np.random.randn(d_k, num_random_features)

    # Step 2: Define the feature map φ(x) = exp(xW) / sqrt(m).
    # The paper uses a more complex positive orthogonal feature map for better stability,
    # but this captures the main idea.
    def feature_map(x):
        # Project the input
        projected_x = x @ random_projection_matrix
        # Apply the kernel approximation components
        return np.exp(projected_x) / np.sqrt(num_random_features)
    
    # Step 3: Map Q and K to the random feature space.
    Q_prime = feature_map(Q)
    K_prime = feature_map(K)

    # Step 4: Compute attention by re-ordering matrix multiplications.
    # This is the key step that avoids the quadratic L x L matrix.
    
    # First, compute D_inv, a normalization factor (denominator of softmax).
    # This is equivalent to summing the rows of the approximated attention matrix.
    D_inv = 1.0 / (Q_prime @ K_prime.sum(axis=0))
    D_inv = np.diag(D_inv)
    
    # Next, compute K_prime.T @ V. Shape: (m, d_v)
    # This aggregates values based on their key projections.
    K_prime_T_V = K_prime.T @ V

    # Now, compute Q_prime @ (K_prime.T @ V). Shape: (L, d_v)
    # This calculates the weighted sum for each query.
    Q_prime_K_prime_T_V = Q_prime @ K_prime_T_V

    # Finally, apply the normalization.
    approximated_context = D_inv @ Q_prime_K_prime_T_V
    
    return approximated_context


# --- Demonstration ---
L = 1024  # A longer sequence length
d = 64
m = 128   # Number of random features (m << L)

Q_long = np.random.randn(L, d)
K_long = np.random.randn(L, d)
V_long = np.random.randn(L, d)

print(f"Sequence Length (L): {L}")
print(f"Embedding Dim (d): {d}")
print(f"Num Random Features (m): {m}")

context_performer = educational_performer_attention(Q_long, K_long, V_long, num_random_features=m)

print("\nPerformer Approximated Output Shape:", context_performer.shape)
print("Notice that no (L x L) matrix was created in the process.")

### 4.3 ProtNLM: Annotating Proteins with Natural Language

**Motivation:** Over 50% of known protein sequences are uncharacterized, meaning their function is unknown. Manually annotating them is a massive bottleneck. Automating this process could significantly accelerate biological research.

**Contribution:** ProtNLM frames this as a sequence-to-sequence task, analogous to machine translation or image captioning. It uses a T5 (Text-to-Text Transfer Transformer) model to "translate" a sequence of amino acids into a natural language description of the protein's function.

- **Input:** A sequence of amino acids (e.g., `M A D Q G V...`)
- **Output:** A natural language text (e.g., `"This protein is involved in ATP binding and metabolic processes."`) 

The model was trained on the UniProt database, which contains existing protein sequences and their annotations. This approach successfully generated plausible descriptions for 49 million previously uncharacterized proteins.

### 4.4 DeepConsensus: Improving Genome Sequencing Accuracy

**Motivation:** Modern DNA sequencers (like PacBio) read a DNA molecule multiple times to generate several "subreads". These are then combined into a single Circular Consensus Sequence (CCS). However, this process still contains errors.

**Contribution:** DeepConsensus is a Transformer-based model that takes the raw subreads and the initial CCS read as input to produce a more accurate, polished sequence. 

- **Input:** Multiple noisy subreads of a DNA segment and other instrument features.
- **Output:** A single, corrected DNA sequence.

A key innovation mentioned in the lecture is the use of a special **alignment-based loss function** instead of standard cross-entropy loss. 

**Why not Cross-Entropy?**
DNA sequences can have insertion and deletion errors. If the model predicts a sequence that is shifted by one base, cross-entropy loss would penalize every single subsequent prediction, even if the rest of the sequence is correct. An alignment loss (related to edit distance) can correctly identify that the error is a single insertion/deletion and provide a more meaningful gradient for training.

### 4.5 Enformer: Predicting Gene Expression from Long-Range DNA Interactions

**Motivation:** 90% of genetic variants associated with diseases lie in non-coding regions of DNA. These regions don't code for proteins but act as regulatory elements (like "enhancers"). An enhancer can be very far from a gene in the linear DNA sequence but influence its expression because the DNA folds in 3D space, bringing them close together. Predicting these long-range effects is a major challenge.

**Contribution:** The Enformer model uses a hybrid architecture (CNNs followed by Transformer blocks) to predict gene expression from DNA sequences. The Transformer's ability to model long-range dependencies is critical here.

- **Input:** A long DNA sequence (200k base pairs).
- **Output:** A prediction of gene expression levels across thousands of genomic tracks.
- **Architecture:** The initial CNN layers efficiently learn local patterns, and their output is then fed into Transformer blocks which can model interactions between elements that are hundreds of thousands of base pairs apart.

# Section 5: Experimental Analysis

This section reproduces some of the key experimental findings and comparisons discussed in the lecture.

### 5.1 Med-PaLM: Closing the Gap to Clinicians

The lecture presented results from a comprehensive human evaluation framework for the medical question answering task. Long-form answers from **FLAN-PaLM** (the base model), **Med-PaLM** (the aligned model), and expert **Clinicians** were rated by other clinicians and lay users across several axes.

**Key Findings:**
- **FLAN-PaLM vs. Med-PaLM:** The instruction-prompt-tuned Med-PaLM consistently and significantly outperformed the general-purpose FLAN-PaLM across axes like factual accuracy, medical reasoning, and potential for harm.
- **Med-PaLM vs. Clinicians:** While clinicians still performed best overall, Med-PaLM closed the gap considerably. For example, in terms of answers agreeing with scientific consensus, FLAN-PaLM was rated at ~60%, while Med-PaLM jumped to over 90%, very close to the clinician-generated answers.
- **Lay User Helpfulness:** For non-expert users, Med-PaLM's answers were rated as helpful 80% of the time, a large improvement over FLAN-PaLM's 60%, but still below the clinicians' 90%. This highlights the importance of evaluating from multiple perspectives.

### 5.2 Performer: Parameter Sensitivity and Efficiency Gains

A key claim of the Performer paper is its linear complexity. Let's analyze the computational complexity of standard attention versus the Performer's approximation. The dominant operation in standard attention is the `Q @ K.T` matrix multiplication, which is O(L²d). In Performer, the dominant steps are the random feature projections `Q @ W` (O(Ldm)) and the final aggregations like `Q_prime @ K_prime_T_V` (O(Lmd)), resulting in an overall complexity of O(Lmd). Since `m` is a fixed hyperparameter independent of `L`, this is **O(L)**.

Let's visualize this scaling difference.