# Demo: How Input Sequence Length Impacts Inference Speed

### Goal
To demonstrate that longer input sequences (prompts) increase the total time it takes for a model to generate a response, even when the number of *new* tokens being generated is the same.

We will compare the end-to-end latency and tokens per second for a short prompt versus a long prompt on both CPU and GPU.

## Part 1: Setup and Configuration

### 1.1 Import Libraries

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

### 1.2 Load Model and Tokenizer

We will use `gpt2-medium`, a moderately sized model, for this demonstration. We also need to ensure the tokenizer has a `pad_token` defined. If it doesn't, we'll set it to the `eos_token` (end-of-sentence token), which is a common practice.

In [3]:
model_name = "gpt2-medium"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Ensure a pad token is set for the tokenizer and model configuration
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token
  model.config.pad_token_id = model.config.eos_token_id

print(f"Successfully loaded model and tokenizer for: {model_name}")

Successfully loaded model and tokenizer for: gpt2-medium


## Part 2: Preparing the Prompts

### 2.1 Define Prompts of Different Lengths

Here, we create two prompts: a very short one and a much longer one. We will instruct the model to generate the exact same number of new tokens for both, allowing us to isolate the effect of the initial prompt length.

In [4]:
num_new_tokens_to_generate = 50 # The number of new tokens we want the model to generate

short_prompt = "The three primary colors are"

long_prompt = ("Modern deep learning models, especially transformers like GPT and BERT, "
               "rely heavily on matrix multiplications and attention mechanisms. "
               "While GPUs offer massive parallel processing capabilities via thousands of cores, "
               "the speed is often limited not by computation (FLOPs) but by the rate at which "
               "data (weights, activations, intermediate states) can be fetched from High Bandwidth Memory (HBM). "
               "This memory bandwidth bottleneck becomes particularly pronounced during the inference phase of "
               "large language models where parameter counts reach billions, requiring constant data movement "
               "which can leave the powerful compute units waiting. "
               "The core of the issue is that for every single token generated, the model must read its enormous weights "
               "and the state of all previous tokens (the key-value cache). As the context grows, this cache gets larger, "
               "demanding more memory bandwidth and slowing down the time to generate the next token.")

### 2.2 Tokenize Prompts and Verify Lengths

Now, we'll use the tokenizer to convert our text prompts into numerical IDs that the model can understand. We'll print the number of tokens in each prompt to confirm their difference in length.

In [5]:
inputs_short = tokenizer(short_prompt, return_tensors="pt")
inputs_long = tokenizer(long_prompt, return_tensors="pt")

short_prompt_len = inputs_short['input_ids'].shape[1]
long_prompt_len = inputs_long['input_ids'].shape[1]

print(f"Short prompt length: {short_prompt_len} tokens")
print(f"Long prompt length:  {long_prompt_len} tokens")
print(f"\nWe will generate {num_new_tokens_to_generate} new tokens for each prompt.")

Short prompt length: 5 tokens
Long prompt length:  172 tokens

We will generate 50 new tokens for each prompt.


## Part 3: Benchmarking Inference Time

### 3.1 Benchmarking on CPU

First, we'll measure the performance on a standard CPU. We will:
1.  Move the model and both prompts to the CPU.
2.  Time how long it takes to generate 50 tokens for the short prompt.
3.  Time how long it takes to generate 50 tokens for the long prompt.
4.  Compare the results.

In [6]:
# --- Step 1: Setup CPU Environment ---
print("Moving model and data to CPU...")
cpu_device = torch.device("cpu")
model.to(cpu_device)
inputs_short_cpu = {k: v.to(cpu_device) for k, v in inputs_short.items()}
inputs_long_cpu = {k: v.to(cpu_device) for k, v in inputs_long.items()}
print("Setup complete.")

# --- Step 2: Time Short Prompt on CPU ---
print("\nRunning inference for a SHORT prompt on CPU...")
start_time = time.time()
with torch.no_grad():
    model.generate(inputs_short_cpu["input_ids"], max_new_tokens=num_new_tokens_to_generate)
end_time = time.time()

cpu_latency_short = end_time - start_time
cpu_tps_short = num_new_tokens_to_generate / cpu_latency_short
print(f"CPU - Short Prompt: Latency={cpu_latency_short:.4f}s, TPS={cpu_tps_short:.2f}")

# --- Step 3: Time Long Prompt on CPU ---
print("\nRunning inference for a LONG prompt on CPU...")
start_time = time.time()
with torch.no_grad():
    model.generate(inputs_long_cpu["input_ids"], max_new_tokens=num_new_tokens_to_generate)
end_time = time.time()

cpu_latency_long = end_time - start_time
cpu_tps_long = num_new_tokens_to_generate / cpu_latency_long
print(f"CPU - Long Prompt:  Latency={cpu_latency_long:.4f}s, TPS={cpu_tps_long:.2f}")

# --- Step 4: Compare CPU Performance ---
if cpu_latency_short > 0:
    latency_increase = cpu_latency_long / cpu_latency_short
    print(f"\nCPU Latency Increase (Long/Short): {latency_increase:.2f}x")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Moving model and data to CPU...
Setup complete.

Running inference for a SHORT prompt on CPU...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


CPU - Short Prompt: Latency=4.0583s, TPS=12.32

Running inference for a LONG prompt on CPU...
CPU - Long Prompt:  Latency=4.7266s, TPS=10.58

CPU Latency Increase (Long/Short): 1.16x


### 3.2 Benchmarking on GPU

Now, let's see how a GPU handles this. The process is similar, but with two key differences for accurate measurement:

1.  **GPU Warm-up:** The first time you run an operation on a GPU, it involves some one-time setup costs (like loading kernels). We'll perform a small, untimed run first to get these out of the way.
2.  **Accurate Timing:** CPU-based timers like `time.time()` are not ideal for GPU operations because GPU code runs asynchronously. We'll use `torch.cuda.Event` to precisely measure the time the GPU spends on the task.

In [7]:
if torch.cuda.is_available():
    print(f"CUDA device found: {torch.cuda.get_device_name(0)}\n")

    # --- Step 1: Setup GPU Environment ---
    print("Moving model and data to GPU...")
    gpu_device = torch.device("cuda")
    model.to(gpu_device)
    inputs_short_gpu = {k: v.to(gpu_device) for k, v in inputs_short.items()}
    inputs_long_gpu = {k: v.to(gpu_device) for k, v in inputs_long.items()}
    print("Setup complete.")

    # --- Step 2: GPU Warm-up ---
    print("\nPerforming GPU warm-up run...")
    with torch.no_grad():
        _ = model.generate(inputs_short_gpu["input_ids"], max_new_tokens=5)
    torch.cuda.synchronize() # Wait for the warm-up to finish
    print("Warm-up complete.")

    # --- Step 3: Time Short Prompt on GPU ---
    print("\nRunning inference for a SHORT prompt on GPU...")
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    
    start_event.record()
    with torch.no_grad():
        model.generate(inputs_short_gpu["input_ids"], max_new_tokens=num_new_tokens_to_generate)
    end_event.record()

    torch.cuda.synchronize() # IMPORTANT: Wait for the GPU to finish the work
    gpu_latency_short_ms = start_event.elapsed_time(end_event)
    gpu_latency_short = gpu_latency_short_ms / 1000.0
    gpu_tps_short = num_new_tokens_to_generate / gpu_latency_short
    print(f"GPU - Short Prompt: Latency={gpu_latency_short:.4f}s, TPS={gpu_tps_short:.2f}")

    # --- Step 4: Time Long Prompt on GPU ---
    print("\nRunning inference for a LONG prompt on GPU...")
    start_event.record()
    with torch.no_grad():
        model.generate(inputs_long_gpu["input_ids"], max_new_tokens=num_new_tokens_to_generate)
    end_event.record()

    torch.cuda.synchronize()
    gpu_latency_long_ms = start_event.elapsed_time(end_event)
    gpu_latency_long = gpu_latency_long_ms / 1000.0
    gpu_tps_long = num_new_tokens_to_generate / gpu_latency_long
    print(f"GPU - Long Prompt:  Latency={gpu_latency_long:.4f}s, TPS={gpu_tps_long:.2f}")
    
    # --- Step 5: Compare GPU Performance ---
    if gpu_latency_short > 0:
        latency_increase_gpu = gpu_latency_long / gpu_latency_short
        print(f"\nGPU Latency Increase (Long/Short): {latency_increase_gpu:.2f}x")

else:
    print("CUDA not available. Skipping GPU comparison.")

CUDA device found: Tesla T4

Moving model and data to GPU...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Setup complete.

Performing GPU warm-up run...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Warm-up complete.

Running inference for a SHORT prompt on GPU...


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPU - Short Prompt: Latency=0.7099s, TPS=70.43

Running inference for a LONG prompt on GPU...
GPU - Long Prompt:  Latency=0.7304s, TPS=68.46

GPU Latency Increase (Long/Short): 1.03x


## Part 4: Analysis and Conclusion

### 4.1 Key Takeaways

Let's break down what we observed and why it happened.

#### Observation
A **longer input prompt leads to higher end-to-end latency** (it takes more time) and therefore **lower throughput** (fewer tokens per second), even though the model is generating the same number of new tokens in both cases. 

This slowdown happens on both CPU and GPU.

#### The "Why": Self-Attention

This slowdown is a fundamental characteristic of the Transformer architecture. 

To generate **each subsequent token**, the model must perform a self-attention operation. 

In this operation, the new token must "attend to" (or look at) **all the previous tokens**. This includes the original prompt *and* all the tokens it has already generated.

   - **Short Prompt:** To generate token #11, the model attends to the 5 prompt tokens + 10 generated tokens = **15 tokens**.
   - **Long Prompt:** To generate token #11, the model attends to the 172 prompt tokens + 10 generated tokens = **~182 tokens**.

#### Implication

The cost of generation is not just about the number of tokens you create; it's heavily influenced by the total context length (prompt + generation). 

This is why handling long contexts efficiently is a major area of research and engineering.