<a href="https://colab.research.google.com/github/rahiakela/genai-research-and-practice/blob/main/hands-on-llm-serving-and-optimization/05_llm_batching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

In [None]:
!pip install --quiet vllm transformers tiktoken

In [2]:
import torch
import gc
import time

# Unload models and clean up gpu memory cache
def free_gpu(model):
  if model:
    # Removes the reference to the model's memory,
    # making it eligible for garbage collection.
    del model

  # Release any cached GPU memory that's no longer needed.
  if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()

  # Trigger garbage collection to ensure memory is fully released.
  gc.collect()

free_gpu(None)

## Load Model

In [None]:
import time
from vllm import LLM, SamplingParams

# Load model with vLLM
llm = LLM(
    model="Qwen/Qwen2.5-0.5B",
    dtype="float16",
    trust_remote_code=True,
    max_model_len=2048
)

## Batching

In [4]:
# Define the prompt.
prompt = """You are an expert AI historian writing a detailed chapter for a book titled "The Evolution of Human-AI Collaboration."

Begin by summarizing the early stages of artificial intelligence in the 1950s, touching on symbolic logic and rule-based systems. Then transition into the rise of machine learning, particularly deep learning in the 2010s.

Afterward, describe how large language models like GPT transformed human-computer interaction, enabling applications in education, creative writing, customer support, and software development.

Finally, reflect on the societal and ethical implications of AI, such as misinformation, bias, and the alignment problem.

Write in a formal tone, with rich detail and examples in each era."""

In [None]:
import torch
import gc
import time
from vllm import LLM, SamplingParams
from transformers import pipeline

# Prompts for batch generation, 4 input sequences
prompts = [
    "What is the meaning of life?",
    "Write a short story about a robot learning to love.",
    "Explain quantum physics in simple terms.",
    "Translate 'Hello, world!' into Spanish."
]

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=100
)

In [7]:
start_time = time.time()
# process four input sequences (prompts) together in one batch
vllm_outputs = llm.generate(prompts, sampling_params)
end_time = time.time()
vllm_time = end_time - start_time

print(f"\nvLLM generation time for 4 prompts in a batch: {vllm_time:.4f} seconds")
print(f"VLLM output: {vllm_outputs}")

Adding requests:   0%|          | 0/4 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


vLLM generation time for 4 prompts in a batch: 0.9375 seconds
VLLM output: [RequestOutput(request_id=8, prompt='What is the meaning of life?', prompt_token_ids=[3838, 374, 279, 7290, 315, 2272, 30], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=" The question of meaning, particularly in relation to the concept of life itself, has been a topic of debate among philosophers and thinkers for centuries. Some may consider the question as an existential question, while others may view it as a philosophical or metaphysical inquiry.\n\nThe most common interpretation of the question of meaning in life is that it is a question about the meaning of existence itself. Philosophers argue that life is not just an individual's experience of the world, but rather a process that is interconnected with the", token_ids=(576, 3405, 315, 7290, 11, 7945, 304, 12687, 311, 279, 7286, 315, 2272, 5086, 11, 702, 1012, 264, 8544, 315, 11004, 4221,

In [8]:
# process prompt one by one
start_time = time.time()
for prompt in prompts:
    vllm_outputs = llm.generate([prompt], sampling_params)
end_time = time.time()
vllm_time = end_time - start_time

print(f"\nvLLM generation time for 4 prompts one by one: {vllm_time:.4f} seconds")

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


vLLM generation time for 4 prompts one by one: 2.2677 seconds
