# Exercise: Measure Latency & Throughput
**Objective:** Get hands‑on loading Llama 3.2-Text-1B and timing a simple generation so you understand raw latency and throughput.

**Tasks:**
1. Load the model & tokenizer  
2. Prepare a prompt: Pick or write ~30–50 words; tokenize with tokenizer(...)
3. Time your generation
4. Record
   - Latency (s)
   - Throughput (tokens/s)

**Requirements:**
- `transformers`
- `torch`


In [None]:
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
# Settings
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
MAX_NEW_TOKENS = 50

In [None]:
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto"
)
model.eval()

In [None]:
# Prepare prompt (~40 words)
prompt = (
    "Over the next decade, sustainable energy solutions will revolutionize "
    "global power grids, reducing carbon footprints and fostering resilient "
    "communities through innovative storage and distribution technologies."
)
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
input_len = inputs.input_ids.size(1)
print(f"Input tokens: {input_len}")

In [None]:
# Warm-up pass
_ = model.generate(**inputs, max_new_tokens=5)
if DEVICE.type == "cuda":
    torch.cuda.synchronize()
print("Warm-up done.")

In [None]:
# Time generation
start_time = time.time()
outputs = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
if DEVICE.type == "cuda":
    torch.cuda.synchronize()
end_time = time.time()

# Compute metrics
latency_s = end_time - start_time
generated_tokens = outputs.shape[1] - input_len
throughput_tps = generated_tokens / latency_s

# Report
print(f"Generated tokens: {generated_tokens}")
print(f"Latency: {latency_s:.3f} s")
print(f"Throughput: {throughput_tps:.1f} tokens/s")