# Starter Notebook: Your First LLM with vLLM

Welcome! This notebook is written for complete beginners.

By the end, you will know how to:
1. Install and import **vLLM**.
2. Load a **small instruction model**.
3. Build a prompt with a **chat template** (including a **system message**).
4. Generate text and **extract model outputs** clearly.

---

> **Who this is for:** People with little to no prior LLM experience.

## Cell 1 — (Optional) Install dependencies

If you're running this in a fresh environment, run this cell once.

- `vllm` is the serving/inference engine.
- `transformers` provides tokenizer + chat template utilities.

> If you already installed these packages, you can skip this cell.

In [None]:
# Cell 1: Install dependencies (optional)
# Remove the leading "!" and run in Jupyter/Colab.
!pip install -q vllm transformers

## Cell 2 — Imports and quick environment check

This cell imports what we need and prints useful version information.

In [None]:
# Cell 2: Imports
import torch
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## Cell 3 — Choose a small model

We'll use a small instruction-tuned model:

- **Model:** `Qwen/Qwen2-0.5B-Instruct`

Why this one?
- It's small enough to be beginner-friendly.
- It supports chat-style prompting.

> If your machine is very limited, you may need to try an even smaller or quantized model.

In [None]:
# Cell 3: Pick model name
model_name = "Qwen/Qwen2-0.5B-Instruct"
print(f"Using model: {model_name}")

## Cell 4 — Load tokenizer and vLLM model

This can take a little time on first run because model files are downloaded.

- `AutoTokenizer` handles chat templates.
- `LLM(...)` loads the model with vLLM.

In [None]:
# Cell 4: Load tokenizer + model
# Tip: You can adjust tensor_parallel_size for multi-GPU setups.
tokenizer = AutoTokenizer.from_pretrained(model_name)

llm = LLM(
    model=model_name,
    trust_remote_code=True,  # often needed for some model/tokenizer repos
)

print("Tokenizer and model loaded successfully.")

## Cell 5 — Create chat messages (system + user)

A **system message** sets global behavior for the assistant.
A **user message** is the actual question/task.

In [None]:
# Cell 5: Define messages
messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful tutor for beginners. "
            "Explain concepts clearly using short examples."
        ),
    },
    {
        "role": "user",
        "content": "In 3 bullets, what is an LLM and what can it do?",
    },
]

messages

## Cell 6 — Apply the chat template

`apply_chat_template(...)` converts role-based messages into the model's expected text format.

Important options:
- `tokenize=False`: return a string prompt.
- `add_generation_prompt=True`: append the assistant turn so the model knows to answer next.

In [None]:
# Cell 6: Build prompt from chat messages
prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print("Prompt sent to the model:")
print("-" * 80)
print(prompt_text)
print("-" * 80)

## Cell 7 — Set generation parameters

These control how the model writes:
- `temperature`: creativity/randomness (higher = more varied).
- `max_tokens`: maximum length of generated output.

In [None]:
# Cell 7: Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200,
)

sampling_params

## Cell 8 — Generate output with vLLM

We call `llm.generate(...)` with a list of prompts.
Even for one prompt, we pass a list (this supports batching).

In [None]:
# Cell 8: Generate response
results = llm.generate([prompt_text], sampling_params)

type(results), len(results)

## Cell 9 — Extract and print the assistant text

vLLM returns nested objects. For one prompt + one completion:
- `results[0]` = first prompt's output object
- `results[0].outputs[0].text` = generated assistant text

In [None]:
# Cell 9: Extract generated text
assistant_text = results[0].outputs[0].text

print("Assistant output:")
print(assistant_text)

## Cell 10 — (Optional) Inspect output metadata

This is useful if you want more than just text (for logging, debugging, evaluation).

In [None]:
# Cell 10: Inspect useful metadata
first_result = results[0]
first_completion = first_result.outputs[0]

print(f"Prompt token IDs length: {len(first_result.prompt_token_ids)}")
print(f"Generated token IDs length: {len(first_completion.token_ids)}")
print(f"Finish reason: {first_completion.finish_reason}")

## Cell 11 — Next steps (ideas)

Try changing one thing at a time:
1. Replace the user question.
2. Change the system message style.
3. Lower `temperature` to `0.2` for more deterministic answers.
4. Increase `max_tokens` for longer outputs.

You're now using vLLM with chat templates end-to-end ✅