##### Master Degree in Computer Science and Data Science for Economics

# vLLM

### Sergio Picascia

[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving.

In [None]:
%pip install vllm

In [7]:
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
import torch

In [None]:
DEVICE = (
    "cuda:0"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)
MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct-AWQ"

In [9]:
model = LLM(
    model=MODEL_PATH,
    enforce_eager=False, 
    max_model_len=8192,
    device=DEVICE,
    gpu_memory_utilization=0.3,
    limit_mm_per_prompt={"video": 0, "image": 1},
    max_num_batched_tokens=2048,
    max_num_seqs=64,
    enable_prefix_caching=True,
    enable_chunked_prefill=True,
)
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=4096,
)

INFO 04-22 07:16:43 [config.py:689] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-22 07:16:43 [awq_marlin.py:114] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 04-22 07:16:43 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-22 07:16:44 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='Qwen/Qwen2.5-VL-7B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:1, decoding_config=DecodingConfig(guided_decod

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


INFO 04-22 07:16:48 [gpu_model_runner.py:1276] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct-AWQ...
INFO 04-22 07:16:48 [config.py:3466] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
INFO 04-22 07:16:48 [weight_utils.py:265] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:02<00:00,  1.32s/it]



INFO 04-22 07:16:52 [loader.py:458] Loading weights took 2.73 seconds
INFO 04-22 07:16:52 [gpu_model_runner.py:1291] Model loading took 6.5947 GiB and 3.862872 seconds
INFO 04-22 07:16:53 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
INFO 04-22 07:17:03 [backends.py:416] Using cache directory: /home/serusr01/.cache/vllm/torch_compile_cache/b89ccceadd/rank_0_0 for vLLM's torch.compile
INFO 04-22 07:17:03 [backends.py:426] Dynamo bytecode transform time: 7.61 s
INFO 04-22 07:17:07 [backends.py:132] Cache the graph of shape None for later use
INFO 04-22 07:17:34 [backends.py:144] Compiling a graph for general shape takes 30.01 s
INFO 04-22 07:17:50 [monitor.py:33] torch.compile takes 37.62 s in total
INFO 04-22 07:17:51 [kv_cache_utils.py:634] GPU KV cache size: 329,312 tokens
INFO 04-22 07:17:51 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 40.20x
INFO 04

In [10]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://redpandanetwork.org/get/files/image/galleries/28138502587_a0a020ae9a_k.jpeg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Describe the image."},
        ],
    },
]

In [11]:
prompt = processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
image_inputs, _ = process_vision_info(messages)
llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": {"image": image_inputs},
}

In [12]:
output = model.generate([llm_inputs], sampling_params=sampling_params)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.14s/it, est. speed input: 1117.17 toks/s, output: 99.24 toks/s]


In [14]:
output[0].outputs[0].text

"The image shows a red panda, also known as a lesser panda or a red cat-bear, perched on a tree branch. The red panda has a distinctive coat with a mix of black, white, and reddish-brown fur. Its face is predominantly white with a black nose and a white muzzle. The red panda's ears are upright and rounded, and its eyes are dark and expressive. The background is a blurred green, suggesting a natural, forested environment. The red panda appears to be in a relaxed state, possibly observing its surroundings."