# Extracting Embeddings with vLLM

We can get the vllm code on GitHub: https://github.com/vllm-project/vllm ; official website: https://vllm.ai/

When extracting embeddings from large language models, the vLLM library offers several distinct advantages over the traditional HuggingFace approach. Here is a detailed comparison:

1. **Significantly Faster Inference Speed**
   vLLM is optimized for high-throughput inference. It utilizes advanced scheduling, continuous batching, and more efficient memory management to maximize GPU utilization, resulting in much faster embedding extraction, especially when handling large batches or many texts.

2. **Better GPU Memory Efficiency**
   vLLM uses PagedAttention and memory-paging techniques to minimize redundant memory allocation, allowing it to serve more requests and larger batches without running into out-of-memory issues as quickly as HuggingFace’s default pipelines. This enables larger models (such as 10B+ parameters) to run more smoothly on available hardware.

3. **Superior Multi-GPU Support**
   With built-in support for tensor parallelism and highly efficient multi-GPU scheduling, vLLM can automatically split models and workloads across multiple GPUs, minimizing bottlenecks and manual configuration. HuggingFace acceleration requires more manual setup and often less optimal scaling for high-throughput embedding extraction.

4. **Higher Throughput with Long Contexts**
   vLLM is specifically designed to handle long input sequences efficiently, making it far better suited when extracting embeddings from long documents or paragraphs, whereas HuggingFace methods can become memory-bound or slow in these scenarios.

5. **Production-Ready and Easy to Deploy**
   vLLM provides API and HTTP server support for production inference out of the box, allowing for quick deployment at scale. Integrating HuggingFace models into production serving pipelines typically requires additional engineering effort to achieve similar throughput and stability.


In [1]:
import random
import numpy as np
import torch
import os


def set_seed(seed):
   random.seed(seed)
   np.random.seed(seed)
   os.environ['PYTHONHASHSEED'] = str(seed)
   torch.manual_seed(seed)
   torch.cuda.manual_seed(seed)
   torch.cuda.manual_seed_all(seed)
   torch.backends.cudnn.deterministic = True
   torch.backends.cudnn.benchmark = False


set_seed(42)

In [2]:
import torch
import subprocess
import vllm

def get_gpu_info():
    try:
        cmd = "nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader"
        result = subprocess.check_output(cmd, shell=True, encoding='utf-8').strip().split('\n')
        print("GPU info:")
        for idx, line in enumerate(result):
            name, total_mem, free_mem = [s.strip() for s in line.split(',')]
            print(f"  GPU {idx}: {name}, Total Memory: {total_mem}, Free Memory: {free_mem}")
        print(f"Detected {len(result)} GPU(s)")
    except Exception as e:
        print("Failed to get GPU info:", e)

def print_env_info():
    print(f"PyTorch version: {torch.__version__}")
    print(f"vllm version: {vllm.__version__}")
    cuda_available = torch.cuda.is_available()
    print(f"CUDA available: {cuda_available}")
    if cuda_available:
        print(f"CUDA version: {torch.version.cuda}")
        print(f"cuDNN version: {torch.backends.cudnn.version()}")
print('Testing environment：')
print_env_info()
print()
get_gpu_info()

Testing environment：
PyTorch version: 2.7.1+cu128
vllm version: 0.10.1.dev1+gbcc0a3cbe
CUDA available: True
CUDA version: 12.8
cuDNN version: 90701

GPU info:
  GPU 0: NVIDIA A40, Total Memory: 46068 MiB, Free Memory: 45403 MiB
  GPU 1: NVIDIA A40, Total Memory: 46068 MiB, Free Memory: 45403 MiB
Detected 2 GPU(s)


# Traditional embedding extraction

In [3]:

from transformers import AutoTokenizer, AutoModel
import torch
import time

# Select device map to use 2 GPUs

model_path = "/path/to/Genos-10B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(
    model_path,
    device_map="auto" if torch.cuda.device_count() >= 2 else None
)

text = "ATCG"
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.cuda() for k, v in inputs.items()}

start_time = time.time()
with torch.no_grad():
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state
    print(f"Last hidden state shape: {last_hidden_state.shape}")

    # MEAN pooling
    if "attention_mask" in inputs:
        mask = inputs["attention_mask"].unsqueeze(-1)  # [batch, seq, 1]
        masked_hidden = last_hidden_state * mask
        sum_hidden = masked_hidden.sum(dim=1)
        lengths = mask.sum(dim=1)  # [batch, 1]
        mean_pooled = sum_hidden / lengths
    else:
        # Without an attention mask, average over all tokens.
        mean_pooled = last_hidden_state.mean(dim=1)

    print(f"Mean pooled embedding shape: {mean_pooled.shape}")
    print(f"Mean pooled embedding: {mean_pooled.cpu().numpy()}")

end_time = time.time()
elapsed = end_time - start_time
print(f"Embedding extraction time: {elapsed:.4f} seconds")


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

Last hidden state shape: torch.Size([1, 4, 4096])
Mean pooled embedding shape: torch.Size([1, 4096])
Mean pooled embedding: [[-0.7643822  -0.12757319  0.5734206  ... -0.12371235 -0.00170616
  -0.33227795]]
Embedding extraction time: 0.7419 seconds


# vLLM embedding extraction

In [2]:
from vllm import LLM, SamplingParams
from vllm import TokensPrompt
import torch
from transformers import AutoTokenizer, AutoModel
import os
from vllm.config import PoolerConfig
from vllm.pooling_params import PoolingParams


model_path = "/path/to/Genos-10B"
seq_length = 128 * 1024 
gpu_num = 2

llm = LLM(
    model=model_path,
    trust_remote_code=True,
    tensor_parallel_size=gpu_num, 
    block_size=32,
    enable_prefix_caching=True,
    enforce_eager=True,
    gpu_memory_utilization=0.85,  # Improve GPU memory utilization
    dtype=torch.bfloat16,
    max_model_len=seq_length,
    max_num_batched_tokens=seq_length,
    override_pooler_config=PoolerConfig(pooling_type="MEAN", normalize=False), # Pooling parameter, do not use this parameter to obtain the complete hidden state.
    task='reward',
    enable_chunked_prefill=False
)

INFO 12-17 01:29:32 [__init__.py:235] Automatically detected platform cuda.
INFO 12-17 01:29:44 [config.py:3440] Downcasting torch.float32 to torch.bfloat16.
INFO 12-17 01:29:44 [config.py:1604] Using max model len 131072
INFO 12-17 01:29:48 [config.py:4628] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
INFO 12-17 01:29:48 [core.py:572] Waiting for init message from front-end.
INFO 12-17 01:29:48 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev1+gbcc0a3cbe) with config: model='/path/to/Genos-10B', speculative_config=None, tokenizer='/path/to/Genos-10B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decodi

Loading safetensors checkpoint shards:   0% Completed | 0/9 [00:00<?, ?it/s]


[1;36m(VllmWorker rank=1 pid=6387)[0;0m INFO 12-17 01:30:07 [default_loader.py:262] Loading weights took 13.83 seconds
[1;36m(VllmWorker rank=0 pid=6386)[0;0m INFO 12-17 01:30:07 [default_loader.py:262] Loading weights took 13.87 seconds
[1;36m(VllmWorker rank=1 pid=6387)[0;0m INFO 12-17 01:30:08 [gpu_model_runner.py:1892] Model loading took 10.0644 GiB and 14.017120 seconds
[1;36m(VllmWorker rank=0 pid=6386)[0;0m INFO 12-17 01:30:08 [gpu_model_runner.py:1892] Model loading took 10.0644 GiB and 14.020643 seconds
[1;36m(VllmWorker rank=1 pid=6387)[0;0m INFO 12-17 01:30:22 [gpu_worker.py:255] Available KV cache memory: 22.02 GiB
[1;36m(VllmWorker rank=0 pid=6386)[0;0m INFO 12-17 01:30:22 [gpu_worker.py:255] Available KV cache memory: 22.02 GiB
INFO 12-17 01:30:23 [kv_cache_utils.py:833] GPU KV cache size: 481,120 tokens
INFO 12-17 01:30:23 [kv_cache_utils.py:837] Maximum concurrency for 131,072 tokens per request: 3.67x
INFO 12-17 01:30:23 [kv_cache_utils.py:833] GPU KV cache

In [None]:
import time
tokenizer = llm.get_tokenizer()
seqs = ['ATCG']

token_ids = tokenizer(seqs, add_special_tokens=False)["input_ids"]

start_time = time.time()
outputs = llm.encode(prompt_token_ids=token_ids)
pooleds = []
for i, output in enumerate(outputs):
    pooled = output.outputs.data
    pooleds.append(pooled)
end_time = time.time()
elapsed = end_time - start_time
print(f"Embedding extraction time: {elapsed:.4f} seconds")
print(pooleds[0])

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Embedding extraction time: 0.1067 seconds


In [None]:
vllm_mean_pool = pooleds[0]
# Compute L1 distance, L2 distance, and Pearson correlation coefficient between vllm_mean_pool and mean_pooled

l1_distance = torch.norm(vllm_mean_pool.cpu() - mean_pooled.cpu(), p=1).item()
l2_distance = torch.norm(vllm_mean_pool.cpu() - mean_pooled.cpu(), p=2).item()

# Flatten tensors for correlation computation
vllm_mean_flat = vllm_mean_pool.view(-1).cpu().numpy()
mean_pooled_flat = mean_pooled.view(-1).cpu().numpy()

if vllm_mean_flat.std() == 0 or mean_pooled_flat.std() == 0:
    pearson_corr = float('nan')
else:
    pearson_corr = np.corrcoef(vllm_mean_flat, mean_pooled_flat)[0, 1]

print(f"L1 Distance between vllm_mean_pool and mean_pooled: {l1_distance:.6f}")
print(f"L2 Distance between vllm_mean_pool and mean_pooled: {l2_distance:.6f}")
print(f"Pearson Correlation between vllm_mean_pool and mean_pooled: {pearson_corr:.6f}")



L1 Distance between vllm_mean_pool and mean_pooled: 7.521327
L2 Distance between vllm_mean_pool and mean_pooled: 0.195607
Pearson Correlation between vllm_mean_pool and mean_pooled: 0.999995


Using vLLM for embedding extraction achieves a **7x speedup** over traditional approaches while yielding embeddings that closely match the original outputs, significantly enhancing processing efficiency.


# Traditional methods cannot extract ultra-long sequences.

In [None]:
text = "ATCG" * 32 * 1024 # 128k
inputs = tokenizer(text, return_tensors="pt")
inputs = {k: v.cuda() for k, v in inputs.items()}

start_time = time.time()
with torch.no_grad():
    outputs = model(**inputs)
    last_hidden_state = outputs.last_hidden_state
    print(f"Last hidden state shape: {last_hidden_state.shape}")

    # MEAN pooling
    if "attention_mask" in inputs:
        mask = inputs["attention_mask"].unsqueeze(-1)  # [batch, seq, 1]
        masked_hidden = last_hidden_state * mask
        sum_hidden = masked_hidden.sum(dim=1)
        lengths = mask.sum(dim=1)  # [batch, 1]
        mean_pooled = sum_hidden / lengths
    else:
        # Without an attention mask, average over all tokens.
        mean_pooled = last_hidden_state.mean(dim=1)

    print(f"Mean pooled embedding shape: {mean_pooled.shape}")
    print(f"Mean pooled embedding: {mean_pooled.cpu().numpy()}")

end_time = time.time()
elapsed = end_time - start_time
print(f"Embedding extraction time: {elapsed:.4f} seconds")

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacity of 44.34 GiB of which 10.62 GiB is free. Process 4005586 has 33.71 GiB memory in use. Of the allocated memory 33.39 GiB is allocated by PyTorch, and 12.94 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# vllm extracts ultra-long sequences

In [3]:
import time
tokenizer = llm.get_tokenizer()
seqs = ['ATCG' * 32 * 1024]

token_ids = tokenizer(seqs, add_special_tokens=False)["input_ids"]

start_time = time.time()
outputs = llm.encode(prompt_token_ids=token_ids)
pooleds = []
for i, output in enumerate(outputs):
    pooled = output.outputs.data
    pooleds.append(pooled)
end_time = time.time()
elapsed = end_time - start_time
print(f"Embedding extraction time: {elapsed:.4f} seconds")
print(pooleds[0])

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Embedding extraction time: 21.7921 seconds
tensor([-0.0255,  0.1597, -0.1451,  ..., -0.0175,  0.0131, -0.1934])


# Summary

Based on the experiments above, we can draw the following conclusions:

1. vLLM achieves significantly faster embedding extraction compared to the traditional HuggingFace approach—approximately 7 times faster.

2. The embeddings produced by vLLM are highly consistent with those generated by the conventional method, indicating that vLLM can be reliably used to improve efficiency.

3. vLLM is capable of extracting embeddings from much longer sequences within limited computational resources, making it well-suited for downstream research tasks that require processing ultra-long inputs.

4. You can refer to the above demo for extracting embeddings using vllm.
