# vLLM  

There's two modes of using vLLM local and remote. Let's start form the former one, which requeries CUDA environment available locally. 

### Install vLLM

`pip install vllm` <br>
or if you want to compile you can [compile from source](https://docs.vllm.ai/en/latest/getting_started/installation.html)

### Orca-7b Completion Example


In [18]:
%pip install llama-index-llms-vllm

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import os
os.environ["HF_HOME"] = "model/"

In [None]:
from llama_index.llms.vllm import Vllm, VllmServer

In [60]:
llm = Vllm(
    model="microsoft/Orca-2-7b",
    tensor_parallel_size=4,
    max_new_tokens=100,
    vllm_kwargs={"swap_space": 1, "gpu_memory_utilization": 0.5},
)

Vllm mock initialized


In [61]:
llm.complete(
    "[INST]You are a helpful assistant[/INST] What is a black hole ?"
)

Skipped llm.complete() call: [INST]You are a helpful assistant[/INST] What is a black hole ?


'[INST]You are a helpful assistant[/INST] What is a black hole ?'

### LLama-2-7b Completion Example


In [62]:
llm = Vllm(
    model="codellama/CodeLlama-7b-hf",
    dtype="float16",
    tensor_parallel_size=4,
    temperature=0,
    max_new_tokens=100,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.5,
        "max_model_len": 4096,
    },
)

Vllm mock initialized


In [63]:
llm.complete("import socket\n\ndef ping_exponential_backoff(host: str):")

Skipped llm.complete() call: import socket

def ping_exponential_backoff(host: str):


'import socket\n\ndef ping_exponential_backoff(host: str):'

### Mistral chat 7b Completion Example


In [64]:
llm = Vllm(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    dtype="float16",
    tensor_parallel_size=4,
    temperature=0,
    max_new_tokens=100,
    vllm_kwargs={
        "swap_space": 1,
        "gpu_memory_utilization": 0.5,
        "max_model_len": 4096,
    },
)

Vllm mock initialized


In [65]:
llm.complete(" What is a black hole ?")

Skipped llm.complete() call:  What is a black hole ?


' What is a black hole ?'

# Calling vLLM via HTTP

In this mode there is no need to install `vllm` model nor have CUDA available locally. To setup the vLLM API you can follow the guide present [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). 
Note: `llama-index-llms-vllm` module is a client for `vllm.entrypoints.api_server` which is only [a demo](https://github.com/vllm-project/vllm/blob/abfc4f3387c436d46d6701e9ba916de8f9ed9329/vllm/entrypoints/api_server.py#L2). <br>
If vLLM server is launched with `vllm.entrypoints.openai.api_server` as [OpenAI Compatible Server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server)  or via [Docker](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html) you need `OpenAILike` class from `llama-index-llms-openai-like` [module](localai.ipynb#llamaindex-interaction)

### Completion Response 

In [66]:
from llama_index.core.llms import ChatMessage

In [67]:
llm = VllmServer(
    api_url="http://localhost:8000/generate", max_new_tokens=100, temperature=0
)

VllmServer mock initialized


In [68]:
llm.complete("what is a black hole ?")

Skipped llm.complete() call: what is a black hole ?


'what is a black hole ?'

In [69]:
message = [ChatMessage(content="hello", role="user")]
llm.chat(message)

Skipped llm.chat call: [ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='hello')])]


[ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='hello')])]

### Streaming Response

In [70]:
list(llm.stream_complete("what is a black hole"))[-1]

Skipped llm.stream_complete call: what is a black hole


'what is a black hole'

In [71]:
message = [ChatMessage(content="what is a black hole", role="user")]
[x for x in llm.stream_chat(message)][-1]

Skipped llm.stream_chat call: [ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='what is a black hole')])]


ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='what is a black hole')])

### Async Response

In [72]:
import asyncio
await llm.acomplete("What is a black hole")

Skipped llm.acomplete call: What is a black hole


'What is a black hole'

In [73]:
await llm.achat(message)

Skipped llm.achat call: [ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='what is a black hole')])]


[ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='what is a black hole')])]

In [74]:
[x async for x in await llm.astream_complete("what is a black hole")][-1]

Skipped llm.astream_complete call: what is a black hole


'what is a black hole'

In [75]:
[x async for x in await llm.astream_chat(message)][-1]

Skipped llm.astream_chat call: [ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='what is a black hole')])]


ChatMessage(role=<MessageRole.USER: 'user'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='what is a black hole')])