**Install [vLLM](https://docs.vllm.ai/en/latest/getting_started/quickstart.html) package**

In [1]:
# !pip install vllm lm-format-enforcer

**Run vLLM server background using nohup. It can take a while for the first-time running, as it would download a model from Huggingface**

In [2]:
# !nohup vllm serve Qwen/Qwen2.5-1.5B-Instruct &

In [2]:
# Test connection:
!curl https://7323-193-61-202-14.ngrok-free.app/v1/models

<!DOCTYPE html>
<html class="h-full" lang="en-US" dir="ltr">
  <head>
    <link rel="preload" href="https://cdn.ngrok.com/static/fonts/euclid-square/EuclidSquare-Regular-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://cdn.ngrok.com/static/fonts/euclid-square/EuclidSquare-RegularItalic-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://cdn.ngrok.com/static/fonts/euclid-square/EuclidSquare-Medium-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://cdn.ngrok.com/static/fonts/euclid-square/EuclidSquare-Semibold-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://cdn.ngrok.com/static/fonts/euclid-square/EuclidSquare-MediumItalic-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://cdn.ngrok.com/static/fonts/ibm-plex-mono/IBMPlexMono-Tex

**Install [ngrok](https://ngrok.com/) for explosing the local vLLM on a colab server to the Internet**

In [5]:
!curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
	| sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
	&& echo "deb https://ngrok-agent.s3.amazonaws.com buster main" \
	| sudo tee /etc/apt/sources.list.d/ngrok.list \
	&& sudo apt update \
	&& sudo apt install ngrok

**Follow instruction on [Ngrok dashboard](https://dashboard.ngrok.com/get-started/setup/linux) to tunel a local server to the Internet**

In [None]:
# Replace your authtoken
!ngrok config add-authtoken your_authtoken

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
# Tunel local server to a public URL on ngork
!ngrok http http://localhost:8000 --log=stdout > ngrok.log &

In [None]:
!curl https://a1e4-193-61-202-14.ngrok-free.app/v1/models

{"object":"list","data":[{"id":"Qwen/Qwen2.5-1.5B-Instruct","object":"model","created":1737989912,"owned_by":"vllm","root":"Qwen/Qwen2.5-1.5B-Instruct","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-dc43f971056b4738a6e7531715795990","object":"model_permission","created":1737989912,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

In [None]:
!curl https://a1e4-193-61-202-14.ngrok-free.app/v1/models

{"object":"list","data":[{"id":"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B","object":"model","created":1738330774,"owned_by":"vllm","root":"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-ab2b9216bfa44899995a83b6b79e0688","object":"model_permission","created":1738330774,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}


**Python: call APIs**

In [None]:
from openai import OpenAI
from lmformatenforcer import JsonSchemaParser
from pydantic import BaseModel
from vllm.sampling_params import GuidedDecodingParams
from vllm import SamplingParams

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://a07e-35-197-151-225.ngrok-free.app/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

**Direct prompting**

In [None]:
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="San Francisco is a")

print("Chat response:", completion.choices[0].text)


Chat response:  great city for luxury hotels, but we take pride in also being a place that


**Role-based prompting**

In [None]:
chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response.choices[0].message.content)

Chat response: Why did the tomato turn red?

Because it saw the salad dressing!


**Structured outputs**

In [None]:
# Guided decoding by JSON using Pydantic schema
class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: str

class CarList(BaseModel):
    cars: list[CarDescription]

json_schema = CarList.model_json_schema()

prompt = ("Generate a JSON with the brand, model and car_type of"
          "the most 3 iconic cars from the 90's")
completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{
        "role": "user",
        "content": prompt,
    }],
    extra_body={"guided_json": json_schema},
)
print(completion.choices[0].message.content)

{
  "cars": [
    {
      "brand": "Honda",
      "model": "Civic",
      "car_type": "sedan"
    },
    {
      "brand": "Ford",
      "model": "Mustang",
      "car_type": "coupe"
    },
    {
      "brand": "Toyota",
      "model": "Corolla",
      "car_type": "sedan"
    }
  ]
}


**Set parameters**

In [None]:
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="San Francisco is a",
                                      max_tokens=100,
                                      temperature=1  # Adjust temperature to control randomness
                                      )

print("Creative Chat response:\n", completion.choices[0].text)


completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="San Francisco is a",
                                      max_tokens=100,
                                      temperature=0  # Adjust temperature to control randomness
                                      )

print("Uncreative Chat response:\n", completion.choices[0].text)


Creative Chat response:
  big city with 2700, 000 people. The population increases by 12% each year. How many more people will be in San Francisco in 3 years?

To find out how many more people will be in San Francisco in 3 years, we need to calculate the population increase over those years, given the annual increase of 12%. We'll break this down into steps:

1. Calculate the population each year for 3 years.
2. Find
Uncreative Chat response:
  city in the state of California, in the United States. It is the capital and most populous city of the U.S. state of California. It is located on the Pacific coast of the United States, at the mouth of the San Francisco Bay. The city is the largest in the San Francisco Bay Area, the second-largest in California, and the 9th-largest city in the United States. San Francisco is the only major city in the United States to be built on a natural harbor. The


In [None]:
288.79/8.47/8

4.261953955135773