# Lab 7: LLM API server and Web interfaces

In this lecture, you will learn how to serve modern large models on Linux servers with easy-to-use user interface. We will be using Python as our main programming language, and we do not require knowledge about front-end language such as Javascript or CSS.

## 1 Calling Web Service APIs

In this experiment, we'll equip you with the basic knowledge and practical skills to start making powerful HTTP requests in Python. We'll cover GET and POST methods, and explore JSON data exchange. So, buckle up, let's code!

First, we will need `requests` library. It should be installed by default in your Python environment, but if you don't have it, you can install it using pip:

In [None]:
# %pip install requests

#### 1.1 Basic `GET`

GET retrieves information from a specific web address (URL). Parameters are passed either in the path itself or as a query parameter (after ? in the URL).

Let's try the GET method to retrieve a random joke!

In [None]:
import requests

# Target URL
url = "https://api.chucknorris.io/jokes/random"

# Send a GET request and store the response
response = requests.get(url)

# Check the response status code (2XX means success)
print(f"Status code: {response.status_code}")

# Access the response content (raw bytes)
content = response.content

# Decode the content to text (may differ depending on API)
text = content.decode(response.encoding)

# Print the response
print("\n--- Response Text ---")
print(text)

#### 1.2 Playing with JSON

Many APIs and websites return data in the JSON format, a structured way to organize information. We can easily convert this JSON string to a Python dictionary for easy access:

In [None]:
import json
from pprint import pprint

dict = json.loads(text)
pprint(dict)

encoded_json = json.dumps(dict)
print(encoded_json)

#### 1.3 Moving on to POST Requests

While GET requests fetch data, POST requests send information to a server, like submitting a form. We'll be using a dummy API that echos the data we sent as an example.

In [None]:
# Define URL and data
url = "https://httpbin.org/anything"
data = {"name": "John Doe", "age": 30}  # a python dictionary

# Send POST request with data
response = requests.post(url, data=data) # data is automatically encoded to json

# Check status code and print response
print(f"Status code: {response.status_code}")
print(response.text)

We can see that the sent data is actually received by the server (`form` shows the exactly the same data we sent).

This is just the tip of the iceberg! Now you have seen how we can utilize the existing web service. In the remaining experiments, you will be building your own API server and web service with a nice user interface.

## 2 Creating an API server using FastAPI

Most of you should have experienced the LLM APIs we provided, which allows your program accessing the power of large language models. Here we will guide you to build your own LLM service, using the `fastapi` library of Python.

`fastapi` takes care of the job of launching a web server and serve the API calls. You only need to define a function that takes the input data from the request to produce output. `fastapi` will handle the rest things for you.

First, install the dependency of `fastapi` if needed:

### 2.1 Basics on FastAPI

In [None]:
#%pip install uvicorn fastapi websockets

In [None]:
%%file /tmp/fastapi_example.py

from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

app = FastAPI()

## path parameters
@app.get('/g/{data}')
async def process_data(data: str):
    return f'Processed {data} by FastAPI!'

fake_items_db = [{"item_name": "Foo"}, {"item_name": "Bar"}, {"item_name": "Baz"}]
# Query parameters
@app.get("/items/")
async def read_item(skip: int = 0, limit: int = 10):
    return fake_items_db[skip : skip + limit]


## The data model
from typing import List
class Sale(BaseModel):
    day: int
    price: float
    
class Item(BaseModel):
    name: str
    inventory: int | None = 10
    sales: List[Sale] = []

# Getting Parameters from Request
@app.post("/post")
async def create_item(item: Item):
    return f'Hello {item.name}, {item.inventory} in stock, sold {len(item.sales)} items'

# The main() function is the entry point of the script
if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)


In [None]:
## run the following command in your terminal to start the server
## python /tmp/fastapi_example.py 

In [None]:
# you can visit your web service at:

response = requests.get('http://localhost:54223/g/hello')
print(f"Status code: {response.status_code}")
response.content

In [None]:
# Using the query parameter
response = requests.get('http://localhost:54223/items?skip=2&limit=3')
print(f"Status code: {response.status_code}")
response.content

In [None]:
# Now let the magic happen.  
# Set port forwarding in your VSCode devcontainer to forward port 54223 to your local machine
# Then visit `http://127.0.0.1:54223/g/hello` in your browser, you will be able to see the return string in the browser!

In [None]:
# Also test the POST processing, with a complex data structure as input

url = "http://localhost:54223/post"
data = { "name": "Apple", 
         "inventory": 33, 
         "sales": [{"day": 0, "price": 3.4}, {"day": 1, "price": 3.3}]
         }
encoded = json.dumps(data).encode("utf-8")
response = requests.post(url, data=encoded)  # the parameters should be encoded as JSON
print(f"Status code: {response.status_code}")
print(response.text)

In [None]:
# Another FastAPI magic: automatic document generation
# Visit http://localhost:54223/docs in your browser to see the API documentation
# (Assuming that you have your port forwarding set up correctly)

### 2.2 Creating an API to serve local LLM model

First, let's recall how you run a local LLM.  The following scripts starts a Phi-4 model.

In [None]:
%%file /tmp/local_llm.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

## The main function is the entry point of the script
if __name__ == '__main__':
    model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True,
                                             )
    resp = chat_resp(model, tokenizer, "What is the meaning of life?")
    print(resp)


In [None]:
## first verify that you can run LLM locally correctly (it should print out the results, despite of lots of warnings.)
## python /tmp/local_llm.py

In [None]:
%%file /tmp/llm_api.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
         
from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

from urllib.parse import unquote

app = FastAPI()

def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

#### Your Task ####
## Implement a GET handler that takes in a single string as prompt from user,
## and return the response as a single string.
#### End Task ####

#### Your Task ####
## Implement a POST handler that takes in a single string and a history
## and return the response as a single string.
#### End Task ####

#### Your Task ####
## The main function is the entry point of the script, you should load the model
## and then start the FastAPI server.
#### End Task ####


In [None]:
## run the following command in your terminal to start the server
## python /tmp/llm_api.py

In [None]:
## Run a single query to test the API, using GET

import urllib.parse
params = {"q": "中国的首都是哪里？"}
prompt_url = urllib.parse.urlencode(params)
url = f'http://localhost:54223/run?%s' % prompt_url
print(url)
response = requests.get(url)
print(f"Status code: {response.status_code}")
print(response.content.decode(response.encoding))

In [None]:
#### Your Task ####
## Run a LLM single line query with POST, and add chat history (history stored on the client side only)


## 3 Creating OpenAI-Compatible API server using vLLM

In the previous section, we have created a simple API server using FastAPI. However, the OpenAI-like API has been de facto standard for LLM services. Manual implementation of the OpenAI API is tedious. Luckily, there are many open-source frameworks that provide OpenAI-compatible APIs. In this section, we will use vLLM to create an OpenAI-compatible API server.

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It uses a novel GPU memory management technique called "PagedAttention" to enable efficient inference of large models.

vLLM has two modess: Offline Inference and OpenAI-Compatible Server:
- **Offline Inference**: This mode is just like the huggingface transformers library. You can load a model and run inference by using vllm as a library.
- **OpenAI-Compatible Server**: This mode provides endpoints compatible with the OpenAI API, allowing you to run your own LLMs with a similar interface.

In [None]:
#%pip install vllm

### 3.1 Offline Inference

The offline API is based on the LLM class. To initialize the vLLM engine, create a new instance of LLM and specify the model to run.

The LLM class provides various methods for offline inference. See Engine Arguments for a list of options when initializing the model.

In [None]:
from vllm import LLM

llm = LLM(
    model="/ssdshare/share/model/Qwen3-0.6B-Base",
)

In vLLM, generative models implement the VllmModelForTextGeneration interface. Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, which are then passed through Sampler to obtain the final text.

The `generate` method is available to all generative models in vLLM. It is similar to its counterpart in HF Transformers, except that tokenization and detokenization are also performed automatically.


In [None]:
outputs = llm.generate("Which city is the capital of China?")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

You can optionally control the language generation by passing SamplingParams.

In [None]:
from vllm import SamplingParams

params = SamplingParams(
    temperature=0.7,
    max_tokens=128,
)
outputs = llm.generate("Which city is the capital of China?", params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The chat method implements chat functionality on top of generate. In particular, it accepts input similar to OpenAI Chat Completions API and automatically applies the model’s chat template to format the prompt.

In general, only instruction-tuned models have a chat template. Base models may perform poorly as they are not trained to respond to the chat conversation.

In [None]:
from vllm import LLM
import gc

# terminate the previous LLM instance to free up memory
llm = None
gc.collect()
llm = LLM(
    model="/ssdshare/share/model/Phi-4-mini-instruct",
    max_model_len=8192,
    max_num_seqs=1,
    gpu_memory_utilization=0.5,
)


In [None]:
conversation = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Hello"
    },
    {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
    },
    {
        "role": "user",
        "content": "Write an long essay about the importance of higher education.",
    },
]
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
)
outputs = llm.chat(conversation, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r},\n\nGenerated text: {generated_text!r}")

In [None]:
llm = None
gc.collect()

### 3.2 OpenAI-Compatible Server

You can start the server via the vllm serve command:

In [None]:
# run it in your terminal
# vllm serve /ssdshare/share/model/Phi-4-mini-instruct --dtype auto --api-key token-abc123 --max-model-len 16384

Now you can use OpenAI python package to access the endpoint:

In [None]:
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="/ssdshare/share/model/Phi-4-mini-instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

vllm provides a set of endpoints that are compatible with OpenAI API, like completion, chat completion, embedding, and so on. You can find the full list of endpoints in the vllm documentation.

Moreover, vllm also provides a set of metrics endpoints that can be used to monitor the state and performance of the server.
Some of the metrics are: TTFT, TPOT.

TTFT is the time it takes to generate the first token of the response. TPOT is the time it takes to generate each token of the response. These metrics are so-called SLO (Service Level Objective) metrics, which are used to measure the performance of the server. VLLM did a lot of work to optimize these SLO.

## 4 Adding a Web User Interface using `gradio`

Demo a machine learning application is important. It gives the users a direct experience of your algorithm in an interactive manner. Here we'll be building an interesting demo using `gradio`, a popular Python library for ML demos. Let's install this library.

### 4.1 Basic Gradio

In [None]:
#% pip install gradio --upgrade

Then we are able to write an example UI that takes in a text string and output a processed string. 

In [None]:
%%file /tmp/gradio_example.py

import gradio as gr

def greet(name, intensity):
    return "Hello, hello " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs_features=["text", "slider"],
    outputs=["text"],
)

demo.launch()


In [None]:
# Start the gradio server by runnning the following command

# python /tmp/gradio_example.py

In [None]:
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser

## Try change the last line (launch) to 

## demo.launch(share=True) 
## observe the output and see the link to open (without the need of port forwarding)


### 4.2 The ChatInterface

In [None]:
%%file /tmp/gradio_example.py

import random

def random_response(message, history):
    return random.choice(["Yes", "No"])

import gradio as gr
gr.ChatInterface(random_response).launch()

In [None]:
# Kill your previous process, and restart the new process

# python /tmp/gradio_example.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 automatically. 

### 4.3 Quick and dirty way of creating a UI for a HuggingFace pipeline

In [None]:
%%file /tmp/simpleui.py

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import gradio as gr

model_path = '/ssdshare/share/model/Phi-4-mini-instruct'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.6,
    do_sample=True,
    return_full_text=False,
    max_new_tokens=500,
) 
gr.Interface.from_pipeline(pipe).launch(debug=True)

In [None]:
# python /tmp/simpleui.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

### 4.4 A better way to build a web UI for LLM (through an LLM API server)

Next, you should implement a script that interact with the Phi-4-mini Chat API server you just created.  

Note that you should directly call the API server using request, instead of running the LLM within your UI server process. 

![Illustration of request](./assets/request.jpg)

In [None]:
%%file /tmp/chatUI.py

import gradio as gr
import requests
import json

def predict(message, history):

#### Your Task ####
# Insert code here to perform the inference
# You can use either the hand-crafted API server or the OpenAI-compatible vLLM server
#### End Task ####

gr.ChatInterface(predict).launch()

In [None]:
## Do not forget to start your API server (from above, use the /chat API or use the vLLM)

In [None]:
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

### 4.5 More Gradio: Streaming and Multi-media

Gradio also supports streaming and multi-media input and output.

Magic happens from the `streaming=True` parameter.

In [None]:
%%file /tmp/transcribe.py

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import numpy as np
import gradio as gr

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_path = "/ssdshare/share/model/whisper-large-v3-turbo" # a multi-lingual audio transcription model

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_path)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)


def transcribe(stream, new_chunk):
    sr, y = new_chunk

    # Convert to mono if stereo
    if y.ndim > 1:
        y = y.mean(axis=1)

    # normalize
    y = y.astype(np.float32)
    y /= np.max(np.abs(y))

    if stream is not None:
        stream = np.concatenate([stream, y])
    else:
        stream = y
    return stream, pipe(
        {
            "sampling_rate": sr,
            "raw": stream,
            "return_timestamps": True,
            "task": "transcribe",
            "language": "chinese",
        }
    )["text"]


demo = gr.Interface(
    transcribe,
    ["state", gr.Audio(sources=["microphone"], streaming=True)], # note the streaming=True
    ["state", "text"],
    live=True,
    time_limit=30,
    stream_every=0.5,
)

demo.launch()

Try it!

In [None]:
## python /tmp/transcribe.py

### 4.6 Build your own Gradio UI

Create a separate Gradio UI to serve other models. Maybe an image model in Lab 5, or a translation model cooperated with a transcription model? Explore the Gradio documentation and HuggingFace model cards.

In [None]:
#### Your Task ####
#### End Task ####