# Lab 7: LLM API server and Web interfaces

In this lecture, you will learn how to serve modern large models on Linux servers with easy-to-use user interface. We will be using Python as our main programming language, and we do not require knowledge about front-end language such as Javascript or CSS.

## 1 Calling Web Service APIs

In this experiment, we'll equip you with the basic knowledge and practical skills to start making powerful HTTP requests in Python. We'll cover GET and POST methods, and explore JSON data exchange. So, buckle up, let's code!

First, we will need `requests` library. Install it with the following command.

In [20]:
%pip install requests

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.


#### 1.1 Basic `GET`

GET retrieves information from a specific web address (URL). Parameters are passed either in the path itself or as a query parameter (after ? in the URL).

Let's try the GET method to retrieve a random joke!

In [21]:
import requests

# Target URL
url = "https://api.chucknorris.io/jokes/random"

# Send a GET request and store the response
response = requests.get(url)

# Check the response status code (2XX means success)
print(f"Status code: {response.status_code}")

# Access the response content (raw bytes)
content = response.content

# Decode the content to text (may differ depending on API)
text = content.decode(response.encoding)

# Print the response
print("\n--- Response Text ---")
print(text)

Status code: 200

--- Response Text ---
{"categories":[],"created_at":"2020-01-05 13:42:27.496799","icon_url":"https://assets.chucknorris.host/img/avatar/chuck-norris.png","id":"MCqtvLI4SaumUznRg7A5BA","updated_at":"2020-01-05 13:42:27.496799","url":"https://api.chucknorris.io/jokes/MCqtvLI4SaumUznRg7A5BA","value":"Chuck Norris's power level is over 9000......in his sleep."}


#### 1.2 Playing with JSON

Many APIs and websites return data in the JSON format, a structured way to organize information. We can easily convert this JSON string to a Python dictionary for easy access:

In [22]:
import json
from pprint import pprint

dict = json.loads(text)
pprint(dict)

encoded_json = json.dumps(dict)
print(encoded_json)

{'categories': [],
 'created_at': '2020-01-05 13:42:27.496799',
 'icon_url': 'https://assets.chucknorris.host/img/avatar/chuck-norris.png',
 'id': 'MCqtvLI4SaumUznRg7A5BA',
 'updated_at': '2020-01-05 13:42:27.496799',
 'url': 'https://api.chucknorris.io/jokes/MCqtvLI4SaumUznRg7A5BA',
 'value': "Chuck Norris's power level is over 9000......in his sleep."}
{"categories": [], "created_at": "2020-01-05 13:42:27.496799", "icon_url": "https://assets.chucknorris.host/img/avatar/chuck-norris.png", "id": "MCqtvLI4SaumUznRg7A5BA", "updated_at": "2020-01-05 13:42:27.496799", "url": "https://api.chucknorris.io/jokes/MCqtvLI4SaumUznRg7A5BA", "value": "Chuck Norris's power level is over 9000......in his sleep."}


#### 1.3 Moving on to POST Requests

While GET requests fetch data, POST requests send information to a server, like submitting a form. We'll be using a dummy API that echos the data we sent as an example.

In [23]:
# Define URL and data
url = "https://httpbin.org/anything"
data = {"name": "John Doe", "age": 30}  # a python dictionary

# Send POST request with data
response = requests.post(url, data=data) # data is automatically encoded to json

# Check status code and print response
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "30", 
    "name": "John Doe"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br, zstd", 
    "Content-Length": "20", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-66446c2a-60db626201f88d25347c63e0"
  }, 
  "json": null, 
  "method": "POST", 
  "origin": "114.253.254.93", 
  "url": "https://httpbin.org/anything"
}



We can see that the sent data is actually received by the server (`form` shows the exactly the same data we sent).

This is just the tip of the iceberg! Now you have seen how we can utilize the existing web service. In the remaining experiments, you will be building your own API server and web service with a nice user interface.

## 2 Creating an API server using FastAPI

Most of you should have experienced the LLM APIs we provided, which allows your program accessing the power of large language models. Here we will guide you to build your own LLM service, using the `fastapi` library of Python.

`fastapi` takes care of the job of launching a web server and serve the API calls. You only need to define a function that takes the input data from the request to produce output. `fastapi` will handle the rest things for you.

First, install the dependency of `fastapi`.

### 2.1 Basics on FastAPI

In [24]:
%pip install uvicorn fastapi websockets

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.


In [25]:
%%file /tmp/fastapi_example.py

from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

app = FastAPI()

## path parameters
@app.get('/g/{data}')
async def process_data(data: str):
    return f'Processed {data} by FastAPI!'

fake_items_db = [{"item_name": "Foo"}, {"item_name": "Bar"}, {"item_name": "Baz"}]
# Query parameters
@app.get("/items/")
async def read_item(skip: int = 0, limit: int = 10):
    return fake_items_db[skip : skip + limit]


## The data model
from typing import List
class Sale(BaseModel):
    day: int
    price: float
    
class Item(BaseModel):
    name: str
    inventory: int | None = 10
    sales: List[Sale] = []

# Getting Parameters from Request
@app.post("/post")
async def create_item(item: Item):
    return f'Hello {item.name}, {item.inventory} in stock, sold {len(item.sales)} items'

# The main() function is the entry point of the script
if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)


Overwriting /tmp/fastapi_example.py


In [26]:
## run the following command in your terminal to start the server
## python /tmp/fastapi_example.py 

In [28]:
# you can visit your web service at:

response = requests.get('http://localhost:54223/g/hello')
print(f"Status code: {response.status_code}")
response.content

Status code: 200


b'"Processed hello by FastAPI!"'

In [29]:
# Using the query parameter

response = requests.get('http://localhost:54223/items?skip=2&limit=3')
print(f"Status code: {response.status_code}")
response.content


Status code: 200


b'[{"item_name":"Baz"}]'

In [30]:
# Now let the magic happen.
# Set port forwarding in your VSCode devcontainer to forward port 54223 to your local machine
# Then visit `http://127.0.0.1:54223/g/hello` in your browser, you will be able to see the return string in the browser!


In [31]:
# Also test the POST processing, with a complex data structure as input

url = "http://localhost:54223/post"
data = { "name": "Apple", 
         "inventory": 33, 
         "sales": [{"day": 0, "price": 3.4}, {"day": 1, "price": 3.3}]
         }
encoded = json.dumps(data).encode("utf-8")
response = requests.post(url, data=encoded)  # the parameters should be encoded as JSON
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
"Hello Apple, 33 in stock, sold 2 items"


In [32]:
# Another FastAPI magic: automatic document generation
# Visit http://localhost:54223/docs in your browser to see the API documentation
# (Assuming that you have your port forwarding set up correctly)

### 2.2 Creating an API to serve local LLM model

First, let's recall how you run a local LLM.  The following scripts starts a Phi-3 model.

In [33]:
%%file /tmp/local_llm.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

## The main function is the entry point of the script
if __name__ == '__main__':
    model_path = '/ssdshare/Phi-3-mini-128k-instruct/'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True)
    resp = chat_resp(model, tokenizer, "What is the meaning of life?")
    print(resp)


Overwriting /tmp/local_llm.py


In [34]:
## first verify that you can run LLM locally correctly (it should print out the results, despite of lots of warnings.)
## python /tmp/local_llm.py

In [9]:
%%file /tmp/api_llm.py

import os
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
         
from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn

from urllib.parse import unquote

app = FastAPI()

def chat_resp(model, tokenizer, user_prompt=None, history=[]):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }
    if not history:
        messages = [{"role": "system", "content": "You are a helpful assistant."},]
    else:
        messages = history
    if user_prompt:
        prompt_msg = [{"role": "user", "content": user_prompt}]
        messages.extend(prompt_msg)
    output = pipe(messages, **generation_args)
    return output

#### Your Task ####
## Implement a GET handler that takes in a single string as prompt from user,
## and return the response as a single string.
class Prompt(BaseModel):
    text: str

@app.get("/ask/")
async def get_response(prompt: str):
    return chat_resp(model, tokenizer, user_prompt=prompt)
#### End Task ####

#### Your Task ####
## Implement a POST handler that takes in a single string and a history
## and return the response as a single string.
class PH(BaseModel):
    prompt: str
    history: list

@app.post("/post/")
async def post_response(request: Request, ph: PH):
    return chat_resp(model, tokenizer, user_prompt=ph.prompt, history=ph.history)

#### End Task ####

#### Your Task ####
## The main function is the entry point of the script, you should load the model
## and then start the FastAPI server.
if __name__ == '__main__':
    model_path = '/ssdshare/Phi-3-mini-128k-instruct/'
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True)
    uvicorn.run(app, host='0.0.0.0', port=54223, workers=1)
#### End Task ####


Overwriting /tmp/api_llm.py


In [36]:
## run the following command in your terminal to start the server
## python /tmp/api_llm.py

In [55]:
## Run a single query to test the API, using GET
import requests
import urllib.parse
params = {"prompt": "中国的首都是哪里？"}
prompt_url = urllib.parse.urlencode(params)
url = f'http://localhost:54223/ask/?%s' % prompt_url
print(url)
response = requests.get(url)
print(f"Status code: {response.status_code}")
print(response.content.decode(response.encoding))

params = {"prompt": "中国的都首是哪里？"}
prompt_url = urllib.parse.urlencode(params)
url = f'http://localhost:54223/ask/?%s' % prompt_url
print(url)
response = requests.get(url)
print(f"Status code: {response.status_code}")
print(response.content.decode(response.encoding))

http://localhost:54223/ask/?prompt=%E4%B8%AD%E5%9B%BD%E7%9A%84%E9%A6%96%E9%83%BD%E6%98%AF%E5%93%AA%E9%87%8C%EF%BC%9F
Status code: 200
[{"generated_text":" 中国的首都是北京。北京是中国的政治、文化和国际交流中心，也是世界上最大的人口密集地区之一。"}]
http://localhost:54223/ask/?prompt=%E4%B8%AD%E5%9B%BD%E7%9A%84%E9%83%BD%E9%A6%96%E6%98%AF%E5%93%AA%E9%87%8C%EF%BC%9F
Status code: 200
[{"generated_text":" 中国的首都是北京。北京是中华人民共和国的政治、文化和国际交流中心，也是其最大的城市。北京的历史可追溯到更早的秦朝，有着超过三千年的历史。"}]


In [10]:
#### Your Task ####
## Run a LLM single line query with POST, and add chat history (history stored on the client side only)
#### Your Task ####
## Run a LLM single line query with POST, and add chat history (history stored on the client side only)
import requests

url = 'http://localhost:54223/post/'
data = {
    "prompt": "中国的都首是哪里？",
    "history": [
        {"role": "user", "content": "法国的都首是哪里?"},
        {"role": "system", "content": "是巴黎."},
        {"role": "user", "content": "美国的都首是哪里?"},
        {"role": "system", "content": "是华盛顿."},
        {"role": "system", "content": "You are a helpful assistant."}
    ]
}
response = requests.post(url, json=data)
print(f"Status code: {response.status_code}")
print(response.json())


Status code: 200
[{'generated_text': ' 中国的都首是北京。'}]


## 3 Adding a Web User Interface using `gradio`

Demo a machine learning application is important. It gives the users a direct experience of your algorithm in an interactive manner. Here we'll be building an interesting demo using `gradio`, a popular Python library for ML demos. Let's install this library.

### 3.1 Basic Gradio

In [30]:
pip install gradio --upgrade

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.


Then we are able to write an example UI that takes in a text string and output a processed string. 

In [12]:
%%file /tmp/gradio_example.py

import gradio as gr

def greet(name, intensity):
    return "Hello, hello " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs=["text", "slider"],
    outputs=["text"],
)

demo.launch(share=True)

Overwriting /tmp/gradio_example.py


In [None]:
# Start the gradio server by runnning the following command

# python /tmp/gradio_example.py

In [None]:
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser

## Try change the last line (launch) to 

## demo.launch(share=True) 
## observe the output and see the link to open (without the need of port forwarding)


### 3.2 The ChatInterfae

In [77]:
%%file /tmp/gradio_example.py

import random

def random_response(message, history):
    return random.choice(["Yes", "No"])

import gradio as gr
gr.ChatInterface(fn = random_response, server_port = 7860).launch(share=True)

Overwriting /tmp/gradio_example.py


In [None]:
# Kill your previous process, and restart the new process

# python /tmp/gradio_example.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 automatically. 

### 3.3 Quick and dirty way of creating a UI for a HuggingFace pipeline

In [13]:
%%file /tmp/simpleui.py

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import gradio as gr

model_path = '/ssdshare/Phi-3-mini-128k-instruct/'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    temperature=0.6,
    do_sample=True,
    return_full_text=False,
    max_new_tokens=500,
) 
gr.Interface.from_pipeline(pipe).launch(share=True)

Overwriting /tmp/simpleui.py


In [None]:
# python /tmp/simpleui.py

## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

### 3.4 A better way to build a web UI for LLM (through an LLM API server)

Next, you should implement a script that interact with the Phi-3 Chat API server you just created.  

Note that you should directly call the API server using request, instead of running the LLM within your UI server process. 

![Illustration of request](./assets/request.jpg)

In [1]:
import os
os.environ['HTTP_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY']="socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

In [14]:
%%file /tmp/chatUI.py

import gradio as gr
import requests
import json

API_SERVER_URL = "http://localhost:54223" # Don't forget to start your local API server

def predict(message, history):

#### Your Task ####
    Pr = {
        "prompt": message,
        "history": history.split("\n") if history else []  # Split history string into a list of strings
    }
    # Send POST request to the /post_response/ endpoint with the updated payload
    response = requests.post(f"{API_SERVER_URL}/post/", json=Pr)
    if response.status_code == 200:
        return response.json()
    else:
        return "Error %s: Failed to get response from the server" % response.status_code
#### End Task ####

gr.ChatInterface(predict).launch(show_error=True)

Overwriting /tmp/chatUI.py


In [None]:
## Do not forget to start your API server (from above, with the /chat API.)

In [None]:
## Add the port forwarding (port 7860 by default), and you can see http://localhost:7860 in your browser
## If you do not kill the previous one, the port number will change to 7861 or 7862 automatically. 

You you can also test it programmatically using gradio-client. 

In [28]:
pip install gradio-client

Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
[0mNote: you may need to restart the kernel to use updated packages.


In [15]:
from gradio_client import Client

client = Client("http://127.0.0.1:7861/")
result = client.predict(
		message="Hello!!",
		api_name="/chat"
)
print(result)

Loaded as API: http://127.0.0.1:7861/ ✔
[{'generated_text': ' Hello there! How can I assist you today?'}]


### 3.5 More Gradio: create an UI to serve an image model you created in lab 5.

You can either use the from_pipeline() or create your own, more advanced UI.  In either way, you will need to allow API access to your service (will be needed for the following labs). 

If you feel more adventurous, try a new multi-media model, such as text-to-speach or voice recognition.  We have downloaded some for you at:
/share/model/speecht5_hifigan/,/share/model/speecht5_tts/, and /share/model/whisper-medium/




In [1]:
#### Your Task ####
## follow the instructions above

import gradio as gr
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("/share/LLMs/stable-diffusion-2-1", use_auth_token=False)
pipe = pipe.to("cuda")
gr.Interface.from_pipeline(pipe).launch()

Keyword arguments {'use_auth_token': False} are not expected by StableDiffusionPipeline and will be ignored.


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




  0%|          | 0/50 [00:00<?, ?it/s]