# Lecture 8: Serving Large Language Models and Beyond

In this lecture, you will learn how to serve modern large models on Linux servers with easy-to-use user interface. We will be using Python as our main programming language, and we do not require knowledge about front-end language such as Javascript or CSS.

## Preliminaries

We start by reviewing some basics you should be familiar with. We assume that you are already familiar with Python language. Make sure you have a workspace with Python and Pip available.

### Docker

#### What is Docker?

Imagine a self-contained box carrying everything an application needs to run smoothly: its code, runtime environment, and required libraries. That's what a Docker container is! It provides a standardized, isolated environment for applications, regardless of the underlying system. This simplifies deployment, sharing, and scaling applications.

For example, you can run a docker image (think of it as a minimal virtual machine) that hosts your personal website using Ubuntu (one of the most popular version of Linux system) on your Macbook. You will not need to write additional code, as someone else already build such image for you. You will not worry about the dependencies of the software running your website, as they are already packed into the image.

![Docker illustration image created by AI](./assets/docker.png)

#### Benefits of Docker

- **Consistency**. Applications run identically on any system with Docker installed.
- **Isolation**. Containers share resources but don't interfere with each other, improving stability.
- **Portability**. Move containers easily between systems without worrying about environment conflicts.
- **Reproducibility**. Share configurations and ensure applications run the same way everywhere.

#### Basic Docker Workflow

1. Pull: Download a pre-built container image from a public registry (like Docker Hub).
2. Run: Start the container, bringing the application to life.
3. Interact: Use the container as you would any other application.
4. Stop: Terminate the container when you're done.

#### Key Docker commands

- `docker pull`: Download a container image from the registry.
- `docker run`: Start a container based on an image.
- `docker ps`: See a list of running containers.
- `docker stop`: Stop a running container.
- `docker exec`: Execute a command inside a running container.

Here is some resource for further exploration: [Interactive Docker Tutorial](https://www.docker.com/play-with-docker/), [Docker Official Documentation](https://docs.docker.com/).

### Kubernetes

#### What is Kubernetes?

Think of a conductor managing a whole orchestra of Docker containers. That's Kubernetes! It automates the deployment, scaling, and management of containerized applications across multiple servers. Kubernetes ensures your applications run smoothly, even when things get complicated.

You can think of Docker as a container manager for a single machine, while Kubernetes a container manager for a group machines!

![Illustration of Kubernetes](assets/k8s.png)

#### Benefits of Kubernetes:

- **Automation**. Manage deployment, scaling, and updates of containerized applications automatically.
- **Scalability**. Easily scale your applications up or down based on demand.
- **High Availability**. Kubernetes automatically restarts failed containers and distributes workloads, ensuring service continuity.
- **Portability**. Kubernetes applications can be deployed anywhere with the same setup.

#### Key Kubernetes Concepts

- Pods: Groups of containers that share resources and work together.
- Deployments: Define how and how many pods of a specific application should run.
- Services: Provide a stable endpoint for accessing your pods, even if individual

For further exploration: [play with k8s](https://labs.play-with-k8s.com/), [official documentation](https://kubernetes.io/docs/home/).

## Experiment 0: Serving and Requesting a Web Service

In this experiment, we'll equip you with the basic knowledge and practical skills to start making powerful HTTP requests in Python. We'll cover GET and POST methods, and explore JSON data exchange. So, buckle up, let's code!

First, we will need `requests` library. Install it with the following command.

In [None]:
%pip install requests

#### Basic `GET`

Imagine asking a librarian for a book. That's essentially what a GET request does! It retrieves information from a specific web address (URL). Let's try the GET method to retrieve a random joke!

In [1]:
import requests

# Target URL
url = "https://api.chucknorris.io/jokes/random"

# Send a GET request and store the response
response = requests.get(url)

# Check the response status code (2XX means success)
print(f"Status code: {response.status_code}")

# Access the response content (raw bytes)
content = response.content

# Decode the content to text (may differ depending on API)
text = content.decode("utf-8")

# Print the response
print("\n--- Response Text ---")
print(text)

Status code: 200

--- Response Text ---
{"categories":[],"created_at":"2020-01-05 13:42:21.795084","icon_url":"https://assets.chucknorris.host/img/avatar/chuck-norris.png","id":"cmfIxasPR9ejJi7l_aEPfg","updated_at":"2020-01-05 13:42:21.795084","url":"https://api.chucknorris.io/jokes/cmfIxasPR9ejJi7l_aEPfg","value":"Chuck Norris can get road rage in a fighter jet."}


#### Playing with JSON

Many APIs and websites return data in the JSON format, a structured way to organize information. We can easily convert this JSON string to a Python dictionary for easy access:

In [2]:
import json
from pprint import pprint

pprint(json.loads(text))

{'categories': [],
 'created_at': '2020-01-05 13:42:21.795084',
 'icon_url': 'https://assets.chucknorris.host/img/avatar/chuck-norris.png',
 'id': 'cmfIxasPR9ejJi7l_aEPfg',
 'updated_at': '2020-01-05 13:42:21.795084',
 'url': 'https://api.chucknorris.io/jokes/cmfIxasPR9ejJi7l_aEPfg',
 'value': 'Chuck Norris can get road rage in a fighter jet.'}


#### Moving on to POST Requests

While GET requests fetch data, POST requests send information to a server, like submitting a form. We'll be using a dummy API that echos the data we sent as an example.

In [3]:
# Define URL and data
url = "https://httpbin.org/anything"
data = {"name": "John Doe", "age": 30}

# Send POST request with data
response = requests.post(url, data=data)

# Check status code and print response
print(f"Status code: {response.status_code}")
print(response.text)

Status code: 200
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "30", 
    "name": "John Doe"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br, zstd", 
    "Content-Length": "20", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-660ab349-6fd3466d69452d2d1775de2d"
  }, 
  "json": null, 
  "method": "POST", 
  "origin": "114.253.244.26", 
  "url": "https://httpbin.org/anything"
}



We can see that the sent data is actually received by the server (`form` shows the exactly the same data we sent).

This is just the tip of the iceberg! Now you have seen how we can utilize the existing web service. In the remaining experiments, you will be building your own API server and web service with a nice user interface.

## Experiment 1: API Server for LLMs with GPU Support

Most of you should have experienced the LLM APIs we provided, which allows your program accessing the power of large language models. Here we will guide you to build your own LLM service, using the `fastapi` library of Python.

`fastapi` takes care of the job of launching a web server and serve the API calls. You only need to define a function that takes the input data from the request to produce output. `fastapi` will handle the rest things for you.

First, install the dependency of `fastapi`.

In [None]:
%pip install uvicorn fastapi websockets

In [4]:
%%file /tmp/fastapi_example.py

import fastapi

app = fastapi.FastAPI()

@app.get('/inference')
def process_string(data: str):
    return f'Processed {data} by FastAPI!'

Writing /tmp/fastapi_example.py


In [5]:
!uvicorn --app-dir /tmp fastapi_example:app --port 54223 --host 0.0.0.0

[32mINFO[0m:     Started server process [[36m13226[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://0.0.0.0:54223[0m (Press CTRL+C to quit)
[32mINFO[0m:     127.0.0.1:56128 - "[1mGET /inference?data=hello HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     127.0.0.1:56128 - "[1mGET /favicon.ico HTTP/1.1[0m" [31m404 Not Found[0m
^C
[32mINFO[0m:     Shutting down
[32mINFO[0m:     Waiting for application shutdown.
[32mINFO[0m:     Application shutdown complete.
[32mINFO[0m:     Finished server process [[36m13226[0m]


By visiting `http://127.0.0.1:54223/inference?data=hello` in your browser, you will be able to see the return string:

```
1| "Processed hello by FastAPI!"
```

Note that if you are running on remote server, you may need to forward your port to local machine to see the effect on your browser.

Now, it is your turn to implement a script to serve an API that runs a `GPT-2`.

## Experiment 2: Serving a User Interface using `gradio`

Demo a machine learning application is important. It gives the users a direct experience of your algorithm in an interactive manner. Here we'll be building an interesting demo using `gradio`, a popular Python library for ML demos. Let's install this library.

In [None]:
%pip install gradio

Then we are able to write an example UI that takes in a text string and output a processed string.

In [6]:
%%file /tmp/gradio_example.py

import gradio as gr

def greet(name):
    return "Hello " + name + "!"

demo = gr.Interface(fn=greet, inputs="text", outputs="text")
    
if __name__ == "__main__":
    demo.launch(show_api=False)   

Writing /tmp/gradio_example.py


In [7]:
!python /tmp/gradio_example.py

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
^C
Keyboard interruption in main thread... closing server.


Now you should be able to see a simple website that consumes your input text. Next, you should implement a script that interact with the GPT-2 API you just created.

![Illustration of request](./assets/request.jpg)

## Experiment 3: Serving a Custom Model

In this experiment, you are required to serve other models on HuggingFace, e.g. VLMs. You should design your own UI and your own API service.