# SGLang + ServerlessLLM Integration Example

This notebook demonstrates the integration between SGLang and ServerlessLLM for optimized model loading and inference.

We will:
1. Install dependencies (uv, ServerlessLLM, SGLang)
2. Convert a HuggingFace model (`Qwen/Qwen3-0.6B`) to ServerlessLLM format
3. Run a benchmark comparing the loading time of the ServerlessLLM format vs. the standard format

## Installation

### 1. Install uv (recommended for faster installation)

```bash
pip install uv
uv venv -p 3.10
source .venv/bin/activate
```

### 2. Install SGLang

```bash
# Clone the repository
git clone https://github.com/sgl-project/sglang.git

# Install SGLang
uv pip install -e sglang/python
```

### 3. Install ServerlessLLM

```bash
# Clone the repository
git clone https://github.com/ServerlessLLM/ServerlessLLM.git

# Install sllm_store
uv pip install -e ServerlessLLM/sllm_store

# Install ServerlessLLM
uv pip install -e ServerlessLLM

# Install SGLang again to match the dependency version
uv pip install -e sglang/python
```

### 4. Install Jupyter notebook

```bash
uv pip install jupyter
```

In [1]:
import shutil
import os

# Verify installation
def check_installation():
    errors = []
    
    if not shutil.which("sllm"):
        errors.append("sllm not found")
    if not shutil.which("sllm-store"):
        errors.append("sllm-store not found")
    
    try:
        import sllm
        import sllm_store
        print(f"✅ sllm installed: {os.path.dirname(sllm.__file__)}")
        print(f"✅ sllm_store installed: {os.path.dirname(sllm_store.__file__)}")
    except ImportError as e:
        errors.append(f"Import error: {e}")
    
    try:
        import sglang
        if hasattr(sglang, '__file__') and sglang.__file__:
            print(f"✅ sglang installed: {os.path.dirname(sglang.__file__)}")
        elif hasattr(sglang, '__path__'):
            print(f"✅ sglang installed: {sglang.__path__[0]}")
    except ImportError as e:
        errors.append(f"sglang not installed: {e}")
    
    if errors:
        print("\n❌ Missing dependencies:")
        for err in errors:
            print(f"   - {err}")
        print("\nPlease follow the installation instructions above.")
    else:
        print("\n✅ All dependencies installed!")

check_installation()


✅ sllm installed: /root/xinyuan/sglang/examples/integration/ServerlessLLM/sllm
✅ sllm_store installed: /root/xinyuan/sglang/examples/integration/ServerlessLLM/sllm_store/sllm_store
✅ sglang installed: /root/xinyuan/sglang/examples/integration/sglang

✅ All dependencies installed!


In [2]:
# Import required libraries
import time
import torch
import nest_asyncio
nest_asyncio.apply()

from sglang.srt.entrypoints.engine import Engine



## 2. Convert Model to ServerlessLLM Format

We use the `Qwen/Qwen3-0.6B` model for this demonstration. 
The conversion script optimizes the model checkpoints for fast loading.

In [3]:
model_name = "Qwen/Qwen3-0.6B"
sllm_storage_path = "models" # or your preferred path to store the model weights

In [4]:
# Ensure storage path exists
!mkdir -p {sllm_storage_path}

# Run the conversion script
# This script downloads the model (if needed) and converts it
!uv run python ServerlessLLM/sllm_store/examples/save_sglang_model.py \
   --model-name {model_name} \
   --storage-path {sllm_storage_path} \
   --dtype bfloat16 \
   --tensor-parallel-size 1 

Fetching 7 files:   0%|                                   | 0/7 [00:00<?, ?it/s]
merges.txt: 0.00B [00:00, ?B/s][A

generation_config.json: 100%|██████████████████| 239/239 [00:00<00:00, 1.94MB/s][A[A


config.json: 100%|█████████████████████████████| 726/726 [00:00<00:00, 9.92MB/s][A[A


tokenizer_config.json: 9.73kB [00:00, 26.0MB/s]A
Fetching 7 files:  14%|███▊                       | 1/7 [00:00<00:01,  5.35it/s]

vocab.json: 0.00B [00:00, ?B/s][A[A


tokenizer.json:   0%|                               | 0.00/11.4M [00:00<?, ?B/s][A[A[A



merges.txt: 1.67MB [00:00, 21.8MB/s]                | 0.00/1.50G [00:00<?, ?B/s][A[A[A[A
vocab.json: 2.78MB [00:00, 33.4MB/s]



tokenizer.json:   0%|                      | 37.4k/11.4M [00:00<02:00, 94.5kB/s][A[A[A


tokenizer.json: 100%|██████████████████████| 11.4M/11.4M [00:00<00:00, 21.8MB/s][A[A[A




model.safetensors:   0%|                   | 5.93M/1.50G [00:00<02:24, 10.4MB/s][A[A[A[A



model.safetensors:   1%|▏ 

## 3. Start the sllm-store server in the background

In [7]:
import subprocess

# Define the command as a list for safety
cmd = [
    "sllm-store", "start",
    "--storage-path", sllm_storage_path,
    "--mem-pool-size", "4GB"
]

# Open a log file to capture the output
log_file = open("sllm_server.log", "w")

# Launch the process in the background
process = subprocess.Popen(
    cmd,
    stdout=log_file,
    stderr=subprocess.STDOUT,
    preexec_fn=os.setpgrp # Decouples the process from the notebook's process group
)

print(f"sllm-store started in background with PID: {process.pid}")
print("Check logs with: !tail -f sllm_server.log")

sllm-store started in background with PID: 781491
Check logs with: !tail -f sllm_server.log


In [None]:
!tail -f sllm_server.log

Traceback (most recent call last):
  File "/root/xinyuan/sglang/examples/integration/.venv/bin/sllm-store", line 4, in <module>
    from sllm_store.cli import main
  File "/root/xinyuan/sglang/examples/integration/ServerlessLLM/sllm_store/sllm_store/cli.py", line 30, in <module>
    from sllm_store.server import serve
  File "/root/xinyuan/sglang/examples/integration/ServerlessLLM/sllm_store/sllm_store/server.py", line 13, in <module>
    ctypes.CDLL(os.path.join(sllm_store.__path__[0], "libglog.so"))
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /root/xinyuan/sglang/examples/integration/ServerlessLLM/sllm_store/sllm_store/libglog.so: cannot open shared object file: No such file or directory


## 4. Performance Benchmark

We compare the initialization time of the SGLang Engine using:
- **ServerlessLLM format** (`load_format="serverless_llm"`)
- **Standard format** (`load_format="auto"`)

In [13]:
def run_benchmark():
    print("Starting Benchmark...") 
    
    # 1. Test ServerlessLLM format
    print(f"Loading {model_name} with ServerlessLLM format...")
    start = time.time()
    # Note: model_path for sllm format is the local directory
    sllm_model_path = os.path.join(sllm_storage_path, model_name)
    print(sllm_model_path)
    
    try:
        engine_sllm = Engine(
            model_path=sllm_model_path,
            load_format="serverless_llm",
            tp_size=1,
            dtype="bfloat16"
        )
        sllm_load_time = time.time() - start
        print(f"ServerlessLLM load time: {sllm_load_time:.4f}s")
        prompts = ["What is the capital of France"]
        # engine.generate returns a list of dictionaries/objects depending on the SGLang version
        outputs = engine_sllm.generate(prompts)
        
        for output in outputs:
            # Depending on SGLang version, output might be a dict or an object
            if isinstance(output, dict):
                print(output.get("text", ""))
            else:
                print(output.text)
    
        # 3. Shutdown
        engine_sllm.shutdown()
    except Exception as e:
        print(f"ServerlessLLM loading failed: {e}")
        sllm_load_time = float('inf')

    
        
    # Clear GPU memory if possible
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # 2. Test Standard format
    print(f"Loading {model_name} with Standard format...")
    start = time.time()
    try:
        engine_std = Engine(
            model_path=model_name,
            load_format="auto",
            tp_size=1,
            dtype="bfloat16"
        )
        std_load_time = time.time() - start
        print(f"Standard load time: {std_load_time:.4f}s")
        # Run inference    
        prompts = ["What is the capital of France"]
        # engine.generate returns a list of dictionaries/objects depending on the SGLang version
        outputs = engine_std.generate(prompts)
        
        for output in outputs:
            # Depending on SGLang version, output might be a dict or an object
            if isinstance(output, dict):
                print(output.get("text", ""))
            else:
                print(output.text)

        # 3. Shutdown
        engine_std.shutdown()
    except Exception as e:
        print(f"Standard loading failed: {e}")
        std_load_time = float('inf')


        
    
    print("RESULTS")
    print(f"ServerlessLLM: {sllm_load_time:.4f}s")
    print(f"Standard:      {std_load_time:.4f}s")
    if sllm_load_time > 0 and std_load_time != float('inf'):
        print(f"Speedup:       {std_load_time/sllm_load_time:.2f}x")

run_benchmark()

Starting Benchmark...
Loading Qwen/Qwen3-0.6B with ServerlessLLM format...
/home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B


We recommend installing via `pip install torch-c-dlpack-ext`


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
DEBUG 01-13 20:49:39 torch.py:137] allocate_cuda_memory takes 0.0008206367492675781 seconds
DEBUG 01-13 20:49:39 client.py:72] load_into_gpu: /home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B/rank_0, 2e757c0a-15ec-4ea7-ac83-41f3334d0fb6
INFO 01-13 20:49:39 client.py:113] Model loaded: /home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B/rank_0, 2e757c0a-15ec-4ea7-ac83-41f3334d0fb6
INFO 01-13 20:49:39 torch.py:160] restore state_dict takes 0.0004429817199707031 seconds
INFO 01-13 20:49:39 client.py:117] confirm_model_loaded: /home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B/rank_0, 2e757c0a-15ec-4ea7-ac83-41f3334d0

Capturing batches (bs=1 avail_mem=5.08 GB): 100%|██████████| 8/8 [00:01<00:00,  6.77it/s] 


ServerlessLLM load time: 26.0948s


We recommend installing via `pip install torch-c-dlpack-ext`


 on the green button logo they want tomas?
A) Somewhere North
B) Somewhere East
C) Somewhere Tuesday
D) Somewhere West
E) None of the above?

The choices are based on the logo tomas, which is a very well known French celebrity.
Answer:
C) Somewhere Tuesday

The logo tomas is a famous and well-known French celebrity. On this logo, the shapes represent the like of a green button. The greenbutton logo is associated with the capital of France. On the green button logo they need tomas, the capital that would be represented is Somewhere Tuesday,
Loading Qwen/Qwen3-0.6B with Standard format...


We recommend installing via `pip install torch-c-dlpack-ext`
We recommend installing via `pip install torch-c-dlpack-ext`


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.00it/s]

Capturing batches (bs=1 avail_mem=5.03 GB): 100%|██████████| 8/8 [00:01<00:00,  6.55it/s] 


Standard load time: 39.2057s
?

The capital of France is Paris. LaTeX, the way you mentioned it, is used to write mathematical formulas. So, if I were to write the capital of France in LaTeX, it would be denoted by the symbol \text{Paris}. However, I need to be careful. The capital of France is also known as Paris, which is sometimes also called the capital of the country. Therefore, you could use either \text{Paris} or \text{Paris}, both would be correct. The square brackets in Latex are often used to denote the capitals of different countries, but for Paris, it's better to use the
RESULTS
ServerlessLLM: 26.0948s
Standard:      39.2057s
Speedup:       1.50x
