# SGLang + ServerlessLLM Integration Example

This notebook demonstrates the integration between SGLang and ServerlessLLM for optimized model loading and inference.

We will:
1. Install dependencies.
2. Convert a HuggingFace model (`Qwen/Qwen3-0.6B`) to ServerlessLLM format.
3. Run a benchmark comparing the loading time of the ServerlessLLM format vs. the standard format.

**Note:** We use `uv` for faster installation as recommended.

In [6]:
import shutil
import os

# Install uv
if shutil.which("uv"):
    print("‚úÖ uv is already installed.")
else:
    print("‚ö†Ô∏è Installing uv...")
    !pip install uv
    print("‚úÖ uv installed.")

# Install ServerlessLLM
if shutil.which("sllm") and shutil.which("sllm-store"):
    print("‚úÖ ServerlessLLM is already installed.")
else:
    print("üì• Cloning ServerlessLLM repository...")
    !git clone https://github.com/ServerlessLLM/ServerlessLLM.git
    print("‚úÖ Repository cloned.")
    
    # Install sllm_store
    if shutil.which("sllm-store"):
        print("‚úÖ ServerlessLLM installed.")
    else:
        print("‚ö†Ô∏è Installing sllm_store...")
        !uv pip install -e ServerlessLLM/sllm_store
        print("‚úÖ sllm_store installed.")
    
    # Install ServerlessLLM
    if shutil.which("sllm") or shutil.which("serverless-llm"):
        print("‚úÖ ServerlessLLM is already installed.")
    else:
        print("‚ö†Ô∏è Installing ServerlessLLM...")
        !uv pip install -e ServerlessLLM
        print("‚úÖ ServerlessLLM installed.")

print("\nüéâ All done! ServerlessLLM is ready to use.")
import sllm,sllm_store
sllm_location = os.path.dirname(sllm.__file__)
print(f"   Built from: {sllm_location}")

sllm_store_location = os.path.dirname(sllm_store.__file__)
print(f"   Built from: {sllm_location}")


‚úÖ uv is already installed.
‚úÖ ServerlessLLM is already installed.

üéâ All done! ServerlessLLM is ready to use.
   Built from: /scratch/users/ntu/ktang022/ServerlessLLM/sllm
   Built from: /scratch/users/ntu/ktang022/ServerlessLLM/sllm


In [7]:
# Install SGLang with ServerlessLLM support from the specific feature branch
if shutil.which("sglang"):
    print("‚úÖ sglang is already installed.")
else:
    print("Installing sglang")
    !git clone -b feat-sllm-load-format https://github.com/JustinTong0323/sglang 
    
    !cd sglang_fork/python && uv pip install -e .
    print("‚úÖ sglang is installed.")

import sglang
import os
if hasattr(sglang, '__file__') and sglang.__file__ is not None:
    sglang_location = os.path.dirname(sglang.__file__)
    print(f"   Built from: {sglang_location}")
else:
    # Try to get location from __path__ for namespace packages
    if hasattr(sglang, '__path__'):
        sglang_location = str(sglang.__path__[0]) if sglang.__path__ else "unknown"
        print(f"   Built from: {sglang_location}")
    else:
        print(f"   Built from: unknown (namespace package)")

‚úÖ sglang is already installed.
   Built from: /scratch/users/ntu/ktang022/sglang


## 2. Convert Model to ServerlessLLM Format

We use the `Qwen/Qwen3-0.6B` model for this demonstration. 
The conversion script optimizes the model checkpoints for fast loading.

In [12]:
import os
os.environ["HF_HOME"] = "/home/users/ntu/ktang022/scratch/hf_cache"

In [3]:

import os
model_name = "Qwen/Qwen3-0.6B"

sllm_storage_path = "/home/users/ntu/ktang022/scratch/models"

hf_storage_path = "/home/users/ntu/ktang022/scratch/hf_models"

In [14]:
# Define model and paths

# Ensure storage path exists
!mkdir -p {sllm_storage_path}

# Run the conversion script
# This script downloads the model (if needed) and converts it
!python ServerlessLLM/sllm_store/examples/save_sglang_model.py \
   --model-name {model_name} \
   --storage-path {sllm_storage_path} \
   --dtype bfloat16 \
   --tensor-parallel-size 1 

Fetching 7 files:   0%|                                   | 0/7 [00:00<?, ?it/s]
merges.txt: 0.00B [00:00, ?B/s][A

config.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 726/726 [00:00<00:00, 5.56MB/s][A[A
Fetching 7 files:  14%|‚ñà‚ñà‚ñà‚ñä                       | 1/7 [00:00<00:02,  2.48it/s]

generation_config.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 239/239 [00:00<00:00, 2.89MB/s][A[A


tokenizer_config.json: 9.73kB [00:00, 25.0MB/s]A


vocab.json: 0.00B [00:00, ?B/s][A[A
merges.txt: 1.67MB [00:00, 12.4MB/s][A
Fetching 7 files:  43%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå               | 3/7 [00:00<00:00,  6.58it/s]

vocab.json: 607kB [00:00, 5.41MB/s][A[A
tokenizer.json:   0%|                               | 0.00/11.4M [00:00<?, ?B/s][A


vocab.json: 2.78MB [00:00, 14.4MB/s]                | 0.00/1.50G [00:00<?, ?B/s][A[A[A

tokenizer.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚

## 3 Start the sllm-store server in the background

In [2]:
import subprocess
import os

# Ensure the storage path exists
os.makedirs(sllm_storage_path, exist_ok=True)

# Define the command as a list for safety
cmd = [
    "sllm-store", "start",
    "--storage-path", sllm_storage_path,
    "--mem-pool-size", "4GB"
]

# Open a log file to capture the output
log_file = open("sllm_server.log", "w")

# Launch the process in the background
process = subprocess.Popen(
    cmd,
    stdout=log_file,
    stderr=subprocess.STDOUT,
    preexec_fn=os.setpgrp # Decouples the process from the notebook's process group
)

print(f"sllm-store started in background with PID: {process.pid}")
print("Check logs with: !tail -f sllm_server.log")

sllm-store started in background with PID: 1491768
Check logs with: !tail -f sllm_server.log


## 3. Performance Benchmark

We compare the initialization time of the SGLang Engine using:
1. **ServerlessLLM format** (`load_format="serverless_llm"`)
2. **Standard format** (`load_format="auto"`)

In [5]:
import time
import torch
from sglang.srt.entrypoints.engine import Engine
import nest_asyncio
import asyncio
nest_asyncio.apply()
def run_benchmark():
    print("Starting Benchmark...") 
    
    # 1. Test ServerlessLLM format
    print(f"Loading {model_name} with ServerlessLLM format...")
    start = time.time()
    # Note: model_path for sllm format is the local directory
    sllm_model_path = os.path.join(sllm_storage_path, model_name)
    print(sllm_model_path)
    
    try:
        engine_sllm = Engine(
            model_path=sllm_model_path,
            load_format="serverless_llm",
            tp_size=1,
            dtype="bfloat16"
        )
        sllm_load_time = time.time() - start
        print(f"ServerlessLLM load time: {sllm_load_time:.4f}s")
        prompts = ["What is the capital of France"]
        # engine.generate returns a list of dictionaries/objects depending on the SGLang version
        outputs = engine_sllm.generate(prompts)
        
        for output in outputs:
            # Depending on SGLang version, output might be a dict or an object
            if isinstance(output, dict):
                print(output.get("text", ""))
            else:
                print(output.text)
    
        # 3. Shutdown
        engine_sllm.shutdown()
    except Exception as e:
        print(f"ServerlessLLM loading failed: {e}")
        sllm_load_time = float('inf')

    
        
    # Clear GPU memory if possible
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # 2. Test Standard format
    print(f"Loading {model_name} with Standard format...")
    hf_model_path = os.path.join(hf_storage_path, model_name)
    start = time.time()
    try:
        engine_std = Engine(
            model_path=hf_model_path,
            load_format="auto",
            tp_size=1,
            dtype="bfloat16"
        )
        std_load_time = time.time() - start
        print(f"Standard load time: {std_load_time:.4f}s")
        # Run inference    
        prompts = ["What is the capital of France"]
        # engine.generate returns a list of dictionaries/objects depending on the SGLang version
        outputs = engine_std.generate(prompts)
        
        for output in outputs:
            # Depending on SGLang version, output might be a dict or an object
            if isinstance(output, dict):
                print(output.get("text", ""))
            else:
                print(output.text)

        # 3. Shutdown
        engine_std.shutdown()
    except Exception as e:
        print(f"Standard loading failed: {e}")
        std_load_time = float('inf')


        
    
    print("RESULTS")
    print(f"ServerlessLLM: {sllm_load_time:.4f}s")
    print(f"Standard:      {std_load_time:.4f}s")
    if sllm_load_time > 0 and std_load_time != float('inf'):
        print(f"Speedup:       {std_load_time/sllm_load_time:.2f}x")

if __name__ == "__main__":
    # Simple check to avoid running if imported
    run_benchmark()

Starting Benchmark...
Loading Qwen/Qwen3-0.6B with ServerlessLLM format...
/home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B


We recommend installing via `pip install torch-c-dlpack-ext`


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
DEBUG 01-13 20:49:39 torch.py:137] allocate_cuda_memory takes 0.0008206367492675781 seconds
DEBUG 01-13 20:49:39 client.py:72] load_into_gpu: /home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B/rank_0, 2e757c0a-15ec-4ea7-ac83-41f3334d0fb6
INFO 01-13 20:49:39 client.py:113] Model loaded: /home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B/rank_0, 2e757c0a-15ec-4ea7-ac83-41f3334d0fb6
INFO 01-13 20:49:39 torch.py:160] restore state_dict takes 0.0004429817199707031 seconds
INFO 01-13 20:49:39 client.py:117] confirm_model_loaded: /home/users/ntu/ktang022/scratch/models/Qwen/Qwen3-0.6B/rank_0, 2e757c0a-15ec-4ea7-ac83-41f3334d0

Capturing batches (bs=1 avail_mem=5.08 GB): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:01<00:00,  6.77it/s] 


ServerlessLLM load time: 26.0948s


We recommend installing via `pip install torch-c-dlpack-ext`


 on the green button logo they want tomas?
A) Somewhere North
B) Somewhere East
C) Somewhere Tuesday
D) Somewhere West
E) None of the above?

The choices are based on the logo tomas, which is a very well known French celebrity.
Answer:
C) Somewhere Tuesday

The logo tomas is a famous and well-known French celebrity. On this logo, the shapes represent the like of a green button. The greenbutton logo is associated with the capital of France. On the green button logo they need tomas, the capital that would be represented is Somewhere Tuesday,
Loading Qwen/Qwen3-0.6B with Standard format...


We recommend installing via `pip install torch-c-dlpack-ext`
We recommend installing via `pip install torch-c-dlpack-ext`


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.00it/s]

Capturing batches (bs=1 avail_mem=5.03 GB): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:01<00:00,  6.55it/s] 


Standard load time: 39.2057s
?

The capital of France is Paris. LaTeX, the way you mentioned it, is used to write mathematical formulas. So, if I were to write the capital of France in LaTeX, it would be denoted by the symbol \text{Paris}. However, I need to be careful. The capital of France is also known as Paris, which is sometimes also called the capital of the country. Therefore, you could use either \text{Paris} or \text{Paris}, both would be correct. The square brackets in Latex are often used to denote the capitals of different countries, but for Paris, it's better to use the
RESULTS
ServerlessLLM: 26.0948s
Standard:      39.2057s
Speedup:       1.50x
