# LoRA Serving

SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs.

## Arguments for LoRA Serving

The following server arguments are relevant for multi-LoRA serving:

* `lora_paths`: A mapping from each adaptor's name to its path, in the form of `{name}={path} {name}={path}`.

* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.

* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.

* `max_lora_rank`: The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.

* `lora_target_modules`: The union set of all target modules where LoRA should be applied (e.g., `q_proj`, `k_proj`, `gate_proj`). If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of different target modules after server startup.

* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.

From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to.

## Usage

### Serving Single Adaptor

In [1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, terminate_process

import json
import requests

In [2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
    --max-loras-per-batch 1 --lora-backend triton \
    --disable-radix-cache
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-07-14 06:16:10] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=32950, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-14 06:16:26] Attention backend not set. Use fa3 backend by default.
[2025-07-14 06:16:26] Init torch distributed begin.


[2025-07-14 06:16:26] Init torch distributed ends. mem usage=0.00 GB


[2025-07-14 06:16:28] Load weight begin. avail mem=75.96 GB
[2025-07-14 06:16:28] The weight of LmHead is not packed


[2025-07-14 06:16:28] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.29it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.21it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.17it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.42it/s]

[2025-07-14 06:16:32] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=45.11 GB, mem usage=30.84 GB.
[2025-07-14 06:16:32] Using triton as backend of LoRA kernels.
[2025-07-14 06:16:32] Loading weight of LoRA adapter lora0 from algoprog/fact-generation-llama-3.1-8b-instruct-lora


[2025-07-14 06:16:32] Using model weights format ['*.safetensors']
[2025-07-14 06:16:32] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 132.50it/s]

[2025-07-14 06:16:32] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:16:32] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:16:32] Gate projection base_model.model.model.layers.1.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.1.mlp.up_proj.lora_A.weight. Initializing up proje

[2025-07-14 06:16:32] Gate projection base_model.model.model.layers.12.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.12.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:16:32] Gate projection base_model.model.model.layers.12.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.12.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:16:32] Gate projection base_model.model.model.layers.13.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.13.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:16:32] Gate projection base_model.model.model.layers.13.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.13.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:16:32] Gate projection base_m

[2025-07-14 06:16:33] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=41.86 GB


[2025-07-14 06:16:34] INFO:     Started server process [2602485]
[2025-07-14 06:16:34] INFO:     Waiting for application startup.
[2025-07-14 06:16:34] INFO:     Application startup complete.
[2025-07-14 06:16:34] INFO:     Uvicorn running on http://127.0.0.1:32950 (Press CTRL+C to quit)


[2025-07-14 06:16:34] INFO:     127.0.0.1:40910 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-14 06:16:35] INFO:     127.0.0.1:40916 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-14 06:16:35] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:16:35.035206


[2025-07-14 06:16:36] INFO:     127.0.0.1:40922 - "POST /generate HTTP/1.1" 200 OK
[2025-07-14 06:16:36] The server is fired up and ready to roll!


In [3]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "AI is a field of computer science focused on",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses the base model
    "lora_path": ["lora0", None],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")

[2025-07-14 06:16:39] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:16:39.499994


[2025-07-14 06:16:40] Decode batch. #running-req: 1, #token: 0, token usage: 0.00, cuda graph: False, gen throughput (token/s): 5.62, #queue-req: 1, timestamp: 2025-07-14T06:16:40.557064
[2025-07-14 06:16:40] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:16:40.557302


[2025-07-14 06:16:41] INFO:     127.0.0.1:37602 - "POST /generate HTTP/1.1" 200 OK
Output 0:  Each country and capital should be on a new line. 
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals
Output 1:  creating intelligent machines that can perform tasks that typically require human intelligence, such as learning, problem-solving, decision-making, and perception. AI has many applications in various


In [4]:
terminate_process(server_process)

[2025-07-14 06:16:41] Child process unexpectedly failed with exitcode=9. pid=2603599


### Serving Multiple Adaptors

In [5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
    lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
    --max-loras-per-batch 2 --lora-backend triton \
    --disable-radix-cache
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-07-14 06:16:53] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=37311, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-14 06:17:05] Attention backend not set. Use fa3 backend by default.
[2025-07-14 06:17:05] Init torch distributed begin.


[2025-07-14 06:17:05] Init torch distributed ends. mem usage=0.51 GB


[2025-07-14 06:17:06] Load weight begin. avail mem=57.45 GB
[2025-07-14 06:17:06] The weight of LmHead is not packed


[2025-07-14 06:17:06] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.26it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.19it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.14it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]

[2025-07-14 06:17:10] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=27.11 GB, mem usage=30.34 GB.
[2025-07-14 06:17:10] Using triton as backend of LoRA kernels.
[2025-07-14 06:17:10] Loading weight of LoRA adapter lora0 from algoprog/fact-generation-llama-3.1-8b-instruct-lora


[2025-07-14 06:17:10] Using model weights format ['*.safetensors']
[2025-07-14 06:17:10] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 128.49it/s]

[2025-07-14 06:17:10] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:17:10] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:17:10] Gate projection base_model.model.model.layers.1.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.1.mlp.up_proj.lora_A.weight. Initializing up proje

[2025-07-14 06:17:10] Gate projection base_model.model.model.layers.12.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.12.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:17:10] Gate projection base_model.model.model.layers.12.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.12.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:17:10] Gate projection base_model.model.model.layers.13.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.13.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:17:10] Gate projection base_model.model.model.layers.13.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.13.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:17:10] Gate projection base_m

[2025-07-14 06:17:11] Using model weights format ['*.safetensors']
[2025-07-14 06:17:11] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 92.40it/s]



[2025-07-14 06:17:11] LoRA manager ready. Loaded LoRA adapters: lora0, lora1
[2025-07-14 06:17:11] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-07-14 06:17:11] Memory pool end. avail mem=22.22 GB


[2025-07-14 06:17:12] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=22.12 GB


[2025-07-14 06:17:13] INFO:     Started server process [2605602]
[2025-07-14 06:17:13] INFO:     Waiting for application startup.
[2025-07-14 06:17:13] INFO:     Application startup complete.
[2025-07-14 06:17:13] INFO:     Uvicorn running on http://127.0.0.1:37311 (Press CTRL+C to quit)


[2025-07-14 06:17:13] INFO:     127.0.0.1:36992 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-14 06:17:14] INFO:     127.0.0.1:37002 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-14 06:17:14] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:17:14.123785


[2025-07-14 06:17:15] INFO:     127.0.0.1:37008 - "POST /generate HTTP/1.1" 200 OK
[2025-07-14 06:17:15] The server is fired up and ready to roll!


In [6]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "AI is a field of computer science focused on",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")

[2025-07-14 06:17:18] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:17:18.595647
[2025-07-14 06:17:18] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-14T06:17:18.597151


[2025-07-14 06:17:20] Decode batch. #running-req: 2, #token: 0, token usage: 0.00, cuda graph: False, gen throughput (token/s): 9.64, #queue-req: 0, timestamp: 2025-07-14T06:17:20.014009
[2025-07-14 06:17:20] INFO:     127.0.0.1:44002 - "POST /generate HTTP/1.1" 200 OK
Output 0:  Each country and capital should be on a new line. 
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals
Output 1:  creating intelligent machines capable of performing tasks that typically require human intelligence. AI research has led to the development of various techniques, including machine learning, natural language processing,


In [7]:
terminate_process(server_process)

[2025-07-14 06:17:20] Child process unexpectedly failed with exitcode=9. pid=2606521
[2025-07-14 06:17:20] Child process unexpectedly failed with exitcode=9. pid=2606119


### Dynamic LoRA loading

### Basic Usage

Instead of specifying all adapters during server startup via `--lora-paths`. You can also load & unload LoRA adapters dynamically via the `/load_lora_adapter` and `/unload_lora_adapter` API.

(Please note that, currently we still require you to specify at least one adapter in `--lora-paths` to enable the LoRA feature, this limitation will be lifted soon.)

In [8]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0=philschmid/code-llama-3-1-8b-text-to-sql-lora \
    --cuda-graph-max-bs 2 \
    --max-loras-per-batch 2 --lora-backend triton \
    --disable-radix-cache
    """
)

url = f"http://127.0.0.1:{port}"
wait_for_server(url)

[2025-07-14 06:17:30] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=33270, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-14 06:17:42] Attention backend not set. Use fa3 backend by default.
[2025-07-14 06:17:42] Init torch distributed begin.


[2025-07-14 06:17:42] Init torch distributed ends. mem usage=0.00 GB


[2025-07-14 06:17:43] Load weight begin. avail mem=41.35 GB
[2025-07-14 06:17:43] The weight of LmHead is not packed


[2025-07-14 06:17:44] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.19it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.10it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.36it/s]

[2025-07-14 06:17:47] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=59.64 GB, mem usage=-18.29 GB.
[2025-07-14 06:17:47] Using triton as backend of LoRA kernels.
[2025-07-14 06:17:47] Loading weight of LoRA adapter lora0 from philschmid/code-llama-3-1-8b-text-to-sql-lora


[2025-07-14 06:17:47] Using model weights format ['*.safetensors']
[2025-07-14 06:17:47] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 88.85it/s]



[2025-07-14 06:17:48] LoRA manager ready. Loaded LoRA adapters: lora0


[2025-07-14 06:17:49] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-07-14 06:17:49] Memory pool end. avail mem=58.17 GB


[2025-07-14 06:17:49] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=58.07 GB


[2025-07-14 06:17:50] INFO:     Started server process [2608748]
[2025-07-14 06:17:50] INFO:     Waiting for application startup.
[2025-07-14 06:17:50] INFO:     Application startup complete.
[2025-07-14 06:17:50] INFO:     Uvicorn running on http://127.0.0.1:33270 (Press CTRL+C to quit)


[2025-07-14 06:17:51] INFO:     127.0.0.1:56762 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-14 06:17:51] INFO:     127.0.0.1:56776 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-14 06:17:51] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:17:51.480923


[2025-07-14 06:17:52] INFO:     127.0.0.1:56792 - "POST /generate HTTP/1.1" 200 OK
[2025-07-14 06:17:52] The server is fired up and ready to roll!


In [9]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora1",
        "lora_path": "Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16",
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())

[2025-07-14 06:17:56] Start load Lora adapter. Lora name=lora1, path=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16
[2025-07-14 06:17:56] LoRA adapter loading starts: name=lora1, path=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16. avail mem=56.89 GB
[2025-07-14 06:17:56] Loading weight of LoRA adapter lora1 from Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16
[2025-07-14 06:17:56] Using model weights format ['*.safetensors']


[2025-07-14 06:17:56] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 98.68it/s]

[2025-07-14 06:17:56] LoRA adapter loading completes: name=lora1, path=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16. avail mem=56.89 GB
[2025-07-14 06:17:56] Cache flushed successfully!
[2025-07-14 06:17:56] INFO:     127.0.0.1:56798 - "POST /load_lora_adapter HTTP/1.1" 200 OK
LoRA adapter loaded successfully. {'success': True, 'error_message': '', 'loaded_adapters': {'lora0': 'philschmid/code-llama-3-1-8b-text-to-sql-lora', 'lora1': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16'}}


In [10]:
response = requests.post(
    url + "/generate",
    json={
        "text": [
            "List 3 countries and their capitals.",
            "List 3 countries and their capitals.",
        ],
        "sampling_params": {"max_new_tokens": 32, "temperature": 0},
        "lora_path": ["lora0", "lora1"],
    },
)
print(f"Output from lora0: {response.json()[0]['text']}")
print(f"Output from lora1: {response.json()[1]['text']}")

[2025-07-14 06:17:56] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:17:56.505299
[2025-07-14 06:17:56] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-14T06:17:56.506398


[2025-07-14 06:17:58] INFO:     127.0.0.1:56806 - "POST /generate HTTP/1.1" 200 OK
Output from lora0:  Country 1 has a capital of Bogor? No, that's not correct. The capital of Country 1 is actually Bogor is not the capital,
Output from lora1:  Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -


In [11]:
response = requests.post(
    url + "/unload_lora_adapter",
    json={
        "lora_name": "lora0",
    },
)

[2025-07-14 06:17:58] Start unload Lora adapter. Lora name=lora0
[2025-07-14 06:17:58] LoRA adapter unloading starts: name=lora0. avail mem=56.89 GB
[2025-07-14 06:17:58] Unloading LoRA adapter lora0
[2025-07-14 06:17:58] LoRA adapter unloading completes: name=lora0. avail mem=56.89 GB
[2025-07-14 06:17:58] Cache flushed successfully!
[2025-07-14 06:17:58] INFO:     127.0.0.1:56810 - "POST /unload_lora_adapter HTTP/1.1" 200 OK


In [12]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora2",
        "lora_path": "pbevan11/llama-3.1-8b-ocr-correction",
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())

[2025-07-14 06:17:58] Start load Lora adapter. Lora name=lora2, path=pbevan11/llama-3.1-8b-ocr-correction
[2025-07-14 06:17:58] LoRA adapter loading starts: name=lora2, path=pbevan11/llama-3.1-8b-ocr-correction. avail mem=56.89 GB
[2025-07-14 06:17:58] Loading weight of LoRA adapter lora2 from pbevan11/llama-3.1-8b-ocr-correction


[2025-07-14 06:17:58] Using model weights format ['*.safetensors']
[2025-07-14 06:17:58] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 89.29it/s]

[2025-07-14 06:17:58] LoRA adapter loading completes: name=lora2, path=pbevan11/llama-3.1-8b-ocr-correction. avail mem=56.89 GB
[2025-07-14 06:17:58] Cache flushed successfully!
[2025-07-14 06:17:58] INFO:     127.0.0.1:56812 - "POST /load_lora_adapter HTTP/1.1" 200 OK
LoRA adapter loaded successfully. {'success': True, 'error_message': '', 'loaded_adapters': {'lora1': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16', 'lora2': 'pbevan11/llama-3.1-8b-ocr-correction'}}


In [13]:
response = requests.post(
    url + "/generate",
    json={
        "text": [
            "List 3 countries and their capitals.",
            "List 3 countries and their capitals.",
        ],
        "sampling_params": {"max_new_tokens": 32, "temperature": 0},
        "lora_path": ["lora1", "lora2"],
    },
)
print(f"Output from lora1: {response.json()[0]['text']}")
print(f"Output from lora2: {response.json()[1]['text']}")

[2025-07-14 06:17:58] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:17:58.719297
[2025-07-14 06:17:58] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-14T06:17:58.720364


[2025-07-14 06:17:59] INFO:     127.0.0.1:40260 - "POST /generate HTTP/1.1" 200 OK
Output from lora1:  Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -
Output from lora2:  1.  Japan 2.  China 3.  India  Capital of Japan:  Tokyo Capital of China:  Beijing Capital of India:


In [14]:
terminate_process(server_process)

[2025-07-14 06:17:59] Child process unexpectedly failed with exitcode=9. pid=2609567
[2025-07-14 06:17:59] Child process unexpectedly failed with exitcode=9. pid=2609167


### Advanced: hosting adapters of different shapes

In some cases, you may want to load LoRA adapters with different ranks or target modules (e.g., `q_proj`, `k_proj`) simultaneously. To ensure the server can accommodate all expected LoRA shapes, it's recommended to explicitly specify `--max-lora-rank` and/or `--lora-target-modules` at startup.

For backward compatibility, SGLang will infer these values from `--lora-paths` if they are not explicitly provided. This means it's safe to omit them **only if** all dynamically loaded adapters share the same shape (rank and target modules) as those in the initial `--lora-paths` or are strictly "smaller".

In [15]:
lora0 = "Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16"  # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj
lora1 = "algoprog/fact-generation-llama-3.1-8b-instruct-lora"  # rank - 64, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj


# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.
# We are adding it here just to demonstrate usage.
server_process, port = launch_server_cmd(
    f"""
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0={lora0} \
    --cuda-graph-max-bs 2 \
    --max-loras-per-batch 2 --lora-backend triton \
    --disable-radix-cache
    --max-lora-rank 64
    --lora-target-modules q_proj k_proj v_proj o_proj down_proj up_proj gate_proj
    """
)

url = f"http://127.0.0.1:{port}"
wait_for_server(url)

[2025-07-14 06:18:10] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=34872, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-14 06:18:22] Attention backend not set. Use fa3 backend by default.
[2025-07-14 06:18:22] Init torch distributed begin.


[2025-07-14 06:18:23] Init torch distributed ends. mem usage=0.00 GB


[2025-07-14 06:18:24] Load weight begin. avail mem=60.49 GB
[2025-07-14 06:18:24] The weight of LmHead is not packed


[2025-07-14 06:18:25] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.20it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.15it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.39it/s]

[2025-07-14 06:18:28] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=45.47 GB, mem usage=15.02 GB.
[2025-07-14 06:18:28] Using triton as backend of LoRA kernels.


[2025-07-14 06:18:28] Loading weight of LoRA adapter lora0 from Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16
[2025-07-14 06:18:28] Using model weights format ['*.safetensors']
[2025-07-14 06:18:28] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 99.26it/s]



[2025-07-14 06:18:29] LoRA manager ready. Loaded LoRA adapters: lora0


[2025-07-14 06:18:29] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-07-14 06:18:29] Memory pool end. avail mem=41.93 GB


[2025-07-14 06:18:30] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=41.84 GB


[2025-07-14 06:18:30] INFO:     Started server process [2612321]
[2025-07-14 06:18:30] INFO:     Waiting for application startup.
[2025-07-14 06:18:30] INFO:     Application startup complete.
[2025-07-14 06:18:30] INFO:     Uvicorn running on http://127.0.0.1:34872 (Press CTRL+C to quit)


[2025-07-14 06:18:30] INFO:     127.0.0.1:42728 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-14 06:18:31] INFO:     127.0.0.1:42732 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-14 06:18:31] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:18:31.569546


[2025-07-14 06:18:32] INFO:     127.0.0.1:42748 - "POST /generate HTTP/1.1" 200 OK
[2025-07-14 06:18:32] The server is fired up and ready to roll!


In [16]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora1",
        "lora_path": lora1,
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())

[2025-07-14 06:18:35] Start load Lora adapter. Lora name=lora1, path=algoprog/fact-generation-llama-3.1-8b-instruct-lora
[2025-07-14 06:18:35] LoRA adapter loading starts: name=lora1, path=algoprog/fact-generation-llama-3.1-8b-instruct-lora. avail mem=41.43 GB
[2025-07-14 06:18:35] Loading weight of LoRA adapter lora1 from algoprog/fact-generation-llama-3.1-8b-instruct-lora
[2025-07-14 06:18:36] Using model weights format ['*.safetensors']


[2025-07-14 06:18:36] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 110.68it/s]

[2025-07-14 06:18:36] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:18:36] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:18:36] Gate projection base_model.model.model.layers.1.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.1.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:18:36] Gate projection base_model.mod

[2025-07-14 06:18:36] Gate projection base_model.model.model.layers.23.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.23.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:18:36] Gate projection base_model.model.model.layers.23.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.23.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:18:36] Gate projection base_model.model.model.layers.24.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.24.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-14 06:18:36] Gate projection base_model.model.model.layers.24.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.24.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-14 06:18:36] Gate projection base_m

In [17]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "AI is a field of computer science focused on",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output from lora0: {response.json()[0]['text']}")
print(f"Output from lora1: {response.json()[1]['text']}")

[2025-07-14 06:18:36] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-14T06:18:36.451552
[2025-07-14 06:18:36] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-14T06:18:36.452702


[2025-07-14 06:18:37] INFO:     127.0.0.1:42756 - "POST /generate HTTP/1.1" 200 OK
Output from lora0:  Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -
Output from lora1:  creating intelligent machines that can perform tasks that typically require human intelligence. AI is a broad field that encompasses many subfields, including machine learning, natural language processing,


In [18]:
terminate_process(server_process)

[2025-07-14 06:18:37] Child process unexpectedly failed with exitcode=9. pid=2613496
[2025-07-14 06:18:37] Child process unexpectedly failed with exitcode=9. pid=2612953


## Future Works

The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently radix attention is incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development.