# LoRA Serving

SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs.

## Arguments for LoRA Serving

The following server arguments are relevant for multi-LoRA serving:

* `lora_paths`: A mapping from each adaptor's name to its path, in the form of `{name}={path} {name}={path}`.

* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.

* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.

* `max_lora_rank`: The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.

* `lora_target_modules`: The union set of all target modules where LoRA should be applied (e.g., `q_proj`, `k_proj`, `gate_proj`). If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of different target modules after server startup.

* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.

From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to.

## Usage

### Serving Single Adaptor

In [1]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    from patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, terminate_process

import json
import requests

In [2]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
    --max-loras-per-batch 1 --lora-backend triton \
    --disable-radix-cache
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-07-19 05:11:41] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=38951, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-19 05:11:57] Attention backend not explicitly specified. Use fa3 backend by default.
[2025-07-19 05:11:57] Init torch distributed begin.


[2025-07-19 05:11:58] Init torch distributed ends. mem usage=0.00 GB


[2025-07-19 05:12:00] Load weight begin. avail mem=75.96 GB


[2025-07-19 05:12:01] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.21it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.10it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.06it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

[2025-07-19 05:12:04] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=45.02 GB, mem usage=30.94 GB.
[2025-07-19 05:12:04] Using triton as backend of LoRA kernels.
[2025-07-19 05:12:04] Loading weight of LoRA adapter lora0 from algoprog/fact-generation-llama-3.1-8b-instruct-lora


[2025-07-19 05:12:04] Using model weights format ['*.safetensors']
[2025-07-19 05:12:04] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 139.82it/s]

[2025-07-19 05:12:04] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:12:04] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:12:04] Gate projection base_model.model.model.layers.1.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.1.mlp.up_proj.lora_A.weight. Initializing up proje

[2025-07-19 05:12:05] Gate projection base_model.model.model.layers.15.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.15.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:12:05] Gate projection base_model.model.model.layers.15.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.15.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:12:05] Gate projection base_model.model.model.layers.16.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.16.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:12:05] Gate projection base_model.model.model.layers.16.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.16.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:12:05] Gate projection base_m

[2025-07-19 05:12:05] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=41.86 GB


[2025-07-19 05:12:06] INFO:     Started server process [344283]
[2025-07-19 05:12:06] INFO:     Waiting for application startup.
[2025-07-19 05:12:06] INFO:     Application startup complete.
[2025-07-19 05:12:06] INFO:     Uvicorn running on http://127.0.0.1:38951 (Press CTRL+C to quit)
[2025-07-19 05:12:06] INFO:     127.0.0.1:36900 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-19 05:12:07] INFO:     127.0.0.1:36912 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-19 05:12:07] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:12:07.386512


[2025-07-19 05:12:08] INFO:     127.0.0.1:36928 - "POST /generate HTTP/1.1" 200 OK
[2025-07-19 05:12:08] The server is fired up and ready to roll!


In [3]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "AI is a field of computer science focused on",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses the base model
    "lora_path": ["lora0", None],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")

[2025-07-19 05:12:11] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 1, timestamp: 2025-07-19T05:12:11.465568


[2025-07-19 05:12:12] Decode batch. #running-req: 1, #token: 0, token usage: 0.00, cuda graph: False, gen throughput (token/s): 5.96, #queue-req: 1, timestamp: 2025-07-19T05:12:12.496302
[2025-07-19 05:12:12] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:12:12.496539


[2025-07-19 05:12:13] INFO:     127.0.0.1:53362 - "POST /generate HTTP/1.1" 200 OK
Output 0:  Each country and capital should be on a new line. 
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals
Output 1:  creating intelligent machines that can perform tasks that typically require human intelligence, such as learning, problem-solving, decision-making, and perception. AI has many applications in various


In [4]:
terminate_process(server_process)

[2025-07-19 05:12:13] Child process unexpectedly failed with exitcode=9. pid=345274


### Serving Multiple Adaptors

In [5]:
server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
    lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
    --max-loras-per-batch 2 --lora-backend triton \
    --disable-radix-cache
"""
)

wait_for_server(f"http://localhost:{port}")

[2025-07-19 05:12:23] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=32062, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-19 05:12:36] Attention backend not explicitly specified. Use fa3 backend by default.
[2025-07-19 05:12:36] Init torch distributed begin.


[2025-07-19 05:12:36] Init torch distributed ends. mem usage=0.00 GB


[2025-07-19 05:12:38] Load weight begin. avail mem=57.01 GB


[2025-07-19 05:12:38] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.28it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.15it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.13it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.38it/s]

[2025-07-19 05:12:41] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=27.09 GB, mem usage=29.92 GB.
[2025-07-19 05:12:41] Using triton as backend of LoRA kernels.
[2025-07-19 05:12:41] Loading weight of LoRA adapter lora0 from algoprog/fact-generation-llama-3.1-8b-instruct-lora


[2025-07-19 05:12:41] Using model weights format ['*.safetensors']
[2025-07-19 05:12:41] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 135.67it/s]

[2025-07-19 05:12:41] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:12:41] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:12:41] Gate projection base_model.model.model.layers.1.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.1.mlp.up_proj.lora_A.weight. Initializing up proje

[2025-07-19 05:12:42] Gate projection base_model.model.model.layers.14.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.14.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:12:42] Gate projection base_model.model.model.layers.14.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.14.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:12:42] Gate projection base_model.model.model.layers.15.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.15.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:12:42] Gate projection base_model.model.model.layers.15.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.15.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:12:42] Gate projection base_m

[2025-07-19 05:12:42] Using model weights format ['*.safetensors']
[2025-07-19 05:12:42] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 110.73it/s]

[2025-07-19 05:12:42] LoRA manager ready. Loaded LoRA adapters: lora0, lora1
[2025-07-19 05:12:42] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-07-19 05:12:42] Memory pool end. avail mem=23.57 GB


[2025-07-19 05:12:43] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=22.12 GB


[2025-07-19 05:12:43] INFO:     Started server process [347404]
[2025-07-19 05:12:43] INFO:     Waiting for application startup.
[2025-07-19 05:12:43] INFO:     Application startup complete.
[2025-07-19 05:12:43] INFO:     Uvicorn running on http://127.0.0.1:32062 (Press CTRL+C to quit)


[2025-07-19 05:12:44] INFO:     127.0.0.1:56258 - "GET /v1/models HTTP/1.1" 200 OK
[2025-07-19 05:12:44] INFO:     127.0.0.1:56268 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-19 05:12:44] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:12:44.602359


[2025-07-19 05:12:45] INFO:     127.0.0.1:56284 - "POST /generate HTTP/1.1" 200 OK
[2025-07-19 05:12:45] The server is fired up and ready to roll!


In [6]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "AI is a field of computer science focused on",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")

[2025-07-19 05:12:49] Prefill batch. #new-seq: 2, #new-token: 19, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:12:49.474773


[2025-07-19 05:12:50] Decode batch. #running-req: 2, #token: 0, token usage: 0.00, cuda graph: False, gen throughput (token/s): 9.04, #queue-req: 0, timestamp: 2025-07-19T05:12:50.983816
[2025-07-19 05:12:50] INFO:     127.0.0.1:56296 - "POST /generate HTTP/1.1" 200 OK
Output 0:  Each country and capital should be on a new line. 
France, Paris
Japan, Tokyo
Brazil, Brasília
List 3 countries and their capitals
Output 1:  creating intelligent machines capable of performing tasks that typically require human intelligence. AI research encompasses a range of disciplines, including machine learning, natural language processing, computer vision,


In [7]:
terminate_process(server_process)

[2025-07-19 05:12:50] Child process unexpectedly failed with exitcode=9. pid=348325


### Dynamic LoRA loading

### Basic Usage

Instead of specifying all adapters during server startup via `--lora-paths`. You can also load & unload LoRA adapters dynamically via the `/load_lora_adapter` and `/unload_lora_adapter` API.

(Please note that, currently we still require you to specify at least one adapter in `--lora-paths` to enable the LoRA feature, this limitation will be lifted soon.)

In [8]:
server_process, port = launch_server_cmd(
    """
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0=philschmid/code-llama-3-1-8b-text-to-sql-lora \
    --cuda-graph-max-bs 2 \
    --max-loras-per-batch 2 --lora-backend triton \
    --disable-radix-cache
    """
)

url = f"http://127.0.0.1:{port}"
wait_for_server(url)

[2025-07-19 05:13:02] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=32773, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-19 05:13:14] Attention backend not explicitly specified. Use fa3 backend by default.
[2025-07-19 05:13:14] Init torch distributed begin.


[2025-07-19 05:13:14] Init torch distributed ends. mem usage=0.00 GB


[2025-07-19 05:13:15] Load weight begin. avail mem=57.64 GB


[2025-07-19 05:13:16] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.30it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.21it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.21it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.65it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.47it/s]

[2025-07-19 05:13:19] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.48 GB, mem usage=-22.26 GB.
[2025-07-19 05:13:19] Using triton as backend of LoRA kernels.
[2025-07-19 05:13:19] Loading weight of LoRA adapter lora0 from philschmid/code-llama-3-1-8b-text-to-sql-lora
[2025-07-19 05:13:19] Using model weights format ['*.safetensors']


[2025-07-19 05:13:19] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 92.48it/s]



[2025-07-19 05:13:20] LoRA manager ready. Loaded LoRA adapters: lora0
[2025-07-19 05:13:20] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-07-19 05:13:20] Memory pool end. avail mem=58.17 GB


[2025-07-19 05:13:21] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=58.07 GB


[2025-07-19 05:13:21] INFO:     Started server process [350621]
[2025-07-19 05:13:21] INFO:     Waiting for application startup.
[2025-07-19 05:13:21] INFO:     Application startup complete.
[2025-07-19 05:13:21] INFO:     Uvicorn running on http://127.0.0.1:32773 (Press CTRL+C to quit)


[2025-07-19 05:13:22] INFO:     127.0.0.1:40548 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-19 05:13:22] INFO:     127.0.0.1:40556 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-19 05:13:22] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:13:22.742784


[2025-07-19 05:13:23] INFO:     127.0.0.1:40560 - "POST /generate HTTP/1.1" 200 OK
[2025-07-19 05:13:23] The server is fired up and ready to roll!


In [9]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora1",
        "lora_path": "Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16",
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())

[2025-07-19 05:13:27] Start load Lora adapter. Lora name=lora1, path=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16
[2025-07-19 05:13:27] LoRA adapter loading starts: name=lora1, path=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16. avail mem=56.89 GB
[2025-07-19 05:13:27] Loading weight of LoRA adapter lora1 from Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16
[2025-07-19 05:13:27] Using model weights format ['*.safetensors']


[2025-07-19 05:13:27] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 102.07it/s]

[2025-07-19 05:13:27] LoRA adapter loading completes: name=lora1, path=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16. avail mem=56.89 GB
[2025-07-19 05:13:27] Cache flushed successfully!
[2025-07-19 05:13:27] INFO:     127.0.0.1:40570 - "POST /load_lora_adapter HTTP/1.1" 200 OK
LoRA adapter loaded successfully. {'success': True, 'error_message': '', 'loaded_adapters': {'lora0': 'philschmid/code-llama-3-1-8b-text-to-sql-lora', 'lora1': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16'}}


In [10]:
response = requests.post(
    url + "/generate",
    json={
        "text": [
            "List 3 countries and their capitals.",
            "List 3 countries and their capitals.",
        ],
        "sampling_params": {"max_new_tokens": 32, "temperature": 0},
        "lora_path": ["lora0", "lora1"],
    },
)
print(f"Output from lora0: {response.json()[0]['text']}")
print(f"Output from lora1: {response.json()[1]['text']}")

[2025-07-19 05:13:27] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:13:27.368612
[2025-07-19 05:13:27] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-19T05:13:27.369620


[2025-07-19 05:13:29] INFO:     127.0.0.1:40582 - "POST /generate HTTP/1.1" 200 OK
Output from lora0:  Country 1 has a capital of Bogor? No, that's not correct. The capital of Country 1 is actually Bogor is not the capital,
Output from lora1:  Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -


In [11]:
response = requests.post(
    url + "/unload_lora_adapter",
    json={
        "lora_name": "lora0",
    },
)

[2025-07-19 05:13:29] Start unload Lora adapter. Lora name=lora0
[2025-07-19 05:13:29] LoRA adapter unloading starts: name=lora0. avail mem=56.88 GB
[2025-07-19 05:13:29] Unloading LoRA adapter lora0
[2025-07-19 05:13:29] LoRA adapter unloading completes: name=lora0. avail mem=56.88 GB
[2025-07-19 05:13:29] Cache flushed successfully!
[2025-07-19 05:13:29] INFO:     127.0.0.1:40590 - "POST /unload_lora_adapter HTTP/1.1" 200 OK


In [12]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora2",
        "lora_path": "pbevan11/llama-3.1-8b-ocr-correction",
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())

[2025-07-19 05:13:29] Start load Lora adapter. Lora name=lora2, path=pbevan11/llama-3.1-8b-ocr-correction
[2025-07-19 05:13:29] LoRA adapter loading starts: name=lora2, path=pbevan11/llama-3.1-8b-ocr-correction. avail mem=56.88 GB
[2025-07-19 05:13:29] Loading weight of LoRA adapter lora2 from pbevan11/llama-3.1-8b-ocr-correction


[2025-07-19 05:13:29] Using model weights format ['*.safetensors']


[2025-07-19 05:13:29] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 91.35it/s]

[2025-07-19 05:13:29] LoRA adapter loading completes: name=lora2, path=pbevan11/llama-3.1-8b-ocr-correction. avail mem=56.88 GB
[2025-07-19 05:13:29] Cache flushed successfully!
[2025-07-19 05:13:29] INFO:     127.0.0.1:40606 - "POST /load_lora_adapter HTTP/1.1" 200 OK
LoRA adapter loaded successfully. {'success': True, 'error_message': '', 'loaded_adapters': {'lora1': 'Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16', 'lora2': 'pbevan11/llama-3.1-8b-ocr-correction'}}


In [13]:
response = requests.post(
    url + "/generate",
    json={
        "text": [
            "List 3 countries and their capitals.",
            "List 3 countries and their capitals.",
        ],
        "sampling_params": {"max_new_tokens": 32, "temperature": 0},
        "lora_path": ["lora1", "lora2"],
    },
)
print(f"Output from lora1: {response.json()[0]['text']}")
print(f"Output from lora2: {response.json()[1]['text']}")

[2025-07-19 05:13:29] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:13:29.637794
[2025-07-19 05:13:29] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-19T05:13:29.638760


[2025-07-19 05:13:30] INFO:     127.0.0.1:40610 - "POST /generate HTTP/1.1" 200 OK
Output from lora1:  Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -
Output from lora2:  1.  Japan 2.  China 3.  India  Capital of Japan:  Tokyo Capital of China:  Beijing Capital of India:


In [14]:
terminate_process(server_process)

[2025-07-19 05:13:30] Child process unexpectedly failed with exitcode=9. pid=351685
[2025-07-19 05:13:30] Child process unexpectedly failed with exitcode=9. pid=351171


### Advanced: hosting adapters of different shapes

In some cases, you may want to load LoRA adapters with different ranks or target modules (e.g., `q_proj`, `k_proj`) simultaneously. To ensure the server can accommodate all expected LoRA shapes, it's recommended to explicitly specify `--max-lora-rank` and/or `--lora-target-modules` at startup.

For backward compatibility, SGLang will infer these values from `--lora-paths` if they are not explicitly provided. This means it's safe to omit them **only if** all dynamically loaded adapters share the same shape (rank and target modules) as those in the initial `--lora-paths` or are strictly "smaller".

In [15]:
lora0 = "Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16"  # rank - 4, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj
lora1 = "algoprog/fact-generation-llama-3.1-8b-instruct-lora"  # rank - 64, target modules - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj


# The `--target-lora-modules` param below is technically not needed, as the server will infer it from lora0 which already has all the target modules specified.
# We are adding it here just to demonstrate usage.
server_process, port = launch_server_cmd(
    f"""
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --lora-paths lora0={lora0} \
    --cuda-graph-max-bs 2 \
    --max-loras-per-batch 2 --lora-backend triton \
    --disable-radix-cache
    --max-lora-rank 64
    --lora-target-modules q_proj k_proj v_proj o_proj down_proj up_proj gate_proj
    """
)

url = f"http://127.0.0.1:{port}"
wait_for_server(url)

[2025-07-19 05:13:41] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, impl='auto', host='127.0.0.1', port=32694, nccl_port=None, mem_fraction_static=0.874, max_running_requests=200, max_total_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_outp

[2025-07-19 05:13:54] Attention backend not explicitly specified. Use fa3 backend by default.
[2025-07-19 05:13:54] Init torch distributed begin.


[2025-07-19 05:13:54] Init torch distributed ends. mem usage=0.00 GB


[2025-07-19 05:13:56] Load weight begin. avail mem=60.49 GB


[2025-07-19 05:13:56] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.22it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.12it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.10it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.33it/s]

[2025-07-19 05:13:59] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=45.47 GB, mem usage=15.02 GB.
[2025-07-19 05:13:59] Using triton as backend of LoRA kernels.
[2025-07-19 05:13:59] Loading weight of LoRA adapter lora0 from Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16


[2025-07-19 05:13:59] Using model weights format ['*.safetensors']
[2025-07-19 05:14:00] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 110.28it/s]



[2025-07-19 05:14:00] LoRA manager ready. Loaded LoRA adapters: lora0


[2025-07-19 05:14:00] KV Cache is allocated. #tokens: 20480, K size: 1.25 GB, V size: 1.25 GB
[2025-07-19 05:14:00] Memory pool end. avail mem=41.93 GB


[2025-07-19 05:14:01] max_total_num_tokens=20480, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=200, context_len=131072, available_gpu_mem=41.84 GB


[2025-07-19 05:14:02] INFO:     Started server process [354142]
[2025-07-19 05:14:02] INFO:     Waiting for application startup.
[2025-07-19 05:14:02] INFO:     Application startup complete.
[2025-07-19 05:14:02] INFO:     Uvicorn running on http://127.0.0.1:32694 (Press CTRL+C to quit)


[2025-07-19 05:14:02] INFO:     127.0.0.1:46422 - "GET /v1/models HTTP/1.1" 200 OK


[2025-07-19 05:14:03] INFO:     127.0.0.1:46432 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-19 05:14:03] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:14:03.047923


[2025-07-19 05:14:04] INFO:     127.0.0.1:46436 - "POST /generate HTTP/1.1" 200 OK
[2025-07-19 05:14:04] The server is fired up and ready to roll!


In [16]:
response = requests.post(
    url + "/load_lora_adapter",
    json={
        "lora_name": "lora1",
        "lora_path": lora1,
    },
)

if response.status_code == 200:
    print("LoRA adapter loaded successfully.", response.json())
else:
    print("Failed to load LoRA adapter.", response.json())

[2025-07-19 05:14:07] Start load Lora adapter. Lora name=lora1, path=algoprog/fact-generation-llama-3.1-8b-instruct-lora
[2025-07-19 05:14:07] LoRA adapter loading starts: name=lora1, path=algoprog/fact-generation-llama-3.1-8b-instruct-lora. avail mem=41.43 GB
[2025-07-19 05:14:07] Loading weight of LoRA adapter lora1 from algoprog/fact-generation-llama-3.1-8b-instruct-lora
[2025-07-19 05:14:07] Using model weights format ['*.safetensors']


[2025-07-19 05:14:08] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 112.46it/s]

[2025-07-19 05:14:08] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:14:08] Gate projection base_model.model.model.layers.0.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.0.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:14:08] Gate projection base_model.model.model.layers.1.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.1.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:14:08] Gate projection base_model.mod

[2025-07-19 05:14:08] Gate projection base_model.model.model.layers.24.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.24.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:14:08] Gate projection base_model.model.model.layers.24.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.24.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:14:08] Gate projection base_model.model.model.layers.25.mlp.gate_proj.lora_A.weight does not have a corresponding up projection base_model.model.model.layers.25.mlp.up_proj.lora_A.weight. Initializing up projection to zero.
[2025-07-19 05:14:08] Gate projection base_model.model.model.layers.25.mlp.gate_proj.lora_B.weight does not have a corresponding up projection base_model.model.model.layers.25.mlp.up_proj.lora_B.weight. Initializing up projection to zero.
[2025-07-19 05:14:08] Gate projection base_m

In [17]:
url = f"http://127.0.0.1:{port}"
json_data = {
    "text": [
        "List 3 countries and their capitals.",
        "AI is a field of computer science focused on",
    ],
    "sampling_params": {"max_new_tokens": 32, "temperature": 0},
    # The first input uses lora0, and the second input uses lora1
    "lora_path": ["lora0", "lora1"],
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output from lora0: {response.json()[0]['text']}")
print(f"Output from lora1: {response.json()[1]['text']}")

[2025-07-19 05:14:08] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, timestamp: 2025-07-19T05:14:08.383412
[2025-07-19 05:14:08] Prefill batch. #new-seq: 1, #new-token: 10, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, timestamp: 2025-07-19T05:14:08.384627


[2025-07-19 05:14:09] INFO:     127.0.0.1:46448 - "POST /generate HTTP/1.1" 200 OK
Output from lora0:  Give the countries and capitals in the correct order.
Countries: Japan, Brazil, Australia
Capitals: Tokyo, Brasilia, Canberra
1. Japan -
Output from lora1:  creating intelligent machines that can perform tasks that typically require human intelligence. AI is a broad field that encompasses many subfields, including machine learning, natural language processing,


In [18]:
terminate_process(server_process)

[2025-07-19 05:14:09] Child process unexpectedly failed with exitcode=9. pid=355367


## Future Works

The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently radix attention is incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development.