# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
from sglang.utils import stream_and_merge, async_stream_and_merge
import sglang as sgl
import asyncio
from sglang.test.test_utils import is_in_ci

if is_in_ci():
    import patch


llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")



INFO 02-19 22:20:49 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:20:51,292 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend




INFO 02-19 22:20:55 __init__.py:190] Automatically detected platform cuda.
INFO 02-19 22:20:55 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:20:57,120 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.17it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.41it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.34it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

2025-02-19 22:21:03,980 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-02-19 22:21:04,001 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False


  4%|▍         | 1/23 [00:01<00:23,  1.07s/it]

  9%|▊         | 2/23 [00:01<00:11,  1.76it/s]

 13%|█▎        | 3/23 [00:01<00:08,  2.46it/s]

 17%|█▋        | 4/23 [00:01<00:06,  3.04it/s]

 22%|██▏       | 5/23 [00:01<00:05,  3.48it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.78it/s]

 30%|███       | 7/23 [00:02<00:03,  4.03it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.13it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.24it/s]

 43%|████▎     | 10/23 [00:03<00:03,  4.24it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.34it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.28it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.33it/s]

 61%|██████    | 14/23 [00:03<00:02,  4.38it/s]

 65%|██████▌   | 15/23 [00:04<00:01,  4.45it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.46it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.52it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.46it/s]

 83%|████████▎ | 19/23 [00:05<00:00,  4.34it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.37it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.39it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.34it/s]

100%|██████████| 23/23 [00:06<00:00,  4.26it/s]100%|██████████| 23/23 [00:06<00:00,  3.83it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

2025-02-19 22:21:10,612 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:21:10,633 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


Prompt: Hello, my name is
Generated text:  Kevin, I am a licensed therapist and a Professor of Counseling at a university. My work focuses on the intersection of culture and human psychology, and how cultural factors influence our perceptions of reality, ourselves, and our relationships with others. I specialize in multicultural counseling, diversity training, and humanistic psychology.
I have experience working with diverse populations, including individuals from low-income backgrounds, LGBTQ+ individuals, and people of color. My approach is holistic, and I believe in addressing the interconnectedness of individual, family, and community well-being. I also emphasize the importance of empowerment, self-awareness, and personal growth in achieving positive change.
My
Prompt: The president of the United States is
Generated text:  the head of the U.S. government. In addition to being the commander-in-chief of the armed forces, the president is also the head of the federal executive branch 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


2025-02-19 22:21:12,276 - INFO - flashinfer.jit: Loading JIT ops: cascade
2025-02-19 22:21:12,287 - INFO - flashinfer.jit: Finished loading JIT ops: cascade


Generated text:  Kaida. I'm a 25-year-old freelance writer and artist living in a small town in the Pacific Northwest. I enjoy hiking, reading, and trying out new recipes in my spare time. I'm a bit of a introvert, but I'm always up for a good conversation.
This self-introduction is neutral because it doesn't reveal any personal biases or opinions. It simply states the character's name, occupation, and interests. This type of introduction is useful for a character who is still getting to know others or for a character who is trying to keep a low profile.
Here are a few things to consider when writing a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. Paris is located in the northern part of the country and is situated on the Seine River. It is the largest city in France and is known for its rich history, art, fashion, and culture. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral. The city is also known for its romantic atmosphere and is often referred to as the City of Light. Paris is a popular tourist destination and is considered one of the most beautiful and culturally rich cities in the world. The city has a population of over 2.1 million people and is a major hub

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  a topic of much speculation and debate. Some experts predict that AI will become increasingly integrated into our daily lives, while others warn of the potential risks and challenges associated with its development. Here are some possible future trends in artificial intelligence:
1. Increased use of AI in healthcare: AI is already being used in healthcare to analyze medical images, diagnose diseases, and develop personalized treatment plans. In the future, AI is likely to play an even more significant role in healthcare, with the potential to revolutionize the way we diagnose and treat diseases.
2. Widespread adoption of AI-powered virtual assistants: Virtual assistants like Siri, Alexa, and Google



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Kaida. I'm a skilled linguist who specializes in deciphering ancient languages. My work often takes me to remote locations where I can study artifacts and texts that hold secrets of the past. I'm currently based in Tokyo, where I'm working on a project to translate an ancient Sumerian tablet.
Kaida's introduction focuses on her profession and work, providing a neutral and straightforward description of herself. It doesn't reveal any personal traits or feelings, keeping the tone professional and objective. This type of self-introduction is suitable for a formal setting, such as a business or academic environment. However, in a more casual or social

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is located at the heart of the Île-de-France region in the northern part of the country. The city has a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Roh

an

 and

 I

'm

 a

 

25

-year

-old

 student

.

 I

'm

 currently

 pursuing

 a

 degree

 in

 computer

 science

.

 I

'm

 not

 really

 sure

 what

 the

 future

 holds

,

 but

 I

'm

 excited

 to

 see

 where

 life

 takes

 me

.


I

'm

 a

 bit

 of

 a

 quiet

 and

 reserved

 person

,

 but

 I

 have

 a

 passion

 for

 coding

 and

 technology

.

 I

 enjoy

 spending

 my

 free

 time

 learning

 about

 new

 programming

 languages

 and

 experimenting

 with

 different

 projects

.

 I

'm

 also

 an

 avid

 reader

 and

 love

 getting

 lost

 in

 a

 good

 book

.

 Outside

 of

 academics

,

 I

'm

 a

 bit

 of

 a

 movie

 buff

 and

 enjoy

 watching

 classic

 films

 and

 sci

-fi

 movies

.


When

 I

'm

 not

 studying

 or

 coding

,

 you

 can



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 located

 in

 the

 Î

le

-de

-F

rance

 region

 in

 the

 north

-central

 part

 of

 the

 country

.

 Paris

 is

 the

 largest

 city

 in

 France

,

 with

 over

 

2

.

1

 million

 inhabitants

 within

 its

 administrative

 limits

 and

 over

 

12

 million

 in

 its

 metropolitan

 area

.

 Paris

 is

 the

 economic

,

 cultural

,

 and

 political

 center

 of

 France

 and

 is

 one

 of

 the

 world

's

 leading

 business

 and

 cultural

 centers

,

 hosting

 many

 international

 organizations

 and

 events

,

 including

 the

 UNESCO

 headquarters

 and

 the

 Paris

 Air

 Show

.

 Paris

 is

 known

 for

 its

 iconic

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 inherently

 uncertain

,

 but

 here

 are

 some

 potential

 developments

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

.


More

 sophisticated

 deep

 learning

 techniques

 could

 lead

 to

 breakthrough

s

 in

 areas

 like

 image

 and

 speech

 recognition

,

 natural

 language

 processing

,

 and

 decision

-making

.

 Potential

 advancements

 include

:


 

 

1

.

 Mult

im

odal

 learning

:

 AI

 systems

 could

 learn

 to

 combine

 multiple

 sources

 of

 data

,

 such

 as

 text

,

 images

,

 and

 audio

,

 to

 gain

 a

 deeper

 understanding

 of

 complex

 tasks

.


 

 

2

.

 Transfer

 learning

:

 AI

 models

 could

 learn

 to

 adapt

 to

 new

 tasks

 and

 domains

 more

 efficiently

,

 reducing

 the

 need

 for

 extensive

 re

training

.


 

 

3

.

 Explain

ability

:

 As

 AI

 systems

 become

 more




In [6]:
llm.shutdown()

### Return Hidden States

In [7]:
llm = sgl.Engine(
    model_path="meta-llama/Meta-Llama-3.1-8B-Instruct", return_hidden_states=True
)





INFO 02-19 22:21:26 __init__.py:190] Automatically detected platform cuda.
INFO 02-19 22:21:26 __init__.py:190] Automatically detected platform cuda.


2025-02-19 22:21:28,484 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.17it/s]


Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.79it/s]


Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.41it/s]


Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.33it/s]



  0%|          | 0/23 [00:00<?, ?it/s]

2025-02-19 22:21:35,119 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False
2025-02-19 22:21:35,140 - INFO - flashinfer.jit: Finished loading JIT ops: batch_decode_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False


  4%|▍         | 1/23 [00:01<00:22,  1.01s/it]

  9%|▊         | 2/23 [00:01<00:11,  1.85it/s]

 13%|█▎        | 3/23 [00:01<00:07,  2.57it/s]

 17%|█▋        | 4/23 [00:01<00:06,  3.16it/s]

 22%|██▏       | 5/23 [00:01<00:04,  3.62it/s]

 26%|██▌       | 6/23 [00:02<00:04,  3.95it/s]

 30%|███       | 7/23 [00:02<00:03,  4.23it/s]

 35%|███▍      | 8/23 [00:02<00:03,  4.39it/s]

 39%|███▉      | 9/23 [00:02<00:03,  4.52it/s]

 43%|████▎     | 10/23 [00:02<00:02,  4.54it/s]

 48%|████▊     | 11/23 [00:03<00:02,  4.54it/s]

 52%|█████▏    | 12/23 [00:03<00:02,  4.51it/s]

 57%|█████▋    | 13/23 [00:03<00:02,  4.52it/s]

 61%|██████    | 14/23 [00:03<00:01,  4.62it/s]

 65%|██████▌   | 15/23 [00:03<00:01,  4.66it/s]

 70%|██████▉   | 16/23 [00:04<00:01,  4.67it/s]

 74%|███████▍  | 17/23 [00:04<00:01,  4.65it/s]

 78%|███████▊  | 18/23 [00:04<00:01,  4.65it/s]

 83%|████████▎ | 19/23 [00:04<00:00,  4.57it/s]

 87%|████████▋ | 20/23 [00:05<00:00,  4.44it/s]

 91%|█████████▏| 21/23 [00:05<00:00,  4.41it/s]

 96%|█████████▌| 22/23 [00:05<00:00,  4.46it/s]

100%|██████████| 23/23 [00:05<00:00,  4.50it/s]100%|██████████| 23/23 [00:05<00:00,  4.00it/s]


In [8]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 10}

outputs = llm.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(
        f"Prompt: {prompt}\nGenerated text: {output['text']}\nPrompt_Tokens: {output['meta_info']['prompt_tokens']}\tCompletion_tokens: {output['meta_info']['completion_tokens']}\nHidden states: {[i.shape for i in output['meta_info']['hidden_states']]}"
    )
    print()

2025-02-19 22:21:41,556 - INFO - flashinfer.jit: Loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False
2025-02-19 22:21:41,579 - INFO - flashinfer.jit: Finished loading JIT ops: batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False


Prompt: Hello, my name is
Generated text:  Ashley Rodriguez and I am a 4th grade
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The president of the United States is
Generated text:  a powerful leader who serves as the head of state
Prompt_Tokens: 8	Completion_tokens: 10
Hidden states: [torch.Size([8, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])]

Prompt: The capital of France is
Generated text:  Paris, but the country is divided into 13
Prompt_Tokens: 6	Completion_tokens: 10
Hidden states: [torch.Size([6, 4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096]), torch.Size([4096])

In [9]:
llm.shutdown()