# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0909 00:44:02.462000 85669 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 00:44:02.462000 85669 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0909 00:44:10.950000 86251 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 00:44:10.950000 86251 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0909 00:44:11.055000 86252 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0909 00:44:11.055000 86252 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-09 00:44:11] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=24.33 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=24.33 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.50it/s]Capturing batches (bs=2 avail_mem=23.78 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.50it/s]Capturing batches (bs=1 avail_mem=23.77 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.50it/s]Capturing batches (bs=1 avail_mem=23.77 GB): 100%|██████████| 3/3 [00:00<00:00,  8.65it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Robin and I'm 34 years old and I have recently started dating again. I was introduced to her on Instagram and we began dating soon after. We both have a very similar personality and interests, and we have a lot of in common, such as having both of us enjoy going for walks and spending time at the beach. We have a lot of common interests, such as rock climbing and hiking. We both enjoy taking long walks and spend a lot of time at the beach. However, we seem to have a lot in common, such as having both of us enjoy going for walks and spending time at the beach. It seems
Prompt: The president of the United States is
Generated text:  represented by the Vice President. The Vice President is represented by the Speaker of the House. How many representatives are there in total? The President of the United States has a Vice President. 
The Vice President has a Speaker of the House. 
Therefore, there are a total of 2 representatives in the country. 
The

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and passions. What can you tell me about yourself? I'm a [insert a short description of your character or personality]. And what's your favorite hobby or activity? I love [insert a short description of your favorite hobby or activity]. And what's your favorite book or movie? I love [insert a short description of your favorite book or movie]. And what's your favorite color? I love [insert a short description of your favorite color]. And what's your favorite food? I love [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" and "La Grande Ouvrière" (The Great Work). It is the largest city in France and the third largest in the world, with a population of over 2. 5 million people. Paris is known for its rich history, art, and culture, including the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major economic and financial center, with a thriving fashion industry and a large number of international companies. Paris is a popular tourist destination, with many attractions and events throughout the year. The city is also home to many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to a more human-like experience for users, as AI systems become more capable of understanding and responding to human emotions and motivations.

2. Greater emphasis on ethical considerations: As AI systems become more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability. This will likely lead to more rigorous testing and validation of AI systems, as well as greater transparency and accountability



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane. I am a curious and analytical thinker with a knack for problem-solving. I have a thirst for knowledge and a passion for learning. I am a communicator who can connect with people on a deep level and inspire them with my words. I am a driven individual who is always looking for ways to improve myself and find new ways to solve problems. I am a true believer in perseverance and will not give up easily. I am a confident and assertive personality who is comfortable in both public and private settings. I am a dedicated professional with a sense of humor and enjoy taking on new challenges. I am someone who always tries to find the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the banks of the Seine River and known for its iconic landmarks such as Notre Dame Cathedral, the Eiffel Tower, and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

 [

Occup

ation

].

 I

've

 always

 loved

 nature

 and

 trying

 to

 understand

 the

 world

 around

 me

.

 My

 hobbies

 include

 reading

,

 hiking

,

 and

 gardening

.

 I

'm

 also

 a

 big

 advocate

 for

 sustainability

 and

 live

 a

 simple

,

 low

-

impact

 lifestyle

.

 What

 are

 your

 hobbies

 and

 what

 makes

 you

 unique

?

 (

Note

:

 The

 answer

 should

 be

 specific

 and

 only

 you

)

 Hi

,

 my

 name

 is

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

 [

Occup

ation

].

 I

've

 always

 loved

 nature

 and

 trying

 to

 understand

 the

 world

 around

 me

.

 My

 hobbies

 include

 reading

,

 hiking

,

 and

 gardening

.

 I

'm

 also



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el Tower

 and Paris

ian bou

lev

ards

.

 It

's

 a

 bustling

 met

ropolis

 with

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 period

 and

 now

 playing

 a

 vital

 role

 in

 French

 culture

 and

 politics

.

 Paris

 is

 the

 cultural

 and

 economic

 heart

 of

 the

 country

,

 with

 a

 thriving

 food

 scene

,

 art

 and

 fashion

 scene

,

 and

 many

 museums

 and

 landmarks

.



That

's

 great

!

 Can

 you

 tell

 me

 more

 about

 the

 E

iff

el

 Tower

 and

 its

 history

?

 Sure

!

 The

 E

iff

el

 Tower

 is

 a

 famous

 landmark

 in

 Paris

,

 and

 it

 was

 originally

 built

 as

 a

 communication

 tower

 for

 the

 Paris

 Semaphore

 Railway

.

 The

 tower

 was

 completed

 in

 

1



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

,

 with

 significant

 trends

 expected

 in

 the

 coming

 years

:



1

.

 Adv

ancements

 in

 machine

 learning

:

 With

 the

 help

 of

 big

 data

,

 AI

 will

 continue

 to

 improve

.

 Neural

 networks

 will

 become

 more

 sophisticated

 and

 autonomous

,

 enabling

 more

 complex

 and

 sophisticated

 tasks

.



2

.

 Personal

ization

:

 With

 the

 growing

 amount

 of

 personal

 data

,

 AI

 will

 be

 able

 to

 provide

 more

 personalized

 and

 relevant

 services

 to

 users

.



3

.

 Autonomous

 vehicles

:

 AI

 will

 become

 more

 advanced

,

 enabling

 the

 development

 of

 fully

 autonomous

 vehicles

 that

 can

 navigate

 roads

 and

 handle

 various

 road

 conditions

.



4

.

 Healthcare

:

 AI

 will

 be

 used

 in

 healthcare

 to

 develop

 more

 accurate

 and

 personalized

 treatments

 for

 diseases

.



5

.

 Education

:

 AI




In [6]:
llm.shutdown()