# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0901 06:38:55.446000 2831911 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 06:38:55.446000 2831911 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0901 06:39:04.719000 2832204 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0901 06:39:04.719000 2832204 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-01 06:39:05] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.65it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.59it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.59it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.59it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.76it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nathan and I am a 16 year old male. I have an extremely short stature, like many kids who are average height compared to my height. What is a good way to adjust my body to make me taller?

Im going to have a go at eating more than normal rice and vegetable but is that a good idea?

Choose your answer from:
[A]. No.
[B]. Yes.
I'm not sure how to respond. Can you provide some advice? To help me make the right decision, could you please explain the science behind it? That would be really helpful.

I'm afraid I'm going to have a really tough time
Prompt: The president of the United States is
Generated text:  very busy. He works a lot. He is very important in the world. He is a very popular man and many people like him very much. But the president has a lot of problems. He has a lot of work to do and he has to be away from home. It is very difficult for him. But the president likes his family very much. The president often says, "We are all family.

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short, positive, enthusiastic statement about your personality or skills]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm a [insert a short, positive, enthusiastic statement about your personality or skills]. I'm always looking for new challenges and opportunities to grow and learn. What do you enjoy doing? I enjoy [insert a short, positive, enthusiastic statement about your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also home to many famous French artists, writers, and musicians. The city is known for its cuisine, including its famous croissants and its many traditional French dishes. Paris is a city of contrasts, with its modern architecture and history intertwined with its traditional French culture. Its status as the capital of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and adaptive AI systems that can learn from and adapt to new situations.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment, as well as increased scrutiny of AI systems that are designed to harm or mislead humans.

3. Increased use of



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [occupation]!

I'm here to meet you because I love [job title] and I'm excited to tell you about my experiences and what I bring to our team. 

I'm always looking for new opportunities to learn and grow. I enjoy challenging myself and trying new things, and I'm constantly looking for new ways to improve myself as a leader. 

If you're looking for someone who is ready to work hard, who is ready to make a difference, and who is ready to take on new challenges, I'm the one for you! 

Please come talk to me and find out more

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It's also famous for its rich history and cultural heritage. France's political center is also in Paris, with its city hal

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

]

 years

 old

.

 I

'm

 a

 [

occupation

]

 and

 I

'm

 passionate

 about

 [

specific

 interest

 or

 hobby

].

 I

 enjoy

 [

what

 I

 love

 to

 do

],

 and

 I

'm

 always

 looking

 for

 ways

 to

 [

something

 positive

].

 I

'm

 [

some

 qualities

 or

 traits

 you

 like

 to

 show

].

 I

'm

 looking

 to

 collaborate

 with

 someone

 who

 shares

 my

 [

specific

 interest

 or

 hobby

],

 and

 I

'd

 love

 to

 learn

 more

 about

 you

.

 So

,

 I

'm

 looking

 forward

 to

 [

an

 activity

 or

 meeting

].

 



*

Note

:

 Replace

 the

 placeholders

 with

 actual

 information

 that

 someone

 could

 use

 to

 start

 a

 conversation

 with

 you

.

 Make

 sure

 to

 use

 neutral

 language



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 historical

 and

 cultural

 center

 with

 a

 rich

 history

 dating

 back

 to

 ancient

 times

.

 The

 city

 is

 known

 for

 its

 beautiful

 architecture

,

 vibrant

 art

 scene

,

 and

 world

-class

 museums

 and

 attractions,

 including

 the

 Lou

vre

 and

 the

 É

iff

el

 Tower

.

 It

 is

 also

 known

 for

 its

 bustling

 street

 life

 and

 distinctive

 French

 fashion

,

 which

 has

 made

 Paris

 a

 global

 city

 of

 fashion

,

 art

,

 and

 food

.

 Paris

 is

 a

 city

 of

 light

 and

 vibrant

 life

 that

 attracts

 millions

 of

 visitors

 each

 year

.

 The

 city

's

 language

 is

 French

,

 but

 it

 also

 has

 a

 smaller

 community

 of

 people

 speaking

 other

 languages

,

 including

 English

 and

 Spanish

.

 The

 city

 is

 home

 to

 some

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 diverse

 and

 transformative

,

 with

 many

 possibilities

 and

 implications

.

 Here

 are

 some

 possible

 trends

 in

 the

 AI

 landscape

:



1

.

 Automation

 of

 Human

 Work

:

 As

 AI

 becomes

 more

 advanced

,

 it

 is

 likely

 to

 automate

 many

 of

 the

 tasks

 that

 humans

 currently

 do

,

 including

 tasks

 like

 data

 analysis

,

 customer

 service

,

 and

 administrative

 tasks

.

 This

 could

 lead

 to

 increased

 efficiency

 and

 productivity

,

 but

 it

 could

 also

 result

 in

 job

 losses

 for

 some

 workers

.



2

.

 AI

 for

 Health

care

:

 AI

 has

 the

 potential

 to

 revolution

ize

 healthcare

,

 with

 the

 ability

 to

 analyze

 large

 amounts

 of

 data

 and

 provide

 personalized

 health

 recommendations

.

 This

 could

 lead

 to

 more

 effective

 treatments

 and

 reduced

 costs

 for

 patients

.






In [6]:
llm.shutdown()