# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0902 08:31:43.090000 1802086 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 08:31:43.090000 1802086 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0902 08:31:51.724000 1802789 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 08:31:51.724000 1802789 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0902 08:31:51.851000 1802790 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0902 08:31:51.851000 1802790 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-02 08:31:52] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.16it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.89it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.89it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.89it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.34it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jack. I come from New York. I like to eat chocolate ice cream. I love to go to the park with my family. At the park, we usually have fun and have ice cream at the same time. My dad and I have a big family and we have a lot of fun in the park. The most important thing in the world for me is family. It's impossible to have a wonderful life without my family. I love my mom, my dad, and my grandma. But my sister is my best friend. She is nice to me. She helps me and makes me happy. I love my sister. She can
Prompt: The president of the United States is
Generated text:  a person. He is the president of the United States, a country, a person. There are no logical contradictions in that statement.

Does this relate to another true statement? Yes, it does relate to another true statement. The statement "The president of the United States is a person" is true because it is a factual fact that the current president of the United States, Joe Biden, is a 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? I'm a [insert a short description of your character or personality]. What do you like to do outside of work? I enjoy [insert a short description of your hobbies or interests]. What's your favorite book or movie? I love [insert a short description of your favorite book or movie]. What's your favorite hobby? I love [insert a short description of your favorite hobby]. What's your favorite place to relax? I love [insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

A. True
B. False
A. True

Paris is the capital city of France, and it is the largest city in the country. It is known for its rich history, beautiful architecture, and vibrant culture. Paris is also a major transportation hub, with many major highways and rail lines connecting the city to other parts of France and the world. The city is home to many famous landmarks, including the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is a popular tourist destination, with millions of visitors each year. The city is also known for its cuisine, with many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn from and adapt to human behavior and decision-making processes.

2. Greater emphasis on ethical considerations: As AI becomes more prevalent in various industries, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Development of more advanced AI: AI is likely to become more advanced and capable of performing a wider range of tasks, including tasks that were previously considered impossible or dangerous.

4. Increased use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm [Age]. I've always been fascinated by [current interest or hobby]. What can you tell me about yourself? [Introduce your character to the reader, including any unique characteristics or personality traits that make you stand out]. As someone who enjoys [mention a hobby or interest you've always been interested in], I'm always looking for new ways to challenge myself. What do you think makes you unique? [Ask the reader a question that elicits a response that describes a unique trait or quality of your character]. I'm [Age] years old, and I have a lot of experiences under my belt.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the 15th-largest city in the world by population. According to the French census from 2019, the city’s population was 2.21 million people, making it the 17t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 name

],

 and

 I

'm

 a

 [

insert

 occupation

]

 at

 [

insert

 employer

's

 name

].

 I

 have

 been

 in

 the

 field

 of

 [

insert

 area

 of

 expertise

]

 for

 [

insert

 number

 of

 years

]

 years

.

 I

 am

 passionate

 about

 [

insert

 a

 hobby

 or

 interest

 that

 you

 enjoy

].

 I

 am

 always

 looking

 for

 new

 ways

 to

 learn

 and

 grow

 as

 a

 professional

 and

 someone

 who

 enjoys

 [

insert

 a

 profession

 or

 hobby

].

 How

 do

 you

 like

 to

 spend

 your

 free

 time

?

 I

 enjoy

 [

insert

 something

 you

 do

 that

 you

're

 not

 sure

 if

 you

 have

 done

 before

 or

 have

 done

 before

,

 but

 you

're

 excited

 to

 do

].


I

'm

 a

 [

insert

 your

 name

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 located

 on

 the

 banks

 of

 the

 Se

ine

 River

.

 It

 is

 known

 as

 the

 "

City

 of

 Light

"

 for

 its

 rich

 cultural

 heritage

 and

 vibrant

 arts

 scene

,

 and

 is

 home

 to

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 and

 numerous

 iconic

 landmarks

 such

 as

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 fashion

 industry

 and

 is

 home

 to

 numerous

 museums

 and

 art

 galleries

.

 The

 city

 is

 a

 major

 transportation

 hub

,

 with

 its

 airport

 serving

 as

 a

 major

 gateway

 for

 international

 visitors

.

 It

 is

 also

 known

 for

 its

 climate

 and

 its

 rolling

 countryside

.

 As

 the

 capital

 of

 France

,

 Paris

 is

 a

 cultural

 and

 political

 center

 with

 a

 rich



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 incredibly

 dynamic

,

 with

 a

 wide

 array

 of

 emerging

 trends

 shaping

 how

 it

 will

 evolve

.

 Here

 are

 some

 of

 the

 key

 trends

 that

 could

 drive

 significant

 changes

 in

 the

 field

:



1

.

 Deep

 Learning

 and

 Hyper

-

Parameter

 Tun

ing

:

 Deep

 learning

 is

 rapidly

 advancing

 and

 will

 continue

 to

 make

 significant

 progress

 in

 the

 coming

 years

.

 However

,

 hyper

-parameter

 tuning

 will

 become

 even

 more

 important

.

 This

 will

 involve

 fine

-t

uning

 the

 hyper

parameters

 of

 deep

 learning

 models

 to

 improve

 accuracy

 and

 efficiency

.



2

.

 Personal

ized

 AI

:

 AI

 is

 becoming

 increasingly

 personalized

,

 with

 the

 ability

 to

 learn

 from

 individual

 users

 and

 adapt

 their

 behavior

 accordingly

.

 This

 will

 have

 a

 significant

 impact

 on

 industries

 such

 as

 healthcare

,




In [6]:
llm.shutdown()