# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0915 19:00:54.550000 3753477 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 19:00:54.550000 3753477 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0915 19:01:03.840000 3754115 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 19:01:03.840000 3754115 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 19:01:03.903000 3754114 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 19:01:03.903000 3754114 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 19:01:04] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.91it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=68.48 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=68.48 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.31it/s]Capturing batches (bs=2 avail_mem=68.42 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.31it/s]Capturing batches (bs=1 avail_mem=68.41 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.31it/s]Capturing batches (bs=1 avail_mem=68.41 GB): 100%|██████████| 3/3 [00:00<00:00,  6.08it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chandra, I am 13 years old and I live in New York. I am an English major and I enjoy reading novels. I have been a member of a school reading club for the past year, and I recently read the book "The Man Who Planted Trees" by Madeline Miller. 

I'm a big fan of the book and have been rereading it. It's a beautiful story, and I really enjoyed the way that Madeline Miller writes in the book. I also enjoyed the way that the characters came to life, and the way that the book's plot was engaging.

I have been trying to
Prompt: The president of the United States is
Generated text:  a very important person in our country. He or she is in charge of the country. Presidents in other countries have different jobs. But they are very important. President Barack Obama is one of the most important people in the United States. He was born in 1961. He was a good student. He became president of the United States in 2009. He began his first year in office on Jan

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament and the French National Museum. Paris is a bustling city with a rich history and culture, and is a popular tourist destination. It is also known for its fashion industry and its role in the French Revolution. The city is home to many famous French artists and writers, and is a cultural hub for the region. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that has been a major influence on French culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more robust AI systems that are designed to be transparent, accountable, and responsible.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name]. I am a [insert job title, field, or profession] with [insert years of experience]. I have always been passionate about [insert something like sports, music, art, or education]. I love challenges and trying new things, and I am always looking for ways to expand my skills and knowledge. I am a [insert something like a great cook, natural disaster response specialist, or technology expert]. How can I get started in my field? I am always open to learning new skills and opportunities. Thanks for taking the time to meet me! What's your name and what do you do? What's your name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the 6th most populous city in the world. It is known as the city of light, the world's first city to be built on the sea, and the oldest continuously inhabited city in Euro

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 friendly

 and

 warm

-hearted

 human

.

 I

 am

 originally

 from

 [

Country

],

 but

 I

've

 moved

 to

 [

City

],

 and

 I

've

 been

 working

 hard

 to

 get

 used

 to

 my

 new

 life

 here

.

 I

'm

 an

 [

occupation

]

 who

 loves

 to

 travel

 and

 explore

 new

 places

.

 I

'm

 always

 looking

 for

 new

 experiences

 and

 adventures

.

 I

'm

 always

 looking

 for

 new

 opportunities

 to

 make

 a

 difference

 in

 the

 world

.

 And

,

 I

'm

 always

 open

 to

 learning

 and

 growing

 as

 a

 person

.

 So

,

 if

 you

 ever

 need

 anything

 or

 just

 want

 to

 chat

,

 don

't

 hesitate

 to

 reach

 out

.

 And

 I

'm

 ready

 to

 meet

 you

.

 [

Name

].



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 vibrant

 and

 historic

 city

 that

 serves

 as

 the

 political

,

 cultural

,

 and

 economic

 center

 of

 the

 country

.

 It

 is

 known

 for

 its

 rich

 history

,

 renowned

 museums

,

 stunning

 architecture

,

 and

 vibrant

 nightlife

.

 Paris

 is

 also

 a

 major

 center

 for

 international

 trade

 and

 tourism

,

 and

 its

 many

 landmarks

 and

 historical

 sites

 attract

 millions

 of

 visitors

 each

 year

.

 The

 city

 has

 a

 unique

 blend

 of

 old

 and

 new

,

 with

 its

 historic

 landmarks

 and

 modern

 institutions

 blending

 seamlessly

 in

 a

 city

 that

 has

 endured

 for

 centuries

.

 Paris

 is

 a

 city

 of

 innovation

 and

 creativity

,

 attracting

 world

-ren

owned

 artists

,

 writers

,

 and

 scholars

 from

 all

 over

 the

 globe

.

 The

 city

 is

 also

 home

 to

 some

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 full

 of

 exciting

 developments

 and

 possibilities

.

 Here

 are

 some

 possible

 trends

 to

 expect

 in

 the

 future

:



1

.

 Improved

 machine

 learning

 algorithms

:

 AI

 experts

 are

 already

 using

 machine

 learning

 algorithms

 to

 make

 more

 accurate

 predictions

 and

 better

 understand

 complex

 data

.

 This

 will

 likely

 continue

 to

 be

 a

 key

 area

 of

 focus

 in

 the

 coming

 years

.



2

.

 Autonomous

 vehicles

:

 Self

-driving

 cars

 and

 other

 autonomous

 vehicles

 are

 becoming

 more

 advanced

,

 and

 are

 expected

 to

 become

 a

 major

 part

 of

 daily

 life

 in

 the

 coming

 years

.

 This

 could

 revolution

ize

 transportation

 and

 reduce

 traffic

 congestion

.



3

.

 Personal

ized

 AI

:

 AI

 systems

 that

 can

 learn

 and

 adapt

 to

 individual

 needs

 and

 preferences

 will

 become

 even

 more




In [6]:
llm.shutdown()