# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0908 04:03:39.061000 407263 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 04:03:39.061000 407263 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0908 04:03:47.222000 407670 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 04:03:47.222000 407670 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0908 04:03:47.431000 407671 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 04:03:47.431000 407671 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-08 04:03:47] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=57.94 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=57.94 GB):  33%|███▎      | 1/3 [00:24<00:48, 24.06s/it]Capturing batches (bs=2 avail_mem=51.10 GB):  33%|███▎      | 1/3 [00:24<00:48, 24.06s/it]Capturing batches (bs=1 avail_mem=51.10 GB):  33%|███▎      | 1/3 [00:24<00:48, 24.06s/it]Capturing batches (bs=1 avail_mem=51.10 GB): 100%|██████████| 3/3 [00:24<00:00,  8.04s/it]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Zhou Yu, and I’m a writer based in Beijing. I write my articles on general topics such as personal finance, health, and travel, and I also write about historical events, travel and travel travel. How can I write articles on general topics, such as personal finance, health, and travel?
Certainly! Here are some tips for writing articles on general topics, such as personal finance, health, and travel:

1. Research: Before you start writing, make sure you have a thorough understanding of the topic you want to write about. Read extensively on your own or consult with experts in the field.

2. Write for a reader
Prompt: The president of the United States is
Generated text:  now expected to lead a coalition of Democrats and Republicans, and not be a lone wolf. Republicans will try to overturn the rules and provide a platform for the president to speak candidly about the 2016 election and the future of the country. Democrats will want to advance issue

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your profession or role]. I enjoy [insert a brief description of your hobbies or interests]. I'm always looking for new experiences and learning opportunities. What's your favorite hobby or activity? I'm always looking for new experiences and learning opportunities. What's your favorite hobby or activity? I'm always looking for new experiences and learning opportunities. What's your favorite hobby or activity? I'm always looking for new experiences and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in Europe by population. It is located on the Seine River and is the seat of government, administration, and culture for the country. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also home to many world-renowned museums, including the Louvre and the Musée d'Orsay. Paris is a popular tourist destination and is known for its rich history, art, and cuisine. The city is also home to many cultural institutions,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to improve, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare. This will lead to increased efficiency, productivity, and cost savings for businesses.

2. Enhanced personalization: AI will enable businesses to better understand their customers and provide personalized experiences. This will lead to increased customer satisfaction and loyalty, as well as increased revenue.

3. Improved healthcare: AI will enable more accurate



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a software engineer. I've been working in the tech industry for [Years] years now and have gained a wealth of experience in both design and development. I enjoy being creative and taking risks while learning from failures. I'm confident and dedicated, and I'm always eager to improve and advance my career. I love to collaborate with others and work towards shared goals. I'm a team player and always strive to exceed expectations. Thank you for considering me for a job offer. 

In your personal life, what's your greatest strength? As an engineer, my greatest strength is my ability to think creatively and come

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, an ancient French city located in the central region of the country, on the banks of the Seine River. The city is home to the Notre-Dame

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 name

].

 I

'm

 excited

 to

 meet

 you

 and

 explore

 the

 world

 of

 fiction

 with

 you

.

 How

 can

 I

 help

 you

 today

?


As

 an

 AI

 language

 model

,

 I

 am

 always

 ready

 to

 help

 you

 with

 any

 questions

 or

 concerns

 you

 may

 have

.

 Please

 feel

 free

 to

 ask

 me

 anything

 you

 need

 help

 with

,

 and

 I

'll

 do

 my

 best

 to

 assist

 you

.

 What

's

 a

 fictional

 character

?

 A

 fictional

 character

 is

 a

 character

 that

 exists

 in

 a

 story

 or

 a

 fictional

 world

 created

 by

 the

 writer

.

 They

 can

 be

 characters

 from

 real

 life

 or

 from

 books

,

 movies

,

 or

 other

 forms

 of

 media

.

 They

 can

 be

 imaginary

,

 but

 they

 still

 have

 a

 human

-like



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 

2

0

1

5

–

2

0

1

6

 capital

 city

.



Paris

 is

 the

 

2

0

1

5

–

2

0

1

6

 capital

 city

,

 and

 the

 second

-largest

 city

 in

 the

 European

 Union

 after

 Brussels

.

 The

 city

 is

 home

 to

 the

 European

 Parliament

,

 the

 United

 Nations

 Headquarters

,

 and

 numerous

 museums

 and

 cultural

 institutions

.

 It

 is

 also

 known

 for

 its

 wine

 industry

,

 fashion

 industry

,

 and

 fashion

-forward

 fashion

.

 The

 city

 is

 home

 to

 the

 Lou

vre

 and

 the

 E

iff

el

 Tower

,

 as

 well

 as

 numerous

 landmarks

 and

 museums

 throughout

 the

 city

.

 Paris

 is

 considered

 one

 of

 the

 world

's

 most

 iconic

 cities

 and

 is

 home

 to

 the

 World

 Trade

 Center

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 and

 many

 factors

 such

 as

 technological

 advancements

,

 policy

 changes

,

 and

 changes

 in

 societal

 values

 and

 norms

 are

 likely

 to

 shape

 its

 trajectory

.

 However

,

 based

 on

 current

 trends

 and

 projections

,

 here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 use

 of

 AI

 in

 everyday

 life

:

 As

 AI

 continues

 to

 evolve

,

 its

 impact

 on

 everyday

 life

 is

 likely

 to

 increase

.

 This

 could

 include

 more

 widespread

 use

 of

 self

-driving

 cars

,

 smarter

 homes

 and

 devices

,

 and

 more

 personalized

 and

 efficient

 healthcare

 solutions

.



2

.

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 patient

 outcomes

 and

 reduce

 costs

.

 As

 AI

 continues

 to

 improve

,

 it

 is

 likely

 to

 become




In [6]:
llm.shutdown()