# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0915 01:47:34.867000 373921 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 01:47:34.867000 373921 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0915 01:47:43.224000 374639 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 01:47:43.224000 374639 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 01:47:43.288000 374638 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0915 01:47:43.288000 374638 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 01:47:43] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.28it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=10.54 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=10.54 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.07it/s]Capturing batches (bs=2 avail_mem=10.48 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.07it/s]Capturing batches (bs=1 avail_mem=10.47 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.07it/s]Capturing batches (bs=1 avail_mem=10.47 GB): 100%|██████████| 3/3 [00:00<00:00,  9.06it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chris. I am 17 years old. I am from China. Now I'm in England. I'm a student. I go to school in a middle school. The students there are all boys. The Chinese girls are not so many. But they have a lot of friends. In the evening, I like to watch TV with my friends, like watching movies and listening to music. They are very kind to me. They are kind to me. I like to play sports. I play basketball and soccer. I like to play with my friends. I want to be a football player when I grow up. I hope to be
Prompt: The president of the United States is
Generated text:  running for a second term. How many years will it take for him to spend his second term? The answer is 7 years. The president of the United States typically serves two terms in a single term, and it takes 7 years to complete a full term of 4 years in the Senate. Therefore, it will take 7 years for the president to spend his second term. 

The answer is \boxed{7}.
Prompt: The capital of Fra

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a few key points about yourself, such as your education, experience, or hobbies]. I'm looking forward to meeting you and learning more about you. What can you tell me about yourself? I'm a [insert a few key points about yourself, such as your education, experience, or hobbies]. I'm looking forward to meeting you and learning more about you. What can you tell me about yourself? I'm a [insert a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and art scene. Paris is a major tourist destination and a cultural hub, attracting millions of visitors each year. The city is also home to many important institutions such as the French Academy of Sciences and the Louvre Museum. Paris is a vibrant and dynamic city that continues to be a major center of culture and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased use of AI in healthcare: AI is already being used to diagnose and treat diseases, and it has the potential to revolutionize the field of medicine. AI-powered diagnostic tools, such as AI-powered X-rays and AI-powered pathology analysis, are already being used in hospitals around the world.

2. AI in finance: AI is already being used to analyze financial data and make investment decisions. In the future, we may see even more advanced AI-powered financial tools, such as AI-powered trading



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  John Smith, and I'm an experienced writer of fiction. My writing has been published in several literary magazines and I'm currently working on a series of short stories set in a distant future where technology has advanced to the point where humans have colonized the planet Mars. I'm also a freelance writer with a background in marketing, and I'm always looking for new writing prompts and ideas to spark my imagination. Thank you for taking the time to meet me. How can I assist you with your writing projects? As John Smith, can you provide me with some writing tips on how to improve my storytelling skills? Absolutely! I can certainly provide you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a historical and cultural center with iconic landmarks such as the Eiffel Tower and Notre-Dame Cathedral. 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

 [

age

]

 year

 old

 girl

 who

 has

 a

 love

 for

 [

favorite

 hobby or

 activity].

 I

 enjoy

 [

how

 I

 spend

 my

 free

 time

]

 with

 my

 family

 and

 friends

.

 I

'm

 always

 trying

 to

 learn

 new

 things

 and

 I

'm

 always

 looking

 for

 ways

 to

 make

 my

 life

 more

 interesting

.

 And

 above

 all

,

 I

'm

 a

 [

character

 trait

 or

 personality

].

 I

'm

 [

ex

cellent

 at

],

 and

 I

 enjoy

 [

exc

iting

 activity

].

 If

 you

 had

 the

 chance

 to

 meet

 me

,

 what

 would you

 tell

 me

 about

 me

?

 [

Name

]

 is

 [

type

 of

 person

].

 [

Name

]

 is

 [

happy

 or

 sad

].

 [

Name

]

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 located

 on

 the

 north

 bank

 of

 the

 Se

ine

 River

 in

 the

 center

 of

 the

 country

.



To

 summarize

 the

 passage

 in

 one

 sentence

,

 you

 could

 say

:


The

 capital

 city

 of

 France

,

 Paris

,

 is

 situated

 on

 the

 north

 bank

 of

 the

 Se

ine

 River

,

 in

 the

 heart

 of

 the

 country

.

 



This

 concise

 statement

 captures

 the

 key

 information

 about

 Paris

's

 location

 on

 the

 Se

ine

 River

 and

 its

 central

 position

 within

 France

.

 It

 provides

 a

 brief

 yet

 comprehensive

 overview

 of

 Paris

's

 historical

 significance

 and

 its

 role

 as

 the

 capital

.

 The

 sentence

 structure

 is

 clear

 and

 easy

 to

 understand

,

 making

 it

 an

 appropriate

 summary

 for

 most

 readers

.

 However

,

 a

 more

 poetic

 or



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 driven

 by

 several

 trends

,

 including

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 is

 already

 becoming

 integrated

 into

 other

 technologies

 such

 as

 smart

 homes

,

 self

-driving

 cars

,

 and

 virtual

 and

 augmented

 reality

.

 We

 may

 see

 more

 integration

 between

 AI

 and

 other

 emerging

 technologies

 as

 these

 technologies

 become

 more

 prevalent

.



2

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 expected

 to

 be

 a

 major

 driver

 of

 AI

 in

 the

 future

.

 The

 technology

 is

 already

 in

 the

 works

 and

 will

 revolution

ize

 transportation

 and

 become

 more

 prevalent

 in

 the

 coming

 years

.



3

.

 Personal

ized

 AI

:

 As

 AI

 becomes

 more

 sophisticated

,

 we

 may

 see

 more

 personalized

 AI

 solutions

 that

 adapt

 to

 individual

 users

'

 needs

 and

 preferences




In [6]:
llm.shutdown()