# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0907 19:59:42.713000 1155614 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0907 19:59:42.713000 1155614 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0907 19:59:51.279000 1156246 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0907 19:59:51.279000 1156246 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0907 19:59:51.728000 1156247 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0907 19:59:51.728000 1156247 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-07 19:59:51] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.89 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.89 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.87it/s]Capturing batches (bs=2 avail_mem=74.83 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.87it/s]Capturing batches (bs=1 avail_mem=74.82 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.87it/s]Capturing batches (bs=1 avail_mem=74.82 GB): 100%|██████████| 3/3 [00:00<00:00, 11.35it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jack, my birthday is on May 22nd, and I live in a big city. My mom and I are very busy every day, but we always have a lot of fun. We have many afternoons free during the summer when we go to the beach. My friend and I usually like to go to the beach on Sunday afternoon. We usually have a beach party on the beach. And we usually swim in the ocean. We don't usually eat a lot of food. We usually have a big meal with our friends. We like to talk a lot and laugh a lot. We like to eat ice cream and drinks
Prompt: The president of the United States is
Generated text:  a very important person in the country. He/she is supposed to be the leader of the country. The president also has to answer the questions that people ask to the people who want to ask him/her. 2.1.1 Do you think the president is always right? 2.1.2 Can the president control the world? 2.1.3 Can the president make the people happier? 2.1.4 Can the president make the people healthier? 2

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Gender] [Occupation]. I am currently [Current Location] and I am [Current Job Title]. I am a [Current Hobby/Interest] enthusiast and I love [Favorite Food/Drink/Activity]. I am always looking for new experiences and I am always eager to learn new things. I am a [Current Goal/Objective] and I am always looking for ways to [Describe a new skill or hobby]. I am a [Current Personality Type] and I am always [Describe a positive trait or characteristic]. I am a [Current Motivation] and I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the seat of the French government and the country's cultural and political center. Paris is a bustling metropolis with a rich history dating back to the Roman Empire and the French Revolution. It is a popular tourist destination and a major economic hub, with a diverse array of restaurants, shops, and entertainment options. The city is also home to many famous museums and art galleries, including the Musée d'Orsay and the Musée Rodin. Paris is a city of contrasts, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn and adapt in ways that are difficult for humans to do. This could lead to more efficient and effective decision-making in a wide range of applications.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability. This will require a more rigorous and transparent approach to AI development and deployment.

3. Increased use of AI in healthcare: AI is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [gender] [gender role] with [job title] at [organization]. I'm passionate about [one or more of the following] and I am excited to dive into this new chapter of my life! [Justify the passion, experience, and skills you have for the role you are taking on here.] [Make sure to include specific details about your passion, experience, and skills relevant to your job or role. ] [Make sure your introduction is neutral and professional, and doesn't come across as too personal or self-centered. ] [Remember to include your current location, any hobbies, interests

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A) Incorrect - Paris is not the capital of France.
B) Incorrect - Paris is the capital of France.
C) Correct - Paris is the capital of France.
D) Incorrect - The capital of France is n

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 dedicated

 amateur

 photographer

 who

 has

 been

 traveling

 the

 world

 for

 the

 past

 few

 years

.

 I

 have

 a

 passion

 for

 capturing

 the

 beauty

 of

 nature

 and

 documenting

 its

 every

 moment

.

 My

 photography

 style

 is

 minimalist

 and

 focuses

 on

 creating

 a

 sense

 of

 calm

 and

 ser

enity

 in

 my

 subjects

.

 I

 am

 confident

 in

 my

 ability

 to

 capture

 moments

 that

 speak

 to

 the

 human

 experience

 and

 share

 them

 with

 others

.

 If

 you

're

 looking

 for

 a

 photographer

 who

 is

 passionate

 about

 capturing

 beauty

 and

 inspiring

 others

,

 then

 I

'm

 the

 one

 for

 you

!

 What

 kind

 of

 photography

 are

 you

 particularly

 interested

 in

,

 and

 what

 motiv

ates

 you

 to

 take

 photos

?

 My

 photography

 style

 is



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 historical

 landmarks

,

 vibrant

 fashion

 scene

,

 and

 world

-class

 museums

.

 It

 is

 a

 major

 cultural

 and

 economic

 hub

 with

 a

 population

 of

 over

 

2

.

 

5

 million

 people

.

 Paris

 is

 also

 famous

 for

 its

 food

 culture

,

 particularly

 its

 famous

 Paris

ian

 b

oud

in

,

 a

 traditional

 Paris

ian

 dish

 made

 with

 pork

 skin

 and

 spices

.

 The

 city

 is

 known

 for

 its

 romantic

 and

 romantic

 nightlife

,

 with

 many

 locations

 for

 couples

 to

 enjoy

 romantic

 dinners

 and

 concerts

.

 In

 addition

 to

 being

 a

 global

 cultural

 and

 economic

 center

,

 Paris

 is

 also

 a

 hub

 of

 transportation

,

 with

 many

 bus

 and

 train

 lines

 connecting

 it

 to

 other

 cities

.

 The

 city

 is

 also



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 highly

 competitive

,

 driven

 by

 the

 rapid

 advancements

 in

 computing

 technology

 and

 the

 increasing

 complexity

 of

 tasks

 that

 AI

 systems

 need

 to

 perform

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Adv

ancements

 in

 Machine

 Learning

 and

 Deep

 Learning

:

 Machine

 learning

 and

 deep

 learning

 will

 continue

 to

 advance

,

 with

 the

 ability

 to

 build

 more

 complex

 and

 sophisticated

 AI

 systems

.

 This

 will

 enable

 AI

 to

 learn

 from

 data

 and

 make

 more

 accurate

 predictions

 and

 decisions

.



2

.

 Improved

 Security

 and

 Privacy

:

 As

 AI

 systems

 are

 used

 for

 more

 complex

 tasks

,

 there

 will

 be

 a

 need

 to

 ensure

 their

 security

 and

 privacy

.

 This

 will

 require

 advancements

 in

 security

 protocols

 and

 data

 protection

 measures

.



3

.




In [6]:
llm.shutdown()