# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0911 13:42:43.999000 2662870 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 13:42:43.999000 2662870 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0911 13:42:53.245000 2663557 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 13:42:53.245000 2663557 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 13:42:53.254000 2663559 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 13:42:53.254000 2663559 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 13:42:53] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.39it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.39it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.39it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.48it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Linus and I'm a programmer. I am really proud of my recent work on a new version of the chess game, which I call "CyberChess". The new version has a much better user interface, and I hope that it will be even better in the future.

At the moment, I am working on a new chess piece called "Rogue". I want to make the Rogue a more realistic representation of an enemy chess piece. The Rogue should have the following characteristics:

  * The Rogue should be a unit that can attack the player's pieces.
  * The Rogue should be a unit that can attack the enemy's pieces
Prompt: The president of the United States is
Generated text:  a person, a state is a place, and a person's job is a position. Therefore, the president of the United States is a position. Which of the following, if true, most strengthens the argument?
A: The president of the United States is a person, and a position is a person's job.
B: The president of the United States is a state, and

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a cultural and historical center with a rich history dating back to the Roman Empire and the Middle Ages. Paris is a popular tourist destination and a major economic hub, with a diverse array of restaurants, shops, and entertainment venues. It is also home to many notable French artists, writers, and composers. The city is known for its cuisine, including its famous Parisian dishes such as croissants, escargot, and charcuterie. Paris is a vibrant and dynamic city with a strong

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased Use of AI in Healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As the technology continues to advance, we can expect to see even more widespread use of AI in healthcare, with more sophisticated algorithms and machine learning models being developed to improve diagnosis, treatment, and patient care.

2. Increased Use of AI in Finance: AI is already being used in finance to improve risk management, fraud detection, and trading



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a/an [Job Title] with [Number of Years in This Position] years of experience in [field or industry]. I'm a/an [Type of Personality] person and I'm constantly seeking opportunities to learn and grow. I'm a/an [Motivation Level] person and I'm always ready to make a positive impact in the world. I'm [Positive Attitude] and I'm always willing to do whatever it takes to help others succeed. I'm a/an [Loyalty Level] person and I'm always willing to work hard for my goals. I'm [Resilience Level] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris, the 19th and largest city of France, is a global metropolis known for its rich culture, art, and cuisine. The city's architecture, including the iconic Eiffel Tower and Notre-Dame Cathedral, is a UNESCO World Heritage site. It is a major transport

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

/an

 [

character

 name

],

 a

/an

 [

occupation

],

 with

 [

number

 of

 years

]

 years

 of

 experience

 in

 this

 field

.

 I

'm

 fluent

 in

 [

language

]

 and

 have

 [

number

 of

 books

]

 published

 in

 the

 [

genre

]

 world

.

 I

 am

 dedicated

 to

 [

specific

 goal

 or

 mission

],

 and

 I

 am

 always

 looking

 to

 expand

 my

 knowledge

 and

 skills

.

 I

'm

 here

 to

 meet

 new

 people

 and

 learn

 from

 them

,

 and

 I

'm

 always

 on

 the

 lookout

 for

 a

 good

 conversation

.

 I

 enjoy

 reading

,

 and

 I

'm

 always

 up

 for

 a

 good

 laugh

.

 I

 believe

 in

 the

 power

 of

 empathy

,

 and

 I

 strive

 to

 make

 a

 positive

 impact



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

 and

 its

 stunning

 gardens

.

 Paris

 is

 a

 sprawling

 and

 vibrant

 met

ropolis

 with

 a

 rich

 history

 dating

 back

 to

 the

 Middle

 Ages

 and

 a

 modern

 city

-state

 today

.

 The

 city

 is

 home

 to

 some

 of

 the

 world

’s

 most

 famous

 museums

,

 art

 galleries

,

 and

 landmarks

,

 including

 the

 Lou

vre

,

 the

 Centre

 Pom

pid

ou

,

 and

 the

 Ch

amps

-

É

lys

ées

.

 Paris

 is

 also

 famous

 for

 its

 cuisine

,

 with

 its

 famous

 be

ign

ets

,

 cro

iss

ants

,

 and

 past

ries

,

 as

 well

 as

 its

 wine

 and

 cheeses

.

 The

 city

 is

 a

 great

 place

 to

 visit

 for

 both

 history

 buffs

 and

 food

 enthusiasts



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 unpredictable

,

 and

 it

 is

 difficult

 to

 predict

 exactly

 what

 will

 happen

.

 However

,

 there

 are

 several

 potential

 trends

 that

 could

 be

 expected

 in

 the

 coming

 years

:



1

.

 Increased

 integration

 with

 human

 AI

:

 AI

 is

 becoming

 more

 integrated

 with

 human

 AI

 in

 the

 workplace

.

 For

 example

,

 AI

-powered

 chat

bots

 are

 becoming

 more

 prevalent

,

 and

 they

 are

 being

 used

 to

 interact

 with

 customers

 and

 employees

 in

 a

 more

 human

-like

 way

.



2

.

 Better

 handling

 of

 ethical

 and

 legal

 issues

:

 There

 will

 be

 a

 greater

 emphasis

 on

 creating

 AI

 systems

 that

 are

 more

 ethical

 and

 fair

,

 and

 that

 comply

 with

 legal

 and

 regulatory

 standards

.

 This

 will

 require

 a

 greater

 focus

 on

 privacy

 and

 data

 protection

,




In [6]:
llm.shutdown()