# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0908 08:33:34.307000 992601 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 08:33:34.307000 992601 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0908 08:33:42.668000 993127 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 08:33:42.668000 993127 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0908 08:33:42.717000 993128 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 08:33:42.717000 993128 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-08 08:33:43] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.67it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=75.05 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=75.05 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.64it/s]Capturing batches (bs=2 avail_mem=74.90 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.64it/s]Capturing batches (bs=1 avail_mem=74.89 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.64it/s]Capturing batches (bs=1 avail_mem=74.89 GB): 100%|██████████| 3/3 [00:00<00:00,  8.94it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Casey, I am 18 years old, born in 1990, I will be 19 in March 2008. I've been in the UK for over 12 years now, but when I look at the clock, I see that I am only 78 days away from reaching 19. This is causing me to be extremely nervous, since I have never been in the UK for over 12 years, or seen my clock before. What should I do to calm my nerves? I am not worried about things like getting lost or anything, I am just nervous about the
Prompt: The president of the United States is
Generated text:  a political office, which means they are not necessarily president of the U. S. that is held for a term of more than two years. There are many different types of leadership positions that the president may hold, including the office of chief executive officer, chairman, CEO, governor, senator, lieutenant governor, and so forth. President Obama has been the first African American president in U. S. history, and he has been in office for two years. In 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm a [Number] year old, [Gender] and [Country]. I'm [Number] of [Number] years old. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number] years old, [Gender] and [Country]. I'm [Number] of [Number

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a cultural and economic hub, known for its rich history, art, and cuisine. It is a popular tourist destination, attracting millions of visitors each year. The city is also home to the French Parliament, the French National Museum of Modern Art, and the Louvre Museum. It is a major center for politics, culture, and art in France. Paris is a city of contrasts, with its modern architecture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced interactions between humans and machines.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, but there is a potential for even greater use in the future, with the potential to improve diagnosis, treatment, and patient care.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Age]. I am a [job title] with a degree in [major/subject]. I have been in the [industry/field of study] for [number] years and I have had [number] years of experience in [occupation]. I am very [positive/attentive] and I am always [ambitious]. I am a [professional/creative] who enjoys [activity or hobby]. I am a [team player]. I am a [devoted/career] and I am always [self-motivated]. I am very [ambitious]. My goal is to achieve

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic Eiffel Tower, French Riviera, and renowned museums like the Louvre and the Musée d’Orsay. It's also famous for its rich history and culture, including the arrival of Victor Hugo. Paris is the world's 21st most populous city and a UNESCO World Heritage site. The French capital is home to many di

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

_

 and

 I

'm

 an

 AI

 language

 model

.

 My

 purpose

 is

 to

 assist

 and

 provide

 helpful

 responses

 to

 users

.

 I

 am

 always

 ready

 to

 learn

 and

 improve

,

 and

 I

 am

 here

 to

 answer

 any

 questions

 you

 may

 have

.

 What

 can

 I

 do

 for

 you

 today

?

 Do

 you

 have

 any

 specific

 topic

 or

 area

 of

 expertise

 you

'd

 like

 me

 to

 help

 with

?

 Is

 there

 anything

 in

 particular

 I

'd

 like

 to

 know

 about

,

 such

 as

 a

 specific

 topic

 or

 problem

 you

're

 looking

 to

 solve

?

 Whatever

 it

 is

,

 I

'm

 here

 to

 help

 you

 find

 the

 answers

 you

 need

.

 How

 can

 I

 assist

 you

 today

?

 Let

 me

 know

 what

 you

'd

 like

 to

 learn

 or



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 and

 the

 second

-largest

 city

 in

 the

 European

 Union

,

 with

 a

 population

 of

 over

 

2

.

 

5

 million

 people

.

 Its

 metro

 system

 is

 the

 world

's

 seventh

-largest

.

 It

 has

 a

 rich

 cultural

 heritage

 dating

 back

 to

 its

 Roman

 origins

 and

 a

 history

 of

 significant

 influence

 from

 the

 French

 monarchy

,

 including

 Louis

 XIV

.

 Paris

 is

 known

 for

 its

 romantic

 atmosphere

,

 with

 its

 many

 grand

 bou

lev

ards

,

 historic

 landmarks

,

 and

 cafes

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

,

 gastr

onomy

,

 and

 art

 scene

.

 Its

 skyline

 features

 iconic

 buildings

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 there

 are

 several

 possible

 trends

 that

 are

 likely

 to

 shape

 the

 field

 in

 the

 coming

 years

:



1

.

 Increased

 personal

ization

:

 AI

 will

 become

 even

 more

 personalized

 in

 the

 future

.

 As

 AI

 systems

 learn

 more

 about

 individual

 users

,

 they

 will

 be

 able

 to

 provide

 more

 tailored

 and

 relevant

 recommendations

.



2

.

 Self

-driving

 cars

:

 Self

-driving

 cars

 are

 already

 becoming

 a

 reality

,

 and

 it

's

 likely

 that

 AI

 will

 continue

 to

 advance

 in

 this

 area

.

 Autonomous

 vehicles

 will

 become

 more

 sophisticated

 and

 will

 be

 able

 to

 navigate

 roads

 and

 intersections

 more

 efficiently

.



3

.

 Medical

 imaging

:

 AI

 will

 play

 a

 role

 in

 medical

 imaging

,

 helping

 doctors

 to

 better

 diagnose

 and

 treat

 diseases

.




In [6]:
llm.shutdown()