# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0905 04:23:15.182000 836934 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 04:23:15.182000 836934 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0905 04:23:24.342000 837547 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 04:23:24.342000 837547 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0905 04:23:24.657000 837548 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0905 04:23:24.657000 837548 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-05 04:23:24] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.02 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.06it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.06it/s]

Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  5.06it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.67it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sabrina! I'm a 4th grader, who enjoys playing with dolls, playing video games, and listening to music. I hope to be a teacher one day and help people learn new things. What's your name, and what's your favorite hobby? 

Based on the preceding passage, is the hypothesis "Sabrina enjoys playing video games for fun and learning new things." true?

 a). yes;
 b). it is not possible to tell;
 c). no;
a). yes; 

The passage clearly states that Sabrina enjoys playing video games, which is the opposite of her favorite hobby, which is playing dolls and
Prompt: The president of the United States is
Generated text:  seeking to increase the number of women and minorities in the federal government. He has proposed a bill that would establish a bipartisan panel of experts, who would evaluate the qualifications of individuals wishing to serve on the federal government. The purpose of the panel would be to ensure that the new hires are qualified and will be a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I enjoy [job title] because [reason for interest]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [hobby or activity], and I'm always looking for ways to expand my skills and knowledge. What's your favorite book or movie? I love

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French Academy of Sciences, and the French Quarter. Paris is a cultural and historical center with a rich history dating back to ancient times. It is a popular tourist destination and a major economic hub. The city is known for its cuisine, fashion, and art, and is home to many world-renowned museums, theaters, and galleries. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage. It is often referred to as the "City

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into our daily lives through smart home devices, self-driving cars, and virtual assistants like Siri and Alexa. As AI continues to advance, we can expect to see even more integration into our daily lives, from smart homes to virtual assistants to self-driving cars.

2. AI becoming more autonomous: As AI becomes more advanced, we can expect to see more autonomous vehicles on the road. This will likely involve the use of AI to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I am a/an [occupation] with over [number of years] experience in [industry, field]. I have always been passionate about [career goal] and have always been motivated to [positive action], but I am not afraid to take risks and be bold in my pursuit. I am a/an [occupation] with over [number of years] experience in [industry, field], and I have always been motivated to [positive action], but I am not afraid to take risks and be bold in my pursuit. I am a/an [occupation] with over [number of years] experience in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The statements should be concise and accurate. For example, "Paris is the capital of France." or "Paris is the most populous city in Europe." These should be considered correct and concise statements about Paris, but they may n

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 [

career

 goal

]

 in

 a

 [

industry

 or

 field

].

 I

 specialize

 in

 [

your

 specialty

]

 and

 am

 always

 looking

 to

 learn

 and

 improve

.

 Whether

 it

's

 using

 my

 knowledge

 to

 help

 others

 or

 just

 exploring

 my

 interests

,

 I

 am

 always

 eager

 to

 learn

 more

 and

 grow

 in

 my

 field

.

 Thanks

 for

 taking

 the

 time

 to

 meet

 me

!

 What

's

 your

 background

 and

 what

 exc

ites

 you

 about

 your

 career

 goal

?


Hello

,

 my

 name

 is

 [

Your

 Name

]

 and

 I

 am

 a

 [

career

 goal

]

 in

 a

 [

industry

 or

 field

].

 I

 specialize

 in

 [

your

 specialty

]

 and

 am

 always

 looking

 to

 learn

 and

 improve

.

 Whether

 it



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 most

 populous

 city

 in

 the

 country

.

 



The

 French

 capital

 Paris

 is

 the

 administrative

,

 cultural

,

 and

 economic

 center of

 France

 and

 the

 largest

 metropolitan

 area

 in

 Europe

.

 It

 is

 also

 a

 major

 tourist

 destination

,

 known

 for

 its

 well

-p

reserved

 old

 city

 and

 historic

 landmarks

.

 The

 city

 is

 known

 for

 its

 artistic

 and

 cultural

 scene

,

 including

 the

 Lou

vre

,

 the

 E

iff

el

 Tower

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 a

 major

 financial

 center

 and

 has

 been

 a

 hub

 for

 the

 French

 economy

 since

 its

 establishment

 in

 the

 

1

2

th

 century

.



Paris

 is

 also

 a

 center

 for

 politics

,

 including

 the

 French

 legislative

 body

,

 the

 National

 Assembly

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 more

 widespread

 adoption

 of

 machine

 learning

,

 deep

 learning

,

 and

 neural

 networks

.

 AI

 is

 set

 to

 become

 increasingly

 ubiquitous

,

 with

 more

 and

 more

 applications

 being

 developed

 that

 rely

 on

 the

 technology

.

 AI

 will

 be

 integrated

 into

 a

 wider

 range

 of

 industries

,

 from

 healthcare

 to

 finance

 to

 transportation

,

 and

 will

 be

 used

 to

 solve

 increasingly

 complex

 problems

.

 AI

 will

 also

 become

 more

 accessible

 to

 the

 general

 public

,

 with

 more

 devices

 and

 platforms

 being

 equipped

 with

 AI

 algorithms

.

 AI

 will

 continue

 to

 evolve

,

 with

 new

 technologies

 and

 algorithms

 being

 developed

 at

 a

 rapid

 pace

,

 making

 it

 difficult

 to

 predict

 the

 future

 of

 AI

.

 However

,

 with

 the

 right

 investment

 and

 focus

,

 AI

 has

 the




In [6]:
llm.shutdown()