# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0911 01:31:58.660000 2137150 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 01:31:58.660000 2137150 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0911 01:32:06.924000 2137689 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 01:32:06.924000 2137689 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 01:32:07.061000 2137690 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 01:32:07.061000 2137690 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 01:32:07] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.51it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.76it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.17it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Niki and I am a software engineer at Uber.
What do you do?
As an Uber engineer, I help ensure that our software systems are reliable, secure, and scalable. We develop and maintain our software in collaboration with customers, our team members, and our customers and partners.
What is the difference between a software engineer and a developer?
A software engineer designs, develops, and tests computer programs and software. This includes creating new software applications and applications that are based on existing software applications. Developers create new software code, and then write unit tests that run against the software code to ensure that the software functions as expected.
Why do you
Prompt: The president of the United States is
Generated text:  in his home in his bedroom. His wife is in the kitchen, and his children are at school. The president is very busy talking on the telephone. The telephone rang. The president answered it and as

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character, such as "funny, witty, and always up for a good laugh"]. I enjoy [insert a short description of your character's interests, such as "reading, cooking, and playing sports"]. I'm always looking for new experiences and challenges, and I'm always eager to learn and grow. What's your favorite hobby or activity? I'm a huge [insert a hobby or activity, such

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and festivals. Paris is known for its rich history, including the influence of French Revolution and Napoleon Bonaparte, and its influence on modern French culture and politics. It is a popular tourist destination, attracting millions of visitors each year. Paris is also home to the French Parliament, the French Supreme Court, and the French Academy of Sciences. The city is known for its cuisine, including its famous croiss

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn from and adapt to human behavior and decision-making processes.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be increased concerns about privacy and security, with more emphasis on data protection and encryption.

3. Greater use of AI in healthcare: AI will play a more significant role in healthcare, with the ability to analyze medical data and provide personalized treatment recommendations.

4.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [age] year old [gender] who is a member of [specific group or organization]. I am currently [current position or role]. I enjoy [why you enjoy doing what you do]. I like to [what you do as a hobby or to pass the time]. I am [description of what you are]. I have always been [why you are unique]. I believe in [what you believe in]. I am passionate about [what you believe about life or the world]. I am determined to [what you plan to achieve in the future]. I am [why you are that way]. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the Loire Valley region of France, known as the "City of Love" due to its popular romantic and cultural attractions. France's capital city is often referred to as "the City of Love" due to its iconic landmarks like Notre Dame de Paris, the Eiffel 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

'm

 a

 [

Field

 of

 Interest

]

 enthusiast

 who

 is

 always

 eager

 to

 learn

 and

 discover

 new

 things

.

 I

 love

 [

Inter

ests

 or

 hobbies

],

 such

 as

 [

mention

 specific

 interests

 or

 hobbies

 here

].

 I

 have

 a

 strong

 work

 ethic

,

 and

 am

 always

 up

 to

 [

mention

 any

 extra

 responsibilities

 you

 have

,

 such

 as

 volunteer

 work

,

 extra

-cur

ricular

 activities

,

 etc

.

].

 I

'm

 [

Degree

],

 and

 I

'm

 currently

 pursuing

 [

a

 specific

 career

 path

 or

 area

 of

 interest

].

 I

'm

 confident

 in

 my

 abilities

 and

 am

 always

 ready

 to

 learn

 and

 grow

.

 If

 you

 have

 any

 questions

 or

 need

 assistance

,

 please

 don

't

 hesitate



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 capital

 of

 France

 and

 is

 the

 largest

 city

 in

 the

 country

.

 It

 is

 home

 to

 the

 European

 Parliament

,

 the

 French

 National

 Senate

,

 and

 the

 French

 Supreme

 Court

.

 The

 city

 is

 known

 for

 its

 beautiful

 architecture

,

 cultural

 richness

,

 and

 festive

 celebrations

 during

 the

 annual

 Carnival

.

 Paris

 is

 also

 a

 major

 center

 for

 business

 and

 finance

,

 hosting

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 many

 other

 important

 landmarks

.

 With

 a

 population

 of

 over

 

2

.

 

3

 million

 people

,

 Paris

 is

 a

 vibrant

 and

 diverse

 city

 that

 is

 a

 global

 center

 of

 culture

,

 art

,

 and

 philosophy

.

 The

 city

's

 historic

 landmarks

 and

 museums

 have

 made

 it



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 complex

,

 with

 many

 potential

 developments

 that

 could

 shape

 the

 future

 of

 human

 civilization

.

 Some

 of

 the

 most

 common

 trends

 in

 AI

 include

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 can

 help

 doctors

 diagnose

 and

 treat

 diseases

 with

 greater

 accuracy

 and

 efficiency

,

 while

 also

 improving

 patient

 outcomes

 and

 reducing

 healthcare

 costs

.



2

.

 Automation

 of

 tasks

:

 AI

 is

 already

 being

 used

 to

 automate

 tasks

 in

 industries

 such

 as

 manufacturing

 and

 transportation

,

 reducing

 the

 need

 for

 human

 labor

 and

 increasing

 productivity

.



3

.

 Personal

ization

 of

 experiences

:

 AI

 is

 being

 used

 to

 personalize

 experiences

 for

 individuals

,

 from

 recommendations

 for

 movies

 and

 TV

 shows

 to

 personalized

 shopping

 recommendations

.



4

.

 Autonomous

 vehicles

:

 AI

 is




In [6]:
llm.shutdown()