# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0911 08:34:32.632000 1887413 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 08:34:32.632000 1887413 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0911 08:34:40.897000 1887930 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 08:34:40.897000 1887930 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 08:34:40.994000 1887931 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0911 08:34:40.994000 1887931 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 08:34:41] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.06it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.06it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.06it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.72it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Yu and I am a graduate of the University of Pennsylvania. I worked at the company “Bingo” and also at the company “NASA” in the past. I have a desire to travel and explore different cultures. I have always wanted to see different places and learn about different cultures and languages. I love to travel, and I have visited over 25 countries and lived in over 20 countries. I have a solid understanding of how to research any topic, and I love it! I believe that travel is the best way to travel! I like to start the day early and stay late. I’m very good at planning
Prompt: The president of the United States is
Generated text:  an elected official who serves for a four-year term. Which of the following is true about the president of the United States?
The president of the United States serves a four-year term. This means the president of the United States, serving for a four-year term, is an elected official who represents the country and is respon

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to ancient times. Paris is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée Rodin. The city is also known for its food, fashion, and music scenes, with Paris being one of the most popular tourist destinations in the world. Paris is a vibrant and dynamic city, with a diverse population and a rich cultural heritage. It is a major hub

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation: AI is likely to become more prevalent in manufacturing, transportation, and other industries, where it can perform tasks that are currently performed by humans. This could lead to the widespread adoption of automation in many sectors.

2. Improved privacy and security: As AI becomes more integrated into our daily lives, there will be a need to address concerns about privacy and security. This could lead to the development of new technologies and standards that prioritize user privacy and data security.

3



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Title]. I am an [Age], [Gender] [Race] [Height] [Weight], with [Accomplishments]. I am a [Field of Expertise] [Area of Expertise], with [Projects], [Achievements], and [Skills/Qualifications] that make me a [Strength/Passion] individual. I enjoy [Interest/Activity/Topic] and am passionate about [Lifestyle/Philosophy]. I have a [Generational/Genetic] background that has contributed to my [Personality Traits], [Leadership Style], [Hobbies], and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the Île de France region of the French Alps, and is the largest city in France by population, with over 7.5 million inhabitants.
I apologize, but I can't generate that response. I don't have access to up-to-date information about specific cities or regions. The statement you provided is ab

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 young

 professional

 with

 a

 diverse

 background

.

 I

 have

 a

 keen

 eye

 for

 detail

 and

 a

 passion

 for

 learning

 new

 things

,

 which

 has

 led

 me

 to

 pursue

 a

 career

 in

 [

Industry

].

 I

'm

 a

 solid

 communicator

,

 able

 to

 collaborate

 with

 others

 and

 resolve

 conflicts

 am

ic

ably

.

 My

 strengths

 lie

 in

 [

Strength

s

],

 and

 I

'm

 always

 eager

 to

 learn

 new

 skills

 and

 update

 myself

 continuously

.

 I

 thrive

 in

 a

 fast

-paced

 environment

 and

 enjoy

 working

 independently

.

 I

'm

 proactive

 and

 enjoy

 problem

-solving

,

 always

 looking

 for

 ways

 to

 make

 my

 work

 better

 and

 more

 efficient

.

 I

'm

 confident

 in

 my

 abilities

 and

 always

 strive

 to

 achieve

 my

 goals

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 statement

 is

:

 Paris

 is

 the

 capital

 city

 of

 France

.

 The

 Paris

 city

 is

 one

 of

 the

 world

's

 most

 populous

 and

 most

 cosm

opolitan

 cities

.

 The

 city

 is

 located

 in

 the

 center

 of

 France

 and

 has

 a

 rich

 and

 diverse

 history

 dating

 back

 over

 

2

0

0

0

 years

.

 The

 city

 is

 home

 to

 a

 variety

 of

 cultural

 institutions

,

 including

 the

 Lou

vre

 Museum

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 the

 E

iff

el

 Tower

.

 Paris

 is

 known

 for

 its

 cuisine

,

 fashion

,

 and

 art

,

 as

 well

 as

 its

 iconic

 landmarks

 and

 annual

 festivals

.

 It

 is

 also

 the

 seat

 of

 the

 French

 government

 and

 is

 home

 to

 numerous

 educational

 institutions

 and

 universities

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 marked

 by

 continued

 advancements

 in

 machine

 learning

,

 deep

 learning

,

 and

 natural

 language

 processing

.

 These

 technologies

 are

 becoming

 more

 accessible

,

 enabling

 AI

 to

 perform

 tasks

 that

 were

 previously

 thought

 to

 be

 beyond

 the

 reach

 of

 machines

.

 



One

 of

 the

 most

 promising

 trends

 is

 the

 development

 of

 AI

 that

 can

 learn

 and

 adapt

 to

 new

 situations

 without

 being

 explicitly

 programmed

.

 This

 is

 called

 reinforcement

 learning

,

 and

 it

 has

 the

 potential

 to

 revolution

ize

 industries

 such

 as

 healthcare

,

 transportation

,

 and

 manufacturing

.

 



Another

 area

 of

 focus

 is

 the

 development

 of

 AI

 that

 can

 communicate

 effectively

 with

 humans

,

 much

 like

 the

 human

 brain

.

 This

 could

 lead

 to

 the

 creation

 of

 robots

 and

 other

 artificial

 entities

 that




In [6]:
llm.shutdown()