# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0908 09:00:28.819000 4058418 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 09:00:28.819000 4058418 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0908 09:00:37.935000 4058918 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 09:00:37.935000 4058918 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0908 09:00:38.285000 4058919 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 09:00:38.285000 4058919 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-08 09:00:39] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.15it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=20.88 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=20.88 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.21it/s]Capturing batches (bs=2 avail_mem=20.82 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.21it/s]Capturing batches (bs=1 avail_mem=20.81 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.21it/s]Capturing batches (bs=1 avail_mem=20.81 GB): 100%|██████████| 3/3 [00:00<00:00, 10.00it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Vanessa. I'm a student in Grade 6 and I play computer games a lot. My favorite computer games are Pac Man and Candy Crush Saga. The Pac Man game teaches me to be patient and the Candy Crush Saga makes me learn to play the game. What games do you like? Question 1: What games do you like? Question 2: What games do you like to play? Question 3: What games do you like to play? Question 4: What games do you like to play? The correct answer for question 1 is ______. The correct answer for question 2 is ______. The correct answer for
Prompt: The president of the United States is
Generated text:  a position with a high degree of ____________.
authority and responsibility
authority and integrity
independence and responsibility
innovation and responsibility
authority and control
Answer:
authority and responsibility

Which of the following statements about the B2B e-commerce model is incorrect?
A. The main participants in the B2B model are platform compa

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I have been working here for [number of years] years and I have always been passionate about [job title] and have always wanted to [job title] at [company name]. I am always looking for new challenges and opportunities to grow and learn, and I am always eager to learn more about [job title] and the company. I am a team player and always strive to work collaboratively with others to achieve our goals. I am always looking for ways to improve my skills and knowledge, and I am always open to new ideas and perspectives. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city that was founded in 787 AD and is the largest city in Europe by population. It is also the seat of the French government, the French parliament, and the headquarters of the French Foreign Ministry. Paris is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. It is also home to many famous French artists, writers, and musicians. Paris is a vibrant and diverse city with a rich cultural heritage that has been shaped by its history and its people. The city is known for its delicious cuisine, including French cuisine, and its annual festivals

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare. This will lead to increased efficiency, productivity, and cost savings for businesses.

2. Enhanced human-AI collaboration: AI will continue to become more integrated into our daily lives, and we can expect to see more collaboration between humans and AI. This will lead to more efficient and effective communication, as well as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am an experienced freelancer in the field of [Your field of expertise]. I am passionate about delivering high-quality work that meets your needs and provides value to your clients. I am organized, efficient, and able to manage multiple projects simultaneously. My team includes a high-level of technical knowledge and problem-solving skills. I am always committed to delivering projects on time and within budget. My goal is to help you achieve your goals, whether it's through creative, technical, or professional services. What can you tell me about yourself? Hi! I'm a freelance graphic designer with over [number of years] years of experience.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital city of France and is located on the banks of the Seine River in the center of t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

 am

 a

 [

Role

]

 with

 [

Job

 Title

],

 and

 I

've

 always

 been

 passionate

 about

 [

Your

 Hobby

 or

 Inter

ests

].

 I

'm

 [

Age

]

 years

 old

,

 and

 I

'm

 confident

 that

 I

 can

 [

Achie

ve

 a

 Goal

 or

 Win

 a

 Prize

].

 I

'm

 always

 striving

 to

 [

Mot

ivate

 Others

],

 and

 I

 believe

 that

 everyone

 has

 the

 potential

 to

 achieve

 something

 great

.

 



Let

's

 explore

 a

 way

 to

 incorporate

 [

One

 of

 Your

 Inter

ests

 or

 Skills

]

 into

 your

 daily

 life

,

 and

 see

 how

 far

 we

 can

 go

 together

.

 [

Your

 Name

],

 [

Your

 Job

 Title

],

 here

's

 how

 we

 can

 work

 together

 to

 make

 this

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 bustling

 city

 with

 a

 rich

 history

,

 renowned

 for

 its

 artistic

 and

 cultural

 attractions

,

 and

 a

 vibrant

 nightlife

 scene

.

 The

 city

 is

 also

 home

 to

 many

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 popular

 tourist

 destination

 for

 its

 beautiful

 architecture

,

 delicious

 food

,

 and

 vibrant

 culture

.

 The

 city

 has

 a

 diverse

 population

 of

 over

 

2

 million

 residents

,

 including

 a

 growing

 immigrant

 community

. Paris

 is

 a

 globally

 known

 city

 with

 a

 strong

 economy

,

 advanced

 education

 system

,

 and

 a

 thriving

 media

 industry

.

 The

 city

 is

 also

 home

 to

 a

 number

 of

 important

 museums

,

 including

 the

 Mus

ée



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 increasing

 integration

 with

 the

 physical

 world

,

 with

 more

 and

 more

 tasks

 performed

 by

 machines

 as

 humans

 take

 on

 a

 more

 hands

-off

 approach

 to

 everyday

 tasks

.



AI

 will

 become

 more

 capable

 of

 understanding

 and

 predicting

 natural

 language

,

 as

 well

 as

 recognizing

 and

 understanding

 human

 emotions

,

 all

 of

 which

 will

 require

 more

 sophisticated

 algorithms

 and

 machine

 learning

.

 This

 will

 likely

 lead

 to

 a

 more

 natural

 and

 interactive

 user

 experience

,

 as

 machines

 will

 be

 able

 to

 better

 understand

 and

 empath

ize

 with

 users

.



AI

 will

 also

 become

 more

 capable

 of

 handling

 more

 complex

 tasks

,

 such

 as

 playing

 chess

,

 playing

 piano

,

 or

 writing

 poetry

,

 all

 of

 which

 will

 require

 deeper

 understanding

 of

 the

 human

 mind




In [6]:
llm.shutdown()