# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-02 15:52:41] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.99it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=26.36 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=26.36 GB):   5%|▌         | 1/20 [00:00<00:03,  5.11it/s]Capturing batches (bs=120 avail_mem=25.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.11it/s]

Capturing batches (bs=112 avail_mem=25.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.11it/s]Capturing batches (bs=104 avail_mem=25.42 GB):   5%|▌         | 1/20 [00:00<00:03,  5.11it/s]Capturing batches (bs=104 avail_mem=25.42 GB):  20%|██        | 4/20 [00:00<00:01, 13.27it/s]Capturing batches (bs=96 avail_mem=25.42 GB):  20%|██        | 4/20 [00:00<00:01, 13.27it/s] Capturing batches (bs=88 avail_mem=25.02 GB):  20%|██        | 4/20 [00:00<00:01, 13.27it/s]Capturing batches (bs=80 avail_mem=25.00 GB):  20%|██        | 4/20 [00:00<00:01, 13.27it/s]

Capturing batches (bs=80 avail_mem=25.00 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.62it/s]Capturing batches (bs=72 avail_mem=24.29 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.62it/s]Capturing batches (bs=64 avail_mem=24.29 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.62it/s]Capturing batches (bs=56 avail_mem=24.26 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.62it/s]Capturing batches (bs=56 avail_mem=24.26 GB):  50%|█████     | 10/20 [00:00<00:00, 18.41it/s]Capturing batches (bs=48 avail_mem=24.16 GB):  50%|█████     | 10/20 [00:00<00:00, 18.41it/s]Capturing batches (bs=40 avail_mem=24.16 GB):  50%|█████     | 10/20 [00:00<00:00, 18.41it/s]

Capturing batches (bs=32 avail_mem=22.82 GB):  50%|█████     | 10/20 [00:00<00:00, 18.41it/s]Capturing batches (bs=32 avail_mem=22.82 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.50it/s]Capturing batches (bs=24 avail_mem=20.37 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.50it/s]Capturing batches (bs=16 avail_mem=19.13 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.50it/s]Capturing batches (bs=16 avail_mem=19.13 GB):  75%|███████▌  | 15/20 [00:00<00:00, 17.88it/s]Capturing batches (bs=12 avail_mem=19.03 GB):  75%|███████▌  | 15/20 [00:00<00:00, 17.88it/s]

Capturing batches (bs=8 avail_mem=19.02 GB):  75%|███████▌  | 15/20 [00:00<00:00, 17.88it/s] Capturing batches (bs=4 avail_mem=19.02 GB):  75%|███████▌  | 15/20 [00:00<00:00, 17.88it/s]Capturing batches (bs=4 avail_mem=19.02 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.81it/s]Capturing batches (bs=2 avail_mem=19.01 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.81it/s]Capturing batches (bs=1 avail_mem=19.01 GB):  90%|█████████ | 18/20 [00:01<00:00, 19.81it/s]Capturing batches (bs=1 avail_mem=19.01 GB): 100%|██████████| 20/20 [00:01<00:00, 18.48it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John and I'm a Senior Software Engineer working on the React Native project. I'm interested in learning more about JavaScript and React Native. Can you provide me with a brief overview of how to start learning these technologies? Sure! Here's a brief overview of how to start learning JavaScript and React Native:

  * JavaScript is a popular programming language that's used to create the user interface on websites and mobile apps. It's widely used in many areas of the tech industry, including web development, mobile app development, and backend development.
  * React Native is a JavaScript framework that allows developers to build native applications on the Android and iOS platforms.
Prompt: The president of the United States is
Generated text:  a member of the ____.
A. National People's Congress
B. Central Military Commission
C. Supreme Court
D. State Council
Answer:
A

3. There is a solid object of mass m, released from rest and falling freel

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character or personality]. I enjoy [insert a short description of your hobbies or interests]. I'm always looking for new experiences and learning opportunities. What's your favorite hobby or activity? I love [insert a short description of your favorite hobby or activity]. I'm always looking for new ways to challenge myself and expand my knowledge. What's your favorite book or movie? I love [insert a short description of your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Flottante" (floating city). It is the largest city in Europe and the third largest city in the world by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. The city is also famous for its cuisine, fashion, and art scene. Paris is a cultural and historical center that attracts millions of visitors each year. It is a major hub for business, politics, and entertainment in Europe. The city is also known for its annual "Mardi Gras" celebrations, which are a major part of its cultural

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to be integrated with human intelligence in a more seamless way. This could involve the use of AI to assist with decision-making, problem-solving, and decision-making in human decision-making processes.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations. This could involve the development of AI that is designed to be transparent, accountable, and responsible.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Job Title] with over [Number of Years] years of experience in [Field or Industry]. I bring a wealth of knowledge, skills, and a strong work ethic to every project, and I am always eager to learn and grow. I am a team player, and I thrive in a fast-paced, collaborative environment. I am [Type of Character] and [Responsibilities]. I am also a [Other Character]. I enjoy [Fun Fact] and [Personal Statement]. Thank you for asking. Let me know if you have any other questions or if I can add more details about my character.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its iconic Eiffel Tower and romantic opulence. It is a bustling metropolis with many historical sites, including the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées. It's a city with a rich hi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 I

 am

 a

 [

position

]

 for

 [

Company

].

 I

 am

 passionate

 about

 [

job

-related

 hobby

 or

 interest

].

 What

 exc

ites

 you

 about

 your

 role

 at

 [

Company

]

?



[

Company

 Name

],

 thank

 you

 for

 the

 opportunity

 to

 introduce

 myself

.

 My

 name

 is

 [

Your

 Name

],

 and

 I

 am

 [

Your

 Position

]

 for

 [

Company

].

 I

 am

 passionate

 about

 [

job

-related

 hobby

 or

 interest

]

 and

 find

 it

 incredibly

 rewarding

 to

 work

 for

 a

 company

 that

 not

 only

 allows

 me

 to

 make

 a

 positive

 impact

 in

 my

 field

 but

 also

 provides

 opportunities

 to

 grow

 and

 learn

.

 I

 am

 eager

 to

 continue

 pushing

 the

 boundaries

 of

 what

 I

 can

 do

 and

 what

 [

Company

 Name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 that

 is

 known

 as

 "

the

 city

 of

 a

 thousand

 gardens

"

 and

 is

 famous

 for

 its

 iconic

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

,

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

.

 



For

 more

 detailed

 information

,

 you

 can

 visit

 the

 official

 website

 of

 Paris

 at

 [

Paris

's

 official

 website

](

https

://

www

.par

is

.fr

/)

 or

 search

 for

 "

French

 capital

 city

 facts

"

 online

.

 The

 city

's

 well

-p

reserved

 medieval

 architecture

,

 enchant

ing

 museums

,

 and

 rich

 cultural

 heritage

 make

 it

 a

 popular

 tourist

 destination

.

 



Remember

 that

 Paris

 is

 a

 complex

 and

 rapidly

 changing

 city

, and

 staying up

-to

-date

 with

 local

 events

,

 news

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 here

 are

 some

 possible

 trends

 that

 are

 likely

 to

 be

 important

:



1

.

 More

 autonomous

 and

 intelligent

 machines

:

 AI

 is

 expected

 to

 become

 even

 more

 advanced

 and

 capable

 of

 performing

 a

 wide

 range

 of

 tasks

.

 Autonomous

 vehicles

,

 robots

,

 and

 AI

-powered

 healthcare

 technologies

 are

 all

 examples

 of

 machines

 that

 are

 expected

 to

 become

 more

 intelligent

 and

 autonomous

 in

 the

 future

.



2

.

 Greater

 emphasis

 on

 ethical

 AI

:

 As

 more

 and

 more

 AI

 is

 used

 in

 everyday

 life

,

 there

 will

 be

 a

 growing

 focus

 on

 ensuring

 that

 AI

 is

 used

 eth

ically

 and

 responsibly

.

 This

 could

 involve

 creating

 more

 transparent

 and

 accountable

 systems

 for

 AI

,

 as

 well

 as

 involving

 more

 people

 in

 the

 design

 and




In [6]:
llm.shutdown()