# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-17 11:06:20] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-17 11:06:20] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-17 11:06:20] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-17 11:06:23] INFO server_args.py:1830: Attention backend not specified. Use fa3 backend by default.


[2026-02-17 11:06:23] INFO server_args.py:2865: Set soft_watchdog_timeout since in CI








[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=69.02 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=69.02 GB):   5%|▌         | 1/20 [00:05<01:49,  5.79s/it]Capturing batches (bs=120 avail_mem=14.70 GB):   5%|▌         | 1/20 [00:05<01:49,  5.79s/it]Capturing batches (bs=112 avail_mem=14.69 GB):   5%|▌         | 1/20 [00:05<01:49,  5.79s/it]Capturing batches (bs=104 avail_mem=14.69 GB):   5%|▌         | 1/20 [00:05<01:49,  5.79s/it]Capturing batches (bs=104 avail_mem=14.69 GB):  20%|██        | 4/20 [00:05<00:17,  1.12s/it]Capturing batches (bs=96 avail_mem=14.68 GB):  20%|██        | 4/20 [00:05<00:17,  1.12s/it] Capturing batches (bs=88 avail_mem=14.68 GB):  20%|██        | 4/20 [00:05<00:17,  1.12s/it]Capturing batches (bs=80 avail_mem=14.67 GB):  20%|██        | 4/20 [00:05<00:17,  1.12s/it]

Capturing batches (bs=80 avail_mem=14.67 GB):  35%|███▌      | 7/20 [00:05<00:07,  1.85it/s]Capturing batches (bs=72 avail_mem=14.67 GB):  35%|███▌      | 7/20 [00:05<00:07,  1.85it/s]Capturing batches (bs=64 avail_mem=14.66 GB):  35%|███▌      | 7/20 [00:06<00:07,  1.85it/s]Capturing batches (bs=56 avail_mem=14.65 GB):  35%|███▌      | 7/20 [00:06<00:07,  1.85it/s]Capturing batches (bs=56 avail_mem=14.65 GB):  50%|█████     | 10/20 [00:06<00:03,  3.12it/s]Capturing batches (bs=48 avail_mem=14.65 GB):  50%|█████     | 10/20 [00:06<00:03,  3.12it/s]Capturing batches (bs=40 avail_mem=14.64 GB):  50%|█████     | 10/20 [00:06<00:03,  3.12it/s]Capturing batches (bs=32 avail_mem=14.61 GB):  50%|█████     | 10/20 [00:06<00:03,  3.12it/s]

Capturing batches (bs=32 avail_mem=14.61 GB):  65%|██████▌   | 13/20 [00:06<00:01,  4.73it/s]Capturing batches (bs=24 avail_mem=14.60 GB):  65%|██████▌   | 13/20 [00:06<00:01,  4.73it/s]Capturing batches (bs=16 avail_mem=14.60 GB):  65%|██████▌   | 13/20 [00:06<00:01,  4.73it/s]Capturing batches (bs=12 avail_mem=14.56 GB):  65%|██████▌   | 13/20 [00:06<00:01,  4.73it/s]Capturing batches (bs=12 avail_mem=14.56 GB):  80%|████████  | 16/20 [00:06<00:00,  6.52it/s]Capturing batches (bs=8 avail_mem=14.55 GB):  80%|████████  | 16/20 [00:06<00:00,  6.52it/s] Capturing batches (bs=4 avail_mem=14.55 GB):  80%|████████  | 16/20 [00:06<00:00,  6.52it/s]

Capturing batches (bs=2 avail_mem=14.54 GB):  80%|████████  | 16/20 [00:06<00:00,  6.52it/s]Capturing batches (bs=1 avail_mem=14.54 GB):  80%|████████  | 16/20 [00:06<00:00,  6.52it/s]Capturing batches (bs=1 avail_mem=14.54 GB): 100%|██████████| 20/20 [00:06<00:00,  9.71it/s]Capturing batches (bs=1 avail_mem=14.54 GB): 100%|██████████| 20/20 [00:06<00:00,  3.09it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Michele. I am a 14-year-old girl who loves playing sports and I enjoy listening to my favorite music. I also like to cook and enjoy trying new recipes. However, I am not very good at math and I am struggling with it. 
Given a task, I need to provide an answer based on the context provided. The answer should be a complete sentence that incorporates all the information given.
Teacher: Choose the correct sentiment for this passage.
Context: In the past, children have been taught to perform certain tasks like eating vegetables, brushing teeth, and using the bathroom. However, today, it is better to focus on
Prompt: The president of the United States is
Generated text:  visiting a small village in need of food and water. The village has a total population of 100 people. If the president wants to give each person 2 gallons of water and 1 gallon of food per day, how many gallons of water and food does he need to distribute to the village in a month, 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I have been working at [company name] for [number of years] years. I have always been passionate about [job title] and have always wanted to be a [job title] myself. I am always looking for new challenges and opportunities to grow and learn. I am a [job title] and I am always looking for ways to improve my skills and knowledge. I am excited to be a part of [company name] and contribute to their success. Thank you for asking! [Name] [Company Name] [Job Title] [Company Address

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic city with a rich history and diverse culture. It is located in the south of France and is the largest city in the country. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also famous for its fashion industry, art, and cuisine. Paris is a major tourist destination and is home to many world-renowned museums, theaters, and restaurants. It is a cultural and economic hub of France and a major international city. Paris is also known for its romantic

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is already being used to automate a wide range of tasks, from manufacturing to customer service. As AI technology continues to improve, we can expect automation to become even more prevalent, with machines taking on more complex and repetitive tasks.

2. Enhanced human-computer interaction: AI is likely to become more integrated into our daily lives, with machines becoming more capable of understanding and responding to human emotions and needs. This could lead to more natural and intuitive interactions between humans and machines.

3. AI ethics and privacy concerns: As AI becomes more advanced, there will be increasing concerns



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name], and I'm a [insert character's profession] who has always been fascinated by the mysteries of the world. I enjoy reading books and attending book clubs to gain new insights into different cultures and histories. My love for learning has driven me to pursue a career in education and I'm always eager to share my knowledge with others.

I believe in the power of storytelling and use it to craft engaging and informative stories that inspire and entertain. I am always looking for new challenges and opportunities to learn and grow as a person and a professional. What kind of work are you currently doing? I'm currently a high school teacher with a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is renowned for its classical architecture, rich cultural heritage, and vibrant entertainment sc

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 an

 experienced

 software

 developer

 with

 a

 passion

 for

 creating

 innovative

 solutions

.

 I

 have

 a

 deep

 understanding

 of

 programming

 languages

,

 algorithms

,

 and

 design

 patterns

,

 and

 I

 am

 always

 seeking

 to

 improve

 my

 skills

 and

 knowledge

 to

 stay

 ahead

 of

 the

 curve

.

 I

 am

 a

 creative

 problem

 solver

 with

 a

 keen

 eye

 for

 detail

 and

 an

 ins

at

iable

 curiosity

 about

 how

 software

 can

 solve

 real

-world

 problems

.

 I

 am

 a

 strong

 communicator

 and

 a

 good

 listener

,

 always

 striving

 to

 understand

 the

 needs

 and

 goals

 of

 my

 clients

 or

 colleagues

.

 I

 am

 also

 a

 strong

 team

 player

 who

 thr

ives

 in

 a

 fast

-paced

 environment

 and

 can

 work

 well

 in

 a

 team

 of

 colleagues

.

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 known

 as

 the

 "

City

 of

 Light

"

 and

 a

 UNESCO

 World

 Heritage

 site

.

 It

 is

 a

 cosm

opolitan

 met

ropolis

 with

 a

 rich

 cultural

 history

 and

 a

 world

-ren

owned

 art

 museum

,

 the

 Lou

vre

,

 and

 a

 thriving

 food

 culture

.

 Paris

 is

 also

 a

 major

 tourist

 destination

,

 attracting

 millions

 of

 visitors

 every

 year

.

 With

 its

 towering

 E

iff

el

 Tower

 and

 charming

 bist

ros

,

 it

 is

 a

 popular

 destination

 for

 French

 tourists

 and

 locals

 alike

.

 Its

 history

 and

 culture

 make

 it

 a

 fascinating

 destination

 for

 those

 interested

 in

 French

 history

 and

 art

.

 



**

Note

:**

 This

 statement

 is

 fact

ually

 accurate

 and

 includes

 important

 historical

,

 cultural

,

 and

 architectural

 details

 of

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 diverse

,

 and

 it

 is

 expected

 to

 continue

 to

 evolve

 in

 numerous

 ways

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 Personal

ization

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 it

 is

 expected

 to

 become

 even

 more

 personal

.

 We

 will

 be

 able

 to

 tailor

 our

 experiences

 to

 specific

 individuals

,

 such

 as

 their

 preferences

,

 interests

,

 and

 behaviors

,

 to

 provide

 them

 with

 the

 most

 relevant

 and

 personalized

 information

.



2

.

 Aug

mented

 Reality

:

 AI

 will

 continue

 to

 advance

 in

 augmented

 reality

,

 where

 virtual

 objects

 or

 experiences

 can

 be

 enhanced

 and

 customized

 to

 match

 the

 user

's

 surroundings

.

 This

 will

 be

 used

 to

 enhance

 various

 aspects

 of

 our

 lives

,




In [6]:
llm.shutdown()