# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-01-23 23:49:58] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-01-23 23:49:58] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-01-23 23:49:58] INFO utils.py:164: NumExpr defaulting to 16 threads.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

[2026-01-23 23:50:01] INFO server_args.py:1769: Attention backend not specified. Use fa3 backend by default.


[2026-01-23 23:50:01] INFO server_args.py:2658: Set soft_watchdog_timeout since in CI




tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.15it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.87 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.87 GB):   5%|▌         | 1/20 [00:05<01:53,  5.95s/it]Capturing batches (bs=120 avail_mem=74.73 GB):   5%|▌         | 1/20 [00:05<01:53,  5.95s/it]Capturing batches (bs=120 avail_mem=74.73 GB):  10%|█         | 2/20 [00:06<00:45,  2.52s/it]Capturing batches (bs=112 avail_mem=74.72 GB):  10%|█         | 2/20 [00:06<00:45,  2.52s/it]Capturing batches (bs=104 avail_mem=74.72 GB):  10%|█         | 2/20 [00:06<00:45,  2.52s/it]

Capturing batches (bs=96 avail_mem=74.72 GB):  10%|█         | 2/20 [00:06<00:45,  2.52s/it] Capturing batches (bs=96 avail_mem=74.72 GB):  25%|██▌       | 5/20 [00:06<00:11,  1.34it/s]Capturing batches (bs=88 avail_mem=74.71 GB):  25%|██▌       | 5/20 [00:06<00:11,  1.34it/s]Capturing batches (bs=80 avail_mem=74.68 GB):  25%|██▌       | 5/20 [00:06<00:11,  1.34it/s]Capturing batches (bs=72 avail_mem=74.67 GB):  25%|██▌       | 5/20 [00:06<00:11,  1.34it/s]Capturing batches (bs=64 avail_mem=74.66 GB):  25%|██▌       | 5/20 [00:06<00:11,  1.34it/s]Capturing batches (bs=64 avail_mem=74.66 GB):  45%|████▌     | 9/20 [00:06<00:03,  2.99it/s]Capturing batches (bs=56 avail_mem=74.65 GB):  45%|████▌     | 9/20 [00:06<00:03,  2.99it/s]

Capturing batches (bs=48 avail_mem=74.65 GB):  45%|████▌     | 9/20 [00:06<00:03,  2.99it/s]Capturing batches (bs=40 avail_mem=74.64 GB):  45%|████▌     | 9/20 [00:06<00:03,  2.99it/s]Capturing batches (bs=40 avail_mem=74.64 GB):  60%|██████    | 12/20 [00:06<00:01,  4.56it/s]Capturing batches (bs=32 avail_mem=74.64 GB):  60%|██████    | 12/20 [00:06<00:01,  4.56it/s]Capturing batches (bs=24 avail_mem=74.63 GB):  60%|██████    | 12/20 [00:06<00:01,  4.56it/s]Capturing batches (bs=16 avail_mem=74.63 GB):  60%|██████    | 12/20 [00:06<00:01,  4.56it/s]

Capturing batches (bs=16 avail_mem=74.63 GB):  75%|███████▌  | 15/20 [00:06<00:00,  6.33it/s]Capturing batches (bs=12 avail_mem=74.62 GB):  75%|███████▌  | 15/20 [00:06<00:00,  6.33it/s]Capturing batches (bs=8 avail_mem=74.62 GB):  75%|███████▌  | 15/20 [00:06<00:00,  6.33it/s] Capturing batches (bs=4 avail_mem=74.61 GB):  75%|███████▌  | 15/20 [00:06<00:00,  6.33it/s]Capturing batches (bs=2 avail_mem=74.61 GB):  75%|███████▌  | 15/20 [00:06<00:00,  6.33it/s]Capturing batches (bs=2 avail_mem=74.61 GB):  95%|█████████▌| 19/20 [00:06<00:00,  9.46it/s]Capturing batches (bs=1 avail_mem=74.60 GB):  95%|█████████▌| 19/20 [00:06<00:00,  9.46it/s]Capturing batches (bs=1 avail_mem=74.60 GB): 100%|██████████| 20/20 [00:06<00:00,  2.99it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Leah. I'm 12 years old, and I'm from the United States. My favorite color is blue. I want to travel to Japan next month. I like Japan because it's warm and it's a peaceful place. I want to visit the Tokyo Tower and the Yoyogi Park. I want to learn some Japanese and make friends with the people there. What do you think of Japan? I like it a lot and I want to go there when it's time.
Answer the following questions based on the information given above:
(1) How old is Leah?
(2) What is Leah's favorite color?
(
Prompt: The president of the United States is
Generated text:  very busy every day. He has to go to the office to work, and he also has to go to the White House, the Capitol, and the White House. He spends about 50 minutes driving to the White House and the Capitol, and it takes about 40 minutes to drive to the White House. What is the total amount of time the president spends driving to work each day? To determine the total amount of time t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" and "La Ville de la Rose". It is the largest city in France and the second-largest city in the European Union, with a population of over 10 million people. Paris is known for its rich history, art, and culture, and is a popular tourist destination. It is also home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a vibrant and dynamic city with a rich cultural scene and a strong sense of French identity. It is a major hub for business, politics, and entertainment in France

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some potential trends that could be expected in the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in our daily lives. This could include things like self-driving cars, robots in manufacturing, and even more advanced forms of AI that can perform tasks that were previously done by humans.

2. Improved privacy and security: As AI technology becomes more advanced, we can expect to see more privacy and security concerns. This could include



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Type] person. What can you tell me about yourself? My hobbies and interests range from [list hobbies or interests here]. I enjoy [mention a hobby or interest]. What brings you here today? I'm here to [mention a reason for being here]. And what do you look forward to most in your upcoming project or project? I'm really excited to [mention an upcoming project or project you're looking forward to].
Your self-introduction is clear, informative, and concise. Please provide me with a more detailed self-introduction, including a personal anecdote or a quote that exemplifies your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city that serves as the nation’s political and cultural center and has been a UNESCO World Heritage Site since 1994. It was founded by French colonists in the 11th

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 a

 [

occupation

].

 I

 have

 always

 been

 passionate

 about

 [

career

 interest

 or

 hobby

].

 I

 am

 determined

 to

 [

short

,

 positive

 statement

 that

 reflects

 your

 character

]

 and

 always

 strive

 to

 be

 the

 best

 version

 of

 myself

.

 I

 am

 always

 ready

 to

 learn

 and

 improve

 myself

 to

 reach

 my

 goals

.

 I

 am

 a

 good

 listener

 and

 always

 try

 to

 understand

 others

'

 perspectives

.

 I

 am

 organized

 and

 always

 make

 a

 list

 of

 tasks

 to

 complete

 daily

.

 I

 am

 always

 looking

 for

 new

 experiences

 and

 learning

 opportunities

.

 I

 am

 dedicated

 to

 my

 job

 and

 always

 make

 time

 for

 my

 family

 and

 friends

.

 I

 am

 a

 true

 believer

 in

 hard

 work

 and

 hard

-

earned

 success



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

,

 and

 is

 a

 major

 center

 of

 French

 culture

 and

 politics

.

 The

 city

 also

 has

 a

 rich

 history

 dating

 back

 over

 

1

0

0

0

 years

,

 and

 is

 a

 UNESCO

 World

 Heritage

 site

.

 Paris

 is

 also

 home

 to

 numerous

 cultural

 and

 scientific

 institutions

,

 and

 is

 a

 major

 tourist

 destination for

 tourists

 from

 all

 over

 the

 world

.

 The

 city

's

 French

 language

 is

 widely

 spoken

,

 and

 French

 is

 the

 official

 language

 of

 France

.

 Paris

 is

 also

 home

 to

 many

 influential

 and

 prestigious

 French

 institutions

,

 such

 as

 the

 Institut

 d

'

H

isto



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 trends

 that

 are

 likely

 to

 continue

 over

 the

 next

 few

 decades

.

 Some

 potential

 trends

 include

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 With

 the

 increasing

 number

 of

 ethical

 concerns

 around

 AI

,

 there

 is

 likely

 to

 be

 a

 greater

 emphasis

 on

 ethical

 guidelines

 and

 standards

 for

 AI

 development

 and

 deployment

.



2

.

 Greater

 use

 of

 AI

 in

 other

 industries

:

 AI

 is

 already

 being

 used

 in

 a

 variety

 of

 industries

,

 from

 healthcare

 to

 transportation

,

 but

 there

 is

 likely

 to

 be

 an

 even

 greater

 focus

 on

 its

 use

 in

 other

 sectors

 in

 the

 coming

 years

.



3

.

 Greater

 reliance

 on

 AI

 for

 automation

:

 As

 automation

 becomes

 more

 prevalent

 in

 various

 industries

,

 there

 is

 likely




In [6]:
llm.shutdown()