# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-21 23:59:57] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-21 23:59:57] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-21 23:59:57] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.16it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=4.47 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=4.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=120 avail_mem=4.10 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]

Capturing batches (bs=112 avail_mem=4.04 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=3.96 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=3.96 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s]Capturing batches (bs=96 avail_mem=3.92 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s] Capturing batches (bs=88 avail_mem=3.86 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s]Capturing batches (bs=80 avail_mem=3.82 GB):  20%|██        | 4/20 [00:00<00:01, 15.20it/s]Capturing batches (bs=80 avail_mem=3.82 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=72 avail_mem=3.68 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]

Capturing batches (bs=64 avail_mem=3.67 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=56 avail_mem=3.67 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.88it/s]Capturing batches (bs=56 avail_mem=3.67 GB):  50%|█████     | 10/20 [00:00<00:00, 21.56it/s]Capturing batches (bs=48 avail_mem=3.63 GB):  50%|█████     | 10/20 [00:00<00:00, 21.56it/s]Capturing batches (bs=40 avail_mem=3.63 GB):  50%|█████     | 10/20 [00:00<00:00, 21.56it/s]Capturing batches (bs=32 avail_mem=3.61 GB):  50%|█████     | 10/20 [00:00<00:00, 21.56it/s]Capturing batches (bs=32 avail_mem=3.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.75it/s]Capturing batches (bs=24 avail_mem=3.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.75it/s]

Capturing batches (bs=16 avail_mem=3.60 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.75it/s]Capturing batches (bs=12 avail_mem=3.48 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.75it/s]Capturing batches (bs=12 avail_mem=3.48 GB):  80%|████████  | 16/20 [00:00<00:00, 21.76it/s]Capturing batches (bs=8 avail_mem=3.48 GB):  80%|████████  | 16/20 [00:00<00:00, 21.76it/s] Capturing batches (bs=4 avail_mem=3.48 GB):  80%|████████  | 16/20 [00:00<00:00, 21.76it/s]Capturing batches (bs=2 avail_mem=2.64 GB):  80%|████████  | 16/20 [00:00<00:00, 21.76it/s]

Capturing batches (bs=2 avail_mem=2.64 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.90it/s]Capturing batches (bs=1 avail_mem=1.09 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.90it/s]Capturing batches (bs=1 avail_mem=1.09 GB): 100%|██████████| 20/20 [00:00<00:00, 21.38it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Diego, I'm 14 years old. My favourite subject is English and I really like to read books. How are you, is your favourite subject any different from mine? I don't have a favorite subject. My favorite subject is science and I really enjoy doing experiments. What's your favorite book? Oh, I have a really nice book that I'm really excited to read. I can't wait to see what it's about! 

That's interesting! Can you tell me more about your book? I'm curious about what kind of books you read and what you like in them. I'm looking forward to reading a book
Prompt: The president of the United States is
Generated text:  a public official who serves as the leader of the executive branch of the government of the United States. The office of the president is a very important office; the president of the United States serves in office for a term of four years, after which time he or she must run for re-election. The president also has the power to veto legis

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" or "La Ville de la Rose" and is the largest city in Europe by population. It is located on the Seine River and is the seat of government, administration, and culture for the French Republic. Paris is known for its rich history, art, and cuisine, and is a major tourist destination. The city is also home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a vibrant and dynamic city with a rich cultural and artistic heritage. The city is also home to many important institutions such as the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn from and adapt to human behavior and decision-making processes.

2. Enhanced natural language processing: AI will continue to improve its ability to understand and interpret human language, allowing for more natural and intuitive interactions with machines.

3. Improved predictive analytics: AI will become more capable of predicting future events and trends, enabling organizations to make more informed decisions and take proactive measures.

4. Increased use of AI in healthcare: AI will be used to improve patient care, reduce costs



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I'm [Your Age] years old. I'm an [career] at heart who enjoy [occupation]. I have a passion for [your hobby or sport]. I'm always looking for ways to [something related to your career or hobby]. I have a knack for [something related to your hobby or sport]. I'm a [level of experience in your chosen field]. I'm looking to [state of mind or personal traits] about myself. How can I best describe myself to someone new? [Your Answer Here] I'm [Your Name] but you can call me [Your Old Name], [Your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Rationale: The statement provided is accurate and complete, containing only facts about the capital city of France. It does not contain any speculative or hypothetical information. Therefore, the statement can be categorized as a factual statement. 


### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 [

age

]

 year

 old

 [

occupation

].

 My

 favorite

 [

activity

]

 is

 [

activity

].

 I

 have

 a

 lot

 of

 friends

 and

 love

 to

 spend

 time

 with

 them

.

 I

'm

 also

 good

 at

 [

something

].

 I

 have

 a

 love

 for

 [

something

],

 and

 I

 enjoy

 [

something

].

 How

 are

 you

?

 Let

 me

 know

 if

 you

 want

 me

 to

 add

 more

 details

.

 [

name

]

 [

quote

]

 A

 good

 friend

 is

 someone

 who

 sticks

 by

 you

,

 never

 gives

 up

,

 and

 who

 you

 can

 always

 count

 on

 to

 support

 you

 through

 your

 struggles

.

 There

 are

 many

 ways

 to

 be

 a

 good

 friend

,

 but

 one

 thing

 is

 for

 sure

 -

 it



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Ré

pub

lique

"

 and

 "

La

 Ro

che

-

Br

une

",

 which

 is

 located

 in

 the

 northeastern

 part

 of

 France

.

 It

 is

 the

 largest

 city

 in

 France

 and

 one

 of

 the

 most

 populous

 cities

 in

 the

 world

,

 with

 an

 estimated

 population

 of

 over

 

2

 million

 people

.

 Paris

 is

 known

 for

 its

 historical

 architecture

,

 vibrant

 culture

,

 and

 beautiful

 views

 of

 the

 city

 and

 the

 surrounding

 countryside

.

 Its

 many

 museums

,

 theaters

,

 and

 parks

 are

 also

 popular

 tourist

 destinations

.

 The

 city

 is

 also

 home

 to

 the

 French

 Parliament

,

 the

 French

 Institute

 of

 Paris

,

 and

 many

 other

 important

 institutions

 and

 organizations

.

 Overall

, Paris

 is

 a

 cultural

 and

 intellectual

 hub



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 complex

,

 but

 here

 are

 some

 possible

 trends

 that

 are

 currently

 being

 explored

:



1

.

 Increased

 emphasis

 on

 ethical

 AI

:

 With

 the

 increasing

 awareness

 of

 the

 ethical

 implications

 of

 AI

,

 there

 is

 an

 increasing

 emphasis

 on

 developing

 AI

 that

 is

 transparent

,

 accountable

,

 and

 responsible

.

 This

 means

 that

 AI

 systems

 should

 be

 designed

 to

 minimize

 bias

,

 transparency

,

 and

 accountability

 in

 their

 decision

-making

.



2

.

 Rise

 of

 AI

-driven

 autonomous

 vehicles

:

 As

 autonomous

 vehicles

 become

 more

 widespread

,

 they

 will

 become

 an

 important

 part

 of

 our

 daily

 lives

.

 This

 will

 require

 new

 AI

 systems

 that

 can

 understand

 complex

 driving

 scenarios

,

 anticipate

 potential

 hazards

,

 and

 make

 safe

,

 autonomous

 decisions

.



3

.

 Adv

ancements




In [6]:
llm.shutdown()