# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-17 04:21:34] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-17 04:21:34] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-17 04:21:34] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-17 04:21:36] INFO server_args.py:1830: Attention backend not specified. Use fa3 backend by default.


[2026-02-17 04:21:36] INFO server_args.py:2865: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.99it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=11.54 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=11.54 GB):   5%|▌         | 1/20 [00:00<00:03,  5.62it/s]Capturing batches (bs=120 avail_mem=11.43 GB):   5%|▌         | 1/20 [00:00<00:03,  5.62it/s]

Capturing batches (bs=112 avail_mem=11.42 GB):   5%|▌         | 1/20 [00:00<00:03,  5.62it/s]Capturing batches (bs=104 avail_mem=11.42 GB):   5%|▌         | 1/20 [00:00<00:03,  5.62it/s]Capturing batches (bs=104 avail_mem=11.42 GB):  20%|██        | 4/20 [00:00<00:00, 16.44it/s]Capturing batches (bs=96 avail_mem=11.42 GB):  20%|██        | 4/20 [00:00<00:00, 16.44it/s] Capturing batches (bs=88 avail_mem=11.41 GB):  20%|██        | 4/20 [00:00<00:00, 16.44it/s]Capturing batches (bs=80 avail_mem=11.41 GB):  20%|██        | 4/20 [00:00<00:00, 16.44it/s]Capturing batches (bs=72 avail_mem=11.37 GB):  20%|██        | 4/20 [00:00<00:00, 16.44it/s]

Capturing batches (bs=72 avail_mem=11.37 GB):  40%|████      | 8/20 [00:00<00:00, 22.47it/s]Capturing batches (bs=64 avail_mem=11.36 GB):  40%|████      | 8/20 [00:00<00:00, 22.47it/s]Capturing batches (bs=56 avail_mem=11.36 GB):  40%|████      | 8/20 [00:00<00:00, 22.47it/s]Capturing batches (bs=48 avail_mem=11.35 GB):  40%|████      | 8/20 [00:00<00:00, 22.47it/s]Capturing batches (bs=48 avail_mem=11.35 GB):  55%|█████▌    | 11/20 [00:00<00:00, 23.94it/s]Capturing batches (bs=40 avail_mem=11.35 GB):  55%|█████▌    | 11/20 [00:00<00:00, 23.94it/s]Capturing batches (bs=32 avail_mem=11.34 GB):  55%|█████▌    | 11/20 [00:00<00:00, 23.94it/s]Capturing batches (bs=24 avail_mem=11.34 GB):  55%|█████▌    | 11/20 [00:00<00:00, 23.94it/s]

Capturing batches (bs=24 avail_mem=11.34 GB):  70%|███████   | 14/20 [00:00<00:00, 23.41it/s]Capturing batches (bs=16 avail_mem=11.34 GB):  70%|███████   | 14/20 [00:00<00:00, 23.41it/s]Capturing batches (bs=12 avail_mem=11.33 GB):  70%|███████   | 14/20 [00:00<00:00, 23.41it/s]Capturing batches (bs=8 avail_mem=11.33 GB):  70%|███████   | 14/20 [00:00<00:00, 23.41it/s] Capturing batches (bs=8 avail_mem=11.33 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.58it/s]Capturing batches (bs=4 avail_mem=11.32 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.58it/s]

Capturing batches (bs=2 avail_mem=11.32 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.58it/s]Capturing batches (bs=1 avail_mem=11.31 GB):  85%|████████▌ | 17/20 [00:00<00:00, 20.58it/s]Capturing batches (bs=1 avail_mem=11.31 GB): 100%|██████████| 20/20 [00:00<00:00, 21.28it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Claudia. I'm a fifth-year computer science graduate student. I'm passionate about machine learning and have developed several projects using R and Python. I'm also interested in developing a machine learning model for a new project that has just started.
I'm currently working on a project where I'm trying to classify images of cats. However, I'm having trouble with the preprocessing step of the image data.
I have a CSV file with the following structure:

| image_name |
| --- |
| cat1.jpg |
| cat2.jpg |
| cat3.jpg |
| cat4.jpg |
| cat5.jpg |
| cat6.jpg |
| cat
Prompt: The president of the United States is
Generated text:  a very important person. He is like the boss of the whole country. He makes many important decisions every day. He is also very important to his people, because he is like the leader of the whole country. The president of the United States is also like the head of the government. The president has very important jobs, and he i

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the third-largest city in the world by population. It is located on the Seine River and is the seat of government, administration, and culture for the country. Paris is known for its rich history, art, and cuisine, and is a major tourist destination. It is also home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a vibrant and dynamic city with a rich cultural and artistic heritage, and is a major center for business, finance, and politics in Europe. It is also a major center for science

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced ethical considerations: As AI becomes more integrated with human intelligence, there will be increased scrutiny of its ethical implications. This could lead to more stringent regulations and guidelines for AI development and deployment.

3. Greater reliance on AI for decision-making: AI is likely to become more integrated into decision-making processes, allowing machines to make more informed



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Job Title/Role] at [Company Name]. I'm a [short, enthusiastic intro] with a passion for [insert an interest related to your field or hobby]. I enjoy [mention what you enjoy most about your job or hobby], and I'm always looking to learn something new. I'm a [insert a trait, like resilience or kindness] and I'm always looking to help others. I'm also an [insert a personality trait or skill] and I love [mention your hobbies or passions]. I'm [insert a short, personal statement] about myself. 

[Name]: How

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city in France and the seat of the Government and the City of Paris. It is the country's most populous city, with a population of over 2.3 million people. It is known for its artistic and literary heritage, as well as

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 name

]

 and

 I

'm

 [

Last

 name

],

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 an

 enthusiastic

 and

 enthusiastic

 [

role

]

 who

 is

 always

 ready

 to

 take

 on

 new

 challenges

 and

 make

 a

 positive

 impact

 on

 the

 world

.

 I

'm

 a

 [

occupation

]

 who

 is

 always

 looking

 for

 new

 ways

 to

 improve

 myself

 and

 contribute

 to

 the

 greater

 good

.

 I

 love

 [

occupation

]

 and

 strive

 to

 be

 a

 role

 model

 for

 others

 to

 follow

 in

 their

 footsteps

.

 What

 inspired

 you

 to

 become

 a

 [

occupation

]

?



My

 love

 for

 [

occupation

]

 started

 when

 I

 was

 a

 child

,

 and

 I

 would

 spend

 hours

 learning

 everything I

 could

 about

 it

,

 reading

 books



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



This

 statement

 is

 factual

,

 as

 it

 provides

 the

 name

 and

 official

 title

 of

 the

 capital

 city

 of

 France

.

 The

 statement

 is

 concise

 and

 to

 the

 point

,

 providing

 a

 clear

 and

 un

ambiguous

 answer

 to

 the

 question

.

 It

 also

 follows

 standard

 format

 for

 factual

 statements

,

 which

 includes

 the

 name

 of

 the

 subject

,

 its

 official

 title

,

 and

 a

 brief

 explanation

 of

 its

 significance

 or

 importance

.

 The

 statement

 is

 not

 too

 long

 or

 too

 short

,

 but

 it

 accurately

 con

veys

 the

 necessary

 information

 in

 a

 clear

 and

 informative

 manner

.

 Overall

,

 it

 effectively

 communicates

 the

 key

 facts

 about

 France

's

 capital

 city

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 depends

 on

 a

 wide

 range

 of

 factors

,

 including

 technological

 innovation

,

 changes

 in

 societal

 needs

,

 and

 economic

 factors

.

 However

,

 here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 use

 of AI

 in personal

 assistants:

 In

 the

 coming

 years

,

 we

 can

 expect

 to

 see

 more

 advanced

 AI

 assistants

 that

 can

 assist

 with

 tasks

 like

 making

 phone

 calls

,

 sending

 emails

,

 and

 managing

 household

 chores

.

 These

 assistants

 will

 be

 able

 to

 understand

 natural

 language

 and

 use

 context

 to

 provide

 helpful

 responses

.



2

.

 Autonomous

 vehicles

:

 Self

-driving

 cars

,

 trucks

,

 and

 airplanes

 will

 become

 increasingly

 common

,

 and

 AI

 will

 be

 crucial

 in

 making

 them

 safe

 and

 efficient

.

 Autonomous

 vehicles

 will

 be

 able




In [6]:
llm.shutdown()