# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-05 13:09:15] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-05 13:09:15] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-05 13:09:15] INFO utils.py:164: NumExpr defaulting to 16 threads.






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.16it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.85 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.85 GB):   5%|▌         | 1/20 [00:00<00:03,  5.14it/s]Capturing batches (bs=120 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:03,  5.14it/s]

Capturing batches (bs=112 avail_mem=74.71 GB):   5%|▌         | 1/20 [00:00<00:03,  5.14it/s]Capturing batches (bs=104 avail_mem=74.70 GB):   5%|▌         | 1/20 [00:00<00:03,  5.14it/s]Capturing batches (bs=104 avail_mem=74.70 GB):  20%|██        | 4/20 [00:00<00:01, 14.80it/s]Capturing batches (bs=96 avail_mem=74.70 GB):  20%|██        | 4/20 [00:00<00:01, 14.80it/s] Capturing batches (bs=88 avail_mem=74.69 GB):  20%|██        | 4/20 [00:00<00:01, 14.80it/s]Capturing batches (bs=80 avail_mem=74.66 GB):  20%|██        | 4/20 [00:00<00:01, 14.80it/s]Capturing batches (bs=80 avail_mem=74.66 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.38it/s]Capturing batches (bs=72 avail_mem=74.65 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.38it/s]

Capturing batches (bs=64 avail_mem=74.64 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.38it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.38it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  50%|█████     | 10/20 [00:00<00:00, 21.66it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  50%|█████     | 10/20 [00:00<00:00, 21.66it/s]Capturing batches (bs=40 avail_mem=74.63 GB):  50%|█████     | 10/20 [00:00<00:00, 21.66it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  50%|█████     | 10/20 [00:00<00:00, 21.66it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.90it/s]Capturing batches (bs=24 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.90it/s]

Capturing batches (bs=16 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.90it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.90it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  80%|████████  | 16/20 [00:00<00:00, 21.90it/s]Capturing batches (bs=8 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:00<00:00, 21.90it/s] Capturing batches (bs=4 avail_mem=74.60 GB):  80%|████████  | 16/20 [00:00<00:00, 21.90it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  80%|████████  | 16/20 [00:00<00:00, 21.90it/s]

Capturing batches (bs=2 avail_mem=74.59 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.89it/s]Capturing batches (bs=1 avail_mem=74.59 GB):  95%|█████████▌| 19/20 [00:00<00:00, 23.89it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:00<00:00, 21.32it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Henry and I'm a 19 year old college student who started taking piano lessons. I've always loved piano and just started taking piano lessons. I was in a piano competition for a year. What are some books that you would suggest for a beginner like me?
Certainly! Piano books are a great way to learn the basics and build confidence in your playing. Here are a few books that might be helpful:

1. "Piano for Musicians" by G. F. Haskins
2. "The Piano Handbook" by Frank D. Hackett
3. "The Piano for Dummies" by Charles Pe
Prompt: The president of the United States is
Generated text:  in the town hall of the state of Florida, where he meets with the governor, who is in the governor's residence in the state capital. The president, who is president of the United States, is in a town hall, and the governor is in a governor's residence. So, the total number of states in Florida and the state capital is?
To determine the total number of states in Florida and 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I am a [job title] with [number of years] years of experience in [industry]. I have a passion for [reason for interest in the industry]. I am always looking for new challenges and opportunities to grow and learn. I am a [reason for interest in the industry] and I am always eager to learn and improve. I am a [reason for interest in the industry] and I am always eager to learn and improve. I am a [reason for interest in the industry] and I am always eager to learn and improve. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, including the influence of the French Revolution and the influence of the French Revolution on modern French society. Paris is also home to many famous French artists, writers, and musicians. The city is known for its cuisine, including its famous Parisian cuisine, and its fashion industry. Paris is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there is likely to be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment.

2. AI will become more integrated with other technologies: As AI becomes more integrated with other technologies, such as machine learning and big data, there is likely to be a greater focus on developing more sophisticated and efficient AI systems.

3. AI will become more accessible:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I'm a [Your Profession] who has been working in [Your Field] for [X years] at [Your Company Name] for [Your Last Position]. I'm currently [Your Job Title] and have been in this industry since [Your Year Started]. I've always been passionate about [Your Passion], and I'm always looking for ways to [Your Goal]. I'm confident that my skills and experience make me a great fit for this role. Thank you for considering me for this position. What would you like me to know more about yourself? As a [Your Profession], I believe in [Your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a bustling metropolis known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Located on the left bank of the Seine river, Paris is the world’s second-largest city and a majo

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

].

 I

'm

 a

 [

insert

 profession

],

 and

 I

've

 been

 working

 in

 the

 field

 of

 [

insert

 field

 of

 interest

]

 for

 [

insert

 number

 of

 years

]

 years

.

 I

 have

 a

 passion

 for

 [

insert

 personal

 interest or

 hobby

]

 that

 I

've

 been

 passionate

 about

 since

 I

 was

 a

 child

.

 In

 my

 free

 time

,

 I

 enjoy

 [

insert

 hobbies

 or

 interests

].

 I

 believe

 that

 my

 expertise

 in

 [

insert

 field

 of

 interest

]

 has

 allowed

 me

 to

 develop

 a

 strong

 communication

 and

 interpersonal

 skills

,

 which

 are

 valuable

 in

 any

 career

.

 I

'm

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 grow

 and

 improve

 my

 skills

.

 Thank

 you

 for

 having

 me

!

 To

 conclude

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

.

 It

 is

 a

 sprawling

 met

ropolis

 of

 over

 

2

.

2

 million

 people

,

 located

 on

 the

 banks

 of

 the

 Se

ine

 River

.

 Paris

 is

 renowned

 for

 its

 romantic

 architecture

,

 vibrant

 culture

,

 and

 annual

 celebrations

 such

 as

 the

 E

iff

el

 Tower

 ceremony

,

 the

 World

 Cup

,

 and

 the

 festival

 of

 Lights

.

 It

 is

 also

 home

 to

 many

 famous

 landmarks

 and

 museums

,

 including

 the

 Lou

vre

 Museum

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Mus

ée

 Rod

in

.

 The

 city

 has

 a

 rich

 history

 dating

 back

 to

 ancient

 Rome

,

 and

 it

 continues

 to

 be

 a

 major

 cultural

 center

 and

 political

 hub

 in

 Europe

.

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 is

 a

 rapidly

 evolving

 field

.

 While

 there

 are

 many

 potential

 directions

 in

 which

 AI

 might

 develop

,

 the

 following

 are

 some

 of

 the

 most

 likely

 areas

 of

 focus

 for

 research

 and

 development

:



1

.

 More

 Advanced

 Language

 Models

:

 Future

 AI

 research

 will

 likely

 focus

 on

 creating

 more

 advanced

 language

 models

 that

 can

 understand

 and

 generate

 human

 language

 more

 accurately

 and

 with

 greater

 precision

.

 This

 could

 include

 developing

 models

 that

 can

 understand

 and

 generate

 natural

 language

 like

 speech

 or

 text

.



2

.

 Deep

 Learning

:

 Deep

 learning

 is

 an

 area

 of

 AI

 research

 that

 has

 been

 gaining

 significant

 traction

 in

 recent

 years

.

 It

 involves

 training

 computer

 algorithms

 to

 recognize

 patterns

 and

 make

 predictions

 based

 on

 large

 amounts

 of

 data

.

 Deep




In [6]:
llm.shutdown()