# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-26 21:03:03] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-26 21:03:03] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-26 21:03:03] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-26 21:03:05] INFO server_args.py:2420: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.62it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.03it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.03it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.03it/s]Capturing batches (bs=104 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.03it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01, 14.75it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.75it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 14.75it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 14.75it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.59it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.59it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.59it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.59it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.91it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.91it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.91it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 19.91it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 13.18it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 13.18it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:01<00:00, 13.18it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.84it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.84it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.84it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.84it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.16it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.16it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:01<00:00, 16.16it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 16.04it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Shoshana. I'm a Junior at the University of Miami. My major is Electrical and Computer Engineering. My academic background is in mechanical engineering, so I have a natural inclination towards engineering and application of that knowledge in the real world. I have a passion for coding and am passionate about sharing that passion with others. I have a passion for Math, Physics, and Computer Science. I am proficient in several programming languages, and I have experience in creating basic and advanced software solutions. My work has been on the frontlines of the internet, from web application development to server-side development, and from hardware development to cloud computing. My work
Prompt: The president of the United States is
Generated text:  represented by a vice president. How many Vice Presidents does the President have? One.
The Vice President is a distinct official position, so the President cannot have one. However, it is possible 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Library. Paris is a bustling city with a rich cultural heritage and is a major tourist destination. It is also known for its cuisine, including its famous croissants and its famous French fries. Paris is a city that is constantly evolving and is home to many new and exciting developments. It is a city that is a must-visit for anyone interested in French culture and history. 

Paris is a city that is a must-visit for anyone interested

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to more personalized and context-aware AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and adapt to new situations. This could lead to more efficient and effective AI systems that can handle a wider range of tasks and applications.

3. Increased reliance on AI for decision



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name here], and I'm a [insert character's profession, personality, or what makes you unique here]. I'm a [insert number of years since finishing college here]. And my [insert the most important skill or accomplishment here, such as "writing", "teaching", or "adventuring"] is [insert one or two bullet points here]. I enjoy [insert hobby here, such as "cooking", "reading", or "traveling"]. And I hope to [insert future goal here, such as "becoming a [insert occupation, like "psychologist", "engineer", or "

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a vibrant metropolis with a rich history and a cosmopolitan culture that draws tourists from all over the world. Paris is known for its beautiful museums, palaces, and art galleries, as well as its delicious cuisine, jazz music, a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

/an

 [

Occup

ation

]

 who

 has

 been

 coding

 for

 [

Number

]

 years

.

 I

'm

 a

 professional

 who

 has

 always

 been

 passionate

 about

 creating

 and

 improving

 systems

 that

 are

 efficient

 and

 user

-friendly

.

 I

 have

 a

 keen

 eye

 for

 detail

 and

 a

 talent

 for

 problem

-solving

 that

 I

 use

 to

 help

 others

 succeed

 in

 their

 coding

 endeavors

.



I

 have

 a

 keen

 eye

 for

 detail

 and

 a

 talent

 for

 problem

-solving

 that

 I

 use

 to

 help

 others

 succeed

 in

 their

 coding

 endeavors

.

 I

 am

 a

 self

-st

arter

,

 working

 from

 home

 and

 always

 looking

 for

 new

 challenges

 to

 learn

 from

 and

 grow

.

 I

'm

 a

 team

 player

,

 always

 looking

 out

 for

 the

 best

 interests



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 a

 city

 in

 the

 south

 of

 France

 and

 the

 largest

 city

 in

 the

 country

.

 It

 has

 a

 long

 and

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

 and

 its

 influence

 on

 French

 culture

 and

 literature

.

 Today

,

 Paris

 is

 known

 for

 its

 world

-ren

owned

 museums

,

 art

 galleries

,

 fashion

 shows

,

 and

 festivals

,

 as

 well

 as

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 home

 to

 many

 of

 the

 country

's

 major

 industries

,

 including

 the

 aerospace

 and

 automotive

 industries

.

 Paris

 is

 also

 known

 for

 its

 cuisine

,

 with

 its

 famous

 dishes

 such

 as

 cro

iss

ants

,

 and

 its

 unique

 cultural

 and

 historical

 significance



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 rapidly

 evolving

,

 and

 there

 are

 many

 possible

 trends

 that

 could

 shape

 the

 direction

 of

 the

 technology

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 that

 are

 currently

 being

 explored

:



1

.

 AI

 ethics

 and

 governance

:

 With

 the

 rise

 of

 AI

-powered

 systems

,

 there

 will

 be

 a

 growing

 focus

 on

 addressing

 ethical

 issues

 and

 ensuring

 that

 the

 technology

 is

 used

 responsibly

.

 This

 includes

 issues

 such

 as

 bias

 in

 algorithms

,

 privacy

,

 and

 data

 privacy

.



2

.

 Increased

 automation

 and

 artificial

 general

 intelligence

 (

AG

I

):

 There

 is

 a

 growing

 expectation

 that

 AI

 systems

 will

 continue

 to

 become

 more

 capable

,

 with

 the

 potential

 to

 replace

 human

 workers

 in

 many

 industries

.

 However

,

 this

 also




In [6]:
llm.shutdown()