# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-15 17:22:57] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-15 17:22:57] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-15 17:22:57] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-15 17:22:59] INFO server_args.py:1832: Attention backend not specified. Use fa3 backend by default.


[2026-02-15 17:22:59] INFO server_args.py:2867: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.66it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.66it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=55.79 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=55.79 GB):   5%|▌         | 1/20 [00:00<00:04,  4.59it/s]Capturing batches (bs=120 avail_mem=55.69 GB):   5%|▌         | 1/20 [00:00<00:04,  4.59it/s]Capturing batches (bs=112 avail_mem=55.68 GB):   5%|▌         | 1/20 [00:00<00:04,  4.59it/s]Capturing batches (bs=112 avail_mem=55.68 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.45it/s]Capturing batches (bs=104 avail_mem=55.68 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.45it/s]Capturing batches (bs=96 avail_mem=55.67 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.45it/s] Capturing batches (bs=88 avail_mem=55.67 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.45it/s]

Capturing batches (bs=88 avail_mem=55.67 GB):  30%|███       | 6/20 [00:00<00:00, 15.59it/s]Capturing batches (bs=80 avail_mem=55.66 GB):  30%|███       | 6/20 [00:00<00:00, 15.59it/s]Capturing batches (bs=72 avail_mem=55.66 GB):  30%|███       | 6/20 [00:00<00:00, 15.59it/s]Capturing batches (bs=64 avail_mem=55.66 GB):  30%|███       | 6/20 [00:00<00:00, 15.59it/s]Capturing batches (bs=64 avail_mem=55.66 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.28it/s]Capturing batches (bs=56 avail_mem=55.65 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.28it/s]Capturing batches (bs=48 avail_mem=55.64 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.28it/s]

Capturing batches (bs=40 avail_mem=55.64 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.28it/s]Capturing batches (bs=40 avail_mem=55.64 GB):  60%|██████    | 12/20 [00:00<00:00, 19.60it/s]Capturing batches (bs=32 avail_mem=55.63 GB):  60%|██████    | 12/20 [00:00<00:00, 19.60it/s]Capturing batches (bs=24 avail_mem=55.63 GB):  60%|██████    | 12/20 [00:00<00:00, 19.60it/s]Capturing batches (bs=16 avail_mem=55.62 GB):  60%|██████    | 12/20 [00:00<00:00, 19.60it/s]

Capturing batches (bs=16 avail_mem=55.62 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.43it/s]Capturing batches (bs=12 avail_mem=55.62 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.43it/s]Capturing batches (bs=8 avail_mem=55.61 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.43it/s] Capturing batches (bs=4 avail_mem=55.61 GB):  75%|███████▌  | 15/20 [00:00<00:00, 18.43it/s]Capturing batches (bs=4 avail_mem=55.61 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.16it/s]Capturing batches (bs=2 avail_mem=55.60 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.16it/s]Capturing batches (bs=1 avail_mem=55.60 GB):  90%|█████████ | 18/20 [00:01<00:00, 20.16it/s]Capturing batches (bs=1 avail_mem=55.60 GB): 100%|██████████| 20/20 [00:01<00:00, 18.41it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emma and I'm a graphic designer. I work with a team of people to create engaging designs that are both professional and personal. I specialize in graphic design for websites, branding, advertising, social media, and packaging. I've gained over 8 years of experience in graphic design and am constantly learning. I'm looking for someone to help me with my current project. Can you provide a brief description of the project and what I need from my help? Also, please provide me with a sample of the graphic design I need.
Certainly! Please provide me with the details of the project and the specific requirements you have for the design. This
Prompt: The president of the United States is
Generated text:  from the 23rd president. In what year was he born?
To determine the year the 23rd president was born, we need to establish the sequence of U.S. presidents and their years of birth. Let's list the presidents by their years of birth:

1. Thomas Jefferson

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at [company name], and I've been working here for [number of years] years. I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major center for international business and diplomacy. The city is also known for its rich history, including the influence of the French Revolution and the influence of the French language. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also home to many famous French artists, writers, and musicians. Paris is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to a more human-like experience for users.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve diagnosis, treatment, and patient care. As AI becomes more advanced, it is likely to be used in even more areas, including personalized medicine, drug discovery, and patient monitoring.

3. Greater use of AI in automation: AI is already being used in many industries to automate



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [insert your occupation or profession]. I have been working in the [insert your profession] field for [insert your duration in the field] and I have over [insert your number of years] years of experience. My expertise lies in [insert your expertise area] and I have a passion for [insert something related to your experience or skill set]. I have a [insert your current level of experience] in [insert the field you are currently working in] and I am constantly seeking out opportunities to grow my skillset and improve my overall knowledge in the field. What excites me most is [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city and the most populous city in the European Union. It was founded in 787 AD and is located on the island of Corsica. The city is known for

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Age

]

 year

-old

 aspiring

 novelist

 with

 an

 un

shake

able

 belief

 in

 the

 power

 of

 words

 to

 transform

 lives

.

 I

've

 always

 been

 fascinated

 by

 literature

 and

 I

 never

 stop

 learning

 new

 things

 about

 the

 craft

.

 I

've

 been

 working

 on

 my

 novel

 for

 [

Number

]

 years

 now

 and

 I

'm

 constantly

 inspired

 by

 the

 stories

 of

 the

 world

's

 most

 talented

 writers

.

 I

'm

 also

 a

 big

 fan

 of

 [

Genre

]

 writing

 and

 I

 try

 to

 find

 the

 best

 ways

 to

 incorporate

 it

 into

 my

 work

.

 I

'm

 always

 looking

 for

 new

 and

 exciting

 challenges

 to

 try

 out

 and

 I

'm

 eager

 to

 explore

 new

 writing

 styles

 and

 try

 out

 different

 genres



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

 concise

 factual

 statement

 about

 France

’s

 capital

 city

 is

:

 The

 capital

 city

 of

 France

 is

 Paris

.

 



This

 statement

 accurately

 reflects

 the

 name

 and

 location

 of

 the

 capital

 city

 of

 France

.

 



For

 context

,

 Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 capital

 of

 the

 country

.

 It

 is

 located

 on

 the

 north

 bank

 of

 the

 Se

ine

 River

 and

 is

 the

 seat

 of

 the

 French

 government

 and

 the

 major

 cultural

 and

 artistic

 center

 in

 France

.

 Paris

 is

 also

 known

 for

 its

 rich

 history

,

 art

,

 architecture

,

 and

 cuisine

. The

 city

 has

 a

 population

 of

 approximately

 

2

.

3

 million

 people

.

 



In

 summary

,

 the

 concise

 factual

 statement

 about

 France

’s

 capital



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 technological

 advancements

,

 enhanced

 capabilities

,

 and

 increasing

 integration

 with

 other

 technologies

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 The

 integration

 of

 AI

 with

 other

 technologies

,

 such

 as

 sensors

,

 machine

 learning

,

 and

 blockchain

,

 will

 likely

 continue

 to

 increase

.

 This

 integration

 will

 enable

 AI

 to

 perform

 tasks

 that

 were

 previously

 difficult

 or

 impossible

 to

 accomplish

,

 such

 as

 predicting

 disease

 outbreaks

,

 optimizing

 supply

 chains

,

 and

 fraud

 detection

.



2

.

 Enhanced

 capabilities

:

 AI

 will

 continue

 to

 improve

 its

 ability

 to

 perform

 tasks

 and

 solve

 problems

.

 This

 includes

 improvements

 in

 natural

 language

 processing

,

 computer

 vision

,

 and

 autonomous

 driving




In [6]:
llm.shutdown()