# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-16 03:06:55] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:05,  3.21it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:05,  3.21it/s]Capturing batches (bs=120 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:04,  4.41it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  10%|█         | 2/20 [00:00<00:04,  4.41it/s]

Capturing batches (bs=104 avail_mem=76.80 GB):  10%|█         | 2/20 [00:00<00:04,  4.41it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:02,  7.12it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:02,  7.12it/s] 

Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:00<00:02,  7.39it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:02,  7.39it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:00<00:02,  7.39it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:01,  8.93it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:01,  8.93it/s]

Capturing batches (bs=72 avail_mem=76.79 GB):  40%|████      | 8/20 [00:01<00:01,  8.98it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:01<00:01,  8.98it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  8.51it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  8.51it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:01,  8.51it/s]

Capturing batches (bs=48 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.03it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.03it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:01<00:01,  8.03it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.76it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.76it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:01<00:00,  9.76it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00,  8.26it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00,  8.26it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00,  8.26it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  75%|███████▌  | 15/20 [00:01<00:00,  8.26it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00, 11.41it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00, 11.41it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  90%|█████████ | 18/20 [00:02<00:00, 11.41it/s]

Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:02<00:00,  9.40it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Isabella, and I am a student at the University of California, Berkeley. I am currently majoring in chemistry and have taken several advanced courses in the field. I have a natural inclination towards both theoretical and experimental chemistry, and I enjoy working with computers to understand complex systems. I also enjoy reading books, writing, and practicing my spoken English. What's your favorite hobby? I like to read and write. I also enjoy playing sports like tennis and running. What's your favorite book? I really like George Orwell's "1984." It's a book that has really impacted my life. 
What is the role of
Prompt: The president of the United States is
Generated text:  selling custom stuffed animals to raise money for his administration. In the first week, he sells 200 stuffed animals and in the second week, he sells 250% more stuffed animals than the first week. In the third week, he sells half of the total of the first two weeks. How m

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name], and I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name], and I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" and "La Ville de Paris". It is the largest city in France and the second-largest city in the European Union. Paris is a cultural and historical center with many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, Louvre Museum, and the Palace of Versailles. It is also known for its cuisine, fashion, and art. Paris is a major tourist destination and a major economic center in France. It is the seat of the French government and the country's political, economic, and cultural capital. The city is also home to many international organizations and institutions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and artificial intelligence: As AI technology continues to advance, we can expect to see more automation and artificial intelligence in our daily lives. This could include the development of robots and other machines that can perform tasks that are currently done by humans, such as manufacturing, transportation, and healthcare.

2. Improved privacy and security: As AI technology becomes more advanced, we can expect to see more privacy and security concerns. This could include the development of more secure



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Age] year old [Occupation]. I love to [What you enjoy doing]. And I enjoy [What you enjoy doing]. I am a [Favorite Hobby]. And I enjoy [Favorite Hobby]. And I have a love for [Favorite Hobby]. And I love to [Favorite Hobby]. And I am a [Favorite Hobby]. And I enjoy [Favorite Hobby]. And I have a love for [Favorite Hobby]. And I love to [Favorite Hobby]. And I am a [Favorite Hobby]. And I enjoy [Favorite Hobby]. And I have a love for [Favorite Hobby]. And I love to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A. True
B. False
C. I don't know
The answer is B. False. While Paris is indeed the capital of France, it is not the largest city in the country. Paris has a population of about 2.5 million people and is the third largest city in France by area, after Paris and Nice. The lar

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 an

 [

age

]

 year

 old

 girl

 who

 has

 always

 been

 fascinated

 by

 [

topic

].

 I

 love

 [

topic

]

 because

 [

reason

].

 I

 enjoy

 [

topic

]

 because

 it

 challenges

 me

 to

 [

challenge

].

 I

 like

 [

topic

]

 because

 [

reason

].

 I

 am

 excited

 to

 [

exc

use

 yourself

]

 and

 talk

 about

 [

topic

]

 with

 you

.

 [

Name

],

 how

 are

 you

?

 I

 am

 always

 happy

 to

 chat

 with

 you

 about

 anything

!

 [

Name

],

 what

's

 been

 keeping

 you

 busy

 lately

?

 [

Name

],

 how

 do

 you

 plan

 on

 spending

 your

 free

 time

?

 [

Name

],

 how

 can

 I

 be

 of

 assistance

?

 [

Name

],

 what

's

 your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 city

 renowned

 for

 its

 historical

 significance

 and

 architectural

 beauty

.



The

 city

 of

 Paris

,

 located

 in

 the

 north

 of

 France

,

 is

 the

 capital

 of

 France

 and

 the

 seat

 of

 government

,

 administration

,

 and

 culture

.

 It

 is

 home

 to

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Church

.

 Paris

 is

 also

 known

 for

 its

 rich

 history

,

 including

 the

 medieval

 city

 of

 Rome

,

 Renaissance

 art

,

 and

 French

 literature

.

 The

 city

's

 cuisine

,

 including

 cro

iss

ants

,

 g

is

elles

,

 and

 B

é

arn

ese

 pasta

,

 is

 also

 renowned

 worldwide

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 involve

 continued

 advancements

 and

 breakthrough

s

 in

 areas

 such

 as

 deep

 learning

,

 natural

 language

 processing

,

 and

 robotics

.

 These

 technologies

 are

 expected

 to

 continue

 to

 improve

 and

 become

 more

 integrated

 into

 everyday

 life

,

 with

 applications

 ranging

 from

 autonomous

 vehicles

 to

 intelligent

 personal

 assistants

.

 Additionally

,

 the

 development

 of

 ethical

 and

 responsible

 AI

 systems

 will

 continue

 to

 be

 a

 major

 focus

 of

 research

 and

 development

,

 with

 concerns

 about

 privacy

,

 bias

,

 and

 the

 impact

 of

 AI

 on

 society

 and

 the

 environment

 becoming

 more

 prominent

.

 As

 AI

 continues

 to

 evolve

 and

 become

 more

 ubiquitous

 in

 our

 lives

,

 it

 is

 likely

 that

 we

 will

 see

 significant

 changes

 in

 how

 we

 interact

 with

 technology

,

 how

 we

 work

 and

 collaborate

,

 and




In [6]:
llm.shutdown()