# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-12 04:45:56] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-12 04:45:56] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-12 04:45:56] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-12 04:45:58] INFO trace.py:52: opentelemetry package is not installed, tracing disabled






[2025-11-12 04:46:05] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-12 04:46:05] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-12 04:46:05] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-12 04:46:06] INFO trace.py:52: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.43it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.78 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.78 GB):   5%|▌         | 1/20 [00:00<00:03,  6.07it/s]Capturing batches (bs=120 avail_mem=74.67 GB):   5%|▌         | 1/20 [00:00<00:03,  6.07it/s]Capturing batches (bs=112 avail_mem=74.67 GB):   5%|▌         | 1/20 [00:00<00:03,  6.07it/s]

Capturing batches (bs=104 avail_mem=74.66 GB):   5%|▌         | 1/20 [00:00<00:03,  6.07it/s]Capturing batches (bs=104 avail_mem=74.66 GB):  20%|██        | 4/20 [00:00<00:00, 16.08it/s]Capturing batches (bs=96 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:00, 16.08it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:00, 16.08it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:00, 16.08it/s]Capturing batches (bs=80 avail_mem=74.62 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]Capturing batches (bs=64 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]

Capturing batches (bs=56 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.09it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 22.13it/s]Capturing batches (bs=48 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 22.13it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 22.13it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 22.13it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.10it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.10it/s]Capturing batches (bs=16 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.10it/s]

Capturing batches (bs=12 avail_mem=74.57 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.10it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 21.92it/s] Capturing batches (bs=4 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=1 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:00<00:00, 21.92it/s]Capturing batches (bs=1 avail_mem=74.56 GB): 100%|██████████| 20/20 [00:00<00:00, 24.85it/s]Capturing batches (bs=1 avail_mem=74.56 GB): 100%|██████████| 20/20 [00:00<00:00, 21.89it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Pete. I'm a chemist from South Korea. My city is called Seoul. I like to eat "ice cream". I really like it! I have a lot of friends here. We have ice cream after school. Here, I'm very popular because of my ice cream. I like playing basketball with my friends. I play basketball every day. I have a lot of friends in my class. They all like me. I have a best friend. She's from China. She plays tennis. She plays tennis every day. She's so good that everyone likes her. I have a pet dog. He has two big ears,
Prompt: The president of the United States is
Generated text:  trying to decide how many military bases to have. He has 90 bases in the United States and the rest overseas. He has 3 times as many bases in each of these overseas bases than in Hawaii. Hawaii has 3 military bases. How many military bases does the president have in total?

To determine the total number of military bases the president has, we need to follow these steps:

1. Calculat

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting the annual Eiffel Tower Parcels Festival and hosting the World Cup of football. Paris is a popular tourist destination, with millions of visitors annually. The city is also home to many museums, including the Musée d'Orsay, the Musée Rodin, and the Musée d'Orsay. Paris is known for its rich history, including the Romanesque, Gothic, and Renaissance periods, and its influence on French

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation and efficiency: AI is expected to continue to automate a wide range of tasks, from manufacturing to customer service, and will become more efficient and effective at these tasks.

2. Enhanced human-machine collaboration: AI will continue to improve its ability to understand and interact with humans, leading to more effective collaboration between humans and machines.

3. AI will become more integrated with other technologies: AI will continue to be integrated with other technologies, such as the Internet of Things (IoT), to create more connected and intelligent systems.

4. AI will become more ethical and responsible: As AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Job Title] at [Company Name]. I am currently working in [Position] and I enjoy [Favorite Hobby]. I am known for my [Unique Skill/Ability] and am always looking for opportunities to [Achieve Something]. I am a [Neutral Personality Trait] and I have a positive attitude towards [Positive Attribute]. I am always looking for ways to [Make A Difference], and I am committed to [Ethical Standards]. I am a [Neutral Personality Type] and I am always ready to [Give Up].
Hello, my name is [Name] and I am a [Job Title

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

That's correct! The capital of France is Paris. Let me know if you need any other information.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  bound to be a highly dynamic one. 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

'm

 a

 [

Character

's

 Occupation

].

 I

'm

 passionate

 about

 [

Character

's

 Profession

],

 and

 I

'm

 always

 eager

 to

 learn

 new

 things

.

 I

 have

 a

 natural

 talent

 for

 problem

-solving

,

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 my

 skills

 and

 knowledge

.

 I

'm

 a

 very

 flexible

 and

 adaptable

 person

,

 and

 I

'm

 always

 willing

 to

 try

 new

 things

.

 And

 most

 importantly

,

 I

'm

 a

 true

 friend

 to

 anyone

 who

 has

 needed

 a

 listen

 or

 someone

 to

 share

 their

 thoughts

 and

 experiences

.

 Let

's

 be

 friends

!

 [

Character

's

 Name

]

 feels

 like

 a

 true

 friend

 to

 [

Character

's

 Name

],

 and

 I

'm

 always

 here

 for

 you

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



How

 many

 students

 did

 the

 University

 of

 Paris

 have

 in

 

2

0

0

8

?

 The

 University

 of

 Paris

 has

 had

 students

 since

 

1

5

4

0

.

 It

 has

 a

 student

 body

 of

 about

 

4

0

,

 

0

0

0

 students

.



What

 is

 the

 largest

 football

 club

 in

 France

?

 The

 club

 with

 the

 most

 successful

 run

 in

 the

 French

 Football

 League

 is

 L

igue

 

1

.

 It

 is

 T

oulouse

 Saint

-A

nt

oine

.



What

 is

 the

 name

 of

 the

 main

 airport

 in

 Paris

?

 The

 main

 airport

 in

 Paris

 is

 Paris

 Charles

 de

 Gaul

le

 Airport

.



How

 many

 gates

 does

 Paris

 airport

 have

?

 Paris

 airport

 has

 

1

0

 gates

.



What

 is

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 subject

 to

 change

,

 but

 there

 are

 some

 trends

 that

 are

 likely

 to

 shape

 the

 development

 of

 this

 technology

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Integration

:

 AI

 is

 expected

 to

 become

 more

 integrated

 with

 other

 technologies

 in

 the

 future

.

 For

 example

,

 AI

-powered

 devices

 and

 systems

 will

 become

 more

 prevalent

,

 and

 AI

 will

 be

 used

 to

 improve

 the

 efficiency

 and

 accuracy

 of

 other

 technologies

.



2

.

 Autonomous

 Vehicles

:

 AI

 is

 expected

 to

 play

 an

 increasing

 role

 in

 autonomous

 vehicles

,

 as

 the

 technology

 becomes

 more

 advanced

 and

 reliable

.

 Autonomous

 vehicles

 will

 be

 able

 to

 drive

 on

 the

 roads

,

 avoid

 collisions

,

 and

 make

 decisions

 based

 on




In [6]:
llm.shutdown()