# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-24 23:23:32] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-24 23:23:32] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-24 23:23:32] INFO utils.py:164: NumExpr defaulting to 16 threads.






[2025-11-24 23:23:41] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-24 23:23:41] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-24 23:23:41] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.49it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.64it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.64it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.64it/s]Capturing batches (bs=104 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.64it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01, 15.82it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 15.82it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 15.82it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 15.82it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.61it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.61it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.61it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.61it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.76it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.76it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.76it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.76it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.05it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.05it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.05it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.05it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 22.77it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 22.77it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 22.77it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 22.77it/s]

Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 22.77it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 25.80it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 22.45it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex, and I have never felt so happy before. I have 10 cookies. The first cookie costs 100 cents, the second cookie costs 100 cents, the third cookie costs 100 cents, and so on. I have a friend named Sarah. Sarah has 20 cookies. I have 3 cookies less than Sarah. How much money do I need to pay for all my cookies? Let's break down the problem step by step.

First, let's calculate how many cookies Alex has in total:
- Alex has 3 cookies less than Sarah.
- Sarah has 20 cookies
Prompt: The president of the United States is
Generated text:  trying to decide whether to spend $100 million on a new military base or $100 million on social programs. In the first case, he will have to recruit 5000 more soldiers and will have to build 2000 more facilities. In the second case, he will have to recruit 1000 more soldiers and will have to build 500 more facilities. Both bases will be ready in 5 years, and the president wants to know how much more money would 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill/Ability] who has been [Number of Years] years in the field of [Field of Interest]. I'm passionate about [Reason for Passion] and I'm always looking for ways to [Action or Goal]. I'm [Personality Trait] and I'm [Favorite Hobby/Activity]. I'm [Favorite Book/Article/Video/Photo/Other)]. I'm [Favorite Food/Drink/Activity/Place/Other)]. I'm [Favorite Animal/Plant/Insect/Other)]. I'm [Favorite Music/Art

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Flottante" or "La Ville Blanche" (White City). It is the largest city in Europe and the third-largest city in the world by population. Paris is a cultural and historical center, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, Louvre Museum, and the Arc de Triomphe. It is also a major financial and business center, with many of the world's major banks and financial institutions located in the city. Paris is a popular tourist destination, known for its beautiful architecture, rich history, and vibrant culture. It is also

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in manufacturing, transportation, and other industries, where it can perform tasks that were previously done by humans. This could lead to the widespread adoption of automation, which could result in job losses for some workers but also create new opportunities for those who can adapt to the new job market.

2. AI will become more integrated into our daily lives: As AI becomes more integrated into our daily lives, it will become easier and more convenient for us to interact with it. This could lead to a more personalized and efficient way of doing things, such



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I'm a [insert occupation or profession] with a strong passion for [insert something you enjoy or have a hobby]. I'm excited to share my knowledge and experience with you. What's your name, and what's the most exciting thing you've done recently? Let me know and we can start the conversation! #SelfIntroduction #Career #Hobby #CareerHighlight #ExcitingEvent #Inspiration #RealLife #Interests #Opportunities. #Connection #Interactions #Networking #CareerGoals. #TalkItUp #TalkItUp #TalkItUp #TalkItUp. #Talk

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Explain the location and significance of Paris in terms of both culture and politics. 1. Location: Paris is the capital of France, located in the south of the country, on the banks of the Seine River. It is situated in the center of the coun

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

 [

insert

 profession

 or

 major

]

 at

 [

insert

 university

 or

 institution

].

 I

 have

 always

 been

 a

 passionate

 learner

 and

 always

 sought

 to

 understand

 things

 more

 deeply

.

 I

'm

 not

 just

 any

 average

 student

 though

,

 I

'm

 the

 type

 that

 takes

 full

 responsibility

 for

 my

 studies

 and

 often

 goes

 beyond

 the

 expectation

 to

 be

 the

 best

 student

 possible

.

 I

 love

 to

 help

 others

,

 and

 I

 feel

 I

 have

 the

 ability

 to

 make

 a

 difference

 in

 the

 world

.

 If

 you

're

 interested

 in

 learning

 more

 about

 me

,

 or

 if

 you

 have

 a

 question

 about

 anything

 I

 say

,

 feel

 free

 to

 ask

!

 I

 look

 forward

 to

 the

 opportunity

 to

 get

 to

 know

 you



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 UNESCO

 World

 Heritage

 Site

 known

 for

 its

 vibrant

 culture

,

 stunning

 architecture

,

 and

 historic

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 E

iff

el

 Tower

.

 It

 is

 also

 the

 birth

place

 of

 French

 literature

 and

 art

,

 and

 a

 major

 economic

 hub

.

 Paris

 has

 a

 population

 of

 over

 

2

 million

 people

 and

 is

 home

 to

 a

 diverse

 range

 of

 cultures

,

 languages

,

 and

 food

 traditions

.

 The

 city

's

 rich

 history

 and

 dynamic

 environment

 have

 made

 it

 a

 popular

 destination

 for

 travelers

 from

 around

 the

 world

.

 Paris

 is

 often

 referred

 to

 as

 the

 "

city

 of

 love

"

 due

 to

 its

 romantic

 history

 and

 picturesque

 romantic

 landscapes

.

 France

’s

 capital

 city

 is

 Paris

.

 As

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 exciting

 and

 revolutionary

.

 Here

 are

 some

 potential

 trends

 to

 consider

:



1

.

 Increased

 transparency

 and

 accountability

:

 As

 AI

 systems

 become

 more

 complex

 and

 rely

 on

 data

 and

 algorithms

,

 it

 is

 important

 that

 they

 are

 transparent

 and

 accountable

.

 This

 means

 that

 we

 need

 to

 make

 sure

 that

 AI

 systems

 are

 explain

able

 in

 the

 same

 way

 that

 human

 decision

-making

 is

.

 This

 will

 require

 more

 data

 and

 richer

 explanations

 of

 AI

 systems

,

 which

 will

 require

 more

 sophisticated

 models

 and

 techniques

.



2

.

 Personal

ization

:

 With

 the

 increasing

 amount

 of

 data

 available

,

 AI

 is

 becoming

 more

 personalized

.

 We

 will

 see

 AI

 systems

 that

 are

 able

 to

 analyze

 and

 learn

 from

 data

 in

 real

-time

,

 providing




In [6]:
llm.shutdown()