# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-24 12:11:13] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-24 12:11:13] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-24 12:11:13] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-24 12:11:16] INFO server_args.py:1828: Attention backend not specified. Use fa3 backend by default.


[2026-02-24 12:11:16] INFO server_args.py:2889: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=69.02 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=69.02 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]Capturing batches (bs=120 avail_mem=68.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]

Capturing batches (bs=112 avail_mem=68.91 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]Capturing batches (bs=104 avail_mem=68.91 GB):   5%|▌         | 1/20 [00:00<00:03,  5.41it/s]Capturing batches (bs=104 avail_mem=68.91 GB):  20%|██        | 4/20 [00:00<00:01, 15.04it/s]Capturing batches (bs=96 avail_mem=68.90 GB):  20%|██        | 4/20 [00:00<00:01, 15.04it/s] Capturing batches (bs=88 avail_mem=68.90 GB):  20%|██        | 4/20 [00:00<00:01, 15.04it/s]Capturing batches (bs=80 avail_mem=68.89 GB):  20%|██        | 4/20 [00:00<00:01, 15.04it/s]

Capturing batches (bs=80 avail_mem=68.89 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.50it/s]Capturing batches (bs=72 avail_mem=68.89 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.50it/s]Capturing batches (bs=64 avail_mem=68.88 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.50it/s]Capturing batches (bs=56 avail_mem=68.88 GB):  35%|███▌      | 7/20 [00:00<00:00, 18.50it/s]Capturing batches (bs=56 avail_mem=68.88 GB):  50%|█████     | 10/20 [00:00<00:00, 21.94it/s]Capturing batches (bs=48 avail_mem=68.87 GB):  50%|█████     | 10/20 [00:00<00:00, 21.94it/s]Capturing batches (bs=40 avail_mem=68.87 GB):  50%|█████     | 10/20 [00:00<00:00, 21.94it/s]Capturing batches (bs=32 avail_mem=68.86 GB):  50%|█████     | 10/20 [00:00<00:00, 21.94it/s]

Capturing batches (bs=32 avail_mem=68.86 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.22it/s]Capturing batches (bs=24 avail_mem=68.86 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.22it/s]Capturing batches (bs=16 avail_mem=64.83 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.22it/s]Capturing batches (bs=12 avail_mem=40.90 GB):  65%|██████▌   | 13/20 [00:00<00:00, 24.22it/s]

Capturing batches (bs=12 avail_mem=40.90 GB):  80%|████████  | 16/20 [00:00<00:00, 18.64it/s]Capturing batches (bs=8 avail_mem=36.11 GB):  80%|████████  | 16/20 [00:00<00:00, 18.64it/s] Capturing batches (bs=4 avail_mem=20.38 GB):  80%|████████  | 16/20 [00:00<00:00, 18.64it/s]Capturing batches (bs=2 avail_mem=17.64 GB):  80%|████████  | 16/20 [00:01<00:00, 18.64it/s]Capturing batches (bs=2 avail_mem=17.64 GB):  95%|█████████▌| 19/20 [00:01<00:00, 17.68it/s]Capturing batches (bs=1 avail_mem=17.54 GB):  95%|█████████▌| 19/20 [00:01<00:00, 17.68it/s]

Capturing batches (bs=1 avail_mem=17.54 GB): 100%|██████████| 20/20 [00:01<00:00, 18.47it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rajen and I am 21 years old. I have a lot of personal stuff to deal with, but I am now going to spend some time with you as you are a well known and popular blogger on the internet. 

I have a list of questions that I would like to ask. I want you to tell me my question and I want you to give me your answer to them. Please use only the word "yes" and "no" in your answer. If you are unsure, respond with "I'm not sure". 
I'm ready when you are.

Sure thing, Rajen! I'd be happy to
Prompt: The president of the United States is
Generated text:  very busy all the time. He usually spends most of his time in the ___________. [ ]
A. State House
B. White House
C. State Capitol
D. White House Capitol

Answer:
B

Which of the following sentences is grammatically correct and makes sense?
A. The East China Sea has a large area.
B. The East China Sea has a great area.
C. The East China Sea has a lot of area.
D. The East China Sea has a lot of great area.
Ans

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What do you enjoy

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in the European Union. It is located on the Seine River and is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is known for its rich history, art, and culture, and is a major tourist destination. The city is also home to many important institutions, including the French Academy of Sciences and the French National Library. Paris is a vibrant and dynamic city with a diverse population and a rich cultural heritage. Its status as the capital of France has made it a major economic and political center. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more natural and intuitive interactions between humans and machines.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions. This could lead to more efficient and effective decision-making in a wide range of applications.

3. Increased focus on ethical and social implications: As AI becomes more integrated with human society, there will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name of the character]. I am a [insert occupation or profession] who has always been passionate about [insert something about your hobbies, interests, or talents]. I love [insert something about your personal characteristics or traits that set you apart]. Whether you're a friend, family member, or colleague, I am always here to lend a helping hand or provide valuable advice. I am always looking for new experiences to try out and have fun with, and I enjoy making friends with people who are like-minded. I am a [insert a specific skill or skill set] that I love to hone and develop, and I am always eager

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a historical city known for its rich history, art, and cultural influences, including the works of Michelangelo, Claude Monet, and Marcel Duch

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [Your

 Name],

 and I

 am

 [

Your

 Age

].

 I am

 currently 

30

 years old

. I

 have been

 a gamer

 for 

10

 years now

, and

 I

 have

 played over

 

1

0

0

 games

.

 I love

 to

 read

,

 watch movies

 and TV

, and

 travel.

 I have

 a great

 sense of

 humor and

 enjoy playing

 word games

 with my

 friends

. I

 like

 to take

 care

 of my

 family

, and

 I

 take

 time for

 myself

 when

 I

 need

 it.

 I

 believe that

 I have

 a lot

 to offer

,

 and I

 am

 always looking

 for

 new

 challenges and

 opportunities to

 improve myself

. Thank

 you

 for taking

 the

 time to

 learn

 more

 about

 me

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris,

 a city

 that is

 renowned for

 its rich

 history

 and

 stunning

 architecture.

 It

 is the

 largest

 city in

 France and

 one of

 the

 largest cities

 in

 the

 world

 by

 population

.

 Paris

 is

 known

 for

 its

 romantic

 architecture

,

 such

 as

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

,

 and

 its

 vibrant

 culture

,

 including

 its

 annual

 E

iff

el

 Tower

 par

ades

 and

 festivals

.

 The

 city

 is

 also

 famous

 for

 its

 fashion

 industry

,

 which

 has

 produced

 countless

 famous

 designers

 and

 brands

.

 Despite

 its

 famous

 landmarks

 and

 cultural

 importance

,

 Paris

 remains

 a

 vibrant

 and

 dynamic

 city

 with

 a

 rich

 history

 and

 culture

.

 



In

 summary

,

 Paris

 is

 a

 major

 city

 in

 France

 with

 a

 rich

 history



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 a

 significant

 shift

 in

 the

 way

 it

 is

 used

 and

 developed

,

 with

 more

 companies

 investing

 in

 research

 and

 development

 to

 improve

 their

 capabilities

 and

 create

 new

 applications

.

 Here

 are

 some

 possible

 trends

 that

 could

 emerge

 in

 the

 coming

 years

:



1

.

 More

 advanced

 algorithms

:

 As

 AI

 becomes

 more

 complex

 and

 sophisticated

,

 there

 will

 be

 an

 increased

 focus

 on

 developing

 more

 advanced

 algorithms

 that

 can

 handle

 more

 complex

 problems

 and

 make

 better

 predictions

.



2

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 patient

 outcomes

 and

 reduce

 costs

.

 As

 the

 technology

 advances

,

 we

 may

 see

 a

 continued

 increase

 in

 its

 use

 in

 the

 field

,

 with

 more

 specialized

 applications

 being




In [6]:
llm.shutdown()