# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-12 15:34:21] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-12 15:34:21] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-12 15:34:21] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-11-12 15:34:23] INFO trace.py:69: opentelemetry package is not installed, tracing disabled








[2025-11-12 15:34:30] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-12 15:34:30] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-12 15:34:30] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-11-12 15:34:32] INFO trace.py:69: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.39it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=72.81 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=72.81 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]Capturing batches (bs=120 avail_mem=72.71 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]Capturing batches (bs=112 avail_mem=72.70 GB):   5%|▌         | 1/20 [00:00<00:04,  4.32it/s]

Capturing batches (bs=112 avail_mem=72.70 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.63it/s]Capturing batches (bs=104 avail_mem=72.55 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.63it/s]Capturing batches (bs=96 avail_mem=71.91 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.63it/s] Capturing batches (bs=88 avail_mem=71.80 GB):  15%|█▌        | 3/20 [00:00<00:03,  5.63it/s]Capturing batches (bs=88 avail_mem=71.80 GB):  30%|███       | 6/20 [00:00<00:01, 10.56it/s]Capturing batches (bs=80 avail_mem=71.80 GB):  30%|███       | 6/20 [00:00<00:01, 10.56it/s]Capturing batches (bs=72 avail_mem=71.79 GB):  30%|███       | 6/20 [00:00<00:01, 10.56it/s]

Capturing batches (bs=64 avail_mem=71.79 GB):  30%|███       | 6/20 [00:00<00:01, 10.56it/s]Capturing batches (bs=64 avail_mem=71.79 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.66it/s]Capturing batches (bs=56 avail_mem=71.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.66it/s]

Capturing batches (bs=48 avail_mem=71.78 GB):  45%|████▌     | 9/20 [00:01<00:00, 14.66it/s]Capturing batches (bs=48 avail_mem=71.78 GB):  55%|█████▌    | 11/20 [00:01<00:00, 11.09it/s]Capturing batches (bs=40 avail_mem=71.77 GB):  55%|█████▌    | 11/20 [00:01<00:00, 11.09it/s]Capturing batches (bs=32 avail_mem=71.77 GB):  55%|█████▌    | 11/20 [00:01<00:00, 11.09it/s]Capturing batches (bs=32 avail_mem=71.77 GB):  65%|██████▌   | 13/20 [00:01<00:00, 12.81it/s]Capturing batches (bs=24 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:01<00:00, 12.81it/s]

Capturing batches (bs=16 avail_mem=71.76 GB):  65%|██████▌   | 13/20 [00:01<00:00, 12.81it/s]Capturing batches (bs=16 avail_mem=71.76 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.10it/s]Capturing batches (bs=12 avail_mem=71.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.10it/s]Capturing batches (bs=8 avail_mem=71.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 13.10it/s] Capturing batches (bs=8 avail_mem=71.75 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.20it/s]Capturing batches (bs=4 avail_mem=71.74 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.20it/s]

Capturing batches (bs=2 avail_mem=71.74 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.20it/s]

Capturing batches (bs=2 avail_mem=71.74 GB):  95%|█████████▌| 19/20 [00:01<00:00, 11.47it/s]Capturing batches (bs=1 avail_mem=71.74 GB):  95%|█████████▌| 19/20 [00:01<00:00, 11.47it/s]

Capturing batches (bs=1 avail_mem=71.74 GB): 100%|██████████| 20/20 [00:01<00:00, 10.37it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mikaela and I am the owner of The Royal Eiffel Tower. I am passionate about living a sustainable lifestyle and want to share my personal stories and experiences with others who share the same values.
I decided to start my own business because I feel that it is important to be a part of something bigger than myself, to help others, and to create something that has a positive impact. I am always looking for new ideas and opportunities to make a difference and I am excited to start The Royal Eiffel Tower! I hope that my customers can find value in their experience, whether they are looking for a unique gift, a cozy
Prompt: The president of the United States is
Generated text:  running for a second term. He will be replaced by a new president immediately after the inauguration. What is the probability that the president is re-elected given that he is defeated by his opponents in the election? To determine the probability that the president is re-e

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I enjoy [job title] because [reason for interest]. What's your favorite hobby or activity? I love [hobby or activity]. What's your favorite book or movie? I love [book or movie]. What's your favorite food? I love [food]. What's your favorite color? I love [color]. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French National Museum, and the French Academy of Sciences. Paris is a bustling metropolis with a rich cultural heritage and is a major economic and political center in Europe. The city is known for its fashion, art, and cuisine, and is a popular tourist destination. Paris is also home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is known for its iconic landmarks and is a major economic and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends in AI include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to diagnose and treat diseases, and it has the potential to revolutionize the field. AI-powered diagnostic tools, such as AI-powered X-ray machines, could significantly improve patient outcomes.

2. Increased use of AI in finance: AI is already being used in finance to automate trading, fraud detection, and risk management. As AI technology continues to improve, we can expect to see even more sophisticated applications in finance.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Character's Name]. I am a [Age] year old [Occupation or Profession] who has always been passionate about [Why is it that you are passionate about [Occupation or Profession]].

I am always learning and growing, and I am always up for new challenges. I am a team player, always looking to contribute to the team and get the best out of everyone. I am an excellent communicator, always able to convey my ideas clearly and effectively. I am a hard worker, always putting in the extra effort to get things done.

And most importantly, I am a friend. I am always there for you, whether

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the largest city in France and the second-largest city in the European Union after Rome. It is known for its beautiful architecture, rich cultural heritage, and annual celebra

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

,

 and

 I

'm

 a

 friendly

,

 laid

-back

 bar

ista

 at

 a

 local

 coffee

 shop

.

 I

'm

 here

 to

 serve

 you

 all

 the

 time

 and

 make

 sure

 your

 drink

 is

 perfect

 for

 you

.

 I

 love

 brewing

 coffee

 and

 helping

 people

 find

 their

 way

 around

 the

 bustling

 coffee

 shop

 scene

.

 I

'm

 a

 go

-to

 for

 those

 who

 want

 to

 start

 their

 day

 with

 a

 caffeine

 fix

 or

 a

 smooth

ie

.

 I

'm

 here

 to

 assist

 you

 in

 finding

 the

 perfect

 cup

 of

 coffee

 and

 bring

 you

 the

 best

 experience

 possible

.

 How

 can

 I

 help

 you

 today

?

 I

'll

 take

 care

 of

 you

 and

 make

 sure

 that

 you

're

 getting

 the

 best

 experience

 possible

.

 What

 do

 you

 need

 help

 with



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 an

 ancient

 city

 nestled

 in

 the

 Saint

es

-M

aries

-de

-la

-S

ole

il

 mountains

 on

 the

 Mediterranean

 coast

.


Paris

 is

 a

 vibrant

 met

ropolis

 known

 for

 its

 rich

 history

,

 cultural

 importance

,

 and

 stunning

 architecture

.

 The

 city

's

 streets

 are

 lined

 with

 historic

 monuments

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 home

 to

 iconic

 landmarks

 such

 as

 the

 Se

ine

 River

 and

 the

 Arc

 de

 Tri

omp

he

.

 Despite

 its

 size

,

 Paris

 boasts

 a

 diverse

 population

 and

 is

 a

 major

 cultural

 and

 financial

 center

 in

 Europe

.

 Its

 status

 as

 both

 a

 political

 and

 economic

 capital

 has

 made

 it

 a

 popular

 destination

 for

 tourists

 from



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 a

 number

 of

 different

 trends

 that

 will

 shape

 how

 the

 technology

 is

 used

 and

 developed

.

 Here

 are

 some

 potential

 areas

 of

 development

 that

 could

 be

 expected

 in

 the

 coming

 years

:



1

.

 Increased

 efficiency

 and

 accuracy

:

 One

 of

 the

 biggest

 challenges

 facing

 AI

 is

 its

 ability

 to

 process

 and

 analyze

 large

 amounts

 of

 data

 quickly

 and

 accurately

.

 As

 we

 become

 more

 data

-driven

,

 we

 may

 see

 a

 growing

 trend

 toward

 more

 efficient

 and

 accurate

 AI

 systems

,

 with

 the

 goal

 of

 making

 data

-driven

 decisions

 with

 greater

 speed

 and

 precision

.



2

.

 Deep

 learning

:

 Deep

 learning

 is

 a

 type

 of

 machine

 learning

 that involves

 building complex

 neural

 networks

 with

 many

 layers

.

 As

 the

 technology

 continues




In [6]:
llm.shutdown()