# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-27 08:34:05] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-27 08:34:05] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-27 08:34:05] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-27 08:34:07] INFO server_args.py:2420: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.36it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  4.87it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  4.87it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  4.87it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.28it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.28it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.28it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  15%|█▌        | 3/20 [00:00<00:01, 10.28it/s]

Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 16.23it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 16.23it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 16.23it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  30%|███       | 6/20 [00:00<00:00, 16.23it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.59it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.59it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.59it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 19.59it/s]

Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:00<00:00, 21.42it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 21.42it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:00<00:00, 21.42it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  60%|██████    | 12/20 [00:00<00:00, 21.42it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.26it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.26it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.26it/s] 

Capturing batches (bs=8 avail_mem=76.74 GB):  85%|████████▌ | 17/20 [00:01<00:00, 10.63it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  85%|████████▌ | 17/20 [00:01<00:00, 10.63it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 10.63it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 10.63it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 13.78it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 13.63it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aries and I am a Portuguese artist and designer.

I am an illustrator, graphic designer and pop artist. My work focuses on bringing stories to life through the visual arts.

My approach to art is unique and based on the elements of design and storytelling. I believe that art should be a means of expressing and communicating ideas that can inspire and enlighten the viewer. I am passionate about creating artwork that is both engaging and meaningful, using the elements of design and storytelling to inspire and entertain.

My work includes illustrations, graphic design, and pop art. My artwork often features abstract elements and surreal imagery. I have also created an art website that
Prompt: The president of the United States is
Generated text:  proposing a change to the federal budget, which will affect the lives of Americans. The budget proposal will directly impact how many people will be eligible for Medicaid, a program that provides healthc

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and passions. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your interests and passions. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your interests and passions. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. Paris is also a popular tourist destination, with its beautiful architecture, vibrant nightlife, and delicious cuisine. The city is home to many world-renowned museums, art galleries, and theaters, making it a must-visit destination for visitors from around the globe. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its status as the capital of France is a testament to

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn from and adapt to the behavior and preferences of humans. This will enable more sophisticated and personalized interactions between humans and machines.

2. Enhanced natural language processing: AI will continue to improve its ability to understand and interpret human language, allowing for more natural and intuitive interactions between humans and machines.

3. Improved decision-making: AI will become more capable of making more informed and accurate decisions, based on a wide range of data and information. This will enable machines to make



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [职业] who specializes in [职业领域] in [职业领域].

As a [职业], I am passionate about [职业领域] and have a deep understanding of [职业领域] that I aim to share with others. I am a reliable, trustworthy, and accountable person who always strives to provide the best service to my clients. 

I bring a level of professionalism and attention to detail that is unmatched in the industry, and I believe that I can help make a difference in the world. I am a true team player and enjoy working collaboratively with others to achieve common goals.

In my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The answer is:

Paris is the capital of France, located in the Seine-et-Oise region of the North-Western region of France. It is the largest city in France by population, with over 1 million inhabitants. The city is

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 ____

_.

 I

'm

 a

/an

 ____

_.

 I

 love

 ____

 and

 ____

_.

 My

 favorite

 hobby

 is

 ____

 and

 my

 favorite

 movie

 is

 ____

_.

 I

 like

 to

 eat

 ____

 and

 I

 love

 to

 ____

_.

 I

'm

 an

 ____

 and

 I

'm

 ____

!

 I

 love

 playing

 ____

 and

 ____

_.

 I

'm

 ____

!

 I

'm

 a

/an

 ____

!

 How

 do

 you

 like

 me

?

 Let

 me

 know

 what

 you

 think

!

 


1

.

 "

Hello

,

 my

 name

 is

 [

name

].

 I

'm

 a

/an

 [

character

].

 I

 love

 [

character

's

 name

]

 and

 [

character

's

 favorite

 hobby

]

 most

.

 My

 favorite

 movie

 is

 [

movie

 title

]

 and

 my

 favorite

 food

 is

 [

food

].

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 museums

,

 art

 galleries

,

 and

 numerous

 cafes

,

 including

 the

 Mont

mart

re

 neighborhood

.

 Paris

 is

 also

 home

 to

 the

 Notre

-D

ame

 Cathedral

 and

 the

 Lou

vre

 Museum

,

 which

 house

 some

 of

 the

 world

's

 most

 famous

 art

 pieces

.

 The

 city

's

 cuisine

 is

 also

 renowned

 for

 its

 gourmet

 treats

 like

 cro

iss

ants

,

 g

âte

aux

,

 and

 petit

 fours

.

 Additionally

,

 Paris

 is

 a

 cosm

opolitan

 city

 with

 a

 diverse

 population

,

 including

 immigrants

 from

 around

 the

 world

.

 It

 is

 also

 a

 popular

 tourist

 destination

,

 with

 many

 visitors

 coming

 to

 experience

 its

 historical

 and

 cultural

 landmarks

.

 Paris

's

 reputation

 as

 one

 of

 the

 world

's



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 promising

 and

 constantly

 evolving

,

 with

 various

 potential

 trends

 that

 could

 transform

 the

 industry

.

 Here

 are

 some

 possible

 future

 trends in

 AI:



1.

 Increased

 integration with

 other

 technologies:

 As

 AI becomes

 more advanced

 and pervasive

,

 it will

 be increasingly

 integrated with

 other

 technologies

, such

 as blockchain

, IoT

, and

 machine learning

. This

 will

 create

 a more

 seamless and

 interconnected

 ecosystem

,

 allowing

 for

 a

 wider

 range

 of

 applications

 and

 innovations

.



2

.

 Development

 of

 more

 ethical

 and

 responsible

 AI

:

 As

 the

 AI

 industry

 continues

 to

 grow

,

 so

 will

 the

 ethical

 and

 responsible

 use

 of

 AI

.

 This

 will

 require

 the

 development

 of

 new

 ethical

 standards

,

 regulations

,

 and

 practices

 to

 ensure

 that

 AI

 is

 used




In [6]:
llm.shutdown()