# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-12-22 18:13:33] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-12-22 18:13:33] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-12-22 18:13:33] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2025-12-22 18:13:35] INFO server_args.py:2408: Set soft_watchdog_timeout since in CI




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.17it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=13.24 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=13.24 GB):   5%|▌         | 1/20 [00:00<00:03,  4.83it/s]Capturing batches (bs=120 avail_mem=13.04 GB):   5%|▌         | 1/20 [00:00<00:03,  4.83it/s]Capturing batches (bs=112 avail_mem=13.04 GB):   5%|▌         | 1/20 [00:00<00:03,  4.83it/s]Capturing batches (bs=104 avail_mem=13.03 GB):   5%|▌         | 1/20 [00:00<00:03,  4.83it/s]Capturing batches (bs=104 avail_mem=13.03 GB):  20%|██        | 4/20 [00:00<00:01, 13.86it/s]Capturing batches (bs=96 avail_mem=13.03 GB):  20%|██        | 4/20 [00:00<00:01, 13.86it/s] Capturing batches (bs=88 avail_mem=13.02 GB):  20%|██        | 4/20 [00:00<00:01, 13.86it/s]

Capturing batches (bs=88 avail_mem=13.02 GB):  30%|███       | 6/20 [00:00<00:00, 15.93it/s]Capturing batches (bs=80 avail_mem=13.02 GB):  30%|███       | 6/20 [00:00<00:00, 15.93it/s]Capturing batches (bs=72 avail_mem=13.01 GB):  30%|███       | 6/20 [00:00<00:00, 15.93it/s]Capturing batches (bs=64 avail_mem=12.99 GB):  30%|███       | 6/20 [00:00<00:00, 15.93it/s]

Capturing batches (bs=64 avail_mem=12.99 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.33it/s]Capturing batches (bs=56 avail_mem=12.98 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.33it/s]Capturing batches (bs=48 avail_mem=12.98 GB):  45%|████▌     | 9/20 [00:00<00:00, 14.33it/s]

Capturing batches (bs=48 avail_mem=12.98 GB):  55%|█████▌    | 11/20 [00:00<00:00, 11.27it/s]Capturing batches (bs=40 avail_mem=12.97 GB):  55%|█████▌    | 11/20 [00:00<00:00, 11.27it/s]Capturing batches (bs=32 avail_mem=12.97 GB):  55%|█████▌    | 11/20 [00:01<00:00, 11.27it/s]Capturing batches (bs=32 avail_mem=12.97 GB):  65%|██████▌   | 13/20 [00:01<00:00, 10.94it/s]Capturing batches (bs=24 avail_mem=12.96 GB):  65%|██████▌   | 13/20 [00:01<00:00, 10.94it/s]

Capturing batches (bs=16 avail_mem=12.47 GB):  65%|██████▌   | 13/20 [00:01<00:00, 10.94it/s]Capturing batches (bs=16 avail_mem=12.47 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.95it/s]Capturing batches (bs=12 avail_mem=12.46 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.95it/s]

Capturing batches (bs=8 avail_mem=12.46 GB):  75%|███████▌  | 15/20 [00:01<00:00, 11.95it/s] Capturing batches (bs=8 avail_mem=12.46 GB):  85%|████████▌ | 17/20 [00:01<00:00,  9.50it/s]Capturing batches (bs=4 avail_mem=12.46 GB):  85%|████████▌ | 17/20 [00:01<00:00,  9.50it/s]Capturing batches (bs=2 avail_mem=12.45 GB):  85%|████████▌ | 17/20 [00:01<00:00,  9.50it/s]Capturing batches (bs=1 avail_mem=12.45 GB):  85%|████████▌ | 17/20 [00:01<00:00,  9.50it/s]Capturing batches (bs=1 avail_mem=12.45 GB): 100%|██████████| 20/20 [00:01<00:00, 12.93it/s]Capturing batches (bs=1 avail_mem=12.45 GB): 100%|██████████| 20/20 [00:01<00:00, 12.03it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Michelle and I am from St. Mary’s Church in Walsall, a suburb of Birmingham, England. I have been involved in various fields, including music, photography, and pottery. I am a member of the Birmingham City Council and a member of the local community development team. I am passionate about living and working in the region and have been a member of the Sheffield branch of the UK Arts Council for the past seven years.
I am a passionate supporter of social causes, and have been involved in community activities and charities in my local area since my teenage years. This has given me a unique perspective on the challenges and opportunities that come
Prompt: The president of the United States is
Generated text:  seeking a replacement for his first term. If there are 50 eligible candidates, what is the probability that the president will choose a candidate who is the candidate who was the 10th candidate to serve in the United States? To determine the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your profession or experience here]. I enjoy [insert a short description of your hobbies or interests here]. What's your favorite hobby or activity? I love [insert a short description of your favorite hobby or activity here]. What's your favorite book or movie? I love [insert a short description of your favorite book or movie here]. What's your favorite place to go? I love [insert a short description of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Parliament House. Paris is a cultural and economic hub, known for its rich history, art, and cuisine. It is a popular tourist destination and a major transportation hub, with many international flights and trains connecting it to other cities in Europe and beyond. The city is also home to many notable museums, including the Louvre and the Musée d'Orsay. Paris is a vibrant and dynamic city, with

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability.

2. Greater integration with human intelligence: AI will continue to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior. This will enable machines to become more intelligent and capable of making decisions that are more aligned with human values.

3. Enhanced capabilities in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [职业/爱好] who enjoys [爱好/特长] and has been [职业/爱好] for [x年/年] (in your personal life). I'm an [年龄] years old and [身高] centimeters tall, and I have [业余爱好/特长] such as [爱好/特长] or [爱好/特长]. I'm [形容词] and [形容词] in this world, and I have a deep respect for [职业/爱好] and am always eager to learn new things. I enjoy [职业/爱好] and I believe in [职业/

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known as the city of love. It is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, as well as the city's vibrant culture and rich history. Paris is a popular tourist destination, and many people visit each year to experience its stunning architecture, vibrant atmosphere, and world-class art scene. The city is also a major center for science and techno

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 [

Age

]

 years

 old

.

 I

 have

 always

 loved

 books

 and

 art

 since

 I

 was

 a

 child

.

 I

 have

 a

 passion

 for

 helping

 people

 find

 happiness

 and

 joy

 in

 life

.

 I

 believe

 that

 art

 and

 books

 are

 the

 most

 effective

 tools

 for

 creating

 positive

 change

 in

 the

 world

.



What

 is

 your

 favorite

 book

 or

 artist

 to

 read

 or

 paint

,

 and

 why

 do

 you

 enjoy

 reading

 or

 creating

 with

 them

?



As

 an

 AI

 language

 model

,

 I

 don

't

 have

 personal

 preferences

 or

 emotions

,

 but

 I

 can

 tell

 you

 that

 I

 love

 reading

 and

 creating

 art

,

 and

 I

 enjoy

 the

 process

 of

 coming

 up

 with

 new

 ideas

 and

 expressing

 myself

 through

 the

 medium

 of

 writing



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 Paris

,

 the

 capital

 of

 France

,

 is

 known

 as

 the

 "

City

 of

 Love

"

 and

 is

 a

 famous

 cultural

,

 historical

,

 and

 artistic

 city

.

 It

 has

 many

 attractions

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 known

 for

 its

 cuisine

 and

 fashion

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 major

 center

 for

 business

 and

 finance

.

 With

 its

 rich

 history

,

 vibrant

 culture

,

 and

 beautiful

 architecture

,

 Paris

 is

 a

 city

 that

 has

 fascinated

 people

 for

 centuries

.

 It

 is

 also

 a

 symbol

 of

 France

's

 commitment

 to

 democracy

,

 freedom

,

 and

 progress

.

 Paris

 is

 a

 city

 that



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 promising

 and

 is

 expected

 to

 continue

 to

 evolve

 rapidly

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 are

 currently

 being

 explored

 and

 could

 shape

 the

 future

:



1

.

 Machine

 Learning

:

 Machine

 learning

 is

 a

 key

 area

 of

 AI

 research

 and

 development

,

 and

 it

 is

 expected

 to

 continue

 to

 advance

 in

 the

 coming

 years

.

 Machine

 learning

 algorithms

 will

 become

 more

 sophisticated

,

 enabling

 machines

 to

 learn

 from

 large

 amounts

 of

 data

 on

 their

 own

 and

 make

 more

 accurate

 predictions

 and

 decisions

.



2

.

 Natural

 Language

 Processing

:

 Natural

 language

 processing

 is

 a

 crucial

 aspect

 of

 AI

,

 as

 it

 allows

 machines

 to

 understand

 and

 interpret

 human

 language

.

 Advances

 in

 N

LP

 will

 enable

 machines

 to

 communicate

 more

 effectively

,

 understand

 human




In [6]:
llm.shutdown()