# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-13 21:58:50] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.01it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=58.95 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=58.95 GB):   5%|▌         | 1/20 [00:00<00:04,  4.17it/s]Capturing batches (bs=120 avail_mem=58.85 GB):   5%|▌         | 1/20 [00:00<00:04,  4.17it/s]Capturing batches (bs=112 avail_mem=58.84 GB):   5%|▌         | 1/20 [00:00<00:04,  4.17it/s]Capturing batches (bs=112 avail_mem=58.84 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.88it/s]Capturing batches (bs=104 avail_mem=58.84 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.88it/s]

Capturing batches (bs=96 avail_mem=58.83 GB):  15%|█▌        | 3/20 [00:00<00:01,  8.88it/s] Capturing batches (bs=96 avail_mem=58.83 GB):  25%|██▌       | 5/20 [00:00<00:02,  5.43it/s]Capturing batches (bs=88 avail_mem=58.82 GB):  25%|██▌       | 5/20 [00:00<00:02,  5.43it/s]Capturing batches (bs=80 avail_mem=58.82 GB):  25%|██▌       | 5/20 [00:00<00:02,  5.43it/s]Capturing batches (bs=72 avail_mem=58.82 GB):  25%|██▌       | 5/20 [00:00<00:02,  5.43it/s]

Capturing batches (bs=72 avail_mem=58.82 GB):  40%|████      | 8/20 [00:01<00:01,  9.05it/s]Capturing batches (bs=64 avail_mem=58.81 GB):  40%|████      | 8/20 [00:01<00:01,  9.05it/s]Capturing batches (bs=56 avail_mem=58.81 GB):  40%|████      | 8/20 [00:01<00:01,  9.05it/s]Capturing batches (bs=48 avail_mem=58.80 GB):  40%|████      | 8/20 [00:01<00:01,  9.05it/s]Capturing batches (bs=48 avail_mem=58.80 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.16it/s]Capturing batches (bs=40 avail_mem=58.80 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.16it/s]Capturing batches (bs=32 avail_mem=58.79 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.16it/s]

Capturing batches (bs=24 avail_mem=58.79 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.16it/s]Capturing batches (bs=24 avail_mem=58.79 GB):  70%|███████   | 14/20 [00:01<00:00, 15.07it/s]Capturing batches (bs=16 avail_mem=58.78 GB):  70%|███████   | 14/20 [00:01<00:00, 15.07it/s]Capturing batches (bs=12 avail_mem=58.78 GB):  70%|███████   | 14/20 [00:01<00:00, 15.07it/s]Capturing batches (bs=12 avail_mem=58.78 GB):  80%|████████  | 16/20 [00:01<00:00, 15.63it/s]Capturing batches (bs=8 avail_mem=58.77 GB):  80%|████████  | 16/20 [00:01<00:00, 15.63it/s] Capturing batches (bs=4 avail_mem=58.77 GB):  80%|████████  | 16/20 [00:01<00:00, 15.63it/s]

Capturing batches (bs=2 avail_mem=58.76 GB):  80%|████████  | 16/20 [00:01<00:00, 15.63it/s]Capturing batches (bs=1 avail_mem=58.76 GB):  80%|████████  | 16/20 [00:01<00:00, 15.63it/s]Capturing batches (bs=1 avail_mem=58.76 GB): 100%|██████████| 20/20 [00:01<00:00, 20.06it/s]Capturing batches (bs=1 avail_mem=58.76 GB): 100%|██████████| 20/20 [00:01<00:00, 13.29it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kjell. I am 19 years old, Norwegian, and I’m a PhD student at the University of Southern Denmark. I am studying information security and vulnerability management. I’m also an engineer and hobbyist with an interest in AI and computational thinking.

I hope you enjoy your day and thank you for taking the time to read this post. When I post more, I will add a link to my Twitter, LinkedIn and Google+ profiles so you can see more of what I do.

## Wednesday, February 24, 2017

### An Introduction to Livelock Detection

Every time I write an
Prompt: The president of the United States is
Generated text:  a person. A. 正确 B. 错误

正确

C919大型客机是经过了不平凡的5年发展，才有了现在这种规模的型号，整个过程发生的这些进展，给中国制造创造了辉煌的业绩，为我国向世界展示了“中国制造”的实力。这体现的哲理是( )。 A：矛盾是事物发展的动力 B：质变是量变的必然结果 C：事物发展的前进性与曲折性统一 D：事物的发展是量变与质变的统一

C 解析：质变是事物发展的必然趋势。题干中
Prompt: The capital of France is
Generated text:  Paris. Which is not correct according to the passage? A) The British capital is London B) The French 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What brings you to this company? I'm looking for a [insert a short description of the position you're applying for]. I'm confident that I can contribute to your team and help you achieve your goals. What's your favorite part of your job? I love [insert a short description of your favorite part of your job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament and the French National Museum. Paris is a bustling city with a rich history and culture, and is a popular tourist destination. The city is known for its fashion, art, and cuisine, and is a major center of business and commerce in Europe. It is also home to many international organizations and institutions, including the European Parliament and the United Nations. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the city's vibrant culture. The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could emerge in the coming years:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn and adapt to human behavior and preferences. This could lead to more personalized and adaptive AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical and social considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical and social considerations. This could lead to more rigorous testing



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am an experienced [occupation]. I have always been passionate about [your field of interest], and I love sharing my knowledge and experiences with others. I am also a writer, and my writing has been published in [publication name]. I love learning new things and immersing myself in different cultures, and I am always looking for new ways to expand my knowledge and skills. What is your occupation, and what do you enjoy doing with your time? [Your Name] [Your Occupation] [Your Interests and hobbies] [Your Education and career goals] [Your Future plans] [Your Strengths and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The capital city of France is Paris. 

Please paraphrase the sentence "Paris is the capital of France" in simpler terms. The capital of France is Paris. 

Please provid

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

occupation

]

 with

 expertise

 in

 [

specific

 skill

 or

 knowledge

].

 I

'm

 [

age

],

 [

gender

],

 and

 I

 enjoy

 [

what

 I

 do

 well

],

 [

what

 I

 struggle

 with

],

 and

 [

why

 I

'm

 interested

 in

 this

 field

].

 If

 you

 have

 any

 questions

 or

 concerns

,

 feel

 free

 to

 reach

 out

.

 Let

's

 connect

!

 [

Your

 Name

]

 [

Your

 Contact

 Information

]

 [

Your

 Social

 Media

 Handles

]



---



What

 is

 your

 educational

 background

 and

 what

 fields

 have

 you

 exc

elled

 in

?



---



At

 what

 age

 did

 you

 start

 learning

 a

 new

 skill

 or

 field

 and

 how

 long

 have

 you

 been

 in

 that

 particular

 field

?



---



What

 is

 your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Calc

ulation

:

 

1

1





Problem

 statement

:

 Describe

 in

 detail

 a

 specific

 instance

 where

 the

 Paris

ians

 demonstrated

 their

 remarkable

 ability

 to

 navigate

 a

 dangerous

 labyrinth

ine

 maze

 while

 j

uggling

 heavy

 objects

.

 Explain

 the

 ingenious

 method

 they

 used

 to

 navigate

 this

 maze

 and

 the

 strategies

 they

 utilized

 to

 manipulate

 the

 objects

.

 Provide

 an

 example

 of

 how

 this

 puzzle

-solving

 process

 influenced

 the

 Paris

ians

'

 morale

 and

 the

 city

's

 reputation

.

 Additionally

,

 explain

 how

 this

 event

 impacted

 the

 city

's

 cultural

 identity

 and

 how

 it

 has

 shaped

 the

 city

's

 current

 identity

 and

 image

.


Output

:

 I

'm

 sorry

,

 but

 I

 can

't

 assist

 with

 that

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 continued

 advancements

 in

 areas

 such

 as

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 AI

 will

 continue

 to

 become

 more

 sophisticated

,

 with

 models

 that

 can

 make

 more

 accurate

 predictions

 and

 decisions

,

 and

 assist

 with

 a

 wider

 range

 of

 tasks

.

 Additionally

,

 the

 ability

 to

 manipulate

 and

 control

 AI

 systems

 will

 continue

 to

 increase

,

 leading

 to

 new

 opportunities

 for

 both

 developers

 and

 users

 of

 AI

.

 AI

 may

 also

 become

 more

 accessible

 and

 affordable

,

 making

 it

 more

 widely

 available

 to

 individuals

 and

 businesses

 alike

.

 Finally

,

 there

 will

 be

 continued

 efforts

 to

 develop

 ethical

 considerations

 and

 guidelines

 for

 AI

 development

,

 with

 a

 focus

 on

 ensuring

 that

 AI

 systems

 are

 safe

,

 transparent

,

 and

 fair

 for




In [6]:
llm.shutdown()