# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-07 03:42:34] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.12it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.11it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.45it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.45it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.45it/s]Capturing batches (bs=104 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:03,  5.45it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 13.90it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 13.90it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.90it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.90it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.24it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.24it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.24it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.24it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.28it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.28it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.28it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 19.28it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.72it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.72it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.72it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.72it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 20.03it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 20.03it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.03it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.03it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 20.03it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 23.37it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 19.88it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily, and I'm a part-time student in a French class at my local community college. I'm trying to learn how to improve my French vocabulary. I've read some of the articles online and found that many of the articles also contain grammar. But I'm having trouble grasping the grammar and I'm not sure how to incorporate it into my own vocabulary.

Could you provide me with some tips to help me learn vocabulary and improve my French grammar? I would like to start using these tips to improve my vocabulary while still focusing on the grammar.
Certainly! Learning vocabulary and grammar is a great way to improve your French. Here are some
Prompt: The president of the United States is
Generated text:  seeking to please his constituents. He is proposing a new tax on corporations, and the proposal is not well received by some of his constituents. This suggests that the president may be:

a) Influential
b) Lazy
c) Lazy and manipulative
d) Diligent and princ

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] with [number of years] years of experience in [field]. I'm a [character trait] and I'm always looking for ways to [describe a new challenge or opportunity]. I'm a [character trait] and I'm always looking for ways to [describe a new challenge or opportunity]. I'm a [character trait] and I'm always looking for ways to [describe a new challenge or opportunity]. I'm a [character trait] and I'm always looking for ways to [describe a new challenge or opportunity]. I'm a [character trait] and I'm always looking for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also a major center for art, fashion, and cuisine, and is home to many world-renowned museums, theaters, and restaurants. The city is also known for its annual festivals and events, including the Eiffel Tower Festival and the Parisian Carnival. Paris is a city of contrasts, with its modern architecture and cultural attractions blending seamlessly with its historic landmarks. Its status as the capital

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends that could emerge in the coming years:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This could include issues such as bias, privacy, and transparency. As a result, there will be a greater focus on developing AI that is designed to be fair, transparent, and accountable.

2. Integration with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am an AI language model designed to assist you with your queries and tasks. I'm here to help you with any questions you may have, whether you're looking for information, support, or just some distraction-free time. How can I assist you today? Let me know if there's anything I can do for you. I'm always here to help and provide you with the best possible service. Is there anything specific you would like to know or discuss? I'm here to help you learn and improve your language skills in the best way possible. I'm here to help you with any questions or concerns you may have.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest city in France and the second-largest city in Europe. It is the capital of the Department of Paris, and the capital of the French Department of the Centre

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 John

.

 I

’m

 an

 AI

 language

 model

,

 so

 it

’s

 my

 job

 to

 provide

 helpful

 and

 informative

 responses

 to

 your

 queries

.

 I

 can

 answer

 questions

 about

 history

,

 science

,

 literature

,

 technology

,

 and

 many

 other

 topics

.

 I

’m

 also

 skilled

 in

 logical

 reasoning

 and

 can

 assist

 with

 complex

 tasks

 and

 problem

-solving

.

 I

 enjoy

 helping

 others

 and

 learning

 from

 the

 experiences

 of

 others

.

 I

’m

 excited

 to

 learn and

 share

 my

 knowledge

 with

 you

.

 Welcome

 to

 my

 virtual

 world

!

 Let

 me

 know

 if

 you

 have

 any

 questions

 or

 need

 any

 assistance

.

 John

.

 What

 is

 the

 tone

 of

 this

 self

-int

roduction

?

 The

 tone

 of

 this

 self

-int

roduction

 is

 neutral

 and

 friendly

.

 It

's



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


You

 are

 an

 AI

 assistant

 that

 helps

 you

 understand

 the

 reasons

 behind

 decisions.

 Don't

 write copies

 of arguments

, reviews

 orinions

.

 Use

 your

 own

 words

.

 Français

.

 France

's

 capital

 city

 is

 Paris

.

 



I

'll

 provide

 a

 concise

 factual

 statement

 about

 France

's

 capital

 city

.

 The

 capital

 of

 France

 is

 Paris

.



Paris

,

 the

 vibrant

 capital

 of

 France

,

 is

 renowned

 for

 its

 architecture

,

 culture

,

 and

 lively

 lifestyle

.

 Known

 as

 the

 "

City

 of

 Light

,"

 it

's

 a

 bustling

 met

ropolis

 with

 a

 population

 of

 over

 

2

.

 

5

 million

 people

,

 making

 it

 the

 most

 populous

 city

 in

 the

 European

 Union

.

 Famous

 landmarks

 include

 the

 E

iff

el

 Tower



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 exciting

 possibilities

 and

 possibilities

 for

 innovation

,

 but

 it

's

 important

 to

 remember

 that

 AI

 is

 a

 rapidly

 evolving

 field

 with

 many

 unknown

s

 and

 challenges

.



One

 potential

 future

 trend

 in

 AI

 is

 the

 development

 of

 AI

 that

 can

 operate

 in

 a

 wider

 variety

 of

 environments

 and

 situations

,

 including

 the

 internet

 of

 things

 (

Io

T

),

 space

 exploration

,

 and

 autonomous

 vehicles

.

 This

 would

 allow

 AI

 to

 be

 more

 adaptive

 and

 intelligent

,

 better

 able

 to

 adapt

 to

 different

 types

 of

 work

 and

 tasks

,

 and

 be

 more

 effective

 at

 handling

 complex

 and

 dynamic

 environments

.



Another

 trend

 could

 be

 the

 development

 of

 AI

 that

 can

 generate

 human

-like

 thought

 and

 creativity

,

 in

 areas

 such

 as

 language

 translation

,

 creative

 writing




In [6]:
llm.shutdown()