# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-13 01:02:42] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.09it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.08it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:07,  2.71it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:07,  2.71it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:07,  2.71it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.39it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.39it/s]

Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:03,  4.31it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:03,  4.31it/s] Capturing batches (bs=96 avail_mem=76.80 GB):  25%|██▌       | 5/20 [00:01<00:02,  5.11it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:02,  5.11it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  25%|██▌       | 5/20 [00:01<00:02,  5.11it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:01<00:01,  7.11it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:01<00:01,  7.11it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:01<00:01,  7.11it/s]

Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.05it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:01<00:01,  6.05it/s]

Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:02<00:02,  4.84it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:02<00:02,  4.84it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:02<00:02,  4.84it/s]

Capturing batches (bs=40 avail_mem=76.77 GB):  60%|██████    | 12/20 [00:02<00:02,  3.69it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  60%|██████    | 12/20 [00:02<00:02,  3.69it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:03<00:02,  3.21it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:03<00:02,  3.21it/s]

Capturing batches (bs=24 avail_mem=76.76 GB):  70%|███████   | 14/20 [00:03<00:01,  3.27it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:03<00:01,  3.27it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:03<00:01,  3.60it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:03<00:01,  3.60it/s]

Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:03<00:01,  3.60it/s] Capturing batches (bs=8 avail_mem=76.74 GB):  85%|████████▌ | 17/20 [00:03<00:00,  4.69it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:03<00:00,  4.69it/s]

Capturing batches (bs=2 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:04<00:00,  4.69it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:04<00:00,  6.22it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:04<00:00,  6.22it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:04<00:00,  4.86it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Erin and I am the founder of the Futurpreneur community. Futurpreneur is a quarterly, peer-led, alumni-focused community for people who are starting and/or growing a company. People from every industry and from all over the world come together to learn, network, and create new opportunities. Plus, we have a diverse membership that includes startup founders and venture capitalists. We provide resources and opportunities for people who are looking to grow their businesses, and help them take their company to the next level. For more information, visit www.futurpreneur.com. Erin is also a board member of BluePebbles. Here's a snippet
Prompt: The president of the United States is
Generated text:  trying to decide how many military trucks to buy for the next 5 years. He knows that the military needs to buy 100 trucks in total. He also knows that the cost of each truck depends on the number of years it will be in service. The cost of a truck for the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As an AI language model, I'm designed to understand and respond to natural language input, so I can provide information and answer questions in a conversational and engaging way. How can I assist you today? Let's get started! [Name] [Job Title] [Company Name] [Company Address] [Company Phone Number] [Company Email] [Company Website] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also home to many world-renowned museums, theaters, and restaurants. Paris is a cultural and historical center that plays a significant role in France's political, economic, and social life. It is also a popular tourist destination, attracting millions of visitors each year. Paris is a vibrant and dynamic city that continues to evolve and change over

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn and adapt in ways that are difficult for humans to do. This could lead to more efficient and effective decision-making, as well as more personalized and context-aware interactions with humans.

2. Enhanced ethical considerations: As AI becomes more advanced, there will be increased scrutiny of its ethical implications. This could lead to more stringent regulations and guidelines for AI development and deployment, as well as more public debate about the potential risks and benefits of AI.

3. Greater



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [job title] with over 5 years of experience in [industry]. I'm always looking for ways to make a positive impact on the world and I believe that every person has the potential to make a difference. I'm passionate about helping others and I'm always learning new things to stay up-to-date with the latest trends and technologies in my field. I'm excited to meet you and see where our careers can take us together. 

Please be sure to provide a brief introduction that highlights your experience, skills, and values. Also, make sure to include your current location and any relevant work experience. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light and the City of Fine Arts. It is the largest city in both land and sea, and home to iconic landmarks such as the Eiffel

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

 male

.

 I

 have

 been

 known

 for

 my

 [

Title

]

 work

,

 and

 my

 interests

 lie

 in

 [

Personal

 Interest

].

 I

 recently

 moved

 to

 [

City

/T

own

],

 and

 I

 love

 spending

 my

 time

 [

Favorite

 Activity

 or

 hobby

].

 What

 is

 your

 name

,

 and

 what

 are

 your

 hobbies

 or

 interests

?

 Hello

,

 my

 name

 is

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

 male

.

 I

 have

 been

 known

 for

 my

 [

Title

]

 work

,

 and

 my

 interests

 lie

 in

 [

Personal

 Interest

].

 I

 recently

 moved

 to

 [

City

/T

own

],

 and

 I

 love

 spending

 my

 time

 [

Favorite

 Activity

 or

 hobby

].



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 largest

 city

 in

 France

 by

 population

 and

 metropolitan

 area

,

 with

 a

 population

 of

 over

 

1

 million

.

 It

 is

 home

 to

 the

 presidential

 palace

 and

 the

 Lou

vre

 museum

.

 The

 city

 is

 also

 home

 to

 many

 famous

 landmarks

 and

 attractions

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 artistic

 and

 cultural

 scene

,

 with

 many

 museums

,

 theaters

,

 and

 restaurants

 in

 the

 city

.

 The

 city

 is

 a

 major

 transportation

 hub

 and

 is

 located

 on

 the

 River

 Se

ine

,

 which

 forms

 a

 significant

 part

 of

 its

 urban

 landscape

.

 Paris

 is

 a

 leading

 global

 city

,

 with

 a

 rich

 history

 and

 a

 rich

 cultural

 identity

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 depends

 on

 a

 variety

 of

 factors

,

 including

 technological

 advances

,

 cultural

 shifts

,

 economic

 conditions

,

 and

 political

 decisions

.

 However

,

 there

 are

 several

 possible

 future

 trends

 in

 AI

 that

 are

 likely

 to

 continue

 in

 the

 coming

 years

 and

 decades

.



1

.

 Increased

 reliance

 on

 AI

 for

 critical

 applications

:

 One

 of

 the

 most

 significant

 trends

 in

 AI

 is

 the

 increasing

 reliance

 on

 AI

 for

 critical

 applications

,

 such

 as

 healthcare

,

 finance

,

 and

 transportation

.

 AI

 is

 already

 being

 used

 to

 improve

 diagnostics

,

 predict

 risks

,

 and

 optimize

 operations

 in

 these

 areas

,

 and

 it

 has

 the

 potential

 to

 expand

 these

 benefits

 in

 the

 future

.



2

.

 AI

 becoming

 more

 integrated

 with

 human

 decision

-making

:

 AI

 is

 becoming




In [6]:
llm.shutdown()