# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-02 12:40:32] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.61it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.60it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=27.60 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=27.60 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.52it/s]Capturing batches (bs=120 avail_mem=27.50 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.52it/s]Capturing batches (bs=112 avail_mem=27.49 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.52it/s]Capturing batches (bs=104 avail_mem=27.49 GB):   5%|â–Œ         | 1/20 [00:00<00:04,  4.52it/s]Capturing batches (bs=104 avail_mem=27.49 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 12.54it/s]Capturing batches (bs=96 avail_mem=27.48 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 12.54it/s] Capturing batches (bs=88 avail_mem=27.47 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 12.54it/s]

Capturing batches (bs=80 avail_mem=27.47 GB):  20%|â–ˆâ–ˆ        | 4/20 [00:00<00:01, 12.54it/s]Capturing batches (bs=80 avail_mem=27.47 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 16.18it/s]Capturing batches (bs=72 avail_mem=27.47 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 16.18it/s]Capturing batches (bs=64 avail_mem=27.46 GB):  35%|â–ˆâ–ˆâ–ˆâ–Œ      | 7/20 [00:00<00:00, 16.18it/s]Capturing batches (bs=64 avail_mem=27.46 GB):  45%|â–ˆâ–ˆâ–ˆâ–ˆâ–Œ     | 9/20 [00:00<00:00, 17.28it/s]Capturing batches (bs=56 avail_mem=27.46 GB):  45%|â–ˆâ–ˆâ–ˆâ–ˆâ–Œ     | 9/20 [00:00<00:00, 17.28it/s]Capturing batches (bs=48 avail_mem=27.45 GB):  45%|â–ˆâ–ˆâ–ˆâ–ˆâ–Œ     | 9/20 [00:00<00:00, 17.28it/s]

Capturing batches (bs=40 avail_mem=27.45 GB):  45%|â–ˆâ–ˆâ–ˆâ–ˆâ–Œ     | 9/20 [00:00<00:00, 17.28it/s]Capturing batches (bs=40 avail_mem=27.45 GB):  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 12/20 [00:00<00:00, 18.81it/s]Capturing batches (bs=32 avail_mem=27.44 GB):  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 12/20 [00:00<00:00, 18.81it/s]Capturing batches (bs=24 avail_mem=27.44 GB):  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 12/20 [00:00<00:00, 18.81it/s]Capturing batches (bs=16 avail_mem=27.43 GB):  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 12/20 [00:00<00:00, 18.81it/s]

Capturing batches (bs=16 avail_mem=27.43 GB):  75%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ  | 15/20 [00:00<00:00, 17.72it/s]Capturing batches (bs=12 avail_mem=27.43 GB):  75%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ  | 15/20 [00:00<00:00, 17.72it/s]Capturing batches (bs=8 avail_mem=27.42 GB):  75%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ  | 15/20 [00:00<00:00, 17.72it/s] Capturing batches (bs=4 avail_mem=27.42 GB):  75%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–Œ  | 15/20 [00:01<00:00, 17.72it/s]Capturing batches (bs=4 avail_mem=27.42 GB):  90%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ | 18/20 [00:01<00:00, 19.97it/s]Capturing batches (bs=2 avail_mem=27.41 GB):  90%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ | 18/20 [00:01<00:00, 19.97it/s]Capturing batches (bs=1 avail_mem=27.41 GB):  90%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ | 18/20 [00:01<00:00, 19.97it/s]Capturing batches (bs=1 avail_mem=27.41 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 20/20 [00:01<00:00, 18.10it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Amy. I'm in 7th grade, and I love to go to the library. I have a big heart and love helping people with their homework, and I really like to learn new things. I also like to collect action figures. I have a big collection of action figures, and I like to trade them with other friends. I have to give people my toys when they borrow them. I like to play baseball, and I'm a good player. I have an ice cream shop, and I like to ice cream. I have a pet, a cat named Kitty. I also have a dog named Buddy. We are best
Prompt: The president of the United States is
Generated text:  trying to decide how many armed guards should be stationed along the border with New Mexico. The cost of each guard is $20,000 and the border has a 20% chance of being attacked by terrorists each year. The president has a budget of $100 million for this purpose. 

1. Determine if the president can afford to place a guard at the border if there is an attack on the border in the 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm currently [Current Location] and I enjoy [Favorite Activity/Interest]. I'm a [Type of Person] and I'm [Your Personality]. I'm [Your Profession]. I'm [Your Goal]. I'm [Your Motivation]. I'm [Your Purpose]. I'm [Your Vision]. I'm [Your Character]. I'm [Your Character]. I'm [Your Character]. I'm [Your Character]. I'm [Your Character]. I'm [Your Character]. I'm [Your Character]. I'm [Your Character]. I'm

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in Europe by population. It is located on the Seine River and is home to many of France's most famous landmarks, including the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also known for its rich cultural heritage, including its art museums, theaters, and opera houses. The city is known for its vibrant nightlife and is a popular tourist destination for visitors from around the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also home to many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is expected to become more integrated into various industries, leading to increased automation of tasks and processes. This could result in the creation of new jobs, but also the displacement of some traditional jobs.

2. AI ethics and privacy concerns: As AI becomes more integrated into our daily lives, there will be increasing concerns about its impact on society. This includes issues such as bias, privacy, and the potential for AI to be used for malicious purposes.

3. AI in healthcare: AI is already being used in healthcare to improve patient outcomes, but there is also potential for AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a professional marketer. I specialize in creating effective and compelling content for businesses and individuals seeking to grow their businesses online. I'm always on the lookout for new ways to reach and engage my clients, and I believe in using data and analytics to inform my strategies. I'm also a strong believer in the power of storytelling to inspire and motivate my clients to take action. If you're looking to boost your online presence and reach new audiences, I'm the person to turn to. #professionalmarketing #digitalmarketing #digitalinnovation #growthstrategy #growthhype #digitalagency #digitalmarketingagency #digitalmarketing

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris. 

Please answer the following question about the statement:
Who were the first inhabitants of Paris?



### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

...

 [

insert

 character

's

 name

 here

].

 I

'm

 a

 [

insert

 number

 

1

-

3

 of

 your

 favorite

 movies

]

 fan

,

 and

 I

 specialize

 in

 [

insert

 number

 

1

-

3

 of

 your

 favorite

 genres

].

 I

'm

 always

 up

 for

 new

 challenges

 and

 enjoy

 exploring

 the

 world

 of

 different

 cultures

 and

 cuis

ines

.

 I

 love

 to

 travel

 and

 try

 new

 foods

,

 and

 I

'm

 always

 looking

 for

 new

 and

 exciting

 experiences

.

 I

'm

 passionate

 about

 [

insert

 a

 hobby

 or

 interest

 you

're

 passionate

 about

,

 such

 as

 cooking

,

 hiking

,

 or

 photography

].

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me!

 ðŸ˜Š

âœ¨





I

'm

 [

insert

 your

 age

]

 years

 old

,



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 largest

 city

 and

 the

 seat

 of

 government

 of

 the

 country

.

 



A

 summary

 of

 the

 information

 would

 be

:

 France

's

 capital

 is

 Paris

.

 



This

 statement

 encaps

ulates

 the

 main

 information

 provided

 in

 the

 original

 text

,

 focusing

 on

 the

 capital

 city

 of

 Paris

 and

 its

 status

 as

 the

 largest

 city

 and

 government

 seat

 of

 France

.

 



The

 French

 government

 and

 official

 government

 of

 France

 does

 not

 have

 a

 separate

 capital

 city

;

 Paris

 is

 the

 seat

 of

 the

 French

 government

.

 The

 capital

 is

 also

 the

 largest

 city

 in

 France

 by

 population

,

 following

 Lyon

,

 which

 is

 the

 second

 largest

 city

.

 



In

 summary

,

 the

 statement

 con

veys

 the

 basic

 facts

 about

 the

 French

 capital

 city

 without



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 many

 trends

,

 including

:



1

.

 Increased

 automation

 and

 precision

:

 With

 AI

,

 machines

 are

 expected

 to

 be

 able

 to

 perform

 repetitive

 tasks

 with

 more

 accuracy

 and

 speed

 than

 humans

.

 This

 could

 lead

 to

 the

 development

 of

 self

-driving

 cars

,

 robots

 for

 manufacturing

,

 and

 other

 applications

 that

 reduce

 the

 need

 for

 human

 labor

.



2

.

 Greater

 integration

 with

 human

 intelligence

:

 AI

 is

 expected

 to

 become

 more

 integrated

 with

 human

 intelligence

,

 allowing

 machines

 to

 learn

 and

 adapt

 to

 different

 contexts

 and

 situations

.

 This

 could

 lead

 to

 more

 sophisticated

 forms

 of

 human

-like

 intelligence

,

 such

 as

 super

-int

elligence

,

 and

 even

 consciousness

.



3

.

 Improved

 ethical

 and

 legal

 considerations

:

 As

 AI

 becomes




In [6]:
llm.shutdown()