# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-13 03:42:54] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.22it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.21it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=27.89 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=27.89 GB):   5%|▌         | 1/20 [00:00<00:06,  3.05it/s]Capturing batches (bs=120 avail_mem=27.79 GB):   5%|▌         | 1/20 [00:00<00:06,  3.05it/s]Capturing batches (bs=112 avail_mem=27.79 GB):   5%|▌         | 1/20 [00:00<00:06,  3.05it/s]Capturing batches (bs=112 avail_mem=27.79 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.46it/s]Capturing batches (bs=104 avail_mem=27.78 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.46it/s]Capturing batches (bs=96 avail_mem=27.77 GB):  15%|█▌        | 3/20 [00:00<00:02,  7.46it/s] 

Capturing batches (bs=96 avail_mem=27.77 GB):  25%|██▌       | 5/20 [00:00<00:01,  9.97it/s]Capturing batches (bs=88 avail_mem=27.77 GB):  25%|██▌       | 5/20 [00:00<00:01,  9.97it/s]Capturing batches (bs=80 avail_mem=27.76 GB):  25%|██▌       | 5/20 [00:00<00:01,  9.97it/s]Capturing batches (bs=80 avail_mem=27.76 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.17it/s]Capturing batches (bs=72 avail_mem=27.76 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.17it/s]

Capturing batches (bs=64 avail_mem=27.75 GB):  35%|███▌      | 7/20 [00:00<00:01, 10.17it/s]Capturing batches (bs=64 avail_mem=27.75 GB):  45%|████▌     | 9/20 [00:00<00:00, 11.57it/s]Capturing batches (bs=56 avail_mem=27.75 GB):  45%|████▌     | 9/20 [00:00<00:00, 11.57it/s]Capturing batches (bs=48 avail_mem=27.74 GB):  45%|████▌     | 9/20 [00:01<00:00, 11.57it/s]

Capturing batches (bs=48 avail_mem=27.74 GB):  55%|█████▌    | 11/20 [00:01<00:00, 11.04it/s]Capturing batches (bs=40 avail_mem=27.72 GB):  55%|█████▌    | 11/20 [00:01<00:00, 11.04it/s]Capturing batches (bs=32 avail_mem=27.66 GB):  55%|█████▌    | 11/20 [00:01<00:00, 11.04it/s]

Capturing batches (bs=32 avail_mem=27.66 GB):  65%|██████▌   | 13/20 [00:01<00:01,  6.19it/s]Capturing batches (bs=24 avail_mem=27.49 GB):  65%|██████▌   | 13/20 [00:01<00:01,  6.19it/s]

Capturing batches (bs=24 avail_mem=27.49 GB):  70%|███████   | 14/20 [00:01<00:01,  5.91it/s]Capturing batches (bs=16 avail_mem=27.42 GB):  70%|███████   | 14/20 [00:01<00:01,  5.91it/s]Capturing batches (bs=16 avail_mem=27.42 GB):  75%|███████▌  | 15/20 [00:02<00:00,  6.37it/s]Capturing batches (bs=12 avail_mem=27.01 GB):  75%|███████▌  | 15/20 [00:02<00:00,  6.37it/s]

Capturing batches (bs=8 avail_mem=26.86 GB):  75%|███████▌  | 15/20 [00:02<00:00,  6.37it/s] Capturing batches (bs=8 avail_mem=26.86 GB):  85%|████████▌ | 17/20 [00:02<00:00,  7.84it/s]Capturing batches (bs=4 avail_mem=26.80 GB):  85%|████████▌ | 17/20 [00:02<00:00,  7.84it/s]Capturing batches (bs=2 avail_mem=26.67 GB):  85%|████████▌ | 17/20 [00:02<00:00,  7.84it/s]

Capturing batches (bs=2 avail_mem=26.67 GB):  95%|█████████▌| 19/20 [00:02<00:00,  9.39it/s]Capturing batches (bs=1 avail_mem=26.57 GB):  95%|█████████▌| 19/20 [00:02<00:00,  9.39it/s]Capturing batches (bs=1 avail_mem=26.57 GB): 100%|██████████| 20/20 [00:02<00:00,  8.41it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sajjad Soltani and I am a PhD candidate in mathematics at the University of Toronto. My research focuses on theoretical aspects of networks, including network science, network embedding, and social networks.\nMy research interests include: fundamental properties of networks, network embedding, network structure and dynamics, and large-scale network analysis.\nI am a member of the Data Science Research Group, and an affiliate of the Montreal Research Agency on Data Science (MRADDS).\nThis content is for educational purposes only. It is not intended for use by or in any jurisdiction that does not comply with the laws of that jurisdiction. This may include content
Prompt: The president of the United States is
Generated text:  getting ready for his trip to the United Kingdom. He is going to stay for a minimum of 6 days but he is only going to stay for a certain number of days. He is going to start the trip 2 months after he was born. The president

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm excited to be here today. I'm [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a diverse population of over 10 million people. The city is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also known for its cuisine, fashion, and art scene, making it a popular tourist destination. The city is home to many cultural institutions and events throughout the year, including the World Cup and the Eiffel Tower Festival. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This will lead to the creation of new jobs, but it will also create new opportunities for people to work in areas such as data analysis, software development, and robotics.

2. Improved privacy and security: As AI technology becomes more advanced, there will be an increased need for privacy and security



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [Your occupation] with [Number of years of experience]. I have always been passionate about [Your career objective or hobby]. I strive to be [Your character trait or unique selling point] in my work. Who is [Your Name] and what makes you unique? (Hint: Think about what makes you different from others or what you bring to the table that others don't.)
Great! That sounds like a very unique and interesting personality. Can you tell me a little bit more about yourself, like what kind of job you have and what you love to do outside of work? Also, what

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a large and diverse city with a rich history and beautiful architecture. Paris is the world’s most populous city and a major cultural and economic center. It is known for its iconic 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 a

 [

Your

 Profession

]

 who

 just

 moved

 to

 [

Your

 City

 or

 State

].

 I

'm

 excited

 to

 share

 my

 journey

 with

 you

.

 How

 can

 I

 help

 you

 today

?

 Let

's

 chat

 about

 the

 best

 ways

 to

 learn

 about

 your

 industry

 or

 job

 and

 discuss

 any

 questions

 you

 might

 have

.

 Let

's

 stay

 in

 touch

 soon

.

 



Note

:

 This

 self

-int

roduction

 should

 be

 neutral

 and

 informative

,

 while

 avoiding

 any

 personal

 or

 biased

 statements

.

 It

 should

 also

 demonstrate

 a

 level

 of

 respect

 and

 professionalism

,

 as

 you

 are

 a

 fictional

 character

.

 Additionally

,

 it

 should

 be

 appropriate

 for

 a

 neutral

 context

 such

 as

 a

 job

 interview

 or

 chat

 with

 friends

.

 Good



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 renowned

 for

 its

 vibrant

 culture

,

 historical

 landmarks

,

 and

 stunning

 architecture

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 home

 to

 many

 world

-ren

owned

 museums

 and

 theaters

.

 The

 city

 is

 known

 for

 its

 annual

 celebration

 of

 France

's

 Day

 of

 the

 Arts

 and

 Sciences

,

 as

 well

 as

 its

 annual

 fashion

 week

.

 Paris

 is

 a

 cosm

opolitan

 met

ropolis

 with

 a

 diverse

 population

,

 including

 French

,

 Spanish

,

 and

 international

 visitors

.

 It

's

 a

 popular

 tourist

 destination

 and

 a

 cultural

 center

 for

 France

,

 known

 for

 its

 world

-class

 cuisine

,

 art

,

 and

 music

.

 It

's

 important

 to

 note

 that

 Paris

 has



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

,

 and

 it

's

 possible

 that

 there

 will

 be

 many

 exciting

 developments

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 will

 continue

 to

 be

 integrated

 with

 other

 technologies

,

 such

 as

 machine

 learning

,

 robotics

,

 and

 sensors

,

 to

 create

 more

 complex

 and

 powerful

 systems

.

 This

 integration

 could

 lead

 to

 even

 more

 advanced

 applications

 and

 potentially

 help

 to

 solve

 complex

 problems

.



2

.

 Autonomous

 agents

:

 AI

 will

 become

 more

 capable

 of

 performing

 tasks

 that

 are

 not

 currently

 possible

 with

 human

 intervention

,

 such

 as

 autonomous

 vehicles

,

 self

-driving

 airplanes

,

 and

 even

 human

-like

 emotional

 intelligence

.

 This

 will

 require

 a

 significant

 increase

 in

 AI

 technology




In [6]:
llm.shutdown()