# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-01 05:27:03] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.92it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=22.97 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=22.97 GB):   5%|▌         | 1/20 [00:00<00:03,  4.87it/s]Capturing batches (bs=120 avail_mem=22.87 GB):   5%|▌         | 1/20 [00:00<00:03,  4.87it/s]

Capturing batches (bs=120 avail_mem=22.87 GB):  10%|█         | 2/20 [00:00<00:03,  4.89it/s]Capturing batches (bs=112 avail_mem=22.86 GB):  10%|█         | 2/20 [00:00<00:03,  4.89it/s]Capturing batches (bs=104 avail_mem=22.86 GB):  10%|█         | 2/20 [00:00<00:03,  4.89it/s]Capturing batches (bs=104 avail_mem=22.86 GB):  20%|██        | 4/20 [00:00<00:02,  7.36it/s]Capturing batches (bs=96 avail_mem=22.80 GB):  20%|██        | 4/20 [00:00<00:02,  7.36it/s] 

Capturing batches (bs=88 avail_mem=22.80 GB):  20%|██        | 4/20 [00:00<00:02,  7.36it/s]Capturing batches (bs=88 avail_mem=22.80 GB):  30%|███       | 6/20 [00:00<00:01, 10.29it/s]Capturing batches (bs=80 avail_mem=22.79 GB):  30%|███       | 6/20 [00:00<00:01, 10.29it/s]Capturing batches (bs=72 avail_mem=22.79 GB):  30%|███       | 6/20 [00:00<00:01, 10.29it/s]

Capturing batches (bs=72 avail_mem=22.79 GB):  40%|████      | 8/20 [00:00<00:01,  9.28it/s]Capturing batches (bs=64 avail_mem=22.78 GB):  40%|████      | 8/20 [00:00<00:01,  9.28it/s]Capturing batches (bs=56 avail_mem=22.78 GB):  40%|████      | 8/20 [00:01<00:01,  9.28it/s]Capturing batches (bs=56 avail_mem=22.78 GB):  50%|█████     | 10/20 [00:01<00:00, 10.96it/s]Capturing batches (bs=48 avail_mem=22.76 GB):  50%|█████     | 10/20 [00:01<00:00, 10.96it/s]Capturing batches (bs=40 avail_mem=22.75 GB):  50%|█████     | 10/20 [00:01<00:00, 10.96it/s]

Capturing batches (bs=40 avail_mem=22.75 GB):  60%|██████    | 12/20 [00:01<00:00, 12.25it/s]Capturing batches (bs=32 avail_mem=22.75 GB):  60%|██████    | 12/20 [00:01<00:00, 12.25it/s]Capturing batches (bs=24 avail_mem=22.74 GB):  60%|██████    | 12/20 [00:01<00:00, 12.25it/s]

Capturing batches (bs=24 avail_mem=22.74 GB):  70%|███████   | 14/20 [00:01<00:00,  9.71it/s]Capturing batches (bs=16 avail_mem=22.74 GB):  70%|███████   | 14/20 [00:01<00:00,  9.71it/s]Capturing batches (bs=12 avail_mem=22.73 GB):  70%|███████   | 14/20 [00:01<00:00,  9.71it/s]

Capturing batches (bs=12 avail_mem=22.73 GB):  80%|████████  | 16/20 [00:01<00:00,  9.69it/s]Capturing batches (bs=8 avail_mem=22.73 GB):  80%|████████  | 16/20 [00:01<00:00,  9.69it/s] Capturing batches (bs=4 avail_mem=22.72 GB):  80%|████████  | 16/20 [00:01<00:00,  9.69it/s]

Capturing batches (bs=4 avail_mem=22.72 GB):  90%|█████████ | 18/20 [00:02<00:00,  7.89it/s]Capturing batches (bs=2 avail_mem=22.72 GB):  90%|█████████ | 18/20 [00:02<00:00,  7.89it/s]Capturing batches (bs=1 avail_mem=22.71 GB):  90%|█████████ | 18/20 [00:02<00:00,  7.89it/s]Capturing batches (bs=1 avail_mem=22.71 GB): 100%|██████████| 20/20 [00:02<00:00,  9.36it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jerry. I am a busy doctor working in a small hospital in a small town in Maine. I have been married for 26 years to my wife, Joan, for many years. We have two children, a daughter and a son. We have two dogs, Poodle and Labrador Retriever. My wife Joan has a good job and is very healthy. My children are very smart and like to read and write. I love my job and the people I work with. I have a great sense of humor. And I enjoy watching funny shows and movies. Now, I want to share with you the story of my life.
Prompt: The president of the United States is
Generated text:  a high-ranking government official who serves as the leader of the country and is responsible for the national policies. Many politicians in the US have become wealthy through a variety of business practices and have benefited from the economy through their lobbying efforts. Some politicians have become rich through special election tactics, while others have been rich through 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] and [job title]. I'm a [job title] who is always [job title] and I'm always [job title]. I'm a [job title] who is always [job title] and I'm always [job title]. I'm a [job title] who is always [job title] and I'm always [job title]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its cuisine, fashion, and art scene. Paris is a popular tourist destination and a major economic center in France. It is home to many world-renowned museums, theaters, and art galleries. The city is also known for its annual festivals and events, such as the Eiffel Tower Parade and the Louvre Festival. Paris is a city that has a unique

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced machine learning capabilities: AI is likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions. This could lead to more efficient and effective use of resources, as well as more accurate predictions of future events.

3. Increased focus on ethical



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am [Age] years old. I was born in [Place of Birth] and I live in [City of Residence]. I am a creative writer, and I am constantly seeking to challenge my own creativity and find new ways to express myself. I love writing about the world and the people around me, and I am always looking for new ideas and perspectives to add to my work. What are your hobbies or interests outside of writing? I also enjoy reading, watching movies and TV, playing sports, and spending time with my loved ones. Thank you for asking about me, and I hope to have the opportunity to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is also known as the City of Light and has a rich history and culture. Its main attractions include the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Opera House, among o

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

job

 title

]

 with

 over

 [

number

 of

 years

]

 years

 of

 experience

 in

 [

industry

].

 I

've

 always

 been

 [

professional

 trait

 or

 characteristic

],

 [

note

 a

 specific

 example

].

 I

 enjoy

 [

mention

 something

 enjoyable

 or

 interesting

 from

 your

 life

].

 I

'm

 dedicated

 to

 [

mention

 something

 specific

 you

 are

 passionate

 about

 or

 enjoy

 doing

].

 As

 a

 [

mention

 an

 age

 range

],

 [

mention

 your

 current

 profession

].

 I

 strive

 to

 be

 [

mention

 something

 specific

 you

 want

 to

 be

],

 [

note

 your

 personal

 values

 and

 beliefs

].

 Thank

 you

 for

 having

 me

.

 I

 am

 excited

 to

 share

 more

 about

 myself

 with

 you

.

 [

Name

]

 <

3

 <

3



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 on

 the

 banks

 of

 the

 Se

ine

 River

.

 The

 city

 is

 famous

 for

 its

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 French

 Quarter

,

 which

 is

 known

 for

 its

 historic

 architecture

 and

 French

 food

.

 Paris

 is

 also

 home

 to

 museums

 such

 as

 the

 Lou

vre

 and

 the

 Museum

 of

 Modern

 Art

,

 as

 well

 as

 the

 Notre

-D

ame

 Cathedral

 and

 the

 Ch

amps

-

É

lys

ées

.

 The

 French

 Republic

 and

 the

 French

 language

 are

 also

 significant

 aspects

 of

 Paris

,

 which

 has

 a

 diverse

 population

 of

 around

 

6

 million

 residents

,

 including

 many

 students

 and

 tourists

 from

 around

 the

 world

.

 The

 city

 is

 also

 home

 to

 many

 notable

 artists

,

 composers

,

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 fascinating

 and

 uncertain

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 automation

 of jobs

:

 AI

 is

 increasingly

 replacing

 human

 jobs

,

 but

 it

 also

 opens

 up

 new

 opportunities

 for

 automation

.

 For

 example

,

 robots

 may

 soon

 be

 able

 to

 perform

 certain

 tasks

 that

 humans

 are

 currently

 done

.



2

.

 Improved

 privacy

 and

 security

:

 As

 AI

 systems

 become

 more

 advanced

,

 there

 may

 be

 more

 ways

 for

 data

 breaches

 and

 privacy

 violations

 to

 occur

.

 Governments

 and

 organizations

 will

 need

 to

 work

 to

 improve

 privacy

 and

 security

 measures

 to

 protect

 their

 sensitive

 data

.



3

.

 AI

 for

 education

:

 AI

 has

 the

 potential

 to

 revolution

ize

 the

 education

 system

.

 AI

-powered

 learning

 tools

 could

 help

 students

 learn

 at

 their




In [6]:
llm.shutdown()