# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0831 10:03:53.666000 1999127 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0831 10:03:53.666000 1999127 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0831 10:04:02.760000 1999853 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0831 10:04:02.760000 1999853 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-08-31 10:04:03] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.67it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.66it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.48it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.48it/s]Capturing batches (bs=1 avail_mem=74.72 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.48it/s]Capturing batches (bs=1 avail_mem=74.72 GB): 100%|██████████| 3/3 [00:00<00:00, 10.61it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel, I'm a 21 year old woman, and I'm interested in learning more about art. Where should I start to learn more about art?
As a 21 year old woman interested in learning more about art, it is important to start your journey by exploring various art forms. Here are some steps and resources that you can use to get started:

1. Identify your interests: What type of art do you enjoy the most? What are your favorite art movements or artists?

2. Visit museums: Museums are a great way to learn about art. They often have rotating exhibits featuring new and older works. Take advantage
Prompt: The president of the United States is
Generated text:  a very important person in the government of the United States. He is responsible for making decisions on important things like the economy and foreign relations. He is also in charge of making sure that the president serves a term of two years. The president is chosen by the people for a term of two years.

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the annual Eiffel Tower Festival. It is also the seat of the French government and the largest city in the European Union. Paris is a cultural and historical center with many museums, art galleries, and landmarks, including the Louvre and Notre-Dame Cathedral. The city is also known for its cuisine, including its famous croissants and its famous French fries. Paris is a popular tourist destination, with millions of visitors each year. It is also a major financial center, with many financial institutions and companies headquartered in the city. The city is known for its fashion industry

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, privacy, and transparency.

3. Increased use of AI in healthcare: AI is likely to play a larger role in healthcare, with more personalized and accurate diagnoses and treatments.

4. Greater use of AI in education: AI is likely to be used more extensively in education, with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ____ and I'm a/an _____. I'm a/an _____. I love to ____ and I'm a/an _____. I have a/an ____ personality and I'm always ready to _____. I'm a/an ____.
Make sure to use appropriate language and include your personal experiences or background in your introduction. Remember to also provide a brief summary of your character's work, any notable achievements, or personal interests that stand out to you. The key is to make the introduction engaging and easy to read, but not overly long or boring. Good luck! How about you? How would you like to start your self-introduction? To make

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northwestern region of the country.

Here's the concise factual statement in the form of a short paragraph:

Paris, the largest city in France and the capital, is situat

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

/an

 [

Occup

ation

].

 I

 am

 [

Age

],

 and

 I

 have

 always

 been

 [

Attr

action

/

Interest

/

Challenge

]

 to

 me

.

 I

 am

 [

Number

 of

 Languages

]

 -

 [

Language

]

 flu

ently

,

 and

 I

 speak

 [

Language

]

 as

 well

.

 I

 have

 a

 passion

 for

 [

Your

 Hobby

/

Interest

/

Challenge

],

 and

 I

 have

 always

 been

 [

How

 You

 Spend

 Your

 Time

].

 I

 am

 always

 [

Positive

/

Positive

]

 about

 my

 life

 and

 strive

 to

 make

 the

 best

 of

 it

 every

 day

.

 And

 I

 am

 [

Person

ality

 -

 Hard

line

,

 Soft

,

 Quiet

,

 or

 Something

 Else

].

 I

 have

 a

 strong

 [

Strong

 Character



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 a

 historic

 city

 in

 north

western

 France

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 The

 city

 is

 also

 famous

 for

 its

 cuisine

,

 fashion

,

 and

 art

 scene

,

 and

 is

 home

 to

 many

 world

-ren

owned

 universities

 and

 research

 institutions

.

 The

 capital

 city

 of

 France

 is

 often

 considered

 the

 cultural

 and

 economic

 center

 of

 the

 country

.

 It

 is

 one

 of

 the

 largest

 and

 most

 populous

 cities

 in

 Europe

,

 and

 is

 home

 to

 millions

 of

 people

.

 The

 French

 government

 is

 located

 in

 the

 E

iff

el

 Tower

,

 which

 is

 a

 symbol

 of

 Paris

.

 Paris

 is

 also

 the

 official

 residence

 of

 the

 Pope

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 exciting

 and

 constantly

 evolving

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 field

 in

 the

 coming

 years

:



1

.

 Improved

 privacy

 and

 security

:

 As

 more

 and

 more

 data

 is

 generated

 and

 processed

 in

 the

 digital

 age

,

 there

 will

 be

 a

 need

 for

 better

 privacy

 and

 security

 measures

.

 AI

-powered

 systems

 will

 need

 to

 be

 more

 transparent

,

 secure

,

 and

 ethical

 in

 their

 use

,

 and

 will

 need

 to

 be

 designed

 with

 privacy

 and

 security

 in

 mind

.



2

.

 Increased

 automation

 and

 human

 oversight

:

 As

 AI

 becomes

 more

 sophisticated

,

 there

 will

 be

 a

 need

 for

 humans

 to

 monitor

 and

 oversee

 the

 AI

 systems

.

 This

 could

 involve

 training

 humans

 to

 be

 more

 effective

 at

 interpreting

 and

 responding

 to




In [6]:
llm.shutdown()