# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0908 03:00:00.738000 1680872 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 03:00:00.738000 1680872 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0908 03:00:10.295000 1681500 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 03:00:10.295000 1681500 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0908 03:00:10.296000 1681499 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0908 03:00:10.296000 1681499 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-08 03:00:10] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.12it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.70it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.70it/s]Capturing batches (bs=1 avail_mem=74.72 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.70it/s]Capturing batches (bs=1 avail_mem=74.72 GB): 100%|██████████| 3/3 [00:00<00:00, 11.01it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Václav, a 19 year old computer engineer, and I'm from Prague. I'm really passionate about science, technology and history. In my free time I like to read books and watch movies. Now I'm looking for a job in the technology industry. So, I'm writing this job application. I have a master's degree in Computer Science and I'm quite comfortable with programming languages like Python and Java. I have experience with databases and software engineering. I have a lot of software testing experience. I also have experience with systems management. I have an excellent understanding of hardware. I have strong communication and collaboration skills
Prompt: The president of the United States is
Generated text:  seeking to assign three unique numbers to each of the 100 members of the United States Senate. The assignment must be made using the numbers 1 through 100, and each number can only be used once in each assignment. In how many ways can the president ass

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm a [insert a relevant skill or experience here]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Insert a brief answer here]. I look forward to meeting you and learning more about you. [Insert a closing statement here]. Thank you for your time. [Name]. [Name] [Company name] [Job title] [Company website] [Company logo] [Company address] [Company phone number] [Company email address] [Company website] [Company logo] [Company address]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris" or "La Ville de Paris, la capitale de l'Europe". It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, art, and culture, as well as its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. The city is also home to many famous museums, including the Louvre and the Musée d'Orsay. Paris is a popular tourist destination and a major economic center in France. It is also home to many important political and cultural institutions

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to evolve and improve, leading to more sophisticated and accurate AI systems that can perform a wide range of tasks with increasing accuracy and efficiency. Some potential future trends in AI include:

1. Increased integration with other technologies: AI is already being integrated into a wide range of other technologies, such as smart homes, self-driving cars, and virtual assistants. As these technologies continue to evolve, we can expect to see even more integration between AI and other technologies, leading to even more advanced and sophisticated AI systems.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm an [age] year old aspiring [career]. I love [your first job or experience] because [explanation]. What brings you to this stage of your life? It's been a long time since my last job or experience, and I'm always looking for new challenges and opportunities to learn something new. I'm eager to learn more about [specific field or industry] and to gain new skills to help me be better prepared for my future career. 

In my free time, I enjoy [activity or hobby]. I'm constantly seeking to expand my knowledge and to meet new people. I'm always looking

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in France by population and is the heart of the country. It was founded in the 11th century and is located on the Seine river, near the ancient town of Paris. Paris is known f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

'm

 a

 [

Field

 of

 Study

]

 [

High

 School

 or

 University

].

 I

've

 always

 been

 fascinated

 by

 [

Subject

/

Interest

],

 and

 I

've

 always

 dreamed

 of

 being

 a

 [

Job

 Title

]

 in

 the

 future

.

 What

 inspired

 you

 to

 become

 a

 [

Field

 of

 Study

]

?



This

 question

 led

 me

 to

 take

 a

 moment

 to

 reflect

 on

 my

 journey

.

 I

 had

 always

 been

 drawn

 to

 the

 world

 of

 technology

,

 and

 I

've

 always

 been

 fascinated

 by

 the

 potential

 for

 [

Subject

/

Interest

]

 to

 change

 the

 world

.

 I

'm

 excited

 to

 learn

 more

 about

 what

 I

 can

 do

 with

 a

 degree

 in

 [

Field

 of

 Study

]

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Just

ification

:

 This

 sentence

 provides

 the

 most

 specific

 information

 about

 the

 capital

 city

 by

 mentioning

 that

 it

 is

 the

 capital

 of

 France

.

 The

 statement

 is

 straightforward

 and

 easy

 to

 remember

.

 It

 is

 a

 fact

 that

 can

 be

 easily

 verified

.

 The

 sentence

 is

 concise

 but

 retains

 the

 essential

 information

 about

 Paris

's

 role

 as

 the

 capital

 of

 France

.

 This

 provides

 a

 quick

 and

 clear

 overview

 of

 the

 capital

's

 location

 and

 significance

.

 As

 a

 fact

,

 it

 allows

 readers

 to

 quickly

 grasp

 the

 core

 idea

 without

 unnecessary

 detail

.

 The

 sentence

 is

 brief

 while

 effectively

 conveying

 the

 information

 requested

.

 It

 avoids

 repetition

,

 which

 is

 a

 desirable

 quality

 in

 concise

 writing

.

 The

 sentence

's

 precision

 and

 bre

v



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized by

 rapid

 advancement

,

 innovation

,

 and

 new

 applications

.

 Some

 of

 the

 possible

 future

 trends

 in

 AI

 include

:



1

.

 More

 advanced

 hardware

 and

 software

:

 With

 the

 increasing

 demand

 for

 AI

,

 there

 is

 likely

 to

 be

 an

 increase

 in

 the

 demand

 for

 more

 advanced

 hardware

 and

 software

 to

 process

 and

 analyze

 data

,

 as

 well

 as

 to

 manage

 and

 scale

 the

 AI

 systems

.



2

.

 Improved

 privacy

 and

 security

:

 As

 AI

 systems

 become

 more

 complex

 and

 sophisticated

,

 there

 is

 a

 risk

 of

 privacy

 and

 security

 breaches

.

 There

 is

 also

 a

 need

 for

 improved

 algorithms

 and

 models

 that

 can

 protect

 personal

 data

 from

 unauthorized

 access

.



3

.

 Increased

 automation

:

 As

 AI

 systems

 become

 more

 complex




In [6]:
llm.shutdown()