# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-22 23:48:24] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.30it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.25it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.25it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.25it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.77it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Frederik Lekar and I’m the chief architect and project manager for the City of Providence, RI and Providence City Council. As a former professional construction manager and planner, I have worked on numerous construction projects, as well as been an investor and developer.
I’ve worked on major local construction projects, including public works and infrastructure, as well as a number of real estate development and single-family home projects. I’ve also worked on a variety of various small-scale residential and commercial projects. I’ve helped manage large and small construction projects on a day-to-day basis.
I have a degree in Civil Engineering from the University of Rhode Island,
Prompt: The president of the United States is
Generated text:  trying to decide what to do with the budget deficit. He has a budget of $500 million dollars to spend. Each of the 100 wealthiest people in the country decides to donate $50,000 each year. If the deficit

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm a [Number] year old [Gender] who has been [Number] years in the industry. I'm passionate about [Number] and have always been [Number] in my heart. I'm always looking for new challenges and opportunities to grow and learn. I'm [Number] in the industry, and I'm excited to be here. How can I help you today? [Name] [Number] [Gender] [Number] [Company Name] [Job Title] [Company Name] [Job Title] [Company Name] [Job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a vibrant and dynamic city with a rich cultural heritage and is a popular tourist destination. It is also known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced natural language processing: AI will continue to improve its ability to understand and interpret human language, allowing machines to better communicate and interact with humans.

3. Improved decision-making: AI will become more capable of making complex decisions based on vast amounts of data, allowing machines to make more informed and accurate decisions.

4. Increased use of AI in healthcare: AI will be used to improve the accuracy



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am [Age]. I am an [occupation] with [number of years] years of experience in [field], and I am passionate about [anything you can think of, such as [interests, hobbies, etc.]]. I enjoy [how I spend my free time, such as [sports, reading, traveling, etc.]]. I am always looking for new challenges to tackle and have [how many years] years of experience in [field], which gives me the ability to learn and grow. I am a [Type of person: responsible, creative, open-minded, detail-oriented, etc.], and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The capital of France is Paris. It is the most populous city in Europe, with an estimated population of 14 million people as of 2023. The city is home to many of the country's largest and most iconic institutions, including the Louvre Museum, the Notre

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

 am

 a

 [

职业

/

特长

]

 with

 [

number

 of

 years

 in

 the

 industry

].

 I

 started

 my

 career

 in

 [

industry

/

position

]

 in

 [

year

]

 and

 have

 been

 working

 hard

 to

 [

goals

/

achie

vements

].

 Currently

,

 I

 am

 [

current

 position

]

 and

 I

 enjoy

 [

reason

 why

 you

 are

 enthusiastic

 about

 your

 work

].


[

Your

 Name

]

 is

 a

 dedicated

 professional

 who

 has

 dedicated

 over

 [

number

 of

 years

 in

 the

 industry

]

 to

 [

industry

/

position

].

 I

 am

 always

 eager

 to

 learn

 and

 improve

,

 and

 I

 believe

 that

 by

 being

 a

 part

 of

 this

 community

,

 I

 can

 make

 a

 positive

 impact

 on

 the

 world

.

 I

 am



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 major

 urban

 center

 and

 the

 seat

 of

 government

 for

 France

.

 It

 is

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 and

 vibrant

 culture

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 many

 other

 famous

 landmarks

.

 The

 city

 also

 boasts

 a

 diverse

 population

 of

 people

 from

 different

 cultures

,

 languages

,

 and

 backgrounds

.

 Paris

 is

 a

 major

 tourist

 attraction

,

 with

 millions

 of

 tourists

 annually

 visiting

 the

 city

 to

 experience

 its

 beauty

 and

 culture

.

 Its

 status

 as

 the

 capital

 of

 France

 is

 an

 important

 part

 of

 the

 country

’s

 identity

,

 and

 it

 plays

 a

 crucial

 role

 in

 the

 country

’s

 economy

 and

 politics

.

 The

 city

 is

 also

 known



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 many

 different

 trends

 and

 developments

 in

 the

 coming

 years

.

 Here

 are

 some

 possible

 trends

:



1

.

 Increased

 Use

 of

 AI

 in

 Healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 assist

 doctors

 in

 diagn

osing

 diseases

 and

 developing

 personalized

 treatment

 plans

.

 In

 the

 future

,

 AI

 could

 be

 used

 to

 improve

 patient

 outcomes

 and

 reduce

 healthcare

 costs

 by

 making

 healthcare

 more

 efficient

 and

 effective

.



2

.

 AI

 in

 Transportation

:

 AI

-powered

 transportation

 systems

 like

 autonomous

 vehicles

 and

 traffic

 management

 systems

 could

 revolution

ize

 the

 way

 we

 travel

.

 AI

 can

 help

 to

 reduce

 traffic

 congestion

,

 save

 fuel

,

 and

 reduce

 emissions

,

 while

 also

 improving

 safety

.



3

.

 AI

 in

 Financial

 Services

:

 AI

-powered

 financial

 services




In [6]:
llm.shutdown()