# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0917 23:50:23.167000 1975551 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 23:50:23.167000 1975551 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0917 23:50:31.303000 1976209 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 23:50:31.303000 1976209 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0917 23:50:31.347000 1976210 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0917 23:50:31.347000 1976210 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-17 23:50:31] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.23it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.38it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.38it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.38it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.44it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Iolanda, and I'm the new translator for the Zephyr project. My role is to ensure that all messages and text are accurately conveyed in Spanish. Please share your current progress and any upcoming updates for this project. Also, provide a brief explanation for each point.
Iolanda is a new person joining the Zephyr project, and I will be her translator. My current progress is focused on converting messages and texts into Spanish. This includes translating text, identifying misspellings, and ensuring that the text is grammatically correct and proper. I am currently working on translating some initial messages to Spanish and have made some progress
Prompt: The president of the United States is
Generated text:  now the 45th President of the United States. If the current President was the 44th President, how many years have passed since his term began? 
(Note: the current year is 2021)
To determine how many years have passed since the presidency beg

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [Type of Person] who is [What you do for a living]. I'm always looking for ways to [What you do for a living]. I'm always eager to learn and grow, and I'm always willing to help others. What's your favorite hobby or activity? I love [What you like to do]. I'm always looking for new experiences and adventures, and I'm always eager to try new things. What's your favorite

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its iconic Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French Academy of Sciences. Paris is a bustling city with a rich history and culture, and it is a popular tourist destination. The city is also known for its fashion industry, with many famous fashion designe

Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to a more human-like experience with AI, as it becomes more capable of understanding and responding to human emotions and motivations.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability. This will require a more rigorous and transparent approach to AI development and deployment.

3. Increased use of AI in healthcare



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I am [Your Age]. I am a [Your Job Title] and I enjoy [Your hobbies or interests]. I come from [Your nationality or birthplace] and I have always been passionate about [Your love/hobby]. I am always eager to learn new things and challenge myself, and I strive to be the best version of myself. I am constantly improving myself and trying to reach my full potential. I am a hardworking and dedicated individual who takes pride in my accomplishments and goals. I am a true believer in the power of self-improvement and always strive to do my best. I am a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light. It is the largest city in Europe by population and serves as the political, cultural, and economic center of the country. It is renowned for its historical architec

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 young

 adult

 with

 a

 passion

 for

 [

main

 interest

/

interest

/

subject

].

 I

 am

 an

 outgoing

,

 confident

,

 and

 assert

ive

 person

 who

 values

 honesty

,

 positivity

,

 and

 authenticity

.

 My

 personality

 is

 strong

,

 confident

,

 and

 courageous

,

 and

 I

 am

 always

 seeking

 to

 learn

 and

 grow

.



I

 have

 a

 natural

 curiosity

 and

 love

 to

 explore

 new

 ideas

 and

 experiences

.

 I

 am

 always

 eager

 to

 learn

 and

 adapt

 to

 the

 changing

 world

 around

 me

.

 I

 believe

 in

 putting

 others

 before

 myself

 and

 always

 strive

 to

 make

 a

 positive

 impact

 in

 the

 world

.



I

 have

 a

 strong

 work

 ethic

 and

 take

 pride

 in

 my

 achievements

.

 I

 am

 always

 striving

 to

 learn



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 historical

 significance

 and

 vibrant

 culture

.


Paris

 is

 a

 popular

 tourist

 destination

 with

 a

 rich

 history

 and

 vibrant

 culture

.

 It

 is

 the

 capital

 of

 France

 and

 the

 largest

 city

 in

 the

 country

.

 The

 city

 is

 known

 for

 its

 beautiful

 architecture

,

 extensive

 art

 and

 music

 scene

,

 and

 historical

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 home

 to

 numerous

 museums

,

 including

 the

 Mus

ée

 du

 Lou

vre

,

 and

 many

 dining

 venues

,

 including

 the

 famous

 É

t

apis

 de

 Fe

u

 restaurant

.

 Paris

's

 reputation

 as

 a

 city

 of

 art

 and

 culture

 is

 further

 celebrated

 by

 its

 annual

 E

iff

el

 Tower

 Par

c

els

 competition

 and

 the

 world



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 one

 of

 unprecedented

 growth

 and

 potential

,

 with

 a

 vast

 array

 of

 possibilities

 for

 both

 positive

 and

 negative

 impacts

.

 Here

 are

 some

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 AI

 landscape

 in

 the

 next

 decade

:



1

.

 Increased

 integration

 with

 human

 decision

-making

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 is

 likely

 to

 be

 integrated

 more

 deeply

 into

 decision

-making

 processes

.

 This

 could

 lead

 to

 a

 greater

 reliance

 on

 human

 oversight

 and

 validation

 in

 AI

-driven

 systems

.



2

.

 Enhanced

 creativity

 and

 innovation

:

 AI

 is

 already

 capable

 of

 generating

 high

-quality

 creative

 output

,

 such

 as

 music

,

 art

,

 and

 literature

.

 As

 AI

 technology

 continues

 to

 evolve

,

 it

 is

 likely

 to

 become

 even

 more

 capable

 of

 generating

 new




In [6]:
llm.shutdown()