# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0912 02:29:00.327000 872929 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 02:29:00.327000 872929 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0912 02:29:08.682000 873520 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 02:29:08.682000 873520 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0912 02:29:08.917000 873519 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0912 02:29:08.917000 873519 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-12 02:29:09] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.24it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.24it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.08it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.08it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.08it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.16it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily and I am an aspiring author. I have always loved writing and am always looking for new ideas. I am currently working on a novel and am looking for input on my manuscript. Can you please help me with some suggestions for word choices?
Certainly! Here are some suggestions for word choices that I would recommend for your novel:

1. Imaginative - This word is used to describe something that is original and innovative.
2. Original - This word is used to describe something that is unique and fresh.
3. Artistic - This word is used to describe something that is creative and imaginative.
4. Unexpected - This word is used
Prompt: The president of the United States is
Generated text:  running for a second term. To become the next president, a candidate must be at least 40 years old and must have served in the United States military. Among the three military service members being considered for this race, only one has the age requirement, but the ca

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] with [number] years of experience in [industry]. I'm a [job title] with [number] years of experience in [industry]. I'm a [job title] with [number] years of experience in [industry]. I'm a [job title] with [number] years of experience in [industry]. I'm a [job title] with [number] years of experience in [industry]. I'm a [job title] with [number] years of experience in [industry]. I'm a [job title] with [number] years of experience in [industry

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville Flottante" (floating city). It is the largest city in Europe and the third largest city in the world by population. Paris is known for its rich history, art, and culture, and is a major tourist destination. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its cuisine, including French cuisine, and is a popular destination for tourists and locals alike. The city is home to many museums, theaters, and other cultural institutions, and is a major center for business and commerce

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and experiences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced ethical considerations: As AI becomes more integrated with human intelligence, there will be increased scrutiny of its ethical implications. This could lead to more stringent regulations and guidelines for AI development and deployment.

3. Greater reliance on AI for decision-making: AI is likely to become more integrated with human decision-making processes, allowing machines to make



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name]. I am a [role or profession] who has a passion for [insert a passion or hobby you enjoy]. I am a [insert a specific attribute or skill you possess] who have dedicated my life to [insert a reason why you choose this particular role or profession]. I am [insert a personality trait or trait you bring to the table]. I believe that [insert a reason for your choice of profession or role]. I am [insert a year or decade of your career, if applicable]. I am [insert a personal accomplishment or achievement]. I am [insert a hobby or interest you have, if applicable]. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the most populous city in Europe, with an estimated population of over 2. 3 million people as of 2017. Paris is also the world’s 16th-largest city and the 17th-largest metropolitan are

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

].

 I

'm

 a

 young

 and

 enthusiastic

 intro

vert

 who

 loves

 to

 explore

 new

 places

 and

 try

 new

 things

.

 Whether

 I

'm

 hiking

 through

 mountains

,

 exploring

 a

 museum

,

 or

 just

 soaking

 up

 the

 sun

,

 I

'm

 always

 up

 for

 a

 challenge

.

 I

 enjoy

 helping

 others

 and

 being

 a

 good

 listener

.

 And

 oh

,

 I

 also

 love

 learning

 new

 things

,

 whether

 it

's

 a

 new

 language

,

 a

 new

 hobby

,

 or

 just

 a

 new

 perspective

 on

 the

 world

.

 I

'm

 passionate

 about

 sharing

 my

 experiences

 with

 others

 and

 creating

 a

 positive

 impact

 in

 the

 world

.

 I

'm

 a

 friendly

,

 curious

,

 and

 always

 looking

 for

 new

 adventures

.

 So

 if

 you

're

 ready

 to

 make



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 also

 renowned

 for

 its

 historical

 significance

,

 including

 being

 the

 birth

place

 of

 modern

 French

 history

 and

 cuisine

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 cultural

 and

 artistic

 heritage

,

 and

 has

 been

 an

 important

 center

 of

 politics

 and

 culture

 for

 centuries

.

 It

 is

 also

 known

 for

 its

 fashion

,

 food

,

 and

 wine

,

 and

 is

 a

 popular

 tourist

 destination

.

 Overall

,

 Paris

 is

 a

 vibrant

 and

 dynamic

 city

 with

 a

 rich

 history

 and

 culture

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 will

 depend

 on

 many

 factors

,

 including

 ongoing

 technological

 advancements

,

 regulatory

 changes

,

 and

 societal

 values

.

 However

,

 there

 are

 several

 potential

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

:



1

.

 Increased

 emphasis

 on

 ethics

 and

 privacy

:

 As

 more

 AI

 systems

 are

 developed

,

 there

 will

 likely

 be

 increased

 scrutiny

 of

 their

 development

 and

 deployment

.

 There

 will

 be

 greater

 emphasis

 on

 ethical

 considerations

,

 such

 as

 ensuring

 that

 AI

 systems

 do

 not

 cause

 harm

 or

 violate

 human

 rights

.

 This

 will

 require

 ongoing

 investment

 in

 research

 and

 development

 to

 create

 more

 transparent

 and

 accountable

 AI

 systems

.



2

. Increased

 reliance

 on

 machine

 learning

:

 Machine

 learning

 will

 become

 more

 prevalent

 in

 AI

 systems

,

 with

 more




In [6]:
llm.shutdown()