# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0914 07:52:22.245000 3703262 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0914 07:52:22.245000 3703262 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0914 07:52:31.948000 3703788 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0914 07:52:31.948000 3703788 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0914 07:52:32.259000 3703789 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0914 07:52:32.259000 3703789 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-14 07:52:32] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.02it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.66it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.66it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.66it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  6.90it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Linda. I was born in 1978. I can talk and sing. I'm very happy to be here. What's your name? Linda.
A. Yes, I am. Linda. 
B. No, I am not. Linda.
C. I'm Linda. 
D. I can't hear you.
Answer:
C

Which of the following options is closest to the average of the sample data?
A. 2
B. 3
C. 4
D. 5
Answer:
D

I'm not sure if I understood the question correctly. Can you please explain it to
Prompt: The president of the United States is
Generated text:  a political office with a term of four years. For instance, Bill Clinton served four years as the president, and Barack Obama served four years. There was a time when Barack Obama, who was running for president, served four years as president. The number of years he served as president was less than 5 years and greater than 3 years. What was the maximum number of years that Barack Obama served as president?
To determine the maximum number of years Barack Obama served as president, we need to find the large

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason why you're passionate about your job], and I'm always looking for ways to [what you're looking for in a job]. I'm a [type of person] and I'm always looking for opportunities to [what you're looking for in a job]. I'm [what you're looking for in a job] and I'm always looking for ways to [what you're looking for in a job]. I'm [what

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in the country and the seat of the French government. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is also a major tourist destination, with many famous landmarks and museums. The city is home to many important French institutions, including the Louvre Museum and the Eiffel Tower. Paris is a popular destination for international visitors, with many French restaurants, cafes, and shops. The city is also known for its cuisine, with many famous dishes and specialties. Overall, Paris is a city of contrasts and diversity, with a rich cultural heritage and a lively atmosphere. Paris is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Here are some possible future trends in AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This includes issues such as bias, transparency, and accountability.

2. Greater integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced interactions between humans and machines.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve diagnosis, treatment, and patient care.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Profession] [Title] [Subtitle]. I have a passion for [What does the character enjoy or do best]? I am always looking for new challenges and am always ready to learn and grow as a person. In my spare time, I enjoy [What can the character do that interests them?]. If you could become me, what would I be like? I would be someone who is [What are the main qualities I admire most in myself?]. I am [Age/Level of Education] years old. If you could do one thing in the world, what would it be? I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La République"
Paris is a cultural and artistic hub, hosting renowned museums, galleries, theaters, and opera houses. It is also home to the Eiffel Tower, the Louvre Museum, Notre Dame Cathedral, and the Champs-Élysées. The city has a rich

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

First

 Name

]

 and

 I

'm

 a

 [

Occup

ation

 or

 Type

 of

 Work

].

 I

’m

 an

 [

Age

]

 years

 old

,

 [

Gender

]

 and

 [

Current

 Occupation

 or

 Role

].

 I

'm

 passionate

 about

 [

Field

 of

 Interest

 or

 Career

],

 and

 I

’m

 always

 looking

 for

 opportunities

 to

 [

Achie

vement

 or

 Goal

].

 I

’m

 a

 [

Person

ality

 Type

 or

 Traits

]

 person

,

 and

 I

’m

 always

 [

Positive

 Traits

 or

 Attributes

].

 I

'm

 confident

,

 outgoing

,

 and

 [

Ind

oor

 or

 Outdoor

].

 I

 enjoy

 [

Activity

 or

 Hobby

].

 And

 I

'm

 [

Smart

 or

 Vers

atile

].

 What

 exc

ites

 me

 most

 is

 [

What

 Exc

ites

 Me

 Most

].

 How

 would

 you

 describe



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

 and

 the

 world

's

 largest

 city

 in

 terms

 of

 population

.

 It

 serves

 as

 the

 capital

 of

 the

 country

,

 as

 well

 as

 the

 heart

 of

 the

 European

 Union

 and

 the

 home

 to

 the

 City

 of

 Light

,

 the

 French

 Riv

iera

,

 the

 Lou

vre

,

 the

 E

iff

el

 Tower

,

 and

 the

 landmarks

 of

 the

 World

 of

 Be

aux

-A

rts

.

 It

 is

 also

 known

 as

 the

 "

City

 of

 Love

"

 and

 the

 "

City

 of

 Light

".

 Paris

 is

 the

 cultural

,

 artistic

,

 and

 financial

 hub

 of

 Europe

,

 and

 a

 major

 global

 hub

 of

 finance

 and

 business

.

 The

 city

 is

 also

 home

 to

 the

 French

 Parliament

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 several

 trends

 that

 are

 expected

 to

 drive

 significant

 progress

 and

 innovation

.

 Here

 are

 some

 potential

 trends

 that

 could

 shape

 the

 future

 of

 artificial

 intelligence

:



1

.

 Increased

 emphasis

 on

 privacy

 and

 security

:

 As

 more

 data

 is

 collected

 and

 shared

 online

,

 there

 is

 a

 growing

 awareness

 of

 the

 importance

 of

 protecting

 personal

 data

.

 This

 could

 drive

 increased

 emphasis

 on

 privacy

 and

 security

 in

 AI

 systems

.



2

.

 Adv

ancements

 in

 neural

 networks

:

 Neural

 networks

 are

 a

 key

 component

 of

 AI

,

 and

 there

 is

 ongoing

 research

 into

 improving

 their

 performance

 and

 understanding

 of

 complex

 patterns

 in

 data

.



3

.

 Integration

 of

 AI

 into

 everyday

 products

:

 As

 AI

 becomes

 more

 integrated

 into

 everyday

 products

,

 we




In [6]:
llm.shutdown()