# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0918 03:24:18.339000 265637 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0918 03:24:18.339000 265637 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0918 03:24:27.926000 265931 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0918 03:24:27.926000 265931 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0918 03:24:27.981000 265932 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0918 03:24:27.981000 265932 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-18 03:24:28] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.51it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.21it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.21it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.21it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.02it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Enrico and I'm a 28 year old Filipino medical doctor. I am a psychologist and a therapist and a psychoanalyst. I specialize in the field of mental health, but I am also skilled in many other areas, such as: anxiety, trauma, eating disorders, and the psychology of children and adults with mental health issues. I work with adults and children. I specialize in neurodevelopmental, particularly in the areas of post-traumatic stress disorder and the relationship between family relationships and child development. I am a certified Clinical Psychologist and I have experience in the fields of child development, family relationships, and the early years
Prompt: The president of the United States is
Generated text:  a title of office that lasts for a term of two years. It is not a position that can be held for more than two terms. But, can the president be replaced for a term?

Yes, the president can be replaced for a term. The United States Constitution

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. It is also the oldest continuously inhabited city in Europe, having been inhabited since the 5th century BC. Paris is known for its rich history, art, and culture, including the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major transportation hub, with many major highways and railroads running through the city. Paris is a popular tourist destination and a major economic center in France. It is home to many world-renowned museums, including the Louvre and the Musée d'Orsay.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation and artificial intelligence: As AI technology continues to advance, we can expect to see more automation and artificial intelligence in our daily lives. This could include things like self-driving cars, robots in manufacturing, and even virtual assistants that can assist with tasks like scheduling appointments and managing finances.

2. Enhanced privacy and security: As AI technology becomes more advanced, we can expect to see increased privacy and security concerns.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a skilled marketer and influencer. I specialize in creating and promoting brands that resonate with people on a personal level. My work has helped countless businesses grow and achieve their marketing goals. Whether it's through social media campaigns or live video tours, I'm always ready to help my clients create successful campaigns that make a lasting impact. Remember, the key to success is authenticity, so I believe in creating content that feels genuine and connects with people on a personal level. I'm always looking for new ways to help you achieve your marketing goals, and I'm here to help you succeed! Let's chat! Let

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is an historic city with many attractions, including the Eiffel Tower and the Louvre Museum. Paris is also a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

 am

 excited

 to

 share

 my

 journey

 so

 far

 and

 to

 look

 forward

 to

 whatever

 new

 adventures

 I

 may

 have

 in

 store

.

 Thank

 you

 for

 your

 time

 today

.

 



Remember

,

 this

 is

 a

 neutral

 and

 friendly

 introduction

.

 The

 character

 should

 be

 rel

atable

 and

 non

-j

ud

gment

al

 towards

 others

.

 Ensure

 that

 the

 introduction

 align

s

 with

 the

 character

's

 personality

 and

 traits

.

 



In

 addition

,

 consider

 the

 tone

 and

 style

 of

 the

 introduction

.

 It

 should

 be

 engaging

 and

 not

 too

 promotional

.

 The

 introduction

 should

 be

 brief

 but

 informative

,

 and

 should

 not

 contain

 any

 personal

 or

 sensitive

 information

.

 



Remember



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 world

-ren

owned

 cultural

 and

 historical

 center

 of

 the

 nation

.


Prop

ose

 a

 complex

 word

 that

 can

 be

 used

 to

 describe

 the

 city

's

 reputation

 and

 culture

.

 The

 reputation

 of

 Paris

 is

 renowned

 for

 its

 art

,

 gastr

onomy

,

 and

 architectural

 beauty

.

 It

 is

 also

 a

 hub

 of

 commerce

 and

 fashion

,

 known

 for

 its

 fashion

 houses

,

 street

 art

,

 and

 luxury

 bout

iques

.


Write

 a

 comprehensive

 paragraph

 about

 the

 city

's

 importance

.

 Paris

 is

 the

 largest

 city

 in

 Europe

,

 home

 to

 the

 city

-state

 of

 France

 and

 a

 major

 center

 of

 culture

 and

 commerce

.

 It

 is

 the

 capital

 of

 the

 French

 Republic

,

 where

 the

 French

 language

 is

 the

 official

 language

.

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 marked

 by

 a

 number

 of

 different

 trends

 and

 developments

.

 Here

 are

 some

 of

 the

 potential

 future

 trends

:



1

.

 Increased

 reliance

 on

 AI

 for

 automation

 and

 productivity

 improvements

.

 As

 AI

 becomes

 more

 advanced

,

 it

 is

 expected

 to

 become

 even

 more

 capable

 of

 performing

 repetitive

 tasks

,

 freeing

 up

 human

 workers

 to

 focus

 on

 more

 complex

 and

 creative

 work

.



2

.

 Integration

 of

 AI

 with

 other

 technologies

.

 AI

 is

 already

 integrated

 into

 many

 modern

 devices

 and

 platforms

,

 and

 it

 is

 likely

 to

 become

 even

 more

 integrated

 with

 emerging

 technologies

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

),

 virtual

 and

 augmented

 reality

 (

VR

/

AR

),

 and

 blockchain

.



3

.

 Development

 of

 more

 intelligent

 robots

 and




In [6]:
llm.shutdown()