# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0919 00:20:15.103000 864786 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 00:20:15.103000 864786 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0919 00:20:23.701000 865315 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 00:20:23.701000 865315 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0919 00:20:23.922000 865314 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 00:20:23.922000 865314 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-19 00:20:24] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.31it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=68.99 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=68.99 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.98it/s]Capturing batches (bs=2 avail_mem=68.91 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.98it/s]Capturing batches (bs=1 avail_mem=68.90 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.98it/s]Capturing batches (bs=1 avail_mem=68.90 GB): 100%|██████████| 3/3 [00:00<00:00,  8.70it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Hana and I am a professional and experienced digital marketer, you can call me Hana. If you need to add a new service, feature, or product, please let me know. I am fully committed to providing the best possible service to you and will not hesitate to tailor our services to your specific needs. Here's a snippet of what my services include:

1. Digital marketing: Develop a comprehensive and customized digital marketing strategy for your business, including online advertising, social media marketing, SEO, content marketing, email marketing, and analytics tracking.
2. Content creation: Create and manage your brand's content, such as blog posts,
Prompt: The president of the United States is
Generated text:  a three-nerd, or president of a three-noodle house. The president of a three-noodle house has a dog, a cat, and a fish. If it's a weekend, the president has a team of four people who will each share a dinner that costs $50. If it's not a weeken

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm a [Skill or Hobby] enthusiast who loves to [Describe a hobby or interest]. I'm always looking for new experiences and learning new things, and I'm always eager to share my knowledge and skills with others. I'm a [Favorite Subject] lover who loves to [Describe a favorite subject or hobby]. I'm always looking for new challenges and opportunities to grow and learn, and I'm always eager to share my experiences and insights with others. I'm a [Favorite Book or Movie] fan who loves to [Describe a favorite book or movie

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also known for its fashion industry, with iconic fashion houses like Chanel and Louis Vuitton. The city is also home to the French Parliament, the French National Museum of Modern Art, and the Eiffel Tower. Paris is a vibrant and dynamic city with a rich cultural heritage and is a major tourist destination.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations more effectively. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and behaviors.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to be used in even more areas, including personalized medicine, disease diagnosis, and drug discovery.

3. Increased use of AI in finance



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [ profession]. I'm excited to meet you and learn more about what you can offer. I'm always here to help, regardless of the situation. Have a great day! [Feel free to include any background information or personal anecdotes that you feel comfortable sharing. Your self-introduction should be informative yet welcoming to the listener.] Your welcome! [End with a friendly smile, a nod, or a handshake.] Good day, [Name]. [Feel free to personalize the greeting and make the introduction as engaging and informative as possible to make it a memorable meeting.] [Name] is always here for you!

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city with the most populous population. The city is known for its rich history, beautiful architecture, and vibrant culture. It is also famous for its fash

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

,

 and

 I

'm

 an

 artist

.

 I

 love

 to

 create

 beautiful

 things

,

 from

 paintings

 to

 sculptures

.

 I

'm

 excited

 to

 be

 here

 and

 see

 all

 the

 amazing

 things

 you

've

 got

 to

 offer

.



Thank

 you

 for

 asking

,

 I

 look

 forward

 to

 meeting

 you

!

 Let

's

 chat

!

 What

 brings

 you

 to

 this

 world

?

 Can

 you

 share

 a

 little

 bit

 about

 yourself

?

 I

'm

 just

 a

 regular

 person

 with

 a

 passion

 for

 art

.

 I

 enjoy

 painting

 and

 sculpt

ing

,

 and

 I

 love

 to

 create

 unique

 and

 beautiful

 works

 of

 art

.



That

's

 great

 to

 hear

!

 Can

 you

 tell

 me

 about

 a

 particular

 piece

 of

 art

 that

 you

've

 created

?

 I

'm

 really

 into

 abstract

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 capital

 of

 France

.

 It

 is

 the

 

1

2

th

-largest

 city

 by

 population

 and

 the

 

1

5

th

-most

 populous

 city

 by

 population

 in

 the

 world

.

 It

 is

 the

 largest

 city

 in

 France

 by

 area

 and

 the

 most

 densely

 populated

 urban

 area

 in

 Europe

.

 Paris

 is

 known

 as

 "

la

 Par

o

isse

 de

 l

'

Europe

"

 and

 is

 one

 of

 the

 most

 important

 cultural

 and

 economic

 centers

 of

 the

 world

.

 It

 has

 a

 rich

 history

,

 including

 ancient

 Roman

 ruins

 and

 Gothic

 architecture

.

 The

 city

 is

 home

 to

 many

 important

 museums

,

 including

 the

 Lou

vre

 and

 the

 Mus

ée

 d

'

Or

say

.

 It

 is

 also



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 involves

 many

 possible

 paths

 and

 trends

.

 However

,

 here

 are

 some

 possible

 trends

 that

 could

 potentially

 emerge

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 healthcare

 to

 improve

 diagnoses

 and

 treatment

 plans

.

 Future

 trends

 could

 include

 more

 personalized

 and

 advanced

 AI

 algorithms

,

 which

 could

 lead

 to

 even

 more

 accurate

 diagnoses

 and

 treatments

 for

 patients

.



2

.

 Automation

 of

 routine

 tasks

:

 As

 AI

 becomes

 more

 advanced

,

 it

 is

 likely

 that

 more

 routine

 tasks

 will

 be

 automated

.

 This

 could

 lead

 to

 job

 losses

 for

 certain

 workers

,

 but

 it

 could

 also

 create

 new

 job

 opportunities

 in

 areas

 such

 as

 data

 analysis

 and

 software

 development

.



3

.

 AI

 in

 the

 finance

 industry




In [6]:
llm.shutdown()