# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0916 02:15:00.716000 1043655 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 02:15:00.716000 1043655 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




W0916 02:15:08.968000 1044297 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 02:15:08.968000 1044297 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0916 02:15:09.188000 1044296 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0916 02:15:09.188000 1044296 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-16 02:15:09] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.05it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.04 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.04 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00, 11.08it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex Green. I'm from Melbourne, Australia, and I'm the founder and CEO of the website, BetterSchools. I'm also the founder of a popular YouTube channel, BetterSchools. I've worked with schools to create innovative learning environments for students. In this session, I'll be discussing my experience in setting up and managing a YouTube channel, as well as my role as the founder and CEO of BetterSchools.
Can you give me an overview of your experience in setting up and managing a YouTube channel?
Certainly! As a founder and CEO of BetterSchools, I am well-versed in the principles of video production
Prompt: The president of the United States is
Generated text:  a major political office with important responsibilities. What type of official is this? ____ 
A. Judicial
B. Executive
C. Legislative
D. None of the above
Answer: B

Regarding the carbon dioxide (CO2) content in the air, which of the following statements is correct?
A. It is not included 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I enjoy [job title] because [reason why you enjoy it]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I enjoy [job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris". It is the largest city in France and the second-largest city in the European Union. The city is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is known for its rich history, art, and culture, and is a popular tourist destination. It is also home to many important institutions such as the French Academy of Sciences and the French Parliament. The city is a major economic and cultural hub in Europe and plays a significant role in French politics and society. Paris is a city of contrasts, with its modern architecture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn from and adapt to human behavior and decision-making processes.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, transparency, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI becomes more advanced, it is likely to be used in even more areas



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Jane, and I'm a software engineer with experience in developing websites and apps for businesses. I'm constantly learning new technologies and languages, and I enjoy sharing my knowledge with others. I'm also skilled at debugging issues and fixing problems in code. If you're interested in learning more about me, just ask! What's your specialty? What are some examples of your projects that you're proud of?
As an AI language model, I don't have a physical presence or an official profession, but I can help answer questions, provide information, and assist with tasks such as writing, programming, and communication. I'm here to assist you in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest city in France by population, with over 18 million people residing in the city. The city is home t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 [

Age

],

 [

Occup

ation

].

 I

 am

 a

 [

Current

 Job

 Title

]

 with

 [

Previous

 Job

 Title

]

 and

 I

 am

 currently

 working

 [

Current

 Position

].

 I

 am

 passionate

 about

 [

My

 Inter

ests

 or

 Hobby

]

 and

 I

 love

 to

 [

My

 Inter

ests

 or

 Hobby

].

 I

 am

 [

Your

 Personality

],

 [

Your

 Character

]

 with

 [

Your

 Traits

].

 I

 am

 always

 looking

 for

 new

 opportunities

 and

 [

Your

 Work

 Style

],

 [

Your

 Lifestyle

].

 I

 am

 always

 up

 for

 learning

 and

 growing

,

 and

 I

 enjoy

 [

My

 Personality

 Trait

 or

 Hobby

]

 and

 [

Your

 Personality

 Trait

 or

 Hobby

].

 [

Your

 Name

]

 is

 a

 very

 [

Your

 Personality

 Trait



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

,

 and

 is

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 which

 is

 one

 of

 the

 seven

 wonders

 of

 the

 world

.

 Paris

 is

 a

 diverse

 city

 with

 many

 neighborhoods

 and

 landmarks

,

 including

 the

 Latin

 Quarter

,

 the

 Se

ine

 River

,

 and

 the

 Ch

amps

-

É

lys

ées

.

 The

 city

 is

 also

 home

 to

 the

 Lou

vre

 Museum

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 has

 been

 a

 popular

 tourist

 destination

 for

 centuries

 and

 is

 considered

 one

 of

 the

 most

 beautiful

 and

 cosm

opolitan

 cities

 in

 the

 world

.

 The

 city

 is

 known

 for

 its

 architecture

,

 cuisine

,

 and

 culture

,

 and

 attracts

 millions

 of

 visitors

 each

 year

.

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

 and

 unpredictable

,

 but

 there

 are

 several

 trends

 that

 are

 likely

 to

 shape

 the

 technology

's

 development

.

 Here

 are

 some

 of

 the

 most

 likely

 scenarios

:



1

.

 Increased

 Integration

 of

 AI

 into

 Everyday

 Life

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 we

 can

 expect

 to

 see

 even

 more

 applications

 of

 AI

 in

 areas

 such

 as

 healthcare

,

 finance

,

 transportation

,

 and

 even

 entertainment

.

 For

 example

,

 AI

 can

 be

 used

 to

 analyze

 medical

 data

,

 predict

 disease

 outbreaks

,

 and

 optimize

 transportation

 routes

.



2

.

 Emer

gence

 of

 Artificial

 General

 Intelligence

:

 While

 AI

 is

 currently

 mainly

 designed

 to

 perform

 specific

 tasks

,

 it

 is

 possible

 that

 we

 will

 see

 the

 emergence

 of

 AI

 that

 can

 perform

 complex




In [6]:
llm.shutdown()