# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0819 17:17:40.312000 1150227 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0819 17:17:40.312000 1150227 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0819 17:17:48.357000 1150950 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0819 17:17:48.357000 1150950 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.15it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.14it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=72.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=72.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=2 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=1 avail_mem=71.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]Capturing batches (bs=1 avail_mem=71.97 GB): 100%|██████████| 3/3 [00:00<00:00, 11.10it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John and I am an entrepreneur, investor, and international business leader. My expertise is in artificial intelligence, machine learning, and machine vision. I have over 10 years of experience in the field and have held leadership positions in both large and small tech companies. I am passionate about helping companies drive innovation and growth through technology. My team and I have built a number of successful AI and ML-based businesses and I have led successful product launches and investor meetings. I am excited about how artificial intelligence is transforming the world and I would love to help others achieve the same through my expertise and passion for technology. How can I contribute to the development
Prompt: The president of the United States is
Generated text:  a very important person. He has a lot of power. If he wants to make a decision, he can make it. He can also do what? Options: - cancel all flights - eat cake - call a modera

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can I expect from our conversation? [Name] is a [job title] at [company name]. I'm excited to meet you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is known for its rich history, art, and cuisine, and is a popular tourist destination. The city is home to many famous French artists, writers, and musicians, and is considered one of the most beautiful cities in the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also known for its diverse population, with many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and artificial intelligence: As AI technology continues to advance, we are likely to see an increase in automation and artificial intelligence in various industries. This could lead to the creation of more efficient and cost-effective solutions, as well as the creation of new jobs that are not currently being filled by humans.

2. Enhanced privacy and security: As AI technology becomes more advanced, there will be an increased need for privacy and security measures to protect the data that is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [job or role] who is passionate about [job title]. I'm a [job title] who have always been [positive trait or strength] and always strive to [adjective or trait]. I'm known for [reason why this is a good thing] and enjoy [reason why it's a good thing]. I'm always looking for ways to [reason why this is a good thing] and make the world a better place. I'm always up for [reason why this is a good thing]. My favorite hobby is [mention a hobby or activity] and I enjoy [add some details about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

A. True
B. False

B. False

The capital of France is not Paris. The city of Paris is the capital of France and is located in the Île-de-France region, in the Provence region of France. Paris has a rich history dating back to the 6th century, and is know

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Character

 Name

],

 and

 I

 am

 a

 [

Job

 Title

]

 at

 [

Company

 Name

].

 I

 recently

 graduated

 with

 a

 Bachelor

's

 degree

 in

 [

Field

 of

 Study

]

 from

 [

University

 Name

].

 I

 have

 a

 passion

 for

 [

A

 Personal

 Hobby

/

Interest

/

Challenge

].

 I

 have

 a

 strong

 work

 ethic

 and

 enjoy

 making

 a

 positive

 impact

 on

 the

 world

 through

 my

 actions

.

 I

 am

 committed

 to

 [

Your

 Last

 Goal

/

Big

 Dream

].

 What

 is

 your

 favorite

 [

Activity

/

Task

/

Book

/

En

light

enment

/

Other

 Personal

 Interest

]?

 What

 makes

 you

 unique

?

 How

 do

 you

 stay

 motivated

?

 What

 do

 you

 enjoy

 doing

 in

 your

 free

 time

?

 I

 would

 love

 to

 hear

 more

 about



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Ex

plain

 the

 key

 elements

 that

 make

 Paris

 unique

 and

 influential

 in

 French

 culture

,

 including

 its

 architectural

 and

 cultural

 landmarks

.

 The

 historic

 center

 of

 Paris

,

 the

 Lou

vre

,

 offers

 a

 spectacular

 glimpse

 into

 French

 art

 history

.

 The

 E

iff

el

 Tower

,

 the

 Paris

ian

 Quarter

 (

Mont

mart

re

,

 Saint

-G

er

main

-des

-

Pr

és

,

 and

 Ch

amps

-E

lys

ées

)

 and

 the

 Place

 de

 la

 Con

cor

de

 are

 among

 the

 city

's

 most

 iconic

 landmarks

,

 each

 bringing

 a

 unique

 flavor

 to

 the

 French

 landscape

.

 The

 city

's

 architecture

 includes

 the

 Notre

-D

ame

 Cathedral

,

 the

 Arc

 de

 Tri

omp

he

,

 and

 the

 Pal

ais

 Royal

,

 each

 contributing



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 vast

 and

 exciting

,

 and

 it

 is

 likely

 to

 continue

 to

 evolve

 in

 ways

 that

 are

 both

 exciting

 and

 challenging

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 integration

 of

 AI

 with

 other

 technologies

:

 AI

 is

 already

 becoming

 more

 integrated

 with

 other

 technologies

,

 such

 as

 sensors

,

 cameras

,

 and

 other

 devices

.

 As

 more

 and

 more

 of

 these

 technologies

 are

 used

 together

,

 they

 may

 become

 more

 integrated

 and

 seamless

.



2

.

 AI

 becoming

 more

 human

-like

:

 As

 AI

 becomes

 more

 advanced

,

 it

 may

 become

 more

 similar

 to

 the

 way

 humans

 think

 and

 reason

.

 This

 could

 lead

 to

 a

 greater

 understanding

 of

 complex

 problems

 and

 a

 more

 human

-like

 AI

.



3

.

 AI

 becoming




In [6]:
llm.shutdown()