# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0817 19:42:18.876000 179438 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0817 19:42:18.876000 179438 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0817 19:42:26.909000 179965 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0817 19:42:26.909000 179965 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.78it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=55.69 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=55.69 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.90it/s]Capturing batches (bs=2 avail_mem=55.61 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.90it/s]Capturing batches (bs=1 avail_mem=55.60 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.90it/s]Capturing batches (bs=1 avail_mem=55.60 GB): 100%|██████████| 3/3 [00:00<00:00, 11.41it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kyle. I’m 26 years old, weighing 150 pounds, and I’m on my way to becoming a father. As you can imagine, I have a lot of questions and concerns about raising a child. The first question that came to mind was, “What is the best way to give my child a love of books?” I’m very interested in how I can encourage my child to read and enjoy literature. I have an older daughter, and I want her to enjoy my company and spend some time with me. But, I am worried that reading isn’t a good way to teach her about the language of words
Prompt: The president of the United States is
Generated text:  a very important person. Before he or she becomes president, they must be approved by the members of the United States Senate. One thing that makes the people of the United States like the president is that they think that the president is the best leader in the United States. They also think that the president's family is very close to the president. The president

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest in the industry], and I'm always looking for ways to [action or goal]. I'm a [reason for interest in the industry] and I'm always looking for ways to [action or goal]. I'm a [reason for interest in the industry] and I'm always looking for ways to [action or goal]. I'm a [reason for interest in the industry] and I'm always looking for ways to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its rich history, including the French Revolution and the French Revolution Museum. Paris is a cultural and economic center of France and a major tourist destination. It is home to many famous French artists, writers, and musicians. The city is also known for its cuisine, including French cuisine, and its fashion industry. Paris is a vibrant and dynamic city that continues

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in various industries, including manufacturing, healthcare, and transportation. Automation will likely lead to increased efficiency and productivity, but it will also lead to job displacement for some workers.

2. AI ethics and privacy: As AI becomes more prevalent, there will be increased scrutiny of its use and potential misuse. There will likely be a push for greater ethical considerations and privacy protections.

3. AI for human benefit: AI is likely to be used for human benefit, such as in healthcare, education, and transportation. AI will likely be used to improve



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a/an [Occupation] [Name]. I'm a/an [Job Title] [Job Title], [Name]. I'm a/an [Name], [Name], [Name] at [Company Name]. I've been a/an [Job Title] for [Years] [Years] and I'm always [Job Title] and [Job Title]. I enjoy [Job Title] because [reason for job title]. I'm a/an [Name] and I come from [name]. I'm [Job Title] and [Name], [Name], [Name]. I love [Job Title] and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the third-largest city in the European Union by population and one of the most populous cities in the world, and also the seventh-largest city in the world by land area. Paris is known for its beautiful architecture, historical landmarks, and vibrant culture. The city is home to many world-renowned museums, including the Louvre, the Musée d’Orsay, and the Musée de l’

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

 [

Occup

ation

 or

 field

 of

 study

].

 I

'm

 currently

 [

Current

 Location

]

 and

 [

Favorite

 Activity

/

Activity

].

 I

've

 always

 been

 [

favorite

 trait

].

 I

'm

 always

 ready

 to

 learn

 and

 always

 looking

 for

 new

 opportunities

 to

 grow

 and

 succeed

.

 What

 do

 you

 think

 makes

 you

 unique

?

 What

's

 your

 ultimate

 goal

 for

 your

 life

?


Welcome

 to

 the

 world

 of

 [

Name

].

 As

 a

 [

Occup

ation

 or

 field

 of

 study

],

 I

'm

 always

 on

 the

 lookout

 for

 new

 experiences

 and

 opportunities

 to

 learn

 and

 grow

.

 My

 favorite

 trait

 is

 always

 being

 proactive

 and

 never

 giving

 up

 on

 my

 goals

.

 I

'm

 always



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Here

's

 an

 example

 of

 a

 more

 detailed

 response

:


Paris

 is

 the

 largest

 city

 in

 France

,

 located

 in

 the

 northeast

 part

 of

 the

 country

.

 It

 was

 founded

 in

 the

 

8

th

 century

 by

 the

 Mo

ors

,

 and

 has

 been

 a

 major

 cultural

,

 economic

,

 and

 political

 center

 since

 the

 

1

5

th

 century

.

 The

 city

 is

 surrounded

 by

 the

 Se

ine

 river

 and

 includes

 the

 cities

 of

 Lyon

,

 Mont

mart

re

,

 and

 Me

ud

on

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 many

 other

 notable

 landmarks

.

 The

 city

 is

 known

 for

 its

 vibrant

 artistic

 and

 cultural

 scene

,

 as



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 trends

 that

 will

 shape

 the

 field

 as

 we

 know

 it

 today

.

 Here

 are

 some

 of

 the

 most

 likely

 future

 trends

 in

 AI

:



1

.

 Increased

 use

 of

 AI

 in

 healthcare

:

 AI

 is

 already

 being

 used

 in

 a

 number

 of

 healthcare

 applications

,

 from

 diagnostics

 and

 treatment

 planning

 to

 patient

 care

.

 As

 AI

 continues

 to

 develop

 and

 become

 more

 sophisticated

,

 it

 is

 likely

 to

 become

 an

 even

 more

 important

 tool

 in

 healthcare

.



2

.

 AI

 in

 manufacturing

:

 AI

 is

 already

 being

 used

 in

 manufacturing

 to

 automate

 processes

 and

 improve

 efficiency

.

 As

 AI

 continues

 to

 evolve

,

 it

 is

 likely

 to

 become

 an

 even

 more

 important

 tool

 in

 manufacturing

.



3

.

 AI

 in

 financial

 services




In [6]:
llm.shutdown()