# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0811 10:22:58.076000 3843856 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 10:22:58.076000 3843856 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0811 10:23:06.754000 3844436 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 10:23:06.754000 3844436 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.82it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.82it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=54.21 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=54.21 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.89it/s]Capturing batches (bs=2 avail_mem=54.15 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.89it/s]Capturing batches (bs=1 avail_mem=54.15 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.89it/s]Capturing batches (bs=1 avail_mem=54.15 GB): 100%|██████████| 3/3 [00:00<00:00, 11.38it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Xiaoxuan and I was born in 1980, and I'm 13 years old now. How can I get in touch with you? A. Call me B. Write to me C. Post a message D. Make a phone call. Which of the following options should I choose? A. Call me B. Write to me C. Post a message D. Make a phone call. Which of the following options should I choose?
Answer:
A

When conducting construction work on lines without overhead power lines, what safety measure should be taken for equipment that is energized?
A. Short-circuit
B
Prompt: The president of the United States is
Generated text:  250 cm tall. When he walks away from a mirror, his shadow is 150 cm long. If his shadow is the same length as his height, how tall is the person in the mirror?

To determine the height of the person in the mirror, we need to use the concept of similar triangles. The height of the person and the length of his shadow form one triangle, and the height of the mirror and its shadow form another triangle.

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your personality or skills]. And what's your favorite hobby or activity? I love [insert a hobby or activity you enjoy]. And what's your favorite book or movie? I love [insert a book or movie you've read or watched]. And what's your favorite place to go? I love [insert a place you've visited]. And what's your favorite color? I love [insert a favorite color].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its vibrant nightlife, fashion industry, and food scene. The city is a major economic and cultural center in Europe and is home to many international institutions and organizations. It is a popular tourist destination and is considered one of the most beautiful cities in the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also known for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn from and adapt to human behavior and decision-making processes. This will enable more sophisticated and adaptive AI systems that can handle complex and unpredictable situations.

2. Enhanced machine learning capabilities: AI systems will become more capable of learning from large amounts of data and making more accurate predictions and decisions. This will enable more advanced and sophisticated AI systems that can handle a wider range of tasks and applications.

3. Increased use of AI in healthcare: AI will play a key role in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [occupation or profession] at [Location]. I am confident, experienced, and skilled in my field. I pride myself on my ability to think critically and arrive at sound conclusions. My approach is always based on logic and evidence, and I am always willing to learn from new experiences. I have a good sense of humor, and I am always up for a good laugh. I am always on the lookout for new challenges and opportunities to learn and grow. I am excited to meet you and help you achieve your goals. If you have any questions or need help, please don't hesitate to ask. 



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic Eiffel Tower and the iconic Eiffel Line railway.
Paris is the capital of France and is known for its iconic Eiffel Tower and the iconic Eiffel Line rai

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

].

 I

'm

 a

 [

insert

 character

's

 age

]

 year

 old

 who

 was

 born

 in

 [

insert

 birth

 year

]

 in

 [

insert

 birth

place

].

 I

 came

 to

 the

 United

 States

 when

 I

 was

 [

insert

 number

 of

 years

]

 years

 old

,

 so

 I

 was

 a

 kid

 when

 I

 first

 arrived

 in

 America

.

 I

'm

 an

 [

insert

 occupation

]

 and

 have

 a

 passion

 for

 [

insert

 a

 specific

 interest

 or

 hobby

 that

 interests

 you

].

 In

 my

 free

 time

,

 I

 enjoy

 [

insert

 something

 enjoyable

].

 I

'm

 a

 [

insert

 some

 characteristic

 or

 personality

 trait

]

 person

,

 and

 I

 strive

 to

 be

 [

insert

 a

 statement

 that

 reflects

 your

 character

 or

 goal

].

 I

'm



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



That

's

 a

 very

 accurate

 statement

!

 Given

 the

 historical

 and

 cultural

 significance

 of

 Paris

,

 it

's

 no

 surprise

 that

 it

's

 home

 to

 many

 of

 France

's

 most

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 many

 more

.

 Paris

 is

 also

 known

 as

 the

 "

City

 of

 Light

"

 and

 is

 a

 major

 economic

,

 political

,

 and

 cultural

 center

 in

 Western

 Europe

.

 It

's

 home

 to

 many

 international

 institutions

 and

 organizations

,

 including

 the

 French

 Academy

 of

 Fine

 Arts

,

 UNESCO

,

 and

 the

 European

 University

 Center

.

 Paris

 is

 also

 a

 world

-ren

owned

 gastr

onomic

 destination

,

 with

 many

 fine

 dining

 venues

 and

 Paris

ian

 restaurants



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 transformative

,

 with

 potential

 applications

 in

 various

 industries

 and

 fields

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 autonomy

:

 As

 AI

 becomes

 more

 advanced

,

 it

 is

 likely

 to

 become

 more

 autonomous

,

 allowing

 machines

 to

 make

 decisions

 on

 their

 own

 without

 human

 input

.

 This

 could

 lead

 to

 more

 efficient

 and

 effective

 use

 of

 resources

,

 as

 well

 as

 increased

 safety

 in

 industries

 such

 as

 manufacturing

 and

 transportation

.



2

.

 Enhanced

 human

-com

puter

 interaction

:

 AI

 is

 becoming

 more

 integrated

 into

 human

-com

puter

 interaction

,

 with

 more

 voice

-

activated

 assistants

,

 virtual

 assistants

,

 and

 intelligent

 chat

bots

 becoming

 increasingly

 prevalent

.

 These

 technologies

 could

 help

 people

 to

 communicate

 more

 efficiently

 and

 effectively

,




In [6]:
llm.shutdown()