# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0814 10:24:01.356000 3981673 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 10:24:01.356000 3981673 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0814 10:24:09.337000 3983145 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0814 10:24:09.337000 3983145 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.77it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=50.38 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=50.38 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.44it/s]Capturing batches (bs=2 avail_mem=50.22 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.44it/s]Capturing batches (bs=1 avail_mem=50.21 GB):  33%|███▎      | 1/3 [00:00<00:00,  3.44it/s]Capturing batches (bs=1 avail_mem=50.21 GB): 100%|██████████| 3/3 [00:00<00:00,  8.55it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Katri. I'm a Junior at UCF. I play basketball with my friends and my dad and they play with their friends too. My mom and dad are not very good at playing basketball. We have to work together a lot. We play basketball at home and at the park. I like to play in my backyard. I play basketball for 3 hours every day. It feels good to play with my friends. I think it's really fun. I like to play with my friends, and I think it makes me happy. I don't play basketball in my school. I don't play basketball for 15 hours a
Prompt: The president of the United States is
Generated text:  a very important person. Before being elected, the president has to pass a very important test, called the ____ test. The president must be at least 35 years old before he can take this test. The test is to see if the president has made the right choices in the past to help the country succeed. If the test is passed, the president will have to run for another 5 years in th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your profession or experience]. I enjoy [insert a brief description of your hobbies or interests]. What do you do for a living? I'm always looking for new opportunities to learn and grow. What do you enjoy doing in your free time? I enjoy reading, playing sports, and spending time with my family. What's your favorite hobby? I love [insert a hobby you enjoy]. What's your favorite book?

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is known for its rich history, art, and cuisine, and is a popular tourist destination. It is also home to the French Parliament, the country's highest legislative body. The city is also known for its fashion industry, with many famous fashion designers and boutiques. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city that has played a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and the impact of AI on society.

2. Integration with other technologies: AI will continue to be integrated with other technologies, such as machine learning, natural language processing, and computer vision. This will allow for more sophisticated and personalized interactions with AI-powered systems.

3. Development of new AI technologies:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [occupation] with an extensive knowledge of [field or topic]. I enjoy [job-related activity or hobby]. I'm excited to meet you and learn more about [your field of interest]. Feel free to ask me anything you'd like to know. I look forward to our conversation. [Name]: Hello! I'm [Name], a [occupation] with a wealth of experience in [field or topic]. I'm excited to meet you and discuss [your field of interest] in detail. Feel free to ask me anything you'd like to know! [Name]: I look forward to our conversation. [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

This statement is factually accurate. 

To arrive at this conclusion, I used my knowledge of French history and geography. I recalled that Paris is the largest city in France and is also the capital of France. I also remembered that

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 fictional

 character

's

 name],

 and

 I

 am

 a

 [

insert

 fictional

 character

's

 profession

 or

 occupation

].

 I

 specialize

 in

 [

insert

 relevant

 skill

 or

 expertise

].

 How

 can

 I

 be

 of

 assistance

 to

 you

?

 [

insert

 character

's

 enthusiasm

,

 charisma

,

 or

 unique

 selling

 point

 to

 grab

 attention

].

 And

 to

 further

 my

 knowledge

,

 I

 am

 currently

 [

insert

 character

's

 age

,

 if

 any

,

 and

 if

 relevant

,

 [

insert

 the

 character

's

 hometown

 or

 any

 other

 relevant

 details

 that

 might

 make

 them

 stand

 out

]].

 And

 my

 favorite

 hobby

 is

 [

insert

 hobby

 or

 activity

].

 And

 [

insert

 character

's

 education

 or

 qualifications

],

 and

 [

insert

 any

 notable

 achievements

 or

 accomplishments

].

 And

 [

insert

 character



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

.



Paris

,

 often

 referred

 to

 as

 "

the

 City

 of

 Light

"

 or

 "

L

yon

,

 "

 is

 the

 largest

 city

 in

 France

 and

 serves

 as

 the

 seat

 of

 government

 for

 the

 French

 Republic

.

 It

 is

 located

 on

 the

 Î

le

 de

 la

 C

ité

,

 a

 cres

cent

-shaped

 island

 in

 the

 Se

ine

 River

.

 The

 city

 is

 famous

 for

 its

 historic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 home

 to

 numerous

 museums

,

 theaters

,

 and

 restaurants

,

 and

 plays

 a

 crucial

 role

 in

 French

 culture

 and

 society

.

 Its

 elegant

 architecture

 and

 modern

 technology

 have

 made

 it



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 constantly

 evolving

,

 and

 there

 are

 several

 trends

 that

 could

 shape

 the

 way

 we

 interact

 with

 technology

 in

 the

 coming

 years

.

 Here

 are

 some

 of

 the

 most

 likely

 trends

:



1

.

 Increased

 collaboration

 between

 humans

 and

 AI

:

 As

 AI

 continues

 to

 improve

,

 it

 is

 expected

 to

 become

 more

 capable

 of

 performing

 tasks

 that

 require

 human

 intelligence

,

 such

 as

 decision

-making

 and

 problem

-solving

.

 This

 could

 lead

 to

 a

 more

 collaborative

 relationship

 between

 humans

 and

 AI

,

 where

 both

 parties

 work

 together

 to

 achieve

 their

 goals

.



2

.

 AI

 will

 become

 more

 integrated

 into

 our

 daily

 lives

:

 As

 more

 and

 more

 companies

 and

 organizations

 incorporate

 AI

 into

 their

 operations

,

 we

 may

 see

 a

 gradual

 shift

 towards

 AI

-powered

 solutions

 in




In [6]:
llm.shutdown()