# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0811 01:45:49.017000 3702556 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 01:45:49.017000 3702556 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0811 01:46:01.118000 3702977 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 01:46:01.118000 3702977 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0811 01:46:01.155000 3702978 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0811 01:46:01.155000 3702978 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.76it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Greta. I am an aspiring poet. My poetry has been shared on the great websites like Miro and Twitter. I have also published a short poem in a book called "The Poet's Life: A Guide for Beginners". 

I love trying different forms of poetry to explore my own unique voice. My poetry is inspired by nature and often includes elements of music and sound. I am currently in my mid-20s and I have a passion for discovering new literary ideas and in writing my own poetry. 

I would like to share my poetry with anyone interested in literary and creative writing. Your comments and suggestions are very welcome
Prompt: The president of the United States is
Generated text:  trying to improve the ability of the U.S. military to use drones. The president believes that the U.S. military should have fewer drones and not be so dependent on them. The president thinks that instead of having a large number of drones in the air, the U.S. military should use a smaller nu

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a few details about your personality, skills, and accomplishments]. And what's your background? I have a [insert a few details about your education, work experience, or any other relevant information]. And what's your favorite hobby or activity? I enjoy [insert a few details about your hobbies or activities]. And what's your favorite book or movie? I love [insert a few details about your favorite books or movies]. And what's

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in Europe by population. It is located on the Seine River and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is known for its rich history, art, and culture, and is a popular tourist destination for visitors from around the world. It is also home to many important institutions, including the French Academy of Sciences and the French Parliament. The city is known for its vibrant nightlife and is a popular destination for tourists and locals alike. Paris is a city of contrasts, with its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more robust AI systems that are designed to be transparent, accountable, and responsible.

3. Increased focus on AI ethics: As AI becomes more integrated with human intelligence, there will be a greater



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name], and I am a [insert your job title here]. I am a creative problem-solver, excelling in both the creative and technical aspects of my work. I have always been fascinated by technology and have always wanted to make the world a better place. I love to experiment with new ideas and come up with creative solutions to problems. I am always looking for new challenges and opportunities to grow and learn. I am a team player and enjoy working with others to achieve our goals. I am passionate about being a part of a positive, innovative team that pushes boundaries and makes a positive impact in our community. Thank you for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the French Riviera, in the western part of the country, and is the second-most populous city in Europe and the 15th mos

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 first

 name

 and

 last

 name

].

 I

 am

 [

insert

 character

's

 age

,

 gender

,

 and

 any

 unique

 abilities

 or

 experiences

 that

 make

 them

 stand

 out

].

 In

 my

 free

 time

,

 I

 enjoy

 [

insert

 hobbies

 or

 activities

 that

 interest

 me

].

 I

 am

 always

 looking

 for

 challenges

 and

 opportunities

 to

 learn

 new

 things

.

 I

 am

 also

 a

 [

insert

 a

 sport

,

 hobby

,

 or

 activity

]

 enthusiast

.

 What

 are

 some

 of

 the

 interests

 or

 hobbies

 that

 make

 you

 unique

 and

 stand

 out

 from

 others

?

 As

 for

 the

 future

,

 I

 hope

 to

 pursue

 [

insert

 a

 career

 goal

 or

 goal

 you

've

 been

 dreaming

 of

 for

 a

 long

 time

].

 What

 are

 your

 plans

 for

 the

 future

?

 Lastly



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



In

 this

 statement

,

 we

 have

 included

 the

 following

 information

:



1

.

 The

 city

 is

 referred

 to

 as

 the

 capital

 of

 France

,

 which

 is

 the

 country

's

 largest

 and

 most

 populous

 country

 in

 Europe

.


2

.

 Paris

 is

 the

 main

 city

 of

 France

,

 as

 well

 as

 being

 the

 capital

 of

 France

.


3

.

 Paris

 is

 a

 major

 cultural

 and

 economic

 hub

 in

 Europe

,

 hosting

 a

 diverse

 range

 of

 attractions

 and

 events

 throughout

 the

 year

.


4

.

 The

 city

 is

 located

 in

 the

 center

 of

 the

 country

 and

 is

 well

-known

 for

 its

 architecture

,

 art

,

 and

 culinary

 traditions

.


5

.

 Paris

 is

 often

 referred

 to

 as

 "

the

 city

 of

 a

 thousand

 faces

"

 due

 to



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 several

 key

 trends

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 AI

 is

 likely

 to

 become

 more

 integrated

 with

 other

 technologies

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

),

 blockchain

,

 and

 robotics

.

 This

 integration

 will

 allow

 AI

 to

 adapt

 to

 new

 challenges

 and

 evolve

 in

 new

 directions

.



2

.

 Greater

 automation

:

 As

 AI

 becomes

 more

 advanced

,

 it

 is

 likely

 to

 become

 more

 autonomous

 and

 capable

 of

 performing

 tasks

 that

 would

 previously

 have

 required

 human

 intelligence

.

 This

 automation

 will

 likely

 lead

 to

 increased

 efficiency

 and

 productivity

,

 and

 will

 also

 create

 new

 jobs

.



3

.

 Increased

 use

 of

 AI

 for

 humanitarian

 purposes

:

 AI

 has

 the

 potential

 to

 be

 used

 for

 humanitarian

 purposes

 such




In [6]:
llm.shutdown()