# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.30it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Emily and I'm a grade 10 student. I'm really interested in the arts and I have a passion for painting. I also like to read and listen to music, but I have always struggled with the beginning of my writing. 

Emily loves the outdoors and loves to explore the world around her. She really likes the sound of different animals and enjoys talking to animals. She also loves to travel to different places in the world.

What areas of life are you interested in exploring, and what hobbies do you have? I would appreciate it if you could tell me more about them.
Emily loves the outdoors, and enjoys exploring the world
Prompt: The president of the United States is
Generated text:  6 feet tall. The vice president of the United States is 5 feet 3 inches tall. What is the height difference between the vice president and the president in inches?
To determine the height difference between the vice president and the president of the United States, we need to fol

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Character] who is [Describe your character's personality traits here]. I'm [Describe your character's appearance here]. I'm [Describe your character's hobbies or interests here]. I'm [Describe your character's strengths and weaknesses here]. I'm [Describe your character's goals here]. I'm [Describe your character's personality type here]. I'm [Describe your character's personality type here]. I'm [Describe your character's personality type here]. I'm [Describe your character's personality type here]. I'm [Describe your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich cultural heritage and is the largest city in France by population. The city is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also home to many famous museums, including the Louvre and the Musée d'Orsay. The city is known for its cuisine, including its famous croissants and its many traditional French dishes. Paris is a popular tourist destination and is home to many international companies and organizations. The city is also known for its fashion industry, with many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and experiences. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and evaluation of AI systems, as well as greater transparency and accountability in their development and deployment.

3. Increased focus on AI ethics: As AI becomes more integrated with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I am a [Your Profession] with a passion for [Your Field of Interest or Expertise]. I believe in [Your Profession] and am constantly learning and evolving in my area of expertise. I am a team player and enjoy collaborating with others to achieve our goals. I have a keen interest in [Your Profession] and strive to improve my skills and knowledge to be a better [Your Profession]. I am committed to taking on new challenges and continuing to grow as a professional. What's your name and what's your profession? [Your Name] and [Your Profession]. [Your Name] wants to learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known for its iconic Eiffel Tower, vibrant nightlife, and rich cultural heritage. The city is also home to the Louvre Museum and the Notre-Dame Cathedral, as well 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

occupation

 or

 role

]

 who

 is

 passionate

 about

 [

mention

 an

 activity

,

 hobby

,

 or

 interest

].

 I

 am

 a

 [

age

]

 year

 old

 and

 currently

 residing

 in

 [

your

 hometown

 or

 city

].

 I

 am

 always

 on

 the

 lookout

 for

 new

 challenges

 and

 opportunities

 to

 grow

 and

 develop

 as

 a

 [

profession

].

 I

 am

 a

 [

occupation

 or

 role

]

 who

 is

 known

 for

 my

 [

mention

 an

 achievement

,

 talent

,

 or

 skill

].

 I

 am

 an

 [

occupation

 or

 role

]

 with

 [

mention

 an

 area

 of

 interest

,

 hobby

,

 or

 passion

].

 I

 am

 a

 [

occupation

 or

 role

]

 who

 is

 [

mention

 a

 personal

 trait

 or

 quality

].

 I

 am



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 Se

ine

 River

,

 and

 many

 famous

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 Lou

vre

 Museum

.


Paris

 is

 the

 capital

 city

 of

 France

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

,

 Se

ine

 River

,

 and

 many

 famous

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

 and

 the

 Lou

vre

 Museum

.

 It

 is

 located

 in

 the

 center

 of

 the

 country

 and

 is

 a

 major

 international

 city

.

 The

 city

 has

 a

 rich

 history

 dating

 back

 to

 the

 Middle

 Ages

 and

 is

 home

 to

 many

 famous

 museums

 and

 landmarks

.

 It

 is

 also

 a

 popular

 tourist

 destination

,

 with

 millions

 of

 visitors

 each

 year

.

 The

 city

 has

 a

 diverse

 population

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 key

 trends

 and

 developments

,

 including

:



 

 

1

.

 AI

 will

 continue

 to

 become

 more

 powerful

 and

 more

 widely

 accessible

.

 This

 will

 likely

 be

 driven

 by

 advances

 in

 hardware

 and

 software

,

 as

 well

 as

 improvements

 in

 data

 and

 computational

 power

.


 

 

2

.

 AI

 will

 become

 more

 integrated

 with

 other

 technologies

,

 such

 as

 the

 Internet

 of

 Things

 (

Io

T

)

 and

 the

 Cloud

.

 This

 will

 allow

 for

 more

 connected

 and

 efficient

 systems

,

 as

 well

 as

 more

 personalized

 and

 context

-aware

 experiences

.


 

 

3

.

 AI

 will

 continue

 to

 focus

 on

 developing

 new

 forms

 of

 AI

,

 such

 as

 those

 that

 can

 learn

 and

 adapt

 to

 new

 contexts

 and




In [6]:
llm.shutdown()