# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Max. I'm a writer and a blog writer, and I was born on October 24th, 1980, and grew up in the Philippines. I have a passion for adventure, travel, and exploring the world. My writing style is adventurous and open to new experiences, but I also have a deep respect for my culture and history. I'm currently working on a book set in the Philippines and have a tendency to go off on tangents from the main story in my writing. How can I improve my writing and make sure I don't stray from the main story in my writing?
Improving your writing and
Prompt: The president of the United States is
Generated text:  getting ready for the 2022 United States Open tennis tournament. He has 7 days to prepare for the tournament. If he has already prepared 15 games for the tournament, how many games does he need to prepare for the tournament?
To determine how many games the president of the United States needs to prepare for the 2022 United States Open tennis tournam

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [Age] year old [Occupation]. I am a [Type of Character] who has always been [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits] and I am [Positive Traits]. I am [Positive Traits

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major center for business, finance, and tourism, making it a popular destination for tourists and locals alike. The city is home to many museums, theaters, and other cultural institutions, and is a major center for the arts and entertainment industry. Paris is a city of contrasts, with its modern and historic elements blending together to create

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to a more human-like experience for users.

2. Enhanced privacy and security: As AI becomes more sophisticated, there will be a greater need for privacy and security measures to protect user data. This could lead to the development of new technologies and protocols to ensure that AI systems are used responsibly and ethically.

3. Greater automation and efficiency: As AI becomes more integrated with human intelligence, it is likely to become



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm [Age] years old. I have a unique personality and an adventurous spirit. I love traveling, exploring new places, and learning about different cultures. I'm always up for a good challenge and always want to grow in my skills and knowledge. I'm a team player, and I enjoy helping others and creating a positive impact on the world. Thank you! 
Remember, your self-introduction should be brief and to the point, highlighting your unique qualities and personality traits. Please make sure to avoid any personal or sensitive information. Additionally, make sure to include at least one positive attribute that you believe will contribute

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a beautiful and historic city located in the south of the country, surrounded by the River Seine. The city is f

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Age

]

 year

 old

 [

Gender

]

 from

 [

Your

 hometown

].

 I

'm

 a

 [

Occup

ation

],

 and

 I

've

 always

 been

 [

Ad

ap

table

,

 N

erv

ous

,

 Creative

,

 etc

.]

 and

 I

'm

 always

 learning

 and

 growing

.

 I

'm

 not

 afraid

 to

 take

 risks

 and

 try

 new

 things

,

 and

 I

'm

 always

 trying

 to

 improve

 myself

 in

 order

 to

 become

 the

 best

 version

 of

 myself

 possible

.

 I

'm

 passionate

 about

 [

My

 Passion

],

 and

 I

'm

 always

 there

 for

 my

 friends

 and

 family

.

 I

 have

 a

 great

 sense

 of

 humor

 and

 I

'm

 always

 ready

 to

 laugh

 and

 have

 fun

 with

 friends

 and

 family

.

 I

 love

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 "

City

 of

 Love

"

 due

 to

 its

 romantic

 history

 and

 cultural

 attractions

.

 It

 is

 the

 second

-largest

 city

 in

 France

,

 having

 a

 population

 of

 over

 

7

 million

 and

 is

 home

 to

 many

 famous

 landmarks

 and

 attractions

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 a

 UNESCO

 World

 Heritage

 site

 and

 is

 considered

 a

 cultural

 hub

 in

 the

 world

,

 with

 a

 rich

 history

 and

 a

 lively

 atmosphere

 that

 makes

 it

 a

 popular

 tourist

 destination

.

 The

 city

 is

 also

 home

 to

 many

 famous

 restaurants

,

 shopping

 districts

,

 and

 entertainment

 venues

.

 It

 is

 an

 essential

 part

 of

 French

 culture

 and

 society

,

 with

 its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 exponential

 growth

,

 with

 many

 unexpected

 developments

 and

 innovations

 shaping

 our

 future

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Deep

 learning

:

 Deep

 learning

 is

 a

 subset

 of

 AI

 that

 uses

 algorithms

 to

 solve

 problems

 that

 require

 a

 high

 level

 of

 detail

 and

 data

,

 such

 as

 natural

 language

 processing

,

 computer

 vision

,

 and

 speech

 recognition

.

 This

 technology

 is

 expected

 to

 continue

 to

 improve

,

 leading

 to

 even

 more

 advanced

 artificial

 intelligence

 systems

.



2

.

 Explain

ability

:

 AI

 systems

 that

 are

 too

 complex

 or

 complicated

 to

 explain

 their

 decisions

 can

 be

 difficult

 to

 trust

.

 Ex

plan

atory

 models

 are

 being

 developed

 to

 provide

 more

 transparency

 and

 understanding

 of

 AI

 systems

.



3

.

 Autonomous

 vehicles




In [6]:
llm.shutdown()