# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.36it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chastity, and I was born and raised in Cambridge, Massachusetts. I am originally from the countryside where my parents raised me, but we moved to Boston, Massachusetts when I was about 18 years old. I'm married and have two children, two daughters and a son, who are now 11 and 10 years old, and 7 and 5 years old. I grew up in the heart of the city. I live in a small house in Westwood, which is part of the Boston suburbs, and I have a wonderful family who love to help me. I love to explore, travel, and
Prompt: The president of the United States is
Generated text:  a man. A. 2 B. 3 C. 4
Answer: A

Which of the following statements is true? ____ 
A. The IP address 202.113.1.10 is a Class B address.
B. The IP address 202.113.1.10 is a Class C address.
C. The IP address 202.113.1.10 is a Class D address.
D. The IP address 202.113.1.10 is a Class E address.
Answer:
Prompt: The capital of France is
Generated text:  ____
A. Bordeaux
B. Paris
C. Saint-

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Skill] who has been [Number of Years] years in the industry. I am passionate about [What I Love to Do]. I am a [Favorite Thing to Do] and I enjoy [Why I Love It]. I am a [Favorite Book] and I love [Why I Love It]. I am a [Favorite Movie] and I love [Why I Love It]. I am a [Favorite Music] and I love [Why I Love It]. I am a [Favorite Sport] and I love [Why I Love It].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, located in the south of the country and is the largest city in the country. It is known for its rich history, beautiful architecture, and vibrant culture. Paris is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also known for its annual festivals and events, including the World Cup, the Eiffel Tower Festival, and the Paris Fashion Week. Paris is a popular tourist destination and is a major economic center in France. It is also home to many important institutions such as the French Academy of Sciences and the French Parliament. The city is known for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the potential trends that are likely to shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This could lead to the creation of more efficient and productive machines that can perform tasks that were previously done by humans.

2. Enhanced human-computer interaction: As AI technology continues to improve, we are likely to see a greater emphasis on human-computer interaction. This could involve the



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a software engineer with a strong passion for [X] technology. I have been working on this technology for [X] years and have honed my skills and knowledge through [X] courses and [X] projects. I enjoy [X] and am always looking for ways to [X] and [X] with my colleagues and clients. Thank you. [Name]. Hi there, my name is [Name]. I am a [software engineer] with a passion for [X] technology. I have been working on this technology for [years] and have honed my skills and knowledge through [courses

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and is known as the "city of light" due to its modern skyline. Paris is home to several world-renowned museums, monuments, and theaters, including the Louvre and Notre-Dame Cathedral. It is also known for its ric

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

Occup

ation

]

 who

 is

 passionate

 about

 [

Your

 Passion

].

 I

 love

 [

Reason

 for

 Passion

]

 and

 am

 always

 seeking

 to

 [

Action

/

Challenge

].

 How

 can

 I

 contribute

 to

 [

Your

 Goal

/

Project

]

?



It

's

 an

 exciting

 time

 in

 our

 community

 to

 be

 [

Your

 Status

]

!

 As

 an

 [

Occup

ation

],

 I

'm

 always

 on

 the

 lookout

 for

 new

 opportunities

 to

 [

Action

/

Challenge

].

 How

 can

 I

 help

 [

Your

 Goal

/

Project

]

 and

 get

 involved

?



Please

 let

 me

 know

 if

 you

'd

 like

 me

 to

 share

 more

 details

 about

 myself

.

 I

'm

 looking

 forward

 to

 the

 possibility

 of

 working

 with

 you

.



[

Name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 Var

 region

 of

 the

 Lo

ire

 Valley

 and

 has

 been

 the

 capital

 city

 of

 France

 since

 

1

8

3

0

.

 It

 is

 the

 largest

 city

 in

 the

 world

 by

 population

,

 with

 an

 estimated

 population

 of

 over

 

7

.

7

 million

 in

 the

 city

 proper

 and

 an

 estimated

 

1

4

.

5

 million

 worldwide

.

 Paris

 is

 home

 to

 numerous

 famous

 landmarks

 and

 museums

,

 including

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Palace

.

 It

 is

 also

 known

 for

 its

 vibrant

 food

 culture

,

 fashion

 scene

,

 and

 cultural

 activities

.

 Paris

 is

 a

 popular

 destination

 for

 tourists

 from

 around

 the

 world

,

 and

 it

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 exciting

 and

 constantly

 evolving

,

 with

 many

 potential

 applications

 and

 areas

 of

 growth

.

 Here

 are

 some

 possible

 trends

 in

 AI

 that

 are

 likely

 to

 shape

 our

 world

 in

 the

 years

 to

 come

:



1

.

 Increased

 Integration

 with

 Other

 Technologies

:

 AI

 is

 already

 being

 integrated

 into

 a

 wide

 range

 of

 technologies

,

 including

 voice

 assistants

 like

 Siri

 and

 Alexa

,

 self

-driving

 cars

,

 and

 medical

 imaging

 systems

.

 In

 the

 future

,

 we

 may

 see

 even

 more

 integration

 between

 AI

 and

 other

 technologies

,

 such

 as

 the

 internet

 of

 things

 (

Io

T

)

 and

 blockchain

.

 For

 example

,

 AI

 could

 be

 used

 to

 optimize

 the

 use

 of

 energy

 in

 homes

 and

 businesses

,

 or

 to

 improve

 the

 efficiency

 of

 manufacturing

 processes

.



2




In [6]:
llm.shutdown()