# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.48it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sabrina. I am a senior at the University of Chicago where I study Computer Science. I am also the founder and lead developer of the social media platform, Notion, a platform that allows users to create online notes, plans, and schedules.
I spent most of my time at the University of Chicago in my third year, researching and creating algorithms to help solve the challenges faced in information systems. I also did research at MIT, the University of Chicago and the University of Waterloo.
I am currently in my final year of my studies, and I am also the founding and lead developer of Notion, an online productivity tool that allows users
Prompt: The president of the United States is
Generated text:  a person who is in charge of the United States. Is this statement true or false?
A. True
B. False
Answer:

B

In the Windows operating system, regarding the file dialog box's open and save functions, which of the following descriptions is correct?
A. Whe

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill] with [Number] years of experience in [Field]. I'm passionate about [What you do for a living] and I'm always looking for ways to [What you do to improve yourself]. I'm [What you do to improve yourself] and I'm always looking for ways to [What you do to improve yourself]. I'm [What you do to improve yourself] and I'm always looking for ways to [What you do to improve yourself]. I'm [What you do to improve yourself] and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, art, and cuisine. Paris is a popular tourist destination and a major economic center in France. It is home to many world-renowned museums, theaters, and restaurants. The city is also known for its annual Eiffel Tower Festival, which attracts millions of visitors each year. Paris is a city of contrasts,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems will become more integrated with human intelligence, allowing them to learn and adapt to new situations. This will enable AI to perform tasks that are currently beyond the capabilities of humans, such as playing chess or driving a car.

2. Enhanced natural language processing: AI will continue to improve its ability to understand and respond to natural language, allowing for more sophisticated and context-aware interactions with humans.

3. Improved privacy and security: As AI systems become more integrated with human intelligence, there will be increased concerns about privacy and security. AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm [Age]. I'm a friendly, relatable character who can provide you with reliable information and advice on a wide range of topics. I love to help people with their questions and interests, and I believe that I can make a positive difference in people's lives. My goal is to help others and bring joy to their day. I believe in doing my best work and taking pride in the work I do, and I strive to be honest and open with others. I have a positive and optimistic outlook on life and strive to make the most of my skills and abilities. I am always ready to assist and make

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the south-west of the country.

You are an AI assistant that helps you understand the logic of a problem and helps you solve it. He/She doesn’t write, but helps you by giv

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 ______

.

 I

 am

 a

/an

 ______

__

.


[

optional

 details

 about

 your

 background

,

 education

,

 and

 interests

]


If

 you

 could

 change

 anything

 about

 yourself

,

 what

 would

 it

 be

?

 I

 would

 like

 to

 change

 my

 name

,

 as

 I

 do

 not

 want

 to

 be

 identified

 by

 it

.


Is

 there

 anything

 I

 would

 like

 to

 know

 about

 my

 character

 before

 we

 get

 started

?

 Please

 let

 me

 know

 what

 you

 think

.

 **

Your

 Name

:**

 [

optional

]

 **

Your

 Profession

:**

 [

optional

]

 **

Your

 Inter

ests

:**

 [

optional

]

 **

Your

 Education

:**

 [

optional

]

 **

Your

 Background

:**

 [

optional

]


I

 would

 also

 like

 to

 change

 the

 name

 of

 the

 fictional

 character

 to

 [

your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 most

 populous

 city

 and

 the

 cultural

 and

 economic

 center

 of

 the

 country

.



Paris

 is

 the

 capital

 of

 France

 and

 the

 seat

 of

 the

 government

.

 The

 city

 is

 known

 for

 its

 historical

 landmarks

,

 vibrant

 arts

 scene

,

 and

 annual

 festivals

.

 It

 is

 also

 famous

 for

 its

 cuisine

,

 fashion

,

 and

 fashion

 industry

,

 and

 is

 home

 to

 many

 world

-ren

owned

 museums

,

 attractions

,

 and

 entertainment

 venues

.

 Despite

 its

 size

,

 Paris

 has

 a

 lively

 nightlife

 and

 numerous

 cultural

 venues

,

 making

 it

 a

 popular

 tourist

 destination

.



Paris

 is

 home

 to

 over

 

1

0

 million

 people

,

 and

 it

 is

 the

 second

-most

 populous

 city

 in

 the

 world

 after

 New

 York

 City

.

 The

 city

 has

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 variety

 of

 technologies

 and

 applications

 that

 are

 expected

 to

 advance

 at

 unprecedented

 rates

.

 Some

 of

 the

 potential

 future

 trends

 in

 AI

 include

:



1

.

 More

 intelligent

 devices

:

 With

 the

 advent

 of

 blockchain

 and

 other

 distributed

 ledger

 technologies

,

 there

 is

 potential

 for

 more

 intelligent

 devices

 that

 can

 learn

 and

 adapt

 to

 their

 environment

,

 making

 them

 more

 reliable

 and

 efficient

 than

 traditional

 computer

 systems

.



2

.

 Better

 personal

ization

:

 AI

 is

 already

 being

 used

 to

 personalize

 experiences

 for

 users

,

 but

 there

 is

 potential

 for

 even

 greater

 personal

ization

 through

 more advanced

 algorithms

 that

 can

 analyze

 a

 user

's

 data

 and

 tailor

 their

 experiences

 to

 their

 specific

 needs

.



3

.

 More

 autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 already




In [6]:
llm.shutdown()