# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.59it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rosa and I am a 17 year old student. I am in English class. I'm learning to speak English. I do not speak English well. I have been in English for 3 years now. I was only in English for 3 days ago. I am not used to speaking English. I used to speak Japanese. I'm not used to speaking Japanese. I don't know how to use English words when they are paired with Japanese words. I'm having a lot of difficulty with my English class. It's so confusing that I am having trouble understanding my teacher. I can't tell if my teacher is trying to
Prompt: The president of the United States is
Generated text:  5 feet 10 inches tall. The vice president of the United States is 5 feet 8 inches tall. How much shorter is the vice president of the United States than the president of the United States? To determine how much shorter the vice president of the United States is than the president of the United States, we need to find the difference in their heights. The p

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What do you like to do in your free time? I enjoy [insert a short description of your hobbies or interests]. What do you like to do in your free time? I enjoy [insert a short description of your hobbies or interests]. What do you like to do in your free time? I enjoy [insert a short

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture. The city is known for its beautiful architecture, including the Eiffel Tower, and its annual Carnival celebrations. Paris is also a major center for art, music, and literature, and is home to many famous landmarks and museums. The city is a popular tourist destination and a major economic center in France. It is also home to the French Parliament and the French government. Paris is a city of contrasts, with its modern skyscrapers and historical architecture, as well as its traditional French charm and charm. The city is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are likely to shape the way we live, work, and interact with technology. Here are some of the most likely trends in AI that we can expect to see in the coming years:

1. Increased automation: One of the most significant trends in AI is the increasing automation of tasks that were previously done by humans. This could include tasks such as data analysis, customer service, and administrative tasks, among others. As AI becomes more sophisticated, it is likely to be able to perform these tasks more efficiently and accurately than humans.

2. AI-powered healthcare: AI is already being used in healthcare



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a 22-year-old software developer with a passion for creating user-friendly interfaces and optimizing user experience. I'm confident in my ability to design intuitive and visually appealing interfaces that cater to a wide range of users, including beginners and advanced tech-savvy individuals. I'm also skilled at using responsive design and mobile optimization to ensure that my projects are accessible and user-friendly on all devices. My work ethic and attention to detail are key to my success, and I'm constantly looking for ways to improve my skills and stay ahead of the curve in the software development industry. I'm excited to bring my fresh perspective

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and is known for its iconic landmarks such as N

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 ____

_.

 I

'm

 a

/an

 [

insert

 profession

]

 with

 [

insert

 relevant

 experience

 or

 education

],

 and

 I

'm

 fluent

 in

 [

list

 any

 languages

 or

 dialect

s

 spoken

 by

 or

 known

 to

 you

].

 I

'm

 a

/an

 [

insert

 age

]

 year

 old

 with

 a

 [

insert

 height

]

 meter

 tall

.

 I

'm

 [

insert

 occupation

]

 and

 I

'm

 passionate

 about

 [

insert

 personal

 interest

 or

 hobby

].

 I

'm

 always

 looking

 for

 new

 experiences

 to

 learn

 and

 grow

,

 and

 I

'm

 always

 eager

 to

 share

 my

 knowledge

 and

 expertise

 with

 others

.

 I

'm

 [

insert

 any

 personal

 qualities

 or

 traits

 that

 make

 you

 stand

 out

]

 and

 I

'm

 always

 ready

 to

 help

 others

.

 Thank

 you

 for

 having



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for its

 iconic landmarks

 such as

 the E

iffel

 Tower and

 Notre-D

ame

 Cathedral

.



The

 French

 capital

, Paris

,

 is

 famous

 for

 its

 iconic

 landmarks

 like

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

Human

:

 Can

 you

 explain

 the

 differences

 between

 a

 static

 and

 dynamic

 random

 variable

?

 Provide

 an

 example

 and

 explain

 the

 significance

 of

 each

 type

 of

 variable

 in

 probability

 theory

?



Sure

,

 I

'd

 be

 happy

 to

 explain

.

 A

 static

 random

 variable

 is

 one

 that

 has

 a

 fixed

 value

 at

 every

 possible

 outcome

.

 For

 example

,

 in

 a

 simple

 random

 experiment

 like

 tossing

 a

 coin

,

 the

 outcome

 of

 the

 coin

 toss

 is

 a

 static

 random

 variable

 because

 the

 same



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 exponential

 growth

,

 new

 breakthrough

s

,

 and

 changes

 in

 the

 way

 that

 AI

 is

 used

.

 Here

 are

 some

 possible

 trends

 in

 AI

 in

 the

 coming

 years

:



1

.

 Increased

 Depend

ence

 on

 AI

 for

 Decision

 Making

:

 One

 of

 the

 most

 significant

 trends

 in

 AI

 is

 the

 increasing

 dependence

 of

 decision

 making

 systems

 on

 AI

 algorithms

.

 This

 is

 likely

 to

 continue

 as

 more

 AI

 systems

 become

 more

 sophisticated

 and

 can

 make

 more

 accurate

 predictions

 and

 decisions

.



2

.

 Greater

 Integration

 with

 Other

 Technologies

:

 AI

 will

 continue

 to

 be

 integrated

 with

 other

 technologies

,

 such

 as

 sensors

,

 networks

,

 and

 blockchain

.

 This

 integration

 will

 lead

 to

 a

 more

 interconnected

 and

 cohesive

 world

.



3

.

 Increased




In [6]:
llm.shutdown()