# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.44it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Zhen. I am a junior in 2010. I'm a big fan of online chatting and mobile phones. When I'm not busy with school work, I'm always excited to play a game. My best friend is Eric. We both like different kinds of games. We don't play together very often. I usually play games in the afternoon. When we play games, I talk about our classes and homework. Eric likes to play games, too. He plays some games on his mobile phones and sometimes he plays games on his computers. At home, he plays games in the morning. He also watches TV and reads
Prompt: The president of the United States is
Generated text:  elected by popular vote. The president has been in office for 22 years, and the president of another country has been in office for 20 years. If the president of the United States has a presidency that is equivalent to the total sum of the presidents of both countries, how long is the combined presidency of the two countries? To find the combined presidenc

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name], and I'm excited to meet you. I'm a [job title] at [company name],

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous world-renowned museums, theaters, and restaurants. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, diverse culture, and vibrant nightlife. It is often referred to as the "City of Light" and is a major cultural and economic center of Europe. Paris is the capital of France and is a major hub for international business and diplomacy. It is also known for its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This includes issues such as bias, privacy, and transparency. AI developers will need to take a more responsible approach to their work, and will need to consider the potential impact of their technology on society.

2. Integration with human decision-making: AI is likely to become more integrated with human decision-making in the future. This could involve the use of AI to assist



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name] and I'm a [Primary职业或称号] who enjoys [Primary interest]. I'm [Your Age] years old, and I come from [Your hometown]. I'm [Your nationality or location]. I'm a [Your occupation or profession]. I've always been [Your main interest or hobby]. I'm passionate about [Your favorite hobby or interest]. I enjoy [Your favorite hobby or interest]. I'm [Your hometown or hometown's nickname]. I'm [Your occupation or profession]. I've always been [Your main interest or hobby]. I'm passionate about [Your favorite hobby or interest]. I enjoy [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as Louvain.

The city is located in the center of the country, surrounded by the Rhône Valley and the Mediterranean Sea, and is the largest city in the European Union. It is home to the Eiffel Tower, the Louvr

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

]

 and

 I

'm

 a

 [

insert

 character

's

 age

]

 year

 old

 who

 was

 born

 in

 [

insert

 birth

place

]

 and

 currently

 living

 in

 [

insert

 location

].

 I

'm

 an

 [

insert

 occupation

 or

 field

 of

 study

]

 with

 a

 [

insert

 hobby

 or

 interest

]

 that

 I

've

 been

 passionate

 about

 since

 childhood

.

 I

'm

 always

 looking

 for

 new

 things

 to

 learn

 and

 expand

 my

 knowledge

,

 and

 I

'm

 constantly

 updating

 my

 knowledge

 through

 reading

,

 studying

,

 and

 networking

 with

 others

 who

 share

 my

 interests

.

 I

'm

 always

 eager

 to

 learn

 and

 grow

 as

 a

 person

,

 and

 I

'm

 dedicated

 to

 making

 the

 most

 of

 my

 time

 and

 opportunities

 to

 help

 others

.

 What



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 and

 the

 most

 populous

 city

 in

 France

,

 and

 is

 the

 seat

 of

 government

 and

 the

 largest

 metropolitan

 area

 in

 the

 European

 Union

.

 It

 is

 located

 in

 the

 Haut

s

-de

-F

rance

 region

 and

 is

 situated

 on

 the

 river

 Se

ine

,

 which

 flows

 through

 the

 city

.

 Paris

 is

 known

 for

 its

 classical

 architecture

,

 vibrant

 arts

 and

 culture

,

 and

 world

-ren

owned

 museums

,

 such

 as

 the

 Lou

vre

 and

 the

 Mus

ée

 d

'

Or

say

.

 The

 city

 is

 also

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Ch

amps

-

É

lys

ées

,

 and

 has

 a

 rich

 history

 dating

 back

 to

 the

 Roman

 Empire

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 see

 significant

 advancements

 in

 several

 areas

,

 including

:



1

.

 Improved

 AI

 Ethics

 and

 Privacy

:

 As

 more

 data

 is

 collected

 and

 analyzed

,

 there

 is

 a

 growing

 concern

 about

 ethical

 and

 privacy

 issues

.

 AI

 systems

 are

 expected

 to

 become

 more

 transparent

 and

 responsible

,

 with

 greater

 emphasis

 on

 using

 AI

 in

 a

 manner

 that

 is

 fair

 and

 beneficial

 to

 all

.



2

.

 Increased

 Integration

 of

 AI

 with

 Human

 Interaction

:

 AI

 is

 already

 being

 integrated

 into

 everyday

 life

 through

 chat

bots

,

 virtual

 assistants

,

 and

 other

 forms

 of

 AI

-powered

 interaction

.

 As

 this

 technology

 continues

 to

 advance

,

 it

 is

 expected

 to

 become

 even

 more

 integrated

 with

 human

 interaction

,

 such

 as

 through

 language

 translation

 and

 emotional

 intelligence

.



3




In [6]:
llm.shutdown()