# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.53it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Harriet and I live in Port Elizabeth, South Africa. What is your name? And where are you from? And what is your nationality? And how long have you been living in Port Elizabeth?
As Harriet, I live in Port Elizabeth, South Africa, and I was born and raised there. I am of English descent. I have been living in Port Elizabeth for over 25 years. I am an African American. I am from Port Elizabeth and I was born here. I was also raised in this city. I have lived there for many years. I have a beautiful home in Port Elizabeth and I enjoy being here.
Prompt: The president of the United States is
Generated text:  very busy. He has to deal with so many problems. He has to deal with the problems of the country. He has to deal with the problems of the world. He has to deal with the problems in his own family. He has to deal with the problems in his church. He has to deal with the problems of his children. He has to deal with the problems of his grandchild

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in Europe by population. It is located in the south of France and is the seat of government, administration, and culture for the country. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also home to many famous museums, such as the Musée d'Orsay, the Musée Rodin, and the Musée d'Orsay. Paris is a popular tourist destination and is known for its rich history, art, and culture.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential future trends include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI technology continues to advance, we can expect to see even more sophisticated applications in this area.

2. AI in finance: AI is already being used in finance to improve fraud detection, risk management, and trading algorithms. As AI technology continues to evolve, we can expect to see even more sophisticated applications in this area.

3. AI in transportation: AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am an experienced [occupation or hobby], [insert a profession or hobby for your occupation]. I have been in this field for [X] years now and have accumulated a lot of [mention specific skills or knowledge gained]. I am currently working as a [insert current job title or role] and have always been passionate about [mention something about your current role that makes you excited to learn new things or pursue new hobbies]. I enjoy [mention something about your current role that makes you happy and motivated]. I have a deep passion for [insert something specific about your current role that you find most rewarding or makes you feel

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

 am

 [

age

]

 years

 old

.

 I

 was

 born

 in

 [

date

]

 and

 raised

 in

 [

city

].

 My

 favorite

 hobbies

 include

 [

mention

 hobbies

 here

].

 I

 am

 a

 [

occupation

]

 who

 enjoys

 [

mention

 what

 they

 enjoy

 most

 here

].

 I

 am

 currently

 [

state

 of

 being

]

 and

 I

 [

positive

 trait

].



[

Name

]

 is

 a

 [

occupation

]

 who

 loves

 [

mention

 hobby

 or

 activity

].

 [

Name

]

 was

 a

 [

parent

],

 [

mother

's

 name

],

 and

 [

father

's

 name

].

 [

Name

]

 is

 a

 [

occupation

]

 who

 enjoys

 [

mention

 hobby

 or

 activity

].

 [

Name

]

 was

 born

 in

 [

date

]

 and

 raised

 in

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 as

 the

 City

 of

 Light

.

 



F

acts

 about

 Paris

:



1

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 the

 third

 largest

 in

 the

 world

 by

 area

.


2

.

 It

 is

 situated

 on

 the

 banks

 of

 the

 Se

ine

 River

 in

 the

 Î

le

 de

 France

.


3

.

 Paris

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.


4

.

 It

 is

 a

 world

-ren

owned

 center

 of

 culture

,

 art

,

 and

 fashion

.


5

.

 It

 has

 a

 diverse

 population

,

 with

 more

 than

 

6

 million

 residents

 and

 a

 large

 immigrant

 population

.


6

.

 The

 French

 Quarter

 is

 a

 neighborhood

 in

 the

 city

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 heavily

 influenced

 by

 several

 key

 trends

,

 including

:



1

.

 Increased

 availability

 of

 AI

 technology

:

 As

 AI

 technology

 continues

 to

 improve

,

 we

 can

 expect

 to

 see

 it

 become

 more

 accessible

 to

 businesses

 and

 individuals

 alike

.

 This

 will

 likely

 lead

 to

 greater

 adoption

 of

 AI

 solutions

 and

 the

 development

 of

 new

 AI

 applications

.



2

.

 AI

 will

 continue

 to

 be

 used

 for

 various

 purposes

,

 such

 as

 healthcare

,

 finance

,

 and

 transportation

,

 as

 more

 applications

 of

 AI

 are

 realized

.



3

.

 AI

 will

 continue

 to

 be

 used

 for

 specific

 tasks

,

 such

 as

 language

 translation

,

 image

 recognition

,

 and

 natural

 language

 processing

,

 as

 these

 tasks

 require

 more

 sophisticated

 AI

 solutions

.



4

.

 AI

 will

 be




In [6]:
llm.shutdown()