# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.24it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Kristina and I am from the United States of America. I am a medical doctor and I help people understand what is happening in the brain and help them to get better. This article is about the effects of pain on the brain. The article will be based on the information that I have learned. The pain that I am talking about is the pain that is caused by cancer. I have a lot of anxiety about the pain that is caused by cancer and I am very worried that the pain will affect my mind. I am very scared and anxious and I am concerned about the pain that will affect my mind. The pain that is caused by
Prompt: The president of the United States is
Generated text:  trying to decide how many military vehicles to purchase. He has three options: 100 tanks, 300 armored vehicles, and 500 commandos. He estimates that each tank requires 20 hours of maintenance, each armored vehicle requires 30 hours of maintenance, and each commando requires 5 hours of maintenance. A

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As an AI language model, I don't have a physical presence, but I'm always ready to assist you with any questions or tasks you may have. How can I help you today? Let me know if you have any questions or need any assistance. I'm here to help! [Name] [Company name] [Job title] [Company website] [LinkedIn profile] [Twitter handle] [GitHub profile] [Email address] [Phone number

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the Louvre Museum. It is also the seat of the French government and home to the French Parliament. Paris is a bustling metropolis with a rich history dating back to the Roman Empire and the French Revolution. The city is known for its fashion, art, and cuisine, and is a popular tourist destination. It is also home to many famous landmarks and attractions, including the Notre-Dame Cathedral and the Champs-Élysées. Paris is a vibrant and dynamic city that continues to evolve and grow. Its status as the capital of France is a testament to its importance

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased Use of AI in Healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, including in areas such as diagnosis, treatment planning, and patient monitoring.

2. Increased Use of AI in Manufacturing: AI is already being used in manufacturing to improve efficiency, reduce costs, and increase



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I am a [insert your character's role or profession]. I have been following my passion for [insert something related to your character's field of interest or hobby], and I am always eager to learn and grow. I enjoy exploring new topics, asking questions, and always looking for ways to improve my skills. What kind of activities do you enjoy doing, and what is your favorite hobby? Feel free to add any personal anecdotes or experiences to make your introduction even more engaging. Just be sure to include any relevant details about your character, such as your personality traits, achievements, or notable experiences. Good luck with your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. 

I

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

Job

 Title

]

 who

 is

 currently

 pursuing

 a

 [

Degree

]

 degree

.

 I

'm

 a

 [

occupation

]

 who

 loves

 [

occupation

-related

 activity

]

 and

 I

'm

 passionate

 about

 [

other

 passion

-related

 activity

].

 I

've

 been

 working

 hard

 to

 grow

 as

 a

 professional

 and

 am

 always

 looking

 for

 opportunities

 to

 learn

 and

 grow

 even

 further

.

 Thank

 you

 for

 taking

 the

 time

 to

 meet

 me

!

 May

 I

 ask

,

 what

 is

 your

 current

 occupation

 and

 what

 do

 you

 enjoy

 doing

?


[

Your

 Name

]:

 Hello

,

 my

 name

 is

 [

Your

 Name

],

 and

 I

'm

 a

 [

Job

 Title

]

 who

 is

 currently

 pursuing

 a

 [

Degree

]

 degree

.

 I

'm



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 central

 region

 of

 the

 country

.

 It

 is

 the

 largest

 city

 in

 France

,

 with

 a

 population

 of

 around

 

1

0

 million

 people

,

 and

 is

 the

 cultural

,

 economic

,

 and

 political

 center

 of

 the

 country

.

 Paris

 is

 known

 for

 its

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 Mont

mart

re

.

 The

 city

 is

 also

 home

 to

 many

 historical

 sites

 and

 museums

,

 and

 is

 a

 UNESCO

 World

 Heritage

 site

. Paris

 is

 also

 famous

 for

 its

 food

 culture

,

 with

 its

 famous

 bou

quets

 of

 bag

uet

tes

 and

 cro

iss

ants

.

 The

 city

 is

 home

 to

 many

 fashion

 and

 design

 brands

,

 such



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 marked

 by

 significant

 advancements

 in

 many

 different

 areas

,

 including

:



1

.

 Increased

 availability

 and

 accessibility

:

 As

 AI

 continues

 to

 develop

 and

 improve

,

 the

 cost

 of

 training

 models

 will

 decrease

,

 making

 them

 more

 accessible

 to

 businesses

 and

 individuals

.

 This

 will

 lead

 to

 a

 wider

 adoption

 of

 AI

 in

 various

 industries

,

 including

 healthcare

,

 transportation

,

 and

 manufacturing

.



2

.

 Integration

 with

 other

 technologies

:

 AI

 will

 continue

 to

 be

 integrated

 with

 other

 technologies

,

 such

 as

 IoT

,

 to

 create

 more

 efficient

 and

 effective

 systems

.

 This

 integration

 will

 enable

 AI

 to

 perform

 tasks

 that

 would

 be

 difficult

 or

 impossible

 for

 humans

 to

 accomplish

.



3

.

 Em

phasis

 on

 ethical

 and

 responsible

 use

:

 As

 AI

 becomes




In [6]:
llm.shutdown()