# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.32it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.31it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lukas and I am an assistant professor at the University of Wisconsin-Madison. I have a PhD in mathematics from the University of Helsinki, and currently I am working on the field of topology. In this blog post, I want to share a little bit about the research topics and some interesting open problems in my area. I will share both theoretical and applied topics, and this will help the readers to understand the broad spectrum of my research interest. If you are not familiar with my research interests, I will explain that I am interested in topology. I will try my best to keep this post clear and concise. I hope you find the
Prompt: The president of the United States is
Generated text:  getting ready to address the nation. As he stands in the White House, he looks at a small collection of photographs and says, "This is the last of my family's family photos. I have to make a decision on the best way to dispose of them. So here's my decision: I will

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am currently [Current Location] and I have been [Number of Years] years in this field. I am a [Skill/Ability] that I have honed over the years and I am always [Positive Trait]. I am passionate about [What I Love to Do], and I am always looking for ways to [What I Want to Improve]. I am a [Personality Type] and I am always [Positive Attitude]. I am [Favorite Color] and I am always [Friendly]. I am [Favorite Book] and I am always [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also famous for its cuisine, fashion, and art scene. Paris is a bustling metropolis with a diverse population and a rich cultural heritage. It is the largest city in France and the second-largest city in the world by population. The city is home to many world-renowned museums, theaters, and landmarks, making it a popular tourist destination.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we interact with technology and the world around us. Here are some of the most likely trends that could be expected in the future:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to become more integrated into our daily lives. This could lead to the automation of many tasks, such as manufacturing, transportation, and customer service, which could result in increased efficiency and productivity.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, it is likely



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [age] year old [occupation]! I'm passionate about [insert a reason for passion], and I love to [insert a reason for love of work or hobbies]. I'm incredibly organized, and I love to [insert a hobby or activity I enjoy] and I'm always striving to [insert a goal or ambition]. I'm determined and I believe in my abilities! [Name], you're a role model for me, and I look up to you as a mentor. How are you today? [Name], it's been a pleasure to meet you! [Name] is a [insert

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Love. It is the largest and most populous city in France, located on the Seine River and surrounded by a historic urban district. The city is a major cultural and financial hub, with many renowned museums, galleries, and historical landmarks, such as

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

'm

 a

 [

Occup

ation

]

 who

 has

 always

 had

 a

 passion

 for

 [

Topic

 or

 Hobby

].

 I

 have

 a

 strong

 sense

 of

 [

Qual

ification

 or

 Character

].

 I

 enjoy

 [

Exercise

 or

 Social

 Interaction

].

 I

'm

 [

In

flu

ential

 or

 Cur

ious

]

 about

 [

Topic

 or

 Interest

].

 How

 can

 I

 be

 of

 help

 to

 you

?

 [

Name

],

 I

'm

 here

 to

 help

 anyone

 who

 needs

 assistance

 with

 [

What

 the

 Character

 can

 Do

].

 I

'm

 excited

 to

 meet

 you

!

 [

Name

],

 I

'm

 here

 to

 help

 anyone

 who

 needs

 assistance

 with

 [

What

 the

 Character

 can

 Do

].

 I

'm

 excited

 to

 meet

 you

!

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



[

Mark

 the

 correct

 answer

]

 



A

)

 True




B

)

 False




A

)

 True





The

 capital

 of

 France

 is

 indeed

 Paris

.

 This

 statement

 accurately

 describes

 the

 official

 name

 of

 the

 capital

 city

,

 which

 is

 where

 the

 government

,

 institutions

,

 and

 most

 of

 the

 nation

's

 institutions

 are

 located

.

 While

 Paris

 is

 a

 vibrant

 city

 with

 a

 rich

 cultural

 and

 historical

 heritage

,

 it

 is

 also

 a

 major

 economic

 center

,

 particularly

 for

 the

 French

 economy

.

 The

 French

 government

 has

 made

 Paris

 one

 of

 its

 major

 international

 capitals

,

 with

 the

 French

 embassy

 in

 Washington

,

 D

.C

.,

 being

 located

 in

 the

 heart

 of

 the

 city

.

 Paris

 is

 also

 the

 most

 visited

 city

 in



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

 in

 several

 areas

,

 including

 the

 development

 of

 more

 efficient

 and

 powerful

 hardware

,

 the

 expansion

 of

 data

 sets

 to

 allow

 for

 more

 sophisticated

 learning

 and

 analysis

,

 the

 integration

 of

 multiple

 AI

 techniques

 and

 approaches

,

 and

 the

 development

 of

 new

 types

 of

 AI

 systems

 that

 are

 more

 capable

 of

 solving

 complex

 problems

.

 The

 potential

 applications

 of

 AI

 are

 also

 expected

 to

 expand

,

 with

 applications

 in

 fields

 such

 as

 healthcare

,

 transportation

,

 education

,

 and

 security

 becoming

 increasingly

 common

.

 Additionally

,

 the

 increasing

 use

 of

 AI

 in

 our

 daily

 lives

,

 such

 as

 voice

 assistants

,

 virtual

 assistants

,

 and

 automated

 systems

,

 is

 likely

 to

 continue

 to

 grow

 in

 popularity

.

 Finally

,

 the

 increasing




In [6]:
llm.shutdown()