# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.88it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Gabe and I am 17 years old. I am the President of our club and I have been in the club since 2014. I have been on the team since 2016. I have been in the club for 3 years, and I have been the captain of the club since 2019. I have been on the team since 2015, and I am the president of the club. Can you tell me about yourself and your role as President of your club? Gabe enjoys sports and has been playing soccer since he was a young child. He has played
Prompt: The president of the United States is
Generated text:  a very important person. He or she is the leader of the country. The president is also the chief of government. He or she makes decisions about everything that happens in the country. This includes deciding how to spend the country's money, how to decide what happens in the military, how to make decisions about the economy, and how to make decisions about the country's foreign policy. A president is not a king or a queen. The preside

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Skill/Ability] who has been [Number of Years] years in the field of [Field of Interest]. I am passionate about [Why I love my field]. I am a [Favorite Hobby/Activity] that I enjoy [Number of Hours/Week]. I am a [Favorite Book/Article/Video] that I read [Number of Times/Week]. I am a [Favorite Music/Artist/Album] that I listen to [Number of Times/Week]. I am a [Favorite Sport/Activity/Group] that I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower and the Louvre Museum. It is also home to the French Parliament and the French Academy of Sciences. Paris is a bustling city with a rich history and a diverse population, making it a popular tourist destination. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a city that is both a cultural and political center of France. Its history and architecture are celebrated, and it continues to be a major economic and cultural hub in the country. The city is also known for its food scene, with many restaurants and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more efficient and effective decision-making in various industries.

2. Enhanced machine learning capabilities: AI is likely to become more capable of learning from large amounts of data and making more accurate predictions and decisions. This could lead to more personalized and effective solutions to complex problems.

3. Increased use of AI in healthcare: AI is likely to play a more significant role in healthcare, with machines being



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a creative thinker with a passion for helping people. I believe that creativity is the key to unlocking the power of ideas and solutions, and that's why I create art that inspires people to think differently and come up with new ideas. I'm constantly learning and growing in my field of work, and I'm excited to share my ideas with anyone who is interested. So, if you're looking for creative solutions to problems or just want to have fun with a creative mindset, I'm here to help. What's your name? And, if you could give one piece of advice to someone who wants to become a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a bustling metropolis with a rich history and a diverse cultural scene. The city is known for its stunning architecture, rich history, and lively atmosphere, making it

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

job

 title

]

 at

 [

company

].

 I

'm

 currently

 [

current

 position

]

 and

 have

 [

number

 of

 years

'

 experience

].

 I

 am

 [

job

 title

]

 at

 [

company

],

 and

 I

'm

 currently

 [

current

 position

]

 at

 [

company

].

 I

 have

 [

number

 of

 years

'

 experience

]

 of

 experience

 in

 [

field

]

 and

 have

 a

 strong

 passion

 for

 [

special

ity

].

 In

 my

 free

 time

,

 I

 enjoy

 [

interest

s

/

activities

].

 Thank

 you

 for

 asking

!


Sure

,

 here

 is

 a

 short

,

 neutral

 self

-int

roduction

 for

 a

 fictional

 character

:



---



**

My

 Name

:**

 [

Name

]


**

Position

:**

 [

Current

 Job

 Title

]




Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historical

 city

 in

 the

 south

 of

 the

 country

.



Paris

 is

 the

 capital

 of

 France

 and

 the

 seat

 of

 the

 French

 government

,

 and

 the

 second

 most

 populous

 city

 in

 the

 world

,

 after

 New

 York

 City

.

 The

 city

 is

 known

 for

 its

 rich

 culture

,

 architecture

,

 food

,

 and

 fashion

.

 It

 is

 also

 famous

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 a

 popular

 destination

 for

 luxury

 food

 and

 fashion

.

 The

 city

 is

 home

 to

 many

 of

 the

 world

's

 oldest

 museums

,

 theaters

,

 and

 universities

,

 and

 has

 a

 long

 history

 dating

 back

 thousands

 of

 years



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 transformative

 and

 rapidly

 evolving

,

 driven

 by

 a

 combination

 of

 current

 advances

,

 the

 need

 for

 new

 applications

,

 and

 societal

 changes

.

 Here

 are

 some

 possible

 trends

 in

 the

 field

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 There

 will

 be

 a

 growing

 emphasis

 on

 how

 AI

 is

 used

 eth

ically

 and

 responsibly

.

 This

 includes

 issues

 such

 as

 privacy

,

 bias

,

 and

 transparency

.

 AI

 developers

 will

 need

 to

 consider

 how

 their

 technology

 impacts

 society

 and

 ensure

 that

 its

 use

 align

s

 with

 ethical

 principles

.



2

.

 Integration

 with

 existing

 technologies

:

 AI

 is

 already

 being

 integrated

 into

 various

 industries

,

 including

 healthcare

,

 finance

,

 and

 transportation

.

 As

 this

 integration

 becomes

 more

 widespread

,

 it

 will

 be

 important




In [6]:
llm.shutdown()