# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.72it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jack and I'm a teacher from the United States. I have a special kind of wheelchair, and I'm going to teach you about it. It is a high-powered wheelchair. It has a wheel that is about as big as a car. However, the people who use it are not afraid of this wheel. They are so excited and happy to have the wheel. 

The wheelchair has two wheels, one at the front and one at the back. However, the front wheel is so big that it can't be used for the person to push the wheelchair forward. The back wheel is small, but it can be used to push the
Prompt: The president of the United States is
Generated text:  expected to be sworn in by a general of the United States, who is a member of the armed forces. In such a situation, what is the best course of action for the president to take? To be more specific, please provide a step-by-step explanation of the process, including the roles and responsibilities of the U. S. president, the general, and the armed forc

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm [age] years old, and I have [number] years of experience in [industry]. I'm a [job title] with [company name] for [number] years. I'm always looking for new opportunities to grow and learn, and I'm always eager to learn more about the world around me. I'm a [job title] with [company name] for [number] years. I'm always looking for new opportunities to grow

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the second-largest city in the European Union. Paris is home to many famous landmarks such as the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. The city is also known for its rich cultural heritage, including the French language, art, and cuisine. Paris is a popular tourist destination and a major economic center in France. It is home to many important institutions such as the French Academy of Sciences and the French National Library. The city is also known for its fashion industry, with many famous designers and boutiques. Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars and robots in factories to personalized medicine and virtual assistants. As AI becomes more integrated into our daily lives, we may see a shift towards more ethical and responsible use of AI, with greater emphasis on transparency, accountability, and fairness in its development and deployment. Additionally, AI will likely continue to evolve and adapt to new challenges and opportunities, leading to new applications and uses of AI in fields such as healthcare, finance, and transportation. Overall



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Age] year-old who moved to [Location] to pursue my dream of becoming a [Occupation] and [Career Goal]. I'm passionate about [Career Goal] and I'm always looking for new experiences to grow and learn from. I'm a [interest or hobby] person and I enjoy [How I enjoy [Job/Activity/Interest/Interests/Interest in Personal Goal].] I have a [Skill or ability] and I'm always looking for ways to [Improving Something in My Life]. I'm a [Importance in [Company/Team/Industry/Any Other

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the Seine River in the center of the country, serving as the cultural, economic, and political center of the country.

So, what did the previous sentence miss? The previous sentence missed the point that Paris is not only the capital of France, but also the la

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

],

 and

 I

 am

 a

 [

insert

 the

 genre

 of

 the

 story

]

 writer

.

 I

 have

 always

 been

 fascinated

 by

 [

insert

 something

 specific

 to

 your

 genre

].

 I

 have

 always

 been

 interested

 in

 [

insert

 something

 specific

 to

 your

 genre

].

 I

 have

 always

 been

 drawn

 to

 writing

 [

insert

 something

 specific

 to

 your

 genre

].

 I

 have

 always

 been

 interested

 in

 [

insert

 something

 specific

 to

 your

 genre

].

 I

 have

 always

 been

 fascinated

 by

 [

insert

 something

 specific

 to

 your

 genre

].

 I

 have

 always

 been

 drawn

 to

 writing

 [

insert

 something

 specific

 to

 your

 genre

].

 I

 have

 always

 been

 interested

 in

 [

insert

 something

 specific

 to

 your

 genre

].

 I

 have

 always

 been

 fascinated

 by

 [

insert



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

 and

 a

 rich

 history

 dating

 back

 over

 

2

0

0

 years

.

 The

 city

 is

 also

 home

 to

 numerous

 art

 museums

,

 including

 the

 Lou

vre

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Mus

ée

 Rod

in

.

 Paris

 is

 also

 famous

 for

 its

 unique

 culinary

 traditions

,

 such

 as

 cro

iss

ants

,

 sweet

 past

ries

,

 and

 the

 famous

 cro

issant

.

 The

 city

 is

 known

 for

 its

 diverse

 culture

 and

 its

 role

 as

 a

 major

 cultural

 hub

 in

 Europe

.

 It

's

 a

 beautiful

 city

 that

 is

 home

 to

 many

 iconic

 landmarks

 and

 attractions

.

 Is

 there

 anything

 else

 you

'd

 like

 to

 know

 about

 Paris

?

 The

 capital

 of



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 continued

 development

 and

 expansion

.

 Some

 possible

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 precision

 and

 accuracy

:

 With

 the

 development

 of

 more

 powerful

 algorithms

 and

 hardware

,

 AI

 systems

 are

 likely

 to

 become

 more

 precise

 and

 accurate

 in

 their

 decision

-making

 processes

.

 This

 could

 lead

 to

 more

 reliable

 and

 trustworthy

 AI

 systems

.



2

.

 Integration

 with

 other

 technologies

:

 As

 AI

 systems

 become

 more

 integrated

 with

 other

 technologies

,

 such

 as

 machine

 learning

,

 blockchain

,

 and

 IoT

,

 it

 is

 possible

 that

 new

 applications

 and

 services

 could

 emerge

 that

 take

 advantage

 of

 these

 interactions

.



3

.

 Autonomous

 decision

-making

:

 As

 AI

 systems

 become

 more

 sophisticated

,

 they

 may

 be

 able

 to

 make

 decisions

 on

 their




In [6]:
llm.shutdown()