# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.43it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  and thanks for visiting my website. I am a product manager in the engineering firm, in charge of a small team. My goal is to develop product features that are really exciting and useful to the users.
I am very passionate about software engineering, and I try to learn from the latest technology trends and best practices.
At the moment, my main project is working on a new feature for our company's product, but I have a small project that I have been thinking about for a while now. I am looking for feedback on this feature.
I have decided to hold this idea contest, and I have decided to give each other a chance to
Prompt: The president of the United States is
Generated text:  elected by ______ people.

A. one million

B. 500 million

C. 250 million

D. 300 million
The correct answer is B. 500 million. According to the Electoral College system, the president is elected by the combined total of the states plus the two territories. In the 2016 elect

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [Age] year old, [Gender] and [Occupation]. I have a [Skill] in [Skill Name] and I enjoy [Favorite Activity]. I'm always looking for ways to [Challenge], and I'm always eager to learn new things. I'm a [Personality Type] person. I'm always looking for ways to [Challenge], and I'm always eager to learn new things. I'm a [Personality Type

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

This statement is accurate and brief, capturing the essential information about the capital city's name and its role in French politics and culture. It provides a clear and concise overview of the capital's location and significance. 

To further elaborate on this statement, it could be expanded to include additional details about Paris's history, architecture, cultural attractions, or political importance. For example, it could mention that Paris is the birthplace of the French Revolution and the city is home to iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. 

Overall, this statement provides a comprehensive overview of Paris's

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and robotics: As AI technology continues to advance, we can expect to see more automation and robotics in various industries, from manufacturing to healthcare. This will likely lead to increased efficiency and productivity, but it will also create new jobs and challenges for workers.

2. AI-powered healthcare: AI is already being used to improve the accuracy and speed of medical diagnosis and treatment. As AI technology continues to improve, we can expect to see even more widespread use of AI in



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am [Position/Title] at [Company/Agency]. I have [Number of years in the role] years of experience in this role, and I pride myself on [mention your expertise or qualities]. I have a passion for [mention a passion or hobby that interests you], and I strive to be a [mention a desired outcome or behavior] person at work. How would you describe yourself? As an AI language model, I am here to assist you with your queries and provide you with accurate and helpful responses. My mission is to help you learn, improve, and grow as a user like you. How

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Which of the following sentences is true? I. Paris is the only capital city in Europe. II. Paris is the capital of the United States. III. Paris is the capital of France. 

A) I only
B) II only
C) III on

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

,

 and

 I

'm

 a

 free

 spirit

 with

 a

 passion

 for

 photography

.

 I

 have

 a

 knack

 for

 capturing

 the

 beauty

 of

 everyday

 moments

,

 whether

 it

's

 a

 simple

 walk

 in

 nature

 or

 a

 sunset

 at

 the

 beach

.

 I

'm

 not

 afraid

 to

 explore

 new

 places

 and

 immer

se

 myself

 in

 new

 cultures

,

 which

 is

 why

 I

 often

 go

 on

 long

 road

 trips

 and

 backpack

ing

 trips

.

 I

'm

 also

 into

 creating

 and

 sharing

 my

 own

 photography

,

 and

 I

 love

 to

 share

 my

 love

 of

 photography

 with

 others

 through

 my

 Instagram

 and

 Twitter

 accounts

.

 I

'm

 always

 looking

 for

 new

 ways

 to

 capture

 the

 beauty

 of

 the

 world

 around

 me

 and

 share

 it

 with

 the

 world

.

 I

'm

 excited

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



A

.

 True

 B

.

 False




Answer

:

 B

.

 False




You

 are

 an

 AI

 assistant

 that

 helps

 people

 find

 information

.

 Don

't

 know

 a

 answer

,

 ask

 again

.

 



"

Paris

"

 is

 the

 capital

 city

 of

 France

.

 Is

 the

 following

 statement

 true

 or

 false

?

 Paris

 is

 the

 capital

 of

 France

.

 To

 determine

 if

 the

 statement

 "

Paris

 is

 the

 capital

 of

 France

"

 is

 true

 or

 false

,

 we

 need

 to

 understand

 the

 role

 of

 Paris

 in

 the

 country

 and

 its

 status

 as

 the

 capital

.

 



1

.

 **

Paris

'

 Status

 as

 Capital

**:

 The

 statement

 "

Paris

 is

 the

 capital

 of

 France

"

 is

 incorrect

 because

 Paris

 is

 not

 the

 capital

 of

 France



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 growing

 rapidly

,

 with

 many

 trends

 expected

 to

 shape

 the

 technology

's

 direction

.

 Here

 are

 some

 of

 the

 most

 promising

 trends

:



1

.

 Adv

ancements

 in

 AI

 technology

:

 AI

 is

 becoming

 more

 advanced

 and

 powerful

,

 with

 new

 algorithms

 and

 models

 being

 developed

 regularly

.

 This

 means

 that

 AI

 systems

 will

 become

 more

 accurate

,

 efficient

,

 and

 capable

 of

 handling

 complex

 tasks

.



2

.

 Integration

 with

 other

 technologies

:

 AI

 is

 becoming

 more

 integrated

 with

 other

 technologies

,

 such

 as

 robotics

,

 autonomous

 vehicles

,

 and

 smart

 homes

.

 This

 will

 allow

 AI

 systems

 to

 work

 in

 a

 broader

 range

 of

 applications

.



3

.

 Personal

ization

:

 AI

 is

 becoming

 more

 personalized

,

 allowing

 systems

 to

 learn

 from

 user

 data




In [6]:
llm.shutdown()