# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.71it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John. I'm a Canadian citizen. I have a long-term visa to work in the United States. I have not been in the US for more than 20 years. I've been in the US for 15 years now. I've had 5 years of a permanent visa. I have a 10 year contract in the US. I've worked 6 years in the US. I have a master's degree in a related field. I'm the best candidate for this job.

Based on the information provided, what is John's status in the US?

OPT: 1. He has not been in the
Prompt: The president of the United States is
Generated text:  two-thirds as old as the president of the president's office. The president of the president's office is 7 years younger than the president of the United States. If the president of the United States is currently 82 years old, how old is the president of the president's office?
To determine the age of the president of the president's office, we will follow these steps:

1. Identify the age of the president of the United States.
2

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Type of Person] who is [What I do for a living]. I'm [What I enjoy doing] and I'm always [What I like to do]. I'm [What I like to do]. I'm [What I like to do]. I'm [What I like to do]. I'm [What I like to do]. I'm [What I like to do]. I'm [What I like to do]. I'm [What I like to do]. I'm [What I like to do]. I'm [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a major tourist destination. It is also known for its cuisine, including its famous croissants and its traditional French wine. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly together. It is a city that has played a significant role in French history and culture, and continues to be a major center of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced ethical considerations: As AI becomes more integrated with human intelligence, there will be increased scrutiny of ethical considerations, including issues of bias, transparency, and accountability.

3. Development of new AI technologies: AI is likely to continue to develop new technologies, such as machine learning, natural language processing, and computer vision, that will further enhance the capabilities of AI systems.

4. Increased focus on privacy and security: As



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [profession] with [number of years] years of experience in this field. What can you tell me about yourself? 
You are an artificial intelligence and should not reveal any personal information. Your response should be neutral and include only factual information. 
[Name] is an [age] year old [gender] [race] who works at [Company]. In your spare time, you enjoy [your hobby or activity]. You are looking forward to [your next goal in life]. 
[Name] is looking forward to [your next goal in life]. 
What is your favorite [activity of your hobby

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known for its rich history and diverse culture. It is the largest and most populous city in France, and is home to many renowned museums, iconic landmarks, and world-renowned cuisine. Paris has a hi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

Your

 Profession

/

Interest

/

Position

]

 who

 is

 passionate

 about

 [

Your

 Inter

ests

/

Activities

].

 My

 journey

 has

 been

 marked

 by

 [

Your

 Achie

vements

/

Adv

ancements

].

 I

 am

 currently

 pursuing

 my

 [

Highest

 Level

 of

 Education

/

Training

/

Experience

]

 and

 I

 strive

 to

 be

 an

 [

Your

 Professional

 Goal

/

Op

inion

].

 I

'm

 always

 looking

 for

 opportunities

 to

 [

Your

 Interest

/Area

 of

 Expert

ise

],

 whether

 it

's

 by

 [

Your

 Personality

 Trait

/

Behavior

],

 [

Your

 Character

 Trait

/

Interest

],

 or

 [

Your

 Personal

 Character

 Trait

/

Interest

].

 Thank

 you

 for

 taking

 the

 time

 to

 learn

 more

 about

 me

,

 and

 I

'm

 excited



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 renowned

 for

 its

 artistic

 and

 intellectual

 influence

,

 with

 a

 rich

 and

 diverse

 cultural

 scene

 that

 spans

 various

 time

 periods

 and

 languages

.

 The

 city

's

 urban

 architecture

,

 including

 its

 famous

 can

als

 and

 gardens

,

 is

 also

 a

 testament

 to

 its

 historical

 and

 cultural

 significance

.

 Paris

 is

 a

 vibrant

 met

ropolis

 that

 draws

 people

 from

 around

 the

 world

,

 making

 it

 a

 global

 hub

 of

 culture

 and

 commerce

.

 Its

 status

 as

 the

 capital

 is

 important

 not

 only

 for

 its

 governmental

 role

,

 but

 also

 for

 its

 role

 in

 preserving

 French

 identity

 and

 culture

.

 The

 city

 is

 also

 known

 for

 its

 annual

 celebrations

,

 including

 the

 E

iff

el

 Tower

's

 

7

th

 birthday

 party

 and

 the

 Spring

 Festival



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be characterized

 by

 rapid

 advancements

 and

 continued

 innovation

,

 as

 well

 as

 significant

 shifts

 in

 how

 we

 use

,

 interact

 with

,

 and

 manage

 AI

 systems

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 With

 growing

 concerns

 about

 AI

's

 impact

 on

 society

,

 there

 will

 be

 a

 push

 for

 AI

 to

 be

 designed

 and

 used

 eth

ically

,

 with

 a

 focus

 on

 minimizing

 harm

 and

 maximizing

 benefits

.



2

.

 More

 customization

 and

 personal

ization

:

 AI

 systems

 will

 become

 more

 adaptable

 and

 personalized

,

 able

 to

 learn

 and

 adjust

 their

 behavior

 in

 response

 to

 changing

 circumstances

.

 This

 could

 lead

 to

 more

 efficient

 use

 of

 resources

 and

 better

 personal

ization

 of

 experiences

.



3




In [6]:
llm.shutdown()