# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.41it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex and I love to explore the world of stories. From fantasy to science fiction to historical fiction, I enjoy writing about worlds and cultures that I believe will be exciting and interesting. My writing has been featured in several publications and has been shared on social media, which has helped to make me a well-known author. I hope to inspire others to explore the world of stories and see the amazing stories that are waiting to be told. Thank you for stopping by my blog today and let me know what you think of my writing. Let me know if you have any questions! [0/2] Where did the author find inspiration for the stories
Prompt: The president of the United States is
Generated text:  a figure of great importance, and he serves as the leader of the country. Now, there is a certain number of members in the United States House of Representatives. If the president is removed, the number of members decreases by one. Then, the president is added 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [Age] year old [Occupation]. I have always been passionate about [Your Passion], and I am always looking for ways to [Your Goal]. I am always eager to learn and grow, and I am always willing to share my knowledge with others. I am a [Your Personality] person, and I am always ready to help others. I am a [Your Character] who always strives to be the best version of myself. I am a [Your Motivation] person, and I am always determined to achieve my goals. I am a [Your Purpose] person, and I am always

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic Eiffel Tower, beautiful museums, and rich cultural heritage. It is also a major financial center and home to many world-renowned institutions such as the Louvre Museum and the Notre-Dame Cathedral. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is also home to the French Parliament and the French Academy of Sciences. 

Paris is a vibrant and diverse city with a rich history and a strong sense of French identity. It is a city that is constantly evolving and adapting to new challenges and opportunities, making it a fascinating and exciting place to visit. 



Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, AI is likely to continue to be used for a wide range of applications, from healthcare and finance to transportation and entertainment, as it becomes more accessible and cost-effective. However, there are also potential risks and challenges associated with AI, including concerns about job displacement and ethical concerns around data privacy and bias. As AI continues to evolve, it is likely to play an increasingly important



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [What kind of character is you? A character who wears a specific outfit? A character who is an expert in a specific subject? A character who has a unique background? A character who is a selfless friend? A character who is an entertainer? A character who has a penchant for sarcasm?] 
I'm a [What's the main profession of you? What's your current job? What's your most memorable experience? What's your favorite food? What's your favorite sport? What's your favorite hobby? What's your favorite book or movie? What's your favorite place to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris also has a rich history dating back to the Roman Empire, and is a major cultural and 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

occupation

]

 with

 extensive

 experience

 in

 [

your

 profession

].

 I

'm

 passionate

 about

 [

your

 profession

],

 and

 I

 believe

 that

 [

your

 occupation

]

 can

 help

 me

 achieve

 [

your

 career

 goal

].



As

 an

 [

occupation

],

 I

 have

 a

 track

 record

 of

 [

specific

 achievement

],

 and

 I

 am

 always

 looking

 for

 ways

 to

 [

specific

 goal

].

 I

'm

 eager

 to

 learn

 and

 grow

,

 and

 I

'm

 excited

 to

 work

 with

 you

.

 What

 can

 you

 tell

 me

 about

 yourself

?

 [

Your

 Name

]

 with

 extensive

 experience

 in

 [

your

 profession

]

 and

 a

 passion

 for

 [

your

 profession

].

 I

 am

 ready

 to

 work

 with

 anyone

 who

 shares

 my

 interest

 and

 goal



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



France

's

 capital

 city

 is

 Paris

,

 and

 it

 is

 located

 in

 the

 region

 of

 Î

le

-de

-F

rance

.

 The

 city

 has

 a

 population

 of

 about

 

1

0

 million

 people

,

 and

 it

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 stunning

 views

 of

 the

 city

 skyline

.

 



Paris

 is

 the

 official

 capital

 of

 France

 and

 the

 seat

 of

 government

,

 and

 it

 has

 a

 rich

 and

 complex

 history

.

 The

 city

 was

 founded

 in

 

7

8

7

 by

 Char

lem

agne

 and

 has

 since

 become

 one

 of

 the

 oldest

 continuously

 occupied

 cities

 in

 the

 world

.

 



Paris

 is

 also

 a

 major

 hub

 for

 fashion

,

 food

,

 and

 entertainment

,

 and

 it

 has

 hosted

 many



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 some

 possible

 trends

 that

 could

 occur

 over

 the

 next

 few

 decades

 include

:



1

.

 Increased

 automation

:

 AI

 is

 increasingly

 replacing

 human

 labor

 in

 areas

 such

 as

 manufacturing

,

 transportation

,

 and

 customer

 service

.

 Automation

 could

 lead

 to

 a

 shift

 in

 jobs

 and

 economic

 inequality

.



2

.

 AI

 for

 human

 well

-being

:

 AI

 can

 help

 improve

 healthcare

,

 education

,

 and

 transportation

 by

 providing

 personalized

 and

 adaptive

 solutions

 to

 complex

 problems

.

 However

,

 it

 could

 also

 lead

 to

 a

 loss

 of

 jobs

 in

 the

 field

 of

 human

 interaction

.



3

.

 AI

 for

 the

 environment

:

 AI

 can

 help

 mitigate

 environmental

 problems

 such

 as

 climate

 change

 by

 optimizing

 energy

 use

 and

 reducing

 waste

.

 However

,

 it

 may

 also




In [6]:
llm.shutdown()