# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.74it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  David. I'm an American. I was born in Chicago and I have three daughters. I am a dad and the kids like to play tag. They have been playing tag for a long time, and I know the rule is to get the ball to a child first. But now they're upset that they can't get the ball to anyone. They always say "Oh, it's all our fault, " and I think that's really scary. I want to help them because they have been playing tag for a long time. Now I try to be a good parent. I try to explain things to them, and tell them how
Prompt: The president of the United States is
Generated text:  appointed by the _____. ____
A. President
B. Senate
C. House of Representatives
D. President of Congress
Answer:

D

Which of the following statements about what the U. S. House of Representatives does not correctly match its responsibilities?
A. The House of Representatives hears and debates legislation in the Senate.
B. The House of Representatives must present bills to the Senate

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [favorite hobby or activity]. I'm always looking for new experiences and adventures to try. What's your favorite book or movie? I love [favorite book or movie]. I'm always looking for new reads and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Library. Paris is a bustling metropolis with a rich cultural heritage and is a major economic and political center in Europe. The city is known for its fashion industry, art scene, and its role in the French Revolution and the French Revolution. Paris is a popular tourist destination and is home to many museums, theaters, and other cultural institutions. It is also a major hub for international business and finance. The city is known for its cuisine,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars to personalized medicine. Additionally, there is a growing interest in AI ethics and privacy, as concerns about the potential misuse of AI systems continue to grow. As AI becomes more integrated into our daily lives, it is likely to have a significant impact on society, both positive and negative. It is important for policymakers and industry leaders to work together to ensure that AI is developed and used in a responsible and ethical manner. 

In summary



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am [Age] years old, and I am [Profession]. I enjoy [Activity], and I am always [Curiosity]. What can you tell me about yourself?

I'm always up for learning, whether it's through reading books or attending classes. I'm also a big fan of podcasts and video tutorials. I like to stay up-to-date with the latest trends in my field, and I'm always eager to learn new things. What's your favorite hobby? I'm always looking for ways to express myself creatively. I enjoy painting, playing the piano, and trying new recipes. What's your favorite book

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is the largest city and the country's political, economic, and cultural center. It is also home to many notable landmarks, including Notre-Dame Cathedral, Louvre Museum, and the Eiffel Tower. Paris has a r

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

Your

 Profession

 or

 Role

].

 I

've

 been

 working

 in

 the

 [

Name

 of

 the

 Career

]

 field

 for

 [

Number

 of

 Years

].

 I

'm

 a

 big

 fan

 of

 [

Favorite

 Sport

],

 [

Favorite

 Book

],

 [

Favorite

 Movie

],

 and

 [

Favorite

 Music

 Artist

].

 I

'm

 a

 true

 [

Favorite

 Character

].

 I

'm

 a

 [

Favorite

 Book

 or

 Movie

 Character

].

 I

'm

 a

 [

Favorite

 Music

 Artist

's

 Name

].

 I

'm

 a

 [

Favorite

 Sport

's

 Name

].

 I

'm

 a

 [

Favorite

 Book

 Author

].

 I

'm

 a

 [

Favorite

 Movie

 Director

].

 I

'm

 a

 [

Favorite

 Director

's

 Name

].

 I

'm

 a

 [

Favorite

 Actor

].

 I

'm

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Its

 most

 famous

 landmark

 is

 the

 E

iff

el

 Tower

,

 which

 stands

 as

 one

 of

 the

 most

 iconic

 structures

 in

 the

 world

.

 The

 city

 is

 also

 known

 for

 its

 extensive

 network

 of

 parks

,

 including

 the

 Ch

amps

-

É

lys

ées

,

 and

 its

 bustling

 fashion

 industry

.

 



Additional

 facts

 about

 Paris

 include

:



-

 The

 E

iff

el

 Tower

 was

 completed

 in

 

1

8

8

9




-

 The

 city

 is

 located

 in

 the

 northern

 region

 of

 France

,

 at

 the

 mouth

 of

 the

 Se

ine

 River




-

 It

 is

 home

 to

 the

 headquarters

 of

 several

 of

 the

 world

's

 major

 financial

 institutions




-

 Paris

 is

 known

 for

 its

 delicious

 cuisine

 and

 wine

,

 as

 well

 as

 its



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 undoubtedly

 one

 of

 exponential

 growth

 and

 innovation

.

 In

 the

 coming

 years

,

 we

 are

 likely

 to

 see

 several

 key

 trends

 emerge

:



1

.

 Increased

 focus

 on

 ethical

 AI

:

 As

 AI

 becomes

 more

 integrated

 into

 various

 sectors

,

 such

 as

 healthcare

,

 finance

,

 and

 transportation

,

 it

 will

 become

 increasingly

 important

 to

 consider

 the

 ethical

 implications

 of

 its

 decisions

 and

 their

 potential

 impacts

 on

 society

.



2

.

 Expansion

 of

 AI

 into

 new

 applications

:

 AI

 is

 already

 being

 used

 in

 a

 wide

 range

 of

 applications

,

 from

 natural

 language

 processing

 to

 robotics

 and

 autonomous

 vehicles

.

 As

 the

 technology

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 an

 expansion

 of

 AI

 into

 new

 areas

 that

 were

 previously

 thought

 impossible

.



3

.

 Integration




In [6]:
llm.shutdown()