# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.86it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lynne, and I'm 15. I have a boyfriend named Aaron and we love to play sports together. Aaron is tall and he likes to play basketball very much. What do you think of Aaron's basketball playing style?
A) He is a good shooter and good at passing.
B) He is a good shooter but not so good at passing.
C) He is not good at shooting and not so good at passing.
D) He is a bad shooter and bad at passing. To determine what Lynne thinks about Aaron's basketball playing style, we need to consider the following points:

1. **Passing Skill**:
Prompt: The president of the United States is
Generated text:  very busy. On Sunday, he likes to spend a lot of time playing sports. He goes to the park to play basketball. He can't make time to see his wife and have dinner with his family. He has to get up early in the morning, get the car, and then drive to work. On the way, he has to stop for the traffic lights. He goes to work early in the morning and it takes him 30

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [age] year old, and I have a [job title] at [company name]. I'm a [job title] at [company name]. I'm a [job title] at [company name]. I'm a [job title] at [company name]. I'm a [job title] at [company name]. I'm a [job title] at [company name]. I'm a [job title] at [company

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the third-largest city in the European Union. It is located in the south of the country and is the seat of government, administration, and culture for the French Republic. Paris is known for its rich history, art, and cuisine, and is a major tourist destination. It is also home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is a cultural and economic hub of France and plays a significant role in the country's political and economic life. It is also a major hub for international trade and diplomacy. The city is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, particularly in areas such as diagnosis, treatment planning, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection and risk management. As AI technology continues



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm an aspiring musician with a passion for music that goes beyond just playing a guitar. I'm currently pursuing a bachelor's degree in music theory and have been exploring different genres of music, including rock, classical, and jazz. I'm also interested in taking music classes and workshops to improve my skills. In my spare time, I like to listen to music and travel to new places to experience different cultures. How can you help me find the perfect music for my upcoming concert? It's a musical journey, and I'd love to hear your advice on how to make it happen! [Name] [Type of music you

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital city of France, located on the River Seine in the central region of the country. It is the largest city in France and the eighth-largest i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Sarah

.

 I

'm

 a

 busy

 writer

 and

 illustrator

 living

 in

 New

 York

 City

.

 My

 favorite

 hobby

 is

 writing

 screen

plays

 and

 I

'm

 a

 big

 fan

 of

 anime

 and

 manga

.

 I

'm

 also

 an

 avid

 reader

 and

 love

 exploring

 new

 genres

 and

 authors

.

 I

'm

 always

 on

 the

 go

 and

 enjoy

 spending

 time

 with

 friends

 and

 family

.

 



Hi

 there

!

 It

's

 nice

 to

 meet

 you

.

 What

 brings

 you

 here

 today

?

 



Hello

!

 Nice

 to

 meet

 you

 too

.

 What

's

 your

 favorite

 genre

 to

 write

 in

?

 



I

 really

 like

 writing

 screen

plays

 and

 I

'm

 a

 big

 fan

 of

 anime

 and

 manga

.

 That

 sounds

 like

 a

 fun

 genre

 to

 work

 in

.

 



What

's



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



1

.

 **

Step

 

1

:

 Identify

 the

 key

 elements

 of

 the

 statement

**


  

 -

 **

Paris

**:

 The

 capital

 of

 France




  

 -

 **

capital

 city

**:

 It

 is

 the

 center

 of

 French

 government

,

 politics

,

 culture

,

 and

 industry

.



2

.

 **

Step

 

2

:

 Determine

 the

 quantity

**


  

 -

 The

 statement

 mentions

 Paris

 as

 a

 city

.



3

.

 **

Step

 

3

:

 Determine

 the

 extent

 of

 the

 statement

**


  

 -

 The

 statement

 specifies

 Paris

 as

 the

 capital

 city

.



4

.

 **

Step

 

4

:

 Syn

thesize

 the

 information

 into

 a

 concise

 statement

**


  

 -

 The

 concise

 statement

 is

:

 "

Paris

 is

 the

 capital

 of

 France

."



By

 following



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 uncertain

,

 but

 there

 are

 several

 potential

 trends

 that

 could

 shape

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 technology

 in

 the

 coming

 years

.



1

.

 Increased

 automation

:

 The

 potential

 for

 AI

 to

 automate

 repetitive

 tasks

 and

 increase

 efficiency

 is

 likely

 to

 grow

 as

 more

 industries

 adopt

 AI

.

 This

 could

 lead

 to

 job

 losses

 in

 certain

 sectors

,

 but

 also

 create

 new

 opportunities

 for

 those

 who

 can

 adapt

 and

 learn

 to

 work

 with

 AI

.



2

.

 AI

 ethics

 and

 transparency

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 there

 will

 be

 growing

 concerns

 about

 its

 ethical

 implications

.

 Governments

 and

 organizations

 will

 need

 to

 work

 to

 ensure

 that

 AI

 is

 developed

 and

 used

 in

 a

 way

 that

 is

 fair




In [6]:
llm.shutdown()