# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jack and I’m very happy to be here today and share my story with you. I was born and raised in America but I moved to Canada when I was 13. I was very lucky to be able to live in a small town in the province of British Columbia in Canada. I was a talented musician and a student of music. I was also a talented writer. I began my music career as a musician and I was very lucky to have grown up in a family that supported me and encouraged me to keep on making music. When I was 21 I went to Ottawa and was offered a position at the Royal Conservatory
Prompt: The president of the United States is
Generated text:  a very important person. It's his job to make sure that everyone in the country is safe and secure. Most presidents are there because they were elected by their country's people. They must make important decisions like whether to send troops to fight in another country or make money for the government. 

Some presidents try to be fair to th

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I am a [job title] at [company name]. I am passionate about [reason for being at the company]. I am always looking for ways to [what I enjoy doing at the company]. I am a [any other relevant qualities or skills]. I am excited to [what I hope to achieve at the company]. Thank you for asking! [Name] [Company Name] [Company Address] [Company Phone Number] [Company Email] [Company Website] [Company LinkedIn Profile] [Company Social Media Handles] [Company Social Media Handles] [Company Social Media Handles] [Company Social Media Handles] [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a popular tourist destination and home to many cultural institutions and events. Paris is a vibrant and diverse city with a rich history and a strong sense of French identity. The city is known for its cuisine, fashion, and art, and is a major hub for business and commerce in Europe. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city that has played a significant role in shaping French culture and identity for centuries. Paris

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate tasks that are currently performed by humans, such as data analysis, decision-making, and routine maintenance. This will lead to increased efficiency and productivity, but it will also create new job opportunities.

2. Enhanced intelligence: AI will continue to improve its ability to learn and adapt, allowing it to perform tasks that were previously impossible. This will lead to new applications of AI, such as self-driving cars, personalized medicine, and virtual assistants.

3. AI ethics and privacy: As AI becomes more integrated into our daily lives, there will be increasing concerns



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a young and ambitious businesswoman with a keen interest in social impact. I'm passionate about making a positive impact in the world and working towards a future where everyone has access to the same opportunities. I'm determined to drive change and create meaningful change in my community through my work and through my voice. My love for technology and my work ethic is a perfect fit for the role. I thrive on learning and adapting, and I'm eager to learn from others and grow as a leader. I'm confident in my ability to inspire others and make a difference in the world. Thank you. Congratulations! You've just

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light, a UNESCO World Heritage site. It is the most populous city in Europe and is the cultural and economic

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 a

/an

 [

Age

]

 year

 old

 [

Gender

]

 girl

.

 I

 have

 blonde

 hair

 and

 blue

 eyes

,

 and

 I

'm

 tall

 and

 thin

.

 I

 have

 a

 soft

 laugh

 that

 makes

 people

 smile

 and

 I

'm

 a

 fan

 of

 [

Favorite

 Sports

].

 I

 love

 to

 travel

 and

 explore

 new

 places

,

 and

 I

've

 been

 to

 [

Count

less

 Countries

].

 I

 have

 a

 passion

 for

 [

Favorite

 Activity

],

 and

 I

 enjoy

 [

Example

:

 Learning

 new

 languages

 or

 trying

 new

 food

].

 I

'm

 also

 a

 [

Level

 of

 Fitness

],

 which

 I

've

 been

 getting

 stronger

 and

 more

 toned

 over

 the

 years

.

 I

 have

 a

 great

 sense

 of

 humor

,

 and

 I

 enjoy

 making

 friends



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 The

 city

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 It

 is

 the

 largest

 city

 in

 the

 world

,

 home

 to

 several

 famous

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 and

 Lou

vre

 Museum

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 has

 hosted

 numerous

 world

 events

 including

 the

 

1

9

0

0

 and

 

2

0

1

2

 Summer

 Olympics

.

 It

 is

 also

 an

 important

 cultural

 hub

,

 home

 to

 numerous

 museums

,

 theaters

,

 and

 opera

 houses

.

 As

 of

 

2

0

2

1

,

 Paris

 has

 a

 population

 of

 approximately

 

2

.

1

 million

 people

.

 It

 is

 a

 bustling

 met

ropolis

 with

 a

 diverse

 population

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

 and

 integration

 of

 new

 technologies

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 focus

 on

 ethical

 considerations

:

 As

 AI

 systems

 become

 more

 autonomous

,

 there

 will

 be

 a

 greater

 emphasis

 on

 addressing

 the

 ethical

 implications

 of

 their

 decisions

.

 This

 could

 include

 issues

 such

 as

 bias

,

 accountability

,

 and

 transparency

.



2

.

 Greater

 integration

 of

 AI

 with

 other

 technologies

:

 As

 AI

 becomes

 more

 integrated

 with

 other

 technologies

,

 such

 as

 sensors

,

 cameras

,

 and

 other

 data

 collection

 tools

,

 we

 can

 expect

 to

 see

 more

 advanced

 and

 sophisticated

 AI

 systems

 that

 can

 work

 together

 to

 accomplish

 complex

 tasks

.



3

.

 Increased

 reliance

 on

 data

:

 AI

 systems

 will

 require

 more




In [6]:
llm.shutdown()