# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.80it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.79it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex and I'm a travel blogger. My goal is to share my love for the world and give a voice to people who are looking to improve their world. I love taking pictures of people and places around the world and I hope to inspire others to do the same.
My blog covers the world with a focus on diversity and inclusion. I want to share stories of people with different backgrounds and cultures and I'm passionate about sharing my experiences and stories. My heart is in the right place and I want to help others find their way. I believe that everyone deserves a good story and I'm here to help them find one.
I live in
Prompt: The president of the United States is
Generated text:  a very important person in the country. He or she is like the leader of the whole country. The president of the United States is very important because he or she is the leader of the whole country. Let's see why the president of the United States is important. First, the president 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title]. I'm always looking for new challenges and opportunities to grow and learn. What do you do

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center, hosting numerous museums, theaters, and restaurants. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, including the influence of the French Revolution and the influence of the French Revolution on the world. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is also home to many famous French artists, writers, and musicians. Paris is a city of art, culture

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability. AI developers will need to take a more responsible approach to their work, and will need to ensure that their algorithms are fair and unbiased.

2. Integration with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm an [Your Age] year old [Your Profession] who has been a [Your Interest/Interest] enthusiast for [Your Most Recent Publication/Biggest Success]. I'm always looking to learn more about the world and try new things. How can I assist you today? [Your Name] will be your guide, and you'll get to explore more about my interests and interests. [Your Name] will provide you with all the information you need to help you learn more about your hobbies and interests. [Your Name] will also encourage you to explore and discover more on your own. How can I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city of light and beauty. The city is famous for its historic architecture, vibrant culture, and numerous museums. It is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 first

 name

],

 and

 I

'm

 an

 [

insert

 age

]

 year

 old

 student

 in

 high

 school

 who

 loves

 to

 read

,

 play

 sports

,

 and

 travel

.

 I

'm

 also

 really

 good

 at

 [

insert

 hobby

 or

 activity

].

 I

'm

 an

 [

insert

 profession

]

 at

 [

insert

 location

].

 What

 can

 you

 tell

 me

 about

 yourself

?

 



[

insert

 short

,

 neutral

 self

-int

roduction

]

 


[

insert

 profession

]

 at

 [

insert

 location

]


What

 makes

 you

 unique

 and

 special

 about

 yourself

?

 


[

insert

 profession

]

 at

 [

insert

 location

]


What

's

 one

 experience

 that

 has

 truly

 shaped

 you

 into

 the

 person

 you

 are

 today

?

 


[

insert

 profession

]

 at

 [

insert

 location

]


What



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 second

-largest

 city

 in

 Europe

 by

 population

.

 It

 is

 located

 on

 the

 left

 bank

 of

 the

 Se

ine

 River

,

 on

 the

 west

 bank

 of

 the

 Mediterranean

 Sea

,

 and

 on

 the

 Rh

ô

ne

 River

 to

 the

 east

.

 



The

 city

 is

 also

 the

 economic

 and

 cultural

 center

 of

 France

,

 and

 is

 home

 to

 numerous

 UNESCO

 World

 Heritage

 sites

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 the

 Notre

 Dame

 Cathedral

,

 and

 the

 Arc

 de

 Tri

omp

he

.

 



Paris

 is

 also

 known

 for

 its

 cuisine

,

 particularly

 its

 famous

 plate

 of

 fo

ie

 gras

.

 It

 is

 also

 famous

 for

 its

 museums

,

 including



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 extremely

 promising

,

 with

 many

 exciting

 developments

 and

 potential

 applications

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Improved

 interpret

ability

 and

 transparency

:

 AI

 models

 can

 become

 even

 more

 interpre

table

 and

 transparent

 as

 researchers

 and

 developers

 learn

 more

 about

 how

 they

 work

.

 This

 will

 enable

 users

 to

 understand

 how

 the

 AI

 system

 makes

 decisions

 and

 why

 it

 makes

 them

 certain

 decisions

.



2

.

 More

 diverse

 and

 inclusive

 use

 cases

:

 AI

 is

 gaining

 popularity

 as

 a

 tool

 for

 improving

 society

,

 and

 there

 is

 a

 growing

 trend

 of

 developing

 AI

 systems

 that

 are

 more

 diverse

 and

 inclusive

.

 This

 includes

 increasing

 the

 number

 of

 people

 with

 disabilities

,

 women

,

 and

 other

 marginalized

 groups

 who

 can

 use

 and

 benefit

 from




In [6]:
llm.shutdown()