# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.17it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jeanne and I'm a high school student majoring in Education. I have been studying education for the past three years and I have been working in the education field for the last two years. I have taken many courses on education and have learned a lot of things. I have also been a tutor for a few years now and I have tutored students from various grades, including high school and elementary school. I have also been a teacher in my high school and have had the opportunity to teach many different subjects, including English and Math. I have also been a tutor for other students and have helped them with their homework and tests.
I
Prompt: The president of the United States is
Generated text:  a very important person in the government of the United States. He or she is called the President of the United States. In the United States, the President is elected by the people. They are the leaders of the country. The President helps the country by making 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your character, such as "a friendly, helpful, and outgoing person" or "a dedicated, hardworking, and organized person"]. I enjoy [insert a short description of your character's interests, such as "reading", "traveling", or "traveling" and "traveling" and "traveling"]. I'm always looking for new experiences and adventures, and I'm always eager to learn and grow

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in Europe and the third-largest city in the world by population. The city is known for its rich history, beautiful architecture, and vibrant culture. It is also home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a popular tourist destination and a major economic center in Europe. It is also known for its fashion industry, art, and cuisine. The city is home to many international organizations and institutions, including the European Parliament and the European Central Bank. Paris is a cultural and intellectual center that plays a significant role in shaping the country

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to become more prevalent in various industries, including manufacturing, healthcare, transportation, and customer service. Automation will likely lead to increased efficiency and productivity, but it will also create new jobs and require significant changes in work practices.

2. AI ethics and privacy concerns: As AI becomes more integrated into our daily lives, there will be increasing concerns about its ethical implications and potential privacy violations. There will likely be a



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Age] year old [Occupation], who has [previous experience] in [specific field]. I am passionate about [reason why I love my job or field], and I am always up for learning new things and trying new challenges. I thrive on innovation and experimentation, and I believe in taking risks to grow and develop my skills. My work ethic is strong, and I am always striving to improve my performance and effectiveness. I am a true believer in the importance of continuous learning and personal growth, and I am always eager to continue learning and evolving as a professional. Thank you for having me!

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city known as the "City of Light" and "City of Love." It is located on the Seine River and is the largest city in France and the fifth-largest city in

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [Your

 Name

],

 and

 I

 am

 a

 [

Your

 profession

/

role

]

 with

 [

Your

 relevant

 experience

 or

 education

]

 in

 [

Your

 field

 of

 interest

].

 In

 my

 free

 time

,

 I

 enjoy

 [

Your

 hobbies

 or

 interests

],

 whether

 it

's

 [

An

 activity

 or

 project

 you

've

 completed

],

 [

A

 challenge

 you

've

 taken

 on

],

 or

 [

Something

 you

've

 contributed

 to

 a

 project

 or

 cause

].

 What

's

 your

 favorite

 way

 to

 unwind

?

 I

 hope

 you

 enjoy

 our

 conversation

!

 [

Your

 Name

]

 [

Your

 profession

/

role

]

 [

Your

 relevant

 experience

 or

 education

]

 Hello

,

 my

 name

 is

 [

Your

 Name

],

 and

 I

 am

 a

 [

Your

 profession

/

role

]

 with

 [

Your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



That

's

 correct

!

 Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 capital

 of

 the

 country

.

 It

's

 known

 for

 its

 unique

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

.

 The

 city

 is

 home

 to

 many

 famous

 landmarks

 like

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 a

 popular

 tourist

 destination

 with

 millions

 of

 visitors

 every

 year

.

 How

 would

 you

 describe

 Paris

?

 The

 city

 is

 a

 beautiful

 city

 with

 a

 rich

 history

 and

 a

 thriving

 culture

.

 It

 has

 a

 unique

 blend

 of

 old

 and

 new

,

 and

 a

 vibrant

 energy

 that

 keeps

 it

 lively

 even

 in

 the

 busy

 seasons

.

 Paris

 is

 a

 city

 that

's

 constantly

 evolving

 and



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 a

 blend

 of

 progress

 and

 stagn

ation

,

 with

 some

 areas

 set

 to

 continue

 advancing

 at

 a

 rapid

 pace

 and

 others

 at

 a

 slower

 pace

.

 Here

 are

 some

 possible

 trends

 in

 the

 AI

 field

 in

 the

 coming

 years

:



1

.

 Increased

 focus

 on

 privacy

 and

 ethics

:

 As

 more

 and

 more

 AI

 systems

 become

 pervasive

 in

 our

 daily

 lives

,

 there

 is

 growing

 concern

 about

 privacy

 and

 the

 potential

 misuse

 of

 AI

.

 Governments

 and

 organizations

 are

 likely

 to

 continue

 to

 enforce

 regulations

 around

 AI

 to

 protect

 user

 privacy

 and

 prevent

 bias

 in

 algorithms

.



2

.

 Adv

ancements

 in

 computer

 vision

 and

 natural

 language

 processing

:

 AI

 is

 becoming

 increasingly

 capable

 of

 processing

 visual

 and

 linguistic

 information

,

 which

 will

 likely

 lead




In [6]:
llm.shutdown()