# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.05it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tim. I am 21 years old and I have a passion for writing about things that interest me. I love to learn and I love to explore the world. My first book came out last month, and I am very excited about it. However, my second book is still under development, so I am still in the writing process. But what I am most excited about is the potential of my book to reach a wider audience. I am looking for recommendations or suggestions on how to promote my book, whether it be through social media, marketing, or other methods. Can you please provide me with some ideas or tips on how to
Prompt: The president of the United States is
Generated text:  an executive leader. He has the authority to make the decisions of the government. He is responsible for the general direction of the country and the leadership of the nation. He has to make decisions regarding the economy and political power.
The job of the president is a very important one. It is the president

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm excited to be here today. I'm a [job title] at [company name], and I'm a [job title] at [company name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm a [job title] at [company name], and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a city renowned for its historical architecture, vibrant culture, and annual festivals such as the Eiffel Tower and the Louvre Museum. It is the largest city in France and the third-largest city in the world by population. Paris is also known for its fashion industry, art scene, and its role in the French Revolution and the French Revolution. The city is home to many famous landmarks, including the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is a major transportation hub, with the iconic Eiffel Tower serving as a symbol of the city's

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several trends, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes.

2. Enhanced machine learning capabilities: AI is likely to become more capable of learning from large amounts of data and making more accurate predictions and decisions.

3. Increased focus on ethical considerations: As AI becomes more integrated with human intelligence, there will be increased focus on ethical considerations and responsible use of AI.

4. Development of new AI technologies: AI is likely to continue to develop new technologies and applications, such as autonomous vehicles



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a professional software developer with a strong background in web development. I have a passion for creating innovative solutions that solve complex problems, and I'm always looking for ways to enhance the user experience on websites. I'm excited about the opportunity to work with a team of talented professionals and contribute my skills to make the world a better place. What brings you to this opportunity? I'm always on the lookout for new challenges and opportunities to learn and grow, and I'm excited about the possibility of contributing to the growth of a company. 

Please let me know if there are any specific roles or projects that you'd

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital city of France and serves as the country's political, cultural, and eco

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

__.

 I

 am

 a

/an

 ______

__.

 I

 have

 been

 at

 this

 career

 for

 __

__.

 I

 love

 ______

__.

 This

 is

 my

 first

 time

 working

 with

 __

__.

 How

 is

 it

 going

 with

 you

?

 Nice

 to

 meet

 you

!

 That

's

 a

 nice

 name

,

 but

 it

's

 important

 to

 choose

 one

 that

 accurately

 represents

 who

 you

 are

.

 I

 can

 be

 an

 artist

,

 but

 I

 also

 like

 to

 be

 in

 the

 field

 of

 sports

.

 My

 greatest

 strength

 is

 my

 good

 communication

 skills

.

 I

 have

 always

 loved

 learning

 about

 different

 subjects

.

 My

 greatest

 fear

 is

 that

 I

 might

 get

 too

 excited

.

 I

'm

 looking

 forward

 to

 a

 great

 day

 at

 work

!

 It

's

 been

 a

 pleasure

 meeting

 you

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

la

 Vie

ille

 Opera

"

 (

the

 old

 opera

 house

).



Does

 the

 following

 sentence

 make

 sense

?


"

Car

pe

 di

em

 means

 to

 seize

 the

 day

,

 or

 to

 seize

 the

 present

 moment

."

?


OPTIONS

:

 --

yes

 --

no





OPTIONS

:

 --

yes

 --

no




OPTIONS

:

 --

yes

 --

no




OPTIONS

:

 --

yes

 --

no





OPTIONS

:

 --

yes

 --

no




OPTIONS

:

 --

yes

 --

no




OPTIONS

:

 --

yes

 --

no





Yes

,

 the

 sentence

 "

Car

pe

 di

em

 means

 to

 seize

 the

 day

,

 or

 to

 seize

 the

 present

 moment

"

 makes

 sense

.

 It

 is

 a

 correct

 and

 accurate

 representation

 of

 the

 meaning

 behind



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 predicted

 to

 be

 multi

-f

ac

eted

,

 and

 there

 are

 many

 potential

 trends

 that

 could

 shape

 the

 development

 of

 this

 technology

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Increased

 Human

-A

I

 Interaction

:

 As

 AI

 becomes

 more

 advanced

,

 the

 ability

 of

 humans

 to

 interact

 with

 AI

 will

 likely

 increase

.

 This

 could

 mean

 more

 natural

 language

 understanding

 and

 interaction

,

 as

 well

 as

 more

 autonomous

 robots

 that

 can

 perform

 tasks

 in

 our

 homes

 and

 offices

.



2

.

 AI

 Ethics

 and

 Transparency

:

 With

 AI

 becoming

 more

 complex

 and

 sophisticated

,

 there

 will

 be

 a

 growing

 need

 for

 ethical

 guidelines

 and

 transparency

.

 This

 could

 include

 guidelines

 for

 how

 AI

 should

 be

 developed

,

 tested

,

 and

 deployed

,

 as




In [6]:
llm.shutdown()