# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.49it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah and I’m writing a story. I’m looking for advice on what type of story or genre to choose. I'm writing a science fiction story. Can you give me a quick rundown of some possible story types or genres in science fiction?

Certainly! Here are a few possible story types or genres you might consider for a science fiction story:

1. **Action-Adventure**: This type of story focuses on physical action and adventure, often involving space travel, combat, or environmental survival. It's popular in both genre fiction and science fiction.

2. **Science Fiction**: This is the oldest and most well-established type of science fiction.
Prompt: The president of the United States is
Generated text:  two-thirds the age of the president of the Senate. If the president of the Senate is 82 years old, what is the total age of both the president of the United States and the president of the Senate?

To determine the total age of both the president of the United 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] with [number of years] years of experience in [field]. I am a [type of person] and I am always looking for ways to [describe a new skill or hobby]. I am [age] years old and I am [gender]. I am [addressable to a broad audience] and I am [addressable to a specific audience]. I am [addressable to a specific audience] and I am [addressable to a specific audience]. I am [addressable to a specific audience] and I am [addressable to a specific audience]. I am [addressable

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, culture, and politics, and is home to many world-renowned museums and historical sites. Paris is a popular tourist destination and a cultural hub for Europe. It is also known for its rich history and diverse population, which has contributed to its status as a major city in Europe. The city is home to many international organizations and institutions, including the European Parliament and the European Central Bank. Paris is a vibrant and dynamic city that continues to be a major center of global

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, there is a growing focus on ethical considerations and the development of AI that is transparent, accountable, and responsible. As AI becomes more integrated into our daily lives, we can expect to see a significant impact on society, from the way we work and communicate to the way we interact with technology. Ultimately, the future of AI is likely to be one of continued innovation, collaboration



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [职业/专业/身份] at [Your Organization]. I have been in this industry for [X years], and I am always looking for ways to [Your Passion], whether it's [Y, a skill, or a habit]. I am also [Z]. How can I start my day today? Could you tell me more about your background and what motivates you to succeed in your field? Thank you! [Your Name] [Your Title] Introduction: As an experienced [Your Profession], I have dedicated my career to [Your Profession] for [X years]. Throughout this time,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. Paris is the largest city in France and the third-largest city in Europe, with a population of approximately 2. 4 million. It is located on the right bank of the Seine River in the heart of the Paris region, in the Paris basin. Paris is a cultural and historic

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

B

IO

]

 who

 has

 been

 [

H

APP

Y

 H

obbies

].

 I

 love

 to

 travel

,

 and

 I

 often

 plan

 and

 create

 unique

 destinations

 for

 my

 travels

.

 I

'm

 passionate

 about

 photography

 and

 vide

ography

,

 and

 I

'm

 always

 on

 the

 lookout

 for

 new

 ideas

 to

 capture

 the

 beauty

 of

 the

 world

 around

 us

.

 I

'm

 also

 a

 strong

 advocate

 for

 mental

 health

 and

 I

 believe

 that

 being

 happy

 is

 a

 fundamental

 aspect

 of

 a

 healthy

 and

 fulfilling

 life

.

 I

'm

 always

 looking

 to

 learn

 new

 things

,

 try

 new

 things

,

 and

 discover

 the

 world

 around

 me

.

 Thank

 you

 for

 asking

!

 What

 kind

 of

 work

 do

 you

 do

 for

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 most

 populous

 city

 in

 the

 European

 Union

.

 It

 is

 located

 on

 the

 River

 Se

ine

 and

 is

 the

 cultural

 and

 economic

 center

 of

 France

.

 It

 has

 a

 rich

 history

,

 including

 the

 construction

 of

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

.

 The

 city

 is

 known

 for

 its

 delicious

 cuisine

,

 world

-ren

owned

 museums

 and

 art

 collections

,

 and

 its

 vibrant

 nightlife

.

 Paris

 is

 also

 famous

 for

 its

 fashion

 industry

,

 including

 brands

 like

 Chanel

,

 D

ior

,

 and

 Herm

ès

.

 The

 city

 is

 home

 to

 many

 iconic

 landmarks

,

 including

 the

 Tour

 de

 P

iti

é

 and

 the

 Arc

 de

 Tri

omp



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 dynamic

 and

 changing

 constantly

,

 but

 there

 are

 several

 areas

 where

 the

 technology

 is

 likely

 to

 continue

 to

 develop

 and

 improve

:



1

.

 Increased

 accuracy

 and

 precision

:

 One

 of

 the

 most

 promising

 areas

 of

 AI

 is

 the

 development

 of

 more

 accurate

 and

 precise

 algorithms

.

 As

 we

 get

 better

 at

 capturing

 and

 understanding

 human

 language

,

 we

 can

 expect

 to

 see

 even

 more

 sophisticated

 AI

 systems

 that

 can

 improve

 their

 accuracy

 over

 time

.



2

.

 Increased

 collaboration

 and

 integration

:

 AI

 is

 already

 becoming

 increasingly

 integrated

 into

 a

 wide

 range

 of

 applications

,

 from

 self

-driving

 cars

 to

 personalized

 medicine

.

 As

 the

 technology

 becomes

 more

 prevalent

 and

 widely

 available

,

 we

 can

 expect

 to

 see

 even

 more

 seamless

 and

 integrated

 AI

 systems

 that

 can




In [6]:
llm.shutdown()