# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.99it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.98it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ziyad. I am a graduate student in the Mathematical Sciences division at Penn State University. My research interests are in algebraic geometry and number theory, particularly the arithmetic of algebraic varieties, arithmetic dynamics and formal groups.
I am always interested in how the problems in algebraic geometry and number theory can be used to solve interesting problems in other areas of mathematics. In my free time, I enjoy doing mathematics with my wife, loving animals, and reading math books.
Prompt: The president of the United States is
Generated text:  seeking to secure a new proposal to improve the performance of the Department of Defense and the Army National Guard. The proposal is based on the idea of a “Digital Military” to replace outdated hardware and software. The president has received a proposal from an unnamed private company which claims to have developed a solution. The proposal is designed to replace outdated hardware an

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Occupation]. I'm a [Skill] who has been [Number of Years] years in the field of [Field of Interest]. I'm a [Favorite Hobby] who enjoys [Favorite Activity]. I'm a [Favorite Book or Movie] who reads [Number of Books] or watches [Number of Movies]. I'm a [Favorite Sport] who plays [Favorite Sport]. I'm a [Favorite Music] who listens to [Number of Songs] or sings [Number of Songs]. I'm a [Favorite Food] who eats [Number of Meals]. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a cultural and historical center with a rich history dating back to the Roman Empire and the French Revolution. It is a major transportation hub and a major tourist destination. The city is known for its cuisine, fashion, and music, and is home to many famous landmarks and attractions. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. It is a city that has played a significant role in shaping

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way that AI is used and developed. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be an increased focus on ethical considerations. This will include issues such as bias, privacy, and transparency. AI developers will need to be more mindful of the potential consequences of their work and strive to create AI that is fair, transparent, and accountable.

2. Greater use of AI in healthcare: AI is already being used in healthcare to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  John. I'm a friendly, outgoing person who enjoys trying new things and exploring new cultures. I have a natural talent for communication and I'm always looking for new ways to make people smile. And of course, I'm really passionate about helping people, whether that's helping someone in a crisis or providing them with guidance on their goals. I'm a great listener, and I'm always willing to learn and grow. If you'd like to know more about me or discuss how I can help, I'd be happy to do that. Goodbye! John. (Note: The above is a fictional self-introduction and does not reflect

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.
Paris is the capital city of France, known for its rich history, stunning architecture, and vibrant culture. The city is home to many famous landmarks such as the Eiffel Tower, the L

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 your

 name

 here

].

 I

 am

 a

 [

insert

 your

 profession

 here

]

 with

 a

 passion

 for

 [

insert

 something

 specific

 about

 your

 profession

 here

].

 I

 believe

 that

 being

 a

 professional

 and

 passionate

 about

 your

 work

 is

 crucial

 to

 achieving

 success

 in

 your

 field

.

 My

 goal

 is

 to

 become

 an

 expert

 in

 my

 field

,

 and

 I

 am

 committed

 to

 learning

 and

 growing

 continuously

.

 I

 am

 always

 looking

 for

 new

 challenges

 and

 opportunities

 to

 learn

 and

 grow

,

 and

 I

 am

 excited

 to

 share

 my

 knowledge

 with

 others

.

 I

 am

 a

 [

insert

 any

 relevant

 accol

ades

 or

 accomplishments

 here

].

 I

 believe

 that

 my

 knowledge

 and

 expertise

 make

 me

 a

 valuable

 asset

 to

 my

 organization

,

 and

 I

 am

 proud

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 known

 for

 its

 rich

 history

,

 stunning

 architecture

,

 and

 vibrant

 culture

.

 Paris

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

,

 with

 over

 

2

.

 

8

 million

 residents

.

 It

 is

 the

 seat

 of

 government

,

 the

 center

 of

 industry

,

 and

 the

 heart

 of

 French

 culture

.

 Paris

 is

 famous

 for

 its

 landmarks

 such

 as

 Notre

-D

ame

 Cathedral

,

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Place

 des

 V

os

ges

.

 It

 is

 also

 known

 for

 its

 unique

 cuisine

,

 fashion

,

 and

 arts

 scene

.

 Paris

 has

 a

 rich

 history

 and

 culture

 that

 dates

 back

 to

 ancient

 times

 and

 continues

 to

 be

 a

 vital

 part

 of

 French

 identity

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 here

 are

 some

 possible

 trends

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

:



1

.

 Improved

 accuracy

 and

 reliability

:

 AI

 systems

 are

 becoming

 increasingly

 accurate

 and

 reliable

,

 with

 improvements

 in

 natural

 language

 processing

,

 machine

 learning

,

 and

 computer

 vision

.



2

.

 Increased

 autonomy

 and

 decision

-making

:

 AI

 systems

 are

 becoming

 more

 capable

 of

 making

 decisions

 based

 on

 data

,

 which

 could

 lead

 to

 increased

 autonomy

 and

 decision

-making

 in

 areas

 like

 medicine

,

 transportation

,

 and

 manufacturing

.



3

.

 Enhanced

 creativity

 and

 personal

ization

:

 AI

 systems

 are

 becoming

 more

 capable

 of

 generating

 creative

 outputs

 and

 personal

izing

 experiences

,

 which

 could

 lead

 to

 a

 new

 generation

 of

 personalized

 and

 intelligent

 products

 and

 services

.



4




In [6]:
llm.shutdown()