# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel and I am a highly motivated, creative and innovative filmmaker, with a strong background in digital video production. I am passionate about both the creative and technical aspects of filmmaking and have completed over a dozen films since my beginning years in filmmaking.
My expertise lies in working with digital video equipment, and my unique style has helped me to continually improve my craft and hone my own unique direction. I believe that film should always be about the story and the people. I am excited to work with you to bring your story to life and help you achieve your vision. Rachel is a highly talented and dynamic filmmaker who knows how to turn the very
Prompt: The president of the United States is
Generated text:  a _____ political party.
A. left
B. right
C. center
D. moderate
Answer:

C

[Multiple Choice] The basic principles of civil litigation include ____
A. Principle of Substantive Equality
B. Principle of Fairness
C. P

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. Let's chat! [Name] [Company Name] [Company Address] [Company Website] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company Instagram Profile] [Company GitHub Profile] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company Instagram Profile] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook Profile] [Company Instagram Profile] [Company LinkedIn Profile] [Company Twitter Profile] [Company Facebook

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a city with a rich history and diverse culture. It is located on the Seine River and is home to many famous landmarks such as Notre-Dame Cathedral, the Louvre Museum, and the Eiffel Tower. Paris is also known for its vibrant nightlife, fashion industry, and annual festivals such as the Eiffel Tower Festival and the Louvre Festival. The city is a major center of business, politics, and culture in Europe and is a popular tourist destination. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. Its reputation as

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial intelligence: As AI becomes more advanced, it is likely to become more integrated into our daily lives. This could lead to increased automation in industries such as manufacturing, transportation, and healthcare, as well as in areas such as customer service and customer support.

2. Improved privacy and security: As AI becomes more advanced, it is likely to require more data to function effectively. This could lead to increased privacy concerns, as AI systems may be used to collect



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a highly skilled and determined individual with a passion for adventure and exploration. I'm a seasoned traveler who loves nothing more than setting out on journeys that push me out of my comfort zone. Whether it's exploring remote corners of the world or venturing into the unknown, I'm always on the lookout for new challenges and experiences. I love being flexible and adaptable, always looking for ways to grow and learn. I'm a natural leader and have led countless expeditions and group adventures, proving that even the strongest are capable of great things when they have the right mindset and a good sense of humor. I'm a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located on the north bank of the Seine River. It is the largest and most populous city in the country and serves as the

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

'm

 an

 [

occupation

]

 who

 has

 been

 working

 hard

 to

 achieve

 [

specific

 goal

].

 I

 enjoy

 [

h

obby

 or

 sport

]

 and

 I

 believe

 that

 [

career

 goal

].

 As

 a

 [

character

 type

],

 I

 am

 [

type

]

 and

 I

 am

 [

relationship

 status

].

 I

 am

 [

age

]

 years

 old

,

 and

 I

 am

 [

gender

].

 I

 am

 an

 [

occupation

]

 who

 has

 been

 [

occupation

]

 for

 [

time

]

 years

.

 I

 am

 [

gender

],

 and

 I

 believe

 that

 [

character

 goal

 is

].

 I

 am

 currently

 [

occupation

],

 and

 I

 am

 [

gender

].

 I

 am

 passionate

 about

 [

h

obby

 or

 sport

],



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 “

City

 of

 Light

”

 and

 "

the

 Gothic

 Capital

".

 Paris

 is

 a

 large

 and

 dynamic

 city

 with

 a

 rich

 history

 dating

 back

 to

 the

 Middle

 Ages

,

 and

 today

 it

 is

 a

 bustling

 met

ropolis

 known

 for

 its

 art

,

 culture

,

 cuisine

,

 fashion

,

 and

 nightlife

.

 The

 city

 is

 famous

 for

 its

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Ch

amps

-

É

lys

ées

,

 and

 is

 a

 UNESCO

 World

 Heritage

 site

.

 Paris

 has

 a

 diverse

 population

 of

 over

 

2

.

 

5

 million

 people

,

 and

 it

 is

 home

 to

 many

 famous

 museums

,

 theaters

,

 and

 parks



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 a

 number

 of

 potential

 trends

,

 including

:



1

.

 Increased

 integration

 with

 other

 technologies

:

 As

 AI

 becomes

 more

 integrated

 with

 other

 technologies

,

 including

 machine

 learning

,

 big

 data

,

 and

 the

 Internet

 of

 Things

 (

Io

T

),

 we

 can

 expect

 to

 see

 more

 complex

 algorithms

 and

 models

 that

 can

 analyze

 and

 learn

 from

 a

 wider

 range

 of

 data

 sources

.

 This

 will

 make

 AI

 systems

 more

 capable

 of

 understanding

 complex

 patterns

 and

 making

 more

 accurate

 predictions

.



2

.

 Greater

 focus

 on

 ethical

 considerations

:

 As

 AI

 becomes

 more

 widely

 used

,

 there

 will

 be

 a

 growing

 need

 to

 consider

 the

 ethical

 implications

 of

 its

 use

.

 This

 will

 likely

 lead

 to

 more

 rigorous

 testing

 and

 evaluation

 of

 AI




In [6]:
llm.shutdown()