# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.91it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex and I'm here to make a speech to my friends. Please share a piece of your own personal experience with a friend.
My name is Alex and I'm here to make a speech to my friends. Please share a piece of your own personal experience with a friend.
Alex, how do you typically spend your free time?
It's always a bit hectic in the office, but I try to make time for something that I enjoy, whether that's reading, playing board games, or just hanging out with friends.
How did you meet your current friend?
I met her at a coffee shop where I work, and we became friends almost
Prompt: The president of the United States is
Generated text:  a representative of the American people. The people of the United States are a unified whole, composed of a(n) ________ of people with different social strata and different life statuses. A) masses B) majority C) majority of D) majority of the population
Answer:

D) majority of the population

The people of the United 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm a [job title] at [company name], and I'm passionate about [job title] because [reason for passion]. I'm always looking for new challenges and opportunities to grow and learn, and I'm always eager to learn more about [job title] and the company. I'm a [job title] at [company name], and I'm excited to be here and contribute to [job title] at [company name]. I'm a [job title] at [company name], and I'm always looking for new challenges and opportunities to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for art, culture, and politics, and is home to many important museums and historical sites. Paris is a bustling metropolis with a rich history and a diverse population, making it a popular tourist destination. It is also known for its cuisine, including French cuisine, which is renowned for its rich flavors and complex ingredients. Paris is a city that has been a center of culture and politics for centuries, and continues to be a major hub for international trade and diplomacy. The

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing them to learn from and adapt to human behavior and decision-making processes.

2. Enhanced natural language processing: AI systems will become even more capable of understanding and generating human language, allowing for more natural and intuitive interactions with humans.

3. Improved decision-making: AI systems will become more capable of making more accurate and informed decisions, based on a wide range of data and information.

4. Increased use of AI in healthcare: AI will be used to improve the accuracy and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: ... [Your Name]! I'm [Your Age] years old, with [Your Gender] and [Your Name] (if applicable). I am the [Your Profession/Status] of [Your Profession/Status], and I am passionate about [Your Passion/Interest/Interest Involving You]. I am always looking to expand my horizons and learn more about the world around me, and I believe that everyone can learn to be happy and successful by following their dreams. I believe in the power of collaboration and communication, and I am always ready to learn new things and try new things. I hope that you can have a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

While I can't provide an exact, clear, and detailed factual statement about it, I can summarize it for you:

France's capital, the city of Paris, is a bustling metropolis located in the south of the country. 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 a

 [

Age

]

 year

 old

,

 [

Occup

ation

]

 person

.

 I

 have

 always

 loved

 [

Interest

/Area

 of

 Interest

].

 And

 I

 am

 currently

 [

Status

],

 living

 in

 [

City

].

 I

 am

 a

 [

H

obby

/

Att

itude

].

 I

 love

 [

Reason

 for

 Interest

/

Att

itude

].

 So

,

 I

 am

 here

 to

 share

 my

 story

 with

 you

.

 Let

 me

 know

 if

 you

 would

 like

 to

 meet

 me

.

 [

Type

 "

yes

"

 or

 "

no

"]

 Yes

,

 please

 let

 me

 know

.

 [

Type

 your

 answer

]

 Thanks

!

 [

Type

 your

 answer

]

 Your

 message

 has

 been

 received

.

 You

 have

 been

 invited

 to

 meet

 me

.

 [

Type

 your



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 most

 populous

 city

 in

 Europe

 and

 the

 largest

 city

 by

 area

,

 with

 a

 population

 of

 over

 

1

,

0

0

0

,

0

0

0

.

 The

 city

 is

 located

 on

 the

 left

 bank

 of

 the

 Se

ine

 River

 and

 is

 a

 major

 center

 for

 trade

,

 politics

,

 arts

,

 culture

,

 and

 industry

.

 It

 is

 known

 for

 its

 historical

 significance

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

,

 as

 well

 as

 its

 modern

 fashion

 industry

.

 Paris

 is

 also

 known

 for

 its

 international

 cuisine

,

 including

 French

 cuisine

,

 and

 its

 fashion

 industry

,

 including

 haute

 cout

ure

 and

 haute

 cuisine

.

 The

 city

 is

 also



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 and

 there

 are

 many

 different

 areas

 of

 research

 and

 development

 that

 are

 currently

 being

 explored

.

 However

,

 some

 potential

 trends

 in

 AI

 that

 are

 currently

 being

 discussed

 include

:



1

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 becoming

 more

 and

 more

 common

,

 with

 companies

 like

 Tesla

 and

 Uber

 investing

 heavily

 in

 the

 development

 of

 self

-driving

 cars

.

 There

 is

 also

 the

 possibility

 of

 AI

 being

 used

 to

 automate

 many

 jobs

,

 which

 could

 have

 a

 significant

 impact

 on

 society

.



2

.

 Blockchain

:

 Blockchain

 technology

 is

 a

 decentralized

 ledger

 that

 is

 used

 to

 secure

 and

 manage

 data

.

 There

 is

 potential

 for

 AI

 to

 be

 used

 in

 the

 development

 of

 blockchain

 technology

,

 which

 could

 have

 a

 wide

 range

 of

 applications




In [6]:
llm.shutdown()