# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.05it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah. I am twelve years old. I like to go to the zoo. I like the monkey because it is very smart. I like to watch the lion because he is very strong and fearsome. I like the elephants because they are so friendly and beautiful. I like the giraffes because they are very tall and they are so cute. I like the koalas and they are so cute and friendly. The elephants and the koalas are my favorite animals. I like the lion because he is very strong and scary. I like the giraffes because they are so tall and cute. I like the koalas
Prompt: The president of the United States is
Generated text:  trying to decide whether to launch a new war. The president has 1000 Americans vote on the matter. Each vote is a separate one, and the president wants to know which vote was chosen by the majority. If the president decides to launch the war, and it rains, he will have to send 500 extra soldiers to the war zone. Otherwise, he will have to reduce the number of s

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] with [number] years of experience in [field]. I am a [type of person] who is always [positive trait]. I am [gender] and I am [age]. I am [occupation] and I am [number] years old. I am [gender] and I am [age]. I am [occupation] and I am [number] years old. I am [gender] and I am [age]. I am [occupation] and I am [number] years old. I am [gender] and I am [age]. I am [occupation] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is also known for its cuisine, including French cuisine, which is renowned for its rich flavors and use of fresh ingredients. Paris is a city that is both a cultural and historical center of France. It is home to many important historical sites and landmarks, including the Louvre, the Notre-Dame Cathedral

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [Your profession or area of expertise] [Skill level], and I have been [Time since joining the company]. I am available to assist with [Your main job role or service, if applicable]. Any questions, comments, or feedback are welcomed. I am looking forward to the opportunity to work with you. [Name] looks forward to working with you. [Name] is looking forward to working with you. [Name] is looking forward to working with you. Hello, my name is [Name], and I am a [Your profession or area of expertise] [Skill level], and I have

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and one of the most populous, with over 10 million people residing in the city.
The answer is: Paris is the capital city of France. It is the largest city and one of the most popu

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

character

 type

]

 who

 has

 a

 deep

 passion

 for

 [

character

's

 hobby

 or

 interest

].

 I

 have

 [

number

 of

 years

 experience

 in

]

 and

 have

 always

 loved

 creating

 unique

,

 original

 content

 for

 various

 platforms

.

 As

 a

 member

 of

 the

 [

community

 or

 group

]

 I

've

 made

 it

 my

 mission

 to

 connect

 with

 like

-minded

 individuals

,

 fostering

 a

 sense

 of

 community

 and

 helping

 others

 on

 their

 journey

 to

 achieve

 their

 goals

.

 I

'm

 always

 looking

 to

 learn

 and

 grow

,

 and

 I

'm

 always

 eager

 to

 share

 my

 knowledge

 and

 experiences

 with

 those

 I

 meet

.

 I

 believe

 in

 the

 power

 of

 collaboration

,

 and

 I

'm

 always

 willing

 to

 contribute

 to

 the

 success



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Paris

,

 the

 city

 of

 love

,

 has

 a

 rich

 history

,

 with

 significant

 contributions

 to

 French

,

 French

-R

oman

ian

,

 and

 world

 culture

.

 It

 is

 home

 to

 many

 iconic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

 Dame

 Cathedral

,

 Lou

vre

 Museum

,

 and

 the

 Lou

vre

 Academy

 of

 Fine

 Arts

.

 Paris

 is

 also

 known

 for

 its

 vibrant

 cultural

 scene

 and

 has

 produced

 numerous

 notable

 figures

 in

 the

 arts

 and

 sciences

.

 Despite

 its

 commercial

 center

,

 the

 city

 retains

 many

 historical

 streets

 and

 parks

,

 and

 is

 an

 important

 hub

 for

 transportation

,

 finance

,

 and

 media

 in

 France

.

 Paris

 is

 one

 of

 the

 most

 visited

 cities

 in

 the

 world

,

 attracting

 millions

 of

 visitors

 annually



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 a

 rapidly

 evolving

 field

 with

 potential

 to

 transform

 many

 aspects

 of

 our

 lives

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Self

-learning

:

 The

 ability

 of

 AI

 systems

 to

 learn

 from

 feedback

 and

 improve

 their

 performance

 over

 time

 is

 an

 important

 aspect

 of

 AI

 research

.

 This

 could

 lead

 to

 more

 autonomous

 and

 efficient

 systems

 that

 can

 learn

 from

 human

 feedback

 without

 explicit

 instruction

.



2

.

 AI

 ethics

:

 As

 AI

 systems

 become

 more

 integrated

 into

 our

 lives

,

 there

 will

 be

 increasing

 pressure

 to

 consider

 how

 they

 should

 be

 used

 eth

ically

.

 This

 could

 lead

 to

 new

 ethical

 standards

 and

 guidelines

 for

 AI

 development

 and

 deployment

.



3

.

 AI

 for

 humans

:

 AI

 could

 potentially

 be

 used

 for

 human




In [6]:
llm.shutdown()