# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.11it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Fang Zhiqiang. I am an English teacher from Beijing. In this passage, the underlined word “diligently” means ________. A. greedily B. stubbornly C. diligently D. lazily
Answer:

C

According to the passage, what is the main reason why students are afraid of exams?
A. Exams are difficult.
B. Exams are too short.
C. The exam time is too close.
D. The exam results are too harsh.
Answer:

D

Please answer the following question based on the context provided in the passage. The correct answer is ____.
A. He wasn
Prompt: The president of the United States is
Generated text:  32 years older than the president of Brazil. The president of Brazil is 2/3 the age of the president of France. If the president of the United States is currently 46 years old, what is the difference in age between the president of the United States and the president of Brazil?
To find the difference in age between the president of the United States and the president of Brazil, 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I have always been passionate about [Your passion or interest]. I am always looking for new challenges and opportunities to grow and learn. I am a [Your profession or hobby]. I am always eager to learn and improve myself. I am a [Your personality or character trait]. I am a [Your favorite hobby or activity]. I am a [Your favorite book, movie, or TV show]. I am a [Your favorite person or place]. I am a [Your favorite hobby or activity]. I am a [Your favorite book, movie, or TV show

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is also home to the Louvre Museum, the most famous art museum in the world, and the Notre-Dame Cathedral, which is considered one of the most beautiful in the world. Paris is a bustling city with a diverse population and is known for its fashion, food, and music scenes. It is also a popular tourist destination and a major economic center in Europe. Paris is a city that has been a center of culture and politics for centuries and continues to be a major influence on the world today. 

The city

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and artificial intelligence: As AI technology continues to advance, we can expect to see more automation and artificial intelligence in various industries. This could lead to increased efficiency, productivity, and cost savings for businesses and individuals.

2. Enhanced privacy and security: As AI technology becomes more advanced, we can expect to see increased concerns about privacy and security. This could lead to new regulations and standards to protect people's data and prevent cyber attacks.

3. AI-powered healthcare:



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm [Job/Position]. I'm here to share my experiences and accomplishments, and also to find answers to your questions. Are there any particular industries or projects that you are passionate about? I'm always eager to learn new things and contribute to the world. Feel free to ask me anything you'd like to know. [End of Self-Introduction] The title of the work is "Career Assessment." I am looking forward to discussing my experience with you and learning more about your career journey. How can I assist you today? Please share your goals and how you plan to achieve them. I'll do my best

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city where the Eiffel Tower stands and where the French people celebrate their independence day, Bastille Day. It is the largest city in France by population a

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

job

 title

/

role

]

 with

 over

 [

number

 of

 years

]

 years

 of

 experience

 in

 [

specific

 area

 of

 interest

].

 I

 am

 passionate

 about

 [

reason

 for

 interest

],

 and

 I

 believe

 that

 my

 experience

 in

 [

area

 of

 interest

]

 has

 allowed

 me

 to

 develop

 [

specific

 skill

 or

 trait

].

 I

 am

 confident

 in

 my

 ability

 to

 [

specific

 skill

 or

 trait

],

 and

 I

 am

 looking

 forward

 to

 contributing

 to

 [

specific

 area

 of

 interest

]

 with

 my

 [

specific

 contribution

].

 Thank

 you

 for

 considering

 me

 for

 the

 role

.

 [

Name

]

 knows

 that

 their

 self

-int

roduction

 is

 neutral

 and

 doesn

't

 express

 any

 particular

 bias

 or

 agenda

.

 It

's

 an



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 a

 historic

 city

 with

 a

 rich

 history

 and

 a

 vibrant

 culture

.

 It

 is

 known

 for

 its

 museums

,

 art

 galleries

,

 and

 opera

 house

.

 Paris

 is

 also

 famous

 for

 its

 landmarks

 such

 as

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

.

 It

 is

 a

 popular

 tourist

 destination

 and

 a

 major

 economic

 center

 in

 Europe

.

 The

 city

 has

 a

 diverse

 population

 of

 over

 

2

 million

 people

 and

 is

 home

 to

 many

 international

 businesses

 and

 organizations

.

 The

 city

 is

 a

 symbol

 of

 France

 and

 a

 popular

 destination

 for

 tourists

 and

 locals

 alike

.

 Paris

 is

 a

 city

 of

 contrasts

,

 and

 its

 unique

 blend

 of

 historic

 and

 modern

 elements

 continues

 to

 make

 it

 a

 unique

 and

 beloved

 city



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 marked

 by

 rapid

 advancements

 in

 several

 key

 areas

:



1

.

 **

Increased

 Integration

 with

 Human

 Intelligence

**:

 AI

 systems

 are

 becoming

 more

 integrated

 with

 human

 intelligence

.

 This

 integration

 could

 lead

 to

 more

 complex

 and

 intelligent

 decision

-making

.

 For

 example

,

 AI

 could

 understand

 human

 emotions

,

 learn

 from

 past

 experiences

,

 and

 adapt

 to

 new

 situations

,

 potentially

 leading

 to

 more

人性化

(int

elligent

)

 decisions

.



2

.

 **

Ne

ural

 Networks

 and

 Deep

 Learning

**:

 These

 are

 the

 core

 technologies

 driving

 the

 development

 of

 highly

 capable

 AI

 systems

.

 Neural

 networks

 are

 trained

 using

 large

 amounts

 of

 data

 to

 learn

 complex

 patterns

 and

 relationships

,

 making

 them

 powerful

 tools

 in

 many

 areas

.

 Deep

 learning

 is

 a

 particularly

 promising

 area




In [6]:
llm.shutdown()