# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.82it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.82it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Alex and I'm a self-taught programmer. I'm not a teacher or a writer, but I'm passionate about technology and want to share my knowledge with you.
I'd like to share my story with you about my journey to becoming a successful programmer. I was born in the year 2000 in a small town in the United States and I am currently 26 years old. I first started programming in 2010 when I enrolled in a programming bootcamp and have continued to learn and improve my skills since then.
My journey to becoming a successful programmer has been challenging, but it has also been
Prompt: The president of the United States is
Generated text:  a citizen of ______.
A. America
B. France
C. Germany
D. Japan
Answer:
A

The main reason for the impact of the welfare state on the economy is the 'crowding out' effect. A. Correct B. Incorrect
Answer:
A

During the maturity period of the industry, the most important factor influencing the market price of the stock is ____.
A. 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or experience here]. I enjoy [insert a short description of your hobbies or interests here]. What's your favorite hobby or activity? I love [insert a short description of your favorite activity here]. What's your favorite book or movie? I love [insert a short description of your favorite book or movie here]. What's your favorite color? I love [insert a short description of your favorite color here].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Library. Paris is a bustling city with a rich cultural heritage and is a major tourist destination. The city is known for its fashion industry, art scene, and food culture. It is a popular destination for tourists and locals alike, with many attractions and events throughout the year. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly. The city is also known for its diverse population, with many different ethnic groups living in its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing it to learn and adapt in ways that are difficult for humans to do. This could lead to more efficient and effective decision-making, as well as improved problem-solving.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, including issues such as bias, privacy, and accountability. This will require a more rigorous approach to AI development and deployment, with greater attention to the potential impact of AI on society.





### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a [character name] who specializes in [describe role or expertise]. I am a [level] wizard, with a reputation for being [mention a positive quality or an accomplishment]. I am skilled in the art of [mention something about the character's skill set, such as [equipment, spellcasting, combat, etc.]], and have a keen understanding of [mention a particular subject, such as [magic, healing, etc.]].
My journey has been one of [mention a specific goal or quest you've completed], and I have always been motivated by my desire to [mention a positive emotion or goal

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. The city is known for its rich history, beautiful architecture, and beautiful beaches. It is the largest city in Europe by population and the tenth most populous city in the world. Paris 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Character

 Name

].

 I

'm

 a

 [

Character

 Occupation

]

 who

 has

 been

 [

Character

 Occupation

]

 for

 [

Number

 of

 Years

]

 years

.

 I

'm

 a

 bit

 of

 a

 lon

er

 who

 enjoys

 exploring

 new

 places

 and

 trying

 new

 foods

.

 I

'm

 always

 looking

 for

 adventure

,

 whether

 it

's

 in

 the

 outdoors

 or

 in

 the

 kitchen

.

 I

 have

 a

 passion

 for

 creating

 new

 recipes

 and

 sharing

 them

 with

 others

.

 I

'm

 always

 looking

 for

 ways

 to

 make

 my

 experiences

 meaningful

 and

 memorable

,

 and

 I

'm

 always

 ready

 to

 learn

 something

 new

.

 I

'm

 a

 bit

 of

 a

 perfection

ist

,

 but

 also

 a

 bit

 of

 a

 do

er

,

 and

 I

 enjoy

 getting

 things

 done

.

 I

 love

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



The

 statement

 is

:

 "

Paris

 is

 the

 capital

 of

 France

."

 This

 is

 a

 factual

 statement

 about

 the

 capital

 city

 of

 France

.

 It

 provides

 a

 clear

 and

 straightforward

 description

 of

 the

 capital

's

 location

,

 which

 is

 that

 it

 is

 the

 seat

 of

 government

 and

 the

 largest

 city

 in

 France

.

 The

 other

 options

 mentioned

,

 such

 as

 the

 names

 of

 specific

 buildings

 or

 landmarks

,

 are

 not

 part

 of

 the

 factual

 statement

.

 Therefore

,

 the

 statement

 itself

 is

 the

 correct

 answer

 to

 the

 question

.

 As

 a

 result

,

 it

 is

 the

 only

 factual

 statement

 that

 accurately

 describes

 the

 capital

 of

 France

.

 The

 other

 options

 do

 not

 provide

 factual

 information

 about

 the

 capital

 city

.

 They

 may

,

 however

,

 be



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 but

 here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 technology

's

 direction

:



1

.

 Autonomous

 vehicles

:

 With

 advancements

 in

 artificial

 intelligence

,

 autonomous

 vehicles

 will

 become

 more

 prevalent

 in

 the

 future

.

 This

 could

 lead

 to

 a

 reduction

 in

 traffic

 congestion

,

 improved

 safety

,

 and

 increased

 efficiency

 in

 transportation

.



2

.

 Personal

ized

 AI

:

 AI

 will

 become

 more

 personalized

 as

 it

 learns

 from

 user

 data

 and

 behavior

.

 This

 could

 result

 in

 more

 efficient

 healthcare

,

 education

,

 and

 job

 placement

.



3

.

 AI

 in

 manufacturing

:

 AI

 will

 be

 used

 to

 improve

 manufacturing

 processes

,

 reduce

 costs

,

 and

 increase

 productivity

.

 This

 could

 lead

 to

 more

 efficient

 production

,

 lower

 energy

 consumption

,

 and

 improved

 quality




In [6]:
llm.shutdown()