# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.45it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Mike. I'm 15 years old and I'm a student at Xingguo Secondary School. I have a lot of friends and I like to spend my free time with them. I like to talk to my friends and to help them. I like to play computer games and I like to watch TV. I like to have fun and I like to play games. I have a lot of family and a lot of friends, but I don't have any pets. I have a brother and a sister, but we don't play together very often. I like to eat vegetables, but I like to eat meat. I like
Prompt: The president of the United States is
Generated text:  a prominent figure in the country. President Obama, for instance, is a very influential figure in the United States. Which of the following statements is true?
A. The president is the chief executive of the government
B. The president has the power of appointment and removal
C. The president has the power to regulate private business
D. The president has the power to make laws
Answer:
A

According to the mat

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and I love to [job title] with my [job title] skills. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [favorite hobby or activity]. I'm always looking for new experiences and adventures, and I'm always eager to try new things. What's your favorite book

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of French literature and a major center for art, music, and film. Paris is a cultural and economic hub with a rich history dating back to the Roman Empire and the French Revolution. The city is also home to many famous museums, including the Louvre and the Musée d'Orsay. Paris is a popular tourist destination and a major economic center in Europe. It is known for its vibrant nightlife, delicious cuisine, and diverse cultural scene. The city is also

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and adaptive AI systems that can learn from feedback and improve their performance over time.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more stringent regulations and guidelines for AI development and deployment, as well as increased scrutiny of AI systems that are designed to harm or mislead humans.

3. Increased



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Job Title] at [Company Name]. I am a [Brief Introduction] and I enjoy [What I Do]. Currently, I am [Current Position] in [Industry]. Here are some things about myself:

- I am [Age], and I live in [City, State]. 
- I enjoy [Favorite Activity], [Favorite Food], [Favorite Book], or [Favorite Music].
- I have [Number of Pets], and [Favorite Pet]. I am [Number] years old.
- I am [Religion], and I follow [Name of Faith].
- I am [Country],

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city where the Eiffel Tower and the Louvre Museum are located. Paris has a rich cultural heritage and is the birthplace of many famous figures such as Napoleon, Shakespeare, and Michelangelo. The city is known for its charming architecture, vibrant nightlife, and annual festivals and events. French peop

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

occupation

]

 who

 has

 dedicated

 myself

 to

 [

achievement

 or

 passion

].

 I

'm

 excited

 to

 meet

 you

 and

 learn

 more

 about

 you

,

 and

 to

 share

 my

 experience

 and

 knowledge

 with

 you

.



I

 hope

 you

 enjoy

 this

 introduction

 and

 look

 forward

 to

 hearing

 more

 about

 our

 shared

 interests

 and

 experiences

.

 Let

's

 make

 this

 a

 great

 first

 encounter

!



What

 is

 the

 short

,

 neutral

 self

-int

roduction

?

 The

 short

,

 neutral

 self

-int

roduction

 is

 a

 friendly

 and

 concise

 introduction

 to

 someone

's

 identity

,

 experiences

,

 and

 interests

.

 It

 typically

 includes

 their

 name

,

 occupation

 or

 profession

,

 and

 any

 notable

 achievements

 or

 passions

.

 The

 neutral

 tone

 of

 the

 introduction

 suggests

 that



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 It

 is

 the

 largest

 city

 in

 France

 by

 population

 and

 the

 second

 largest

 in

 Europe

,

 after

 Berlin

.

 Paris

 is

 the

 cultural

,

 political

,

 and

 economic

 heart

 of

 the

 country

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Lou

vre

 Museum

,

 and

 Notre

-D

ame

 Cathedral

.

 It

 has

 a

 rich

 history

 dating

 back

 over

 

1

,

0

0

0

 years

 and

 is

 a

 major

 cultural

,

 economic

,

 and

 political

 hub

 in

 Europe

.

 Paris

 is

 also

 home

 to

 many

 famous

 museums

,

 landmarks

,

 and

 cultural

 institutions

.

 It

 is

 a

 major

 transportation

 hub

 and

 a

 center

 for

 business

,

 education

,

 and

 entertainment

.

 Its

 climate

 is

 temper

ate

,

 with

 mild

 winters



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 exciting

 and

 diverse

,

 with

 many

 different

 trends

 and

 technologies

 emerging

 that

 will

 shape

 the

 direction

 of

 the

 field

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Self

-driving

 cars

:

 Self

-driving

 cars

 are

 already

 on

 the

 road

,

 and

 they

 are

 expected

 to

 become

 even

 more

 common

 in

 the

 future

.

 Autonomous

 vehicles

 will

 be

 able

 to

 navigate

 through

 complex

 traffic

 conditions

 and

 make

 decisions

 on

 the

 fly

.



2

.

 Artificial

 general

 intelligence

:

 This

 is

 the

 ultimate

 goal

 of

 AI

,

 where

 an

 AI

 system

 can

 perform

 any

 task

 that

 a

 human

 can

 do

.

 It

's

 still

 a

 long

 way

 off

,

 but

 researchers

 are

 working

 on

 developing

 systems

 that

 can

 do

 tasks

 that

 are

 currently




In [6]:
llm.shutdown()