# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.20it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tom, and I am 15 years old. I play basketball and take piano lessons. I will be 15 in the summer. I like playing basketball very much because I am good at it. I take piano lessons because I want to learn more music. The piano is a very cool instrument for me. I can play so many different notes, so my friends can hear my piano music. 

When I am older, I hope to be a good basketball player, or a good piano player. I like basketball better than piano, but I will learn to play it. When I grow up, I will want to become a
Prompt: The president of the United States is
Generated text:  represented by the Vice President. What does this sentence imply?

a) The President and Vice President share the same nationality
b) The President and Vice President are of the same political party
c) The Vice President and President share the same age
d) The Vice President and President share the same religion
e) The President and Vice President are of different polit

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your job or experience here]. I'm always looking for new opportunities to grow and learn, and I'm always eager to contribute to the success of [company name]. What do you do for a living? I'm always looking for new challenges and opportunities to grow and learn, and I'm always eager to contribute to the success of [company name]. What do you enjoy doing? I enjoy [insert a short description

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a city with a rich history and a diverse population. It is located on the Seine River and is the largest city in France by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its cuisine, fashion, and art, and is a major tourist destination. Paris is a cultural and intellectual center of the world, and its influence can be seen in many aspects of French society and politics. The city is home to many important institutions, including the French Academy of Sciences, the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced interactions between humans and machines.

2. Enhanced machine learning capabilities: AI is likely to become even more capable of learning and adapting to new situations, allowing for more complex and sophisticated decision-making.

3. Improved privacy and security: As AI systems become more integrated with human intelligence, there will be an increased need for privacy and security measures to protect against potential misuse.

4. Increased focus on ethical considerations: As AI systems become more integrated with human intelligence



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [age] year old [occupation]. I'm currently [job title] at [company name]. I have a passion for [reason why you enjoy your job] and I am always looking to [how you stay motivated]. I'm [any notable qualities or skills you possess that make you unique]. If you have any questions or would like to learn more about me, please don't hesitate to ask. 
I'm excited to chat with you! 
[Your Name] [Your Job Title] [Company Name] [Your Contact Information] 
[Your Age] [Your Occupation] [Job Title

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Light, and is located on the banks of the Seine River in the central region of France.
Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, Louvre Museum, and Sacré-Cœur Basilica. It is also the bi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

gender

]

 [

race

]

 [

national

ity

]

 with

 [

age

]

 years

 old

.

 I

'm

 currently

 [

occupation

],

 and

 I

'm

 a

 [

profession

 or

 title

]

 of

 my

 [

career

 field

 or

 industry

].

 I

 started

 [

time

 period

]

 when

 I

 [

short

 answer

,

 e

.g

.,

 "

was

 born

,

 grew

 up

,

 started

 working

"

 or

 "

started

 my

 career

"]

 and

 I

've

 been

 [

career

 length

 or

 duration

]

 in

 this

 field

.

 I

 enjoy

 [

reason

 why

 you

 enjoy

 your

 job

].

 My

 [

adv

antages

 or

 strengths

]

 and

 [

dis

adv

antages

 or

 weaknesses

]

 are

 [

list

 them

].

 If

 you

 could

 describe

 me

 as



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 Lou

vre

 Museum

.

 The

 city

 also

 boasts

 a

 rich

 cultural

 scene

 with

 various

 museums

,

 theaters

,

 and

 a

 lively

 nightlife

.

 It

 is

 a

 major

 hub

 of

 politics

,

 finance

,

 and

 art

,

 and

 has

 a

 diverse

 population

 of

 around

 

3

.

8

 million

 people

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 UNESCO

 World

 Heritage

 site

.

 The

 French

 capital

 is

 recognized

 as

 a

 UNESCO

 World

 Heritage

 site

 due

 to

 its

 historical

 significance

 and

 cultural

 richness

.

 The

 French

 capital

 is

 a

 major

 global

 city

 and

 a

 hub

 of

 economic

 activity

.

 It

 is

 home

 to

 the

 European

 Parliament

 and

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 continuous

 evolution

 and

 growth

,

 with

 the

 following

 potential

 trends

:



1

.

 Integration

 of

 AI

 with

 other

 technologies

:

 The

 integration

 of

 AI

 with

 other

 technologies

 such

 as

 sensors

,

 cameras

,

 and

 machine

 learning

 algorithms

 will

 become

 increasingly

 common

,

 enabling

 machines

 to

 perform

 tasks

 that

 are

 currently

 out

 of

 reach

 for

 humans

.



2

.

 Development

 of

 new

 algorithms

:

 The

 development

 of

 new

 algorithms

,

 such

 as

 neural

 networks

,

 deep

 learning

,

 and

 reinforcement

 learning

,

 will

 become

 more

 sophisticated

 and

 capable

,

 enabling

 machines

 to

 perform

 more

 complex

 and

 diverse

 tasks

.



3

.

 Increased

 reliance

 on

 AI

 in

 healthcare

:

 AI

 is

 expected

 to

 play

 a

 crucial

 role

 in

 healthcare

,

 from

 diagn

osing

 diseases

 to

 recommending




In [6]:
llm.shutdown()