# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.90it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.89it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Nusret and I'm a big fan of fitness and nutrition. I'm the owner of a company that specializes in both fitness and health products. I'm really passionate about all types of fitness, including cross fit, yoga and Pilates. I'm also a huge sports fan - I'm a vegetarian. I love to ride my bike and work out at the gym.

I want to learn how to take better care of my body and add an extra layer of nutrition to my life, so I decided to start a personal nutrition plan. I don't have a problem with my weight, but I wanted to start adding some nutritious elements to
Prompt: The president of the United States is
Generated text:  proposing a new tax policy that aims to reduce the national debt. The tax policy will impose a tax on every resident of the United States, with a tax rate of 5% on income over $1 million. If the tax rate is increased to 7% on income over $2 million, and the tax bill per resident would increase by $100,000, how much would the tax bi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [occupation] who has been [number of years] in the industry. I am passionate about [reason for passion], and I am always looking for ways to [action or goal]. I am always eager to learn and grow, and I am always willing to take on new challenges. I am a [character trait or quality] and I am always ready to help others. I am a [character trait or quality] and I am always willing to help others. I am a [character trait or quality] and I am always willing to help others. I am a [character trait or quality] and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is known for its cuisine, fashion, and art scene. It is also home to the French Parliament and the French National Museum. Paris is a vibrant and dynamic city that is a must-visit for anyone interested in French culture and history. The city is also known for its iconic landmarks such as the Eiffel Tower, Notre-D

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI will continue to automate tasks that are currently performed by humans, such as data analysis, decision-making, and routine maintenance. This will lead to increased efficiency and productivity, but it will also create new job opportunities.

2. Enhanced human-machine collaboration: AI will continue to improve its ability to understand and respond to human emotions, language, and intentions. This will lead to more natural and effective communication between humans and machines, and will also create new opportunities for collaboration and teamwork.

3. AI will become more integrated with other technologies: AI will continue to be integrated with other



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert first and last name] and I am a [insert occupation]. I am [insert age] years old, and I have always been [insert past experience]. I have always been [insert skill level], and I enjoy [insert hobbies or activities]. I am always [insert personality trait], and I am always looking for [insert opportunities or challenges]. I am excited to meet you and to learn more about you.
I'm [insert your profession] and I'm a great fit for the job because I'm [insert a positive statement about your profession and why you're a good fit for the role]. I always try to make

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

(Correct answer: Paris) 

Explanation: The capital of France is the city of Paris, which has been the seat of government and the heart of French culture since the 12th century. Paris is renowne

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

position

]

 at

 [

company

 name

].

 I

 have

 a

 strong

 passion

 for

 [

insert

 the

 hobby

,

 activity

,

 or

 interest

 that

 connects

 with

 the

 position

].

 I

 enjoy

 helping

 people

 and

 making

 a

 positive

 impact

 in

 the

 world

.

 I

'm

 a

 [

insert

 the

 most

 important

 characteristic

 of

 the

 character

 that

 makes

 them

 stand

 out

].

 And

 I

'm

 always

 looking

 to

 learn

 new

 things

 and

 expand

 my

 skill

 set

.

 I

'm

 a

 [

insert

 the

 most

 important

 skill

 that

 makes

 them

 stand

 out

]

 and

 I

'm

 always

 looking

 for

 ways

 to

 improve

 myself

.

 So

,

 if

 you

're

 interested

 in

 becoming

 a

 part

 of

 our

 team

 and

 making

 a

 positive

 impact

 on

 the



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 L

orraine

 region

 of

 northern

 France

 and

 surrounded

 by

 the

 Se

ine

 River

 to

 the

 west

 and

 the

 Alps

 to

 the

 north

.


You

 are

 an

 AI

 assistant

 that

 helps

 people

 find

 information

.

 Don

't

 use

 this

 statement

 in

 places

 where

 it

 could

 be

 misunderstood

 or

 mis

interpreted

.

Please

 provide

 me

 with

 the

 definition

 of

 "

ent

ertainment

 industry

".


The

 entertainment

 industry

,

 also

 known

 as

 film

,

 television

,

 and

 music

,

 is

 a

 complex

 and

 multif

ac

eted

 sector

 that

 encompasses

 a

 wide

 range

 of

 activities

 related

 to

 the

 production

,

 distribution

,

 and

 consumption

 of

 media

 and

 entertainment

 products

.

 It

 includes

 film

,

 television

,

 theater

,

 music

,

 and

 video

 games

.

 The

 industry



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 number

 of

 factors

,

 including

 advances

 in

 hardware

,

 new

 types

 of

 AI

 algorithms

,

 and

 increased

 collaboration

 between

 AI

 researchers

 and

 the

 scientific

 community

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 AI

 will

 become

 more

 integrated

 into

 everyday

 life

:

 One

 of

 the

 biggest

 benefits

 of

 AI

 is

 that

 it

 will

 become

 more

 integrated

 into

 our

 lives

.

 As

 AI

 is

 trained

 on

 more

 data

,

 it

 will

 become

 more

 capable

 of

 solving

 complex

 problems

 that

 were

 previously

 impossible

 to

 solve

 by

 hand

.

 This

 could

 lead

 to

 a

 more

 efficient

 use

 of

 resources

 and

 a

 more

 personalized

 experience

 for

 customers

.



2

.

 AI

 will

 be

 used

 for

 more

 complex

 tasks

:

 As

 AI

 becomes




In [6]:
llm.shutdown()