# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.68it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ravi and I'm a graduate student at the University of Minnesota. I've been working on the RMT Fellowship program for the past several years and am currently preparing to start my second year of my PhD. My research is focused on the role of population balance equations in predicting the dynamics of 3D structures in protoplanetary disks. Specifically, I am interested in modeling the dynamics of protoplanetary disks as a dynamical system by using population balance equations for gas and dust. I want to learn more about this field as I prepare for my thesis, and am keen to get any feedback or advice from other PhD students on this topic
Prompt: The president of the United States is
Generated text:  running for a second term. He will serve his second term starting on July 1, 2021. He is seeking an additional term on December 31, 2023. If his term starts on the 1st day of the month, and he was born on the 1st of January 2000, on what date does he exp

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill or特长] that I've honed through my [Education or Training] and have been [Achievement or Contribution]. I'm [Personality] and I enjoy [What I like to do]. I'm [What I like to do] because [Why I like it]. I'm [What I like to do] because [Why I like it]. I'm [What I like to do] because [Why I like it]. I'm [What I like to do] because [Why I like it].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is also the country's largest city and the second most populous city in the European Union. Paris is a cultural and artistic center, with many famous museums, theaters, and art galleries. The city is also known for its fashion industry, with many famous designers and fashion houses. Paris is a popular tourist destination, with many attractions and events throughout the year. It is also a major financial center, with many banks, insurance companies, and other financial institutions. Paris is a city of contrasts, with its modern architecture and high

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with other technologies: AI is already being integrated into a wide range of devices and systems, from smartphones and computers to smart homes and industrial machinery. As more devices and systems become connected to the internet, AI will likely become even more integrated, with AI systems becoming more integrated with other technologies such as sensors, cameras, and machine learning algorithms.

2. Enhanced capabilities: AI is likely to become even more capable in the future, with the ability to learn and adapt to new situations and data. This will likely lead to more sophisticated and autonomous AI systems that



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a/an [Job Title] with [Number of Years] years of experience in [Industry]. I am currently working as a/an [Position] at [Company Name], and I am passionate about [What you like to do]. I have always been [Favorite Thing], [Favorite Sport], and [Favorite Character]. My favorite thing to do is [What you like to do]. I am also [What you like to do]. I enjoy [What you like to do]. I am confident that I have the skills and knowledge needed to succeed in this position, and I am committed to providing exceptional service and assistance

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is known as the "city of love" due to its romantic and passionate atmosphere. The city is home to numerous landmarks, including the Eiffel Tower and Notre-Dame Cathedral. Paris is a bustling hub of culture, cuisi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

/an

 [

occupation

]

 [

description

].

 I

'm

 [

age

]

 years

 old

 and

 [

major

ity

 color

]

 skin

.

 I

'm

 [

gender

]

 and

 [

occupation

]

 has

 been

 [

reason

 for

 being

].

 I

'm

 [

ability

]

 and

 [

career

 goal

].

 I

'm

 [

person

ality

 trait

]

 and

 [

eye

 color

].

 I

 have

 a

 [

v

ocabulary

]

 personality

 and

 love

 [

interest

s

 or

 hobbies

].

 I

'm

 [

body

 shape

]

 and

 [

political

 affiliation

].

 I

 enjoy

 [

activities

]

 and

 [

travel

].

 I

'm

 [

positive

 outlook

]

 and

 [

gener

ous

].

 I

 value

 [

values

],

 but

 I

 try

 to

 be

 [

positive

]

 about



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 located

 in

 the

 south

 of

 the

 country

.

 It

 is

 known

 for

 its

 rich

 history

,

 art

,

 and

 food

.

 Its

 major

 landmarks

 include

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 famous

 for

 its

 fashion

 and

 restaurants

.

 The

 city

 is

 also

 known

 for

 its

 annual

 E

iff

el

 Tower

 Tower

 F

ête

,

 a

 celebration

 of

 the

 E

iff

el

 Tower

's

 

1

0

0

th

 birthday

.

 As

 of

 

2

0

2

1

,

 Paris

 has

 a

 population

 of

 over

 

1

8

 million

.

 Paris

 is

 a

 major

 cultural

 and

 economic

 hub

 of

 the

 world

,

 and

 it

 is

 widely

 recognized

 as

 one



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 bright

 and

 with

 rapid

 advances

 in

 technology

,

 we

 can

 expect

 many

 new

 and

 exciting

 trends

 to

 emerge

.

 Here

 are

 some

 of

 the

 potential

 areas

 where

 AI

 is

 likely

 to

 continue

 to

 play

 a

 significant

 role

:



1

.

 Adv

ancements

 in

 Machine

 Learning

:

 The

 field

 of

 machine

 learning

 is

 constantly

 evolving

 and

 advancing

,

 bringing

 us

 more

 accurate

 and

 intelligent

 algorithms

.

 We

 expect

 to

 see

 more

 sophisticated

 methods

 for

 understanding

 and

 predicting

 behavior

 of

 human

-like

 agents

,

 such

 as

 robots

 and

 AI

 assistants

.



2

.

 Internet

 of

 Things

 (

Io

T

):

 The

 Internet

 of

 Things

 is

 expected

 to

 transform

 the

 way

 we

 interact

 with

 technology

,

 making

 it

 easier

 for

 us

 to

 access

 and

 control

 devices

 and

 appliances

.

 This

 will




In [6]:
llm.shutdown()