# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.97it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Katie. I am a 15-year-old girl with an 80% GPA and a passion for reading. I like to write stories, and I enjoy helping others write. I am always looking for new topics to write about. How can I improve my writing skills?
Improving your writing skills can be a challenging task, but it's definitely achievable with dedication and practice. Here are some tips to help you improve your writing skills:

1. Read extensively: Read a wide range of books, newspapers, magazines, and other written materials. This will expose you to different writing styles and techniques, as well as help you learn new
Prompt: The president of the United States is
Generated text:  30 years older than the president of Texas. The president of Texas is half the age of the president of France. If the president of France is 30 years older than the president of Japan, how old is the president of Japan?
To determine the age of the president of Japan, we need to work backwards from

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [Age] year old [Occupation]. I am a [Type of Vehicle] with [Number of Wheels] wheels. I have [Number of Feet] feet and [Number of Hands] hands. I am [Gender] and [Race]. I have [Number of Colors] colors and [Number of Shapes] shapes. I am [Occupation] with [Number of Jobs]. I am [Number of Pets] pets. I am [Number of Children] children. I am [Number of Grandchildren] grandchildren. I am [Number of Grandparents] grandparents. I am [Number of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris". It is the largest city in France and the second-largest city in the European Union. The city is home to many of France's most famous landmarks, including the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is also known for its rich history, including the French Revolution and the French Revolution Monument. The city is a major center for business, culture, and entertainment, and is a popular tourist destination. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the city's vibrant culture. The city is also home

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and personalized medicine to virtual assistants and chatbots. Additionally, there is a growing trend towards developing AI that is more ethical and transparent, with greater emphasis on privacy and security. As AI becomes more integrated into our daily lives, we can expect to see more widespread adoption of AI-powered solutions, from healthcare and finance to transportation and entertainment. Overall, the future of AI is likely to be one of continued innovation and growth, with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Job Title] at [Company Name]. I have been [Number of Years in Industry] in the [Industry], with a [Job Title] at [Company Name], for the past [Number of Years] years. My [Job Title] has allowed me to [Tell a fact or accomplishment that showcases my skills and experience]. I am excited about the opportunity to work with you and contribute my skills to your team. Thank you for considering me for this position.
Hello, my name is [Name] and I am a [Job Title] at [Company Name]. I have been [Number of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city of light and memory.
Paris is known for its grand boulevards, iconic Eiffel Tower, and iconic landmarks such as the Louvre and Notre-Dame Cathedral. It is a bustling center of culture, fashion, and food, and a beloved summer destinatio

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

'm

 an

 [

Age

]

 year

 old

,

 [

Gender

]

 with

 [

Occup

ation

]

 and

 I

'm

 currently

 working

 as

 a

 [

Occup

ation

]

 at

 [

Company

].

 I

 enjoy

 [

what

 I

 do

 best

].

 What

's

 your

 name

?



In

 your

 spare

 time

,

 you

 like

 to

 [

what

 you

 enjoy

 doing

].

 Do

 you

 have

 any

 hobbies

 or

 interests

?

 I

'd

 love

 to

 learn

 more

 about

 you

!

 What

's

 your

 favorite

 hobby

?

 [

What

 your

 favorite

 hobby

 is

].

 What

's

 your

 favorite

 book

 or

 movie

?

 [

What

 your

 favorite

 book

 or

 movie

 is

].

 What

 do

 you

 like

 to

 do

 for

 fun

 outside

 of

 work

?

 [

What

 you

 do

 outside

 of

 work



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

,

 serving

 as

 its

 capital

,

 seat

 of

 government

,

 and

 cultural

 and

 economic

 center

.

 It

 is

 the

 fifth

-largest

 city

 in

 the

 world

 by

 population

 and

 the

 second

-largest

 city

 by

 area

.

 Paris

 was

 founded

 in

 the

 

1

1

th

 century

 and

 is

 home

 to

 many

 notable

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 known

 for

 its

 rich

 history

,

 art

,

 and

 culture

,

 and

 has

 become

 a

 global

 destination

 for

 tourists

,

 academics

,

 and

 artists

.

 The

 city

 is

 home

 to

 the

 French

 government

,

 parliament

,

 and

 most

 of

 the

 nation

's

 political

 and

 governmental

 institutions

.

 The



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

,

 as

 it

 depends

 on

 the

 interactions

 between

 people

,

 technology

,

 and

 society

.

 However

,

 there

 are

 some

 trends

 that

 are

 likely

 to

 shape

 the

 future

 of

 AI

 in

 the

 coming

 years

.

 Some

 of

 the

 possible

 future

 trends

 include

:



1

.

 Increased

 integration

 of

 AI

 into

 the

 workforce

:

 As

 AI

 technology

 becomes

 more

 advanced

 and

 accessible

,

 it

 is

 likely

 to

 become

 more

 integrated

 into

 the

 workplace

.

 This

 could

 lead

 to

 more

 jobs

 being

 automated

 or

 performed

 by

 AI

,

 but

 it

 could

 also

 create

 new

 opportunities

 for

 people

 with

 skills

 in

 AI

-related

 fields

.



2

.

 Expansion

 of

 AI

 to

 other

 sectors

:

 AI

 is

 already

 being

 used

 in

 many

 sectors

,

 including

 healthcare

,

 transportation

,

 and




In [6]:
llm.shutdown()