# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.97it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.96it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lucy and I am a high school student in high school. I just want to know what are the main challenges we face in the future and how we can solve them. Can you share your thoughts on this? 
Lucy
As an AI language model, I do not have personal experiences or emotions, but I can provide general information about some of the challenges that may impact the future and suggest some potential solutions.
Some potential challenges in the future include:
1. Climate change: The effects of climate change are already being felt around the world, and it will continue to worsen unless we take action to reduce greenhouse gas emissions and adapt to
Prompt: The president of the United States is
Generated text:  32 years older than the president of Central America. The president of Central America is half the age of the president of Asia. If the president of Asia is 30 years old, what is the total of the ages of the presidents of the three continents?
To determine

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, positive, enthusiastic, or neutral description of your personality or skills]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm a [insert a short, positive, enthusiastic, or neutral description of your job or profession]. I'm always looking for new ways to improve my skills and stay up-to-date with the latest trends and technologies. What do you enjoy doing

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. Paris is also a major transportation hub, with many major highways and railroads connecting the city to other parts of France and the world. The city is a cultural and economic center, with many museums, theaters, and restaurants serving as important attractions for residents and visitors alike. Paris is a popular tourist destination, with millions of visitors each year. It

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, and accountability.

2. Integration with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This will allow for more complex and sophisticated AI systems.

3. Development of new AI technologies: There will be a continued focus on developing new



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [name], and I am a [role or occupation] who has been around for [number of years] years. In my time, I have seen some of the most fascinating and unique phenomena that human knowledge has yet to fully uncover. If you have a question, please feel free to ask me anything. I am open to learning from you, and am always looking for ways to expand my own knowledge base.
I will be here when you call. Let's make a connection! Let's talk about what you like to do, what you want to learn, and where your interests lie. I am ready to help. How can I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the largest city and the oldest city in the European Union, located in the Loire Valley region of the central part of the country. It is the cultural and economic capital of France and home to many of the country's larges

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

'm

 excited

 to

 have

 you

 on

 my

 team

.

 Let

 me

 know

 if

 there

's

 anything

 I

 can

 do

 to

 help

 you

.

 Best

,

 [

Name

]

 

✞

️





This

 is

 a

 friendly

 and

 polite

 introduction

,

 but

 do

 you

 think

 there

 could

 be

 a

 way

 to

 make

 it

 a

 bit

 more

 specific

 or

 personal

?

 For

 example

,

 instead

 of

 using

 the

 company

 name

 and

 job

 title

,

 can

 we

 also

 add

 some

 more

 personal

 information

 about

 my

 experience

 working

 there

 or

 the

 skills

 I

 possess

?

 Also

,

 are

 there

 any

 additional

 questions

 or

 concerns

 I

 should

 address

 to

 make

 the

 introduction

 more

 engaging

 and

 engaging



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 as

 the

 city

 of

 light

 and

 love

.

 It

 is

 a

 historical

 center

,

 a

 hub

 of

 cultural

 life

,

 and

 the

 seat

 of

 government

 and

 society

 for

 much

 of

 the

 country

.

 It

 is

 a

 UNESCO

 World

 Heritage

 site

,

 and

 the

 city

 is

 home

 to

 many

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

,

 and

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 the

 heart

 of

 a

 rich

 cultural

 and

 artistic

 tradition

,

 with

 a

 thriving

 film

 industry

,

 a

 vibrant

 nightlife

,

 and

 a

 diverse

 population

.

 It

 is

 a

 popular

 tourist

 destination

 and

 a

 significant

 center

 of

 trade

 and

 commerce

.

 Paris

 is

 a

 vibrant

 and

 dynamic

 city

 that

 continues

 to

 capt

ivate

 people

 worldwide

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 promising

 and

 many

 possible

 trends

 are

 to

 follow

.

 Here

 are

 some

 of

 the

 trends

 you

 might

 expect

 to

 see

:



1

.

 Increased

 automation

:

 As

 AI

 becomes

 more

 widely

 used

,

 the

 chances

 of

 job

 loss

 are

 likely

 to

 increase

.

 However

,

 there

 are

 also

 many

 opportunities

 for

 AI

 to

 automate

 mundane

 tasks

,

 such

 as

 data

 entry

,

 customer

 service

,

 and

 administrative

 tasks

.



2

.

 AI

 for

 healthcare

:

 AI

 has

 the

 potential

 to

 revolution

ize

 the

 healthcare

 industry

,

 with

 personalized

 medicine

,

 early

 detection

,

 and

 virtual

 assistants

 for

 healthcare

 professionals

.

 AI

-powered

 tools

 will

 help

 doctors

 diagnose

 diseases

,

 prescribe

 treatments

,

 and

 even

 predict

 patients

'

 health

 outcomes

.



3

.

 AI

 for

 autonomous

 vehicles

:

 Autonomous




In [6]:
llm.shutdown()