# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.50it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Chen and I'm a 25-year-old male. I'm from a small town in the southwest of China, and I'm currently a doctor at a city hospital. I have a personal blog, but I don't have a professional blog.
My blog focuses on the things that people in the world are most worried about. What I do is not just a doctor, but also a journalist and a reporter. I wrote some articles on the dangers of burning plastic, the dangers of eating poisonous mushrooms and the dangers of eating rotten meat. I have a lot of readers and I am very popular. 
For the last 12 months
Prompt: The president of the United States is
Generated text:  a powerful man who is given a lot of power to make decisions. He gets to go to any political meetings, he gets to talk to any person he wants, he gets to go to any press conference, he gets to be on any media interview, he gets to have any kind of influence on the country, and he gets to decide all the things. He gets to have as much power as 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm a [Number] year old, [Gender] and [Country]. I'm a [Skill] person who enjoys [Favorite Activity] and [Favorite Food]. I'm [Age] years old, [Height] inches tall, and [Weight] pounds. I have [Physical Feature] and [Physical Attribute]. I'm [Age] years old, [Height] inches tall, and [Weight] pounds. I have [Physical Feature] and [Physical Attribute]. I'm [Age] years old, [Height] inches tall, and [

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a bustling metropolis with a rich history and a vibrant culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The city is also known for its delicious cuisine, including French cuisine and international dishes. Paris is a popular tourist destination and a cultural hub for France and the world. It is a city that is both old and new, with a rich history and a modern spirit. The city is known for its art, music, and fashion, and is a major center for business, science, and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, we can expect to see even more widespread use in healthcare, with more personalized and accurate diagnoses and treatments.

2. AI in manufacturing: AI is already being used in manufacturing to optimize production processes and improve quality control. As AI becomes more advanced, we can expect to see even more widespread use in manufacturing, with more efficient and accurate



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name]. I'm a [Your Profession/Role] who has been [Your Achievements/Hobbies] for the past [X] years. I've always been [Your Character Trait/Background], and I'm always looking forward to [Your Goal/Adventure]. What's your favorite hobby, movie, or TV show? I'm always up for a challenge, so if you have any [Your Area of Expertise/Interest] to share, feel free to let me know! Let's meet in person! **[Your Name]**  
Contact Information:  
- LinkedIn Profile: [Your LinkedIn Profile]
- Twitter: @

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Fontaine" or "La Fontaine Royale". The city is located on the banks of the Seine River and has a rich history dating back to ancient times. It was the capital of France from 1804 until 1969 and is now a major center for the arts, education, and

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

 and

 I

 am

 a

/an

 (

job

 title

)

 at

 __

________

.

 I

 have

 always

 been

 passionate

 about

 __

________

.

 I

 am

 __

________

.

 I

 love

 to

 __

________

.

 



Sure

,

 let

 me

 add

 a

 bit

 more

 context

 to

 make

 the

 introduction

 more

 rel

atable

 and

 engaging

.

 How

 about

:



Hello

,

 my

 name

 is

 [

Name

],

 and

 I

 am

 a

 [

job

 title

]

 at

 [

Company

 Name

].

 I

 have

 always

 been

 passionate

 about

 [

job

 title

]

 work

.

 I

 am

 [

Age

],

 [

Gender

],

 and

 I

 love

 to

 [

job

 title

]

 with

 [

Reason

 for

 love

 or

 passion

].

 I

 am

 [

any

 relevant

 details

],

 and

 I

 am

 [

any

 relevant



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 and

 there

 are

 several

 possible

 trends

 that

 could

 shape

 the

 landscape

 in

 the

 coming

 years

.

 Here

 are

 some

 of

 the

 most

 likely

 areas

 of

 growth

 and

 development

:



1

.

 Increased

 Automation

:

 As

 AI

 systems

 become

 more

 efficient

 and

 capable

,

 they

 will

 likely

 become

 more

 integrated

 into

 a

 wider

 range

 of

 tasks

,

 from

 manufacturing

 to

 customer

 service

.

 This

 could

 lead

 to

 automation

 of

 repetitive

 and

 mundane

 tasks

,

 freeing

 up

 human

 employees

 to

 focus

 on

 more

 complex

,

 creative

 work

.



2

.

 Deep

 Learning

 and

 Rein

forcement

 Learning

:

 These

 technologies

 are

 gaining

 momentum

 as

 the

 primary

 driver

 of

 AI

 progress

,

 and

 could

 lead

 to

 breakthrough

s

 in

 areas

 such

 as

 natural

 language

 processing

,

 computer

 vision

,




In [6]:
llm.shutdown()