# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.51it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Maria and I am a doctor in New York. I have been working in the field of psychiatry and psychology for over 15 years. I am often asked to do presentations for the public, and I usually do my best to keep my presentations clear and simple to understand. I have a degree in Psychology and have a certification as a National Board Certified Psychologist. My specialty is grief counseling, and I also specialize in treating addictions and substance abuse. I have been working with children and their families for 25 years, and I do not have any children of my own. I am a natural speaker and enjoy listening to how
Prompt: The president of the United States is
Generated text:  a powerful position, but he has not always been so position worthy. Throughout his tenure as president, he has faced many challenges and controversies. One of the most significant controversies he has faced was the use of the term "the White House." The president has argued that the

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is known for its rich history, art, and cuisine, and is a popular tourist destination. The city is also home to many international organizations and institutions, including the French Academy of Sciences and the French National Library. Paris is a vibrant and dynamic city with a rich cultural heritage that continues to inspire and captivate people around the world. The city is also known for its diverse population, with many

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to a more human-like experience with AI, as it becomes more capable of understanding and responding to human emotions and behaviors.

2. Enhanced privacy and security: As AI becomes more integrated with human intelligence, there will be an increased need for privacy and security measures to protect the data and information that AI generates. This could lead to the development of new technologies and protocols for handling and protecting sensitive data.

3. Increased



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm [Age]. I've been writing stories for [Number] years and I'm a [Skill or Profession]. I'm passionate about [Your Passion]. I enjoy [My Hobby/Interest/Decision]. I believe in [My Core Values]. My favorite quote is [Favorite Quote]. How would you describe yourself? [Your description]. 
In a short sentence, what is your unique selling point? [Your unique selling point]. I'm always looking to learn new things and grow as a writer. I'm always eager to challenge myself and push my creativity. I enjoy writing in a variety of genres and styles.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

What is the answer? Paris is the capital of France. The answer is Paris. Let me explain in detail:

1. Identify the capital city of France: The capital city of France is Paris.
2. Explain the capital city: P

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 a

 [

insert

 your

 age

 here

]

 year

 old

 female

.

 I

'm

 a

 [

insert

 your

 profession

 here

]

 and

 I

 live

 in

 [

insert

 your

 city

 or

 country

].

 I

 have

 a

 passion

 for

 [

insert

 something

 you

 enjoy

 doing

]

 and

 I

 love

 to

 travel

,

 so

 I

 often

 spend

 my

 free

 time

 exploring

 different

 places

 around

 the

 world

.

 I

'm

 always

 eager

 to

 learn

 new

 things

 and

 to

 try

 new

 things

,

 and

 I

 enjoy

 using

 my

 mind

 to

 think

 creatively

.

 I

 also

 enjoy

 making

 friends

 and

 spending

 time

 with

 people

 who

 are

 like

 me

,

 and

 I

 try

 to

 make

 the

 most

 of

 every

 moment

 I

 have

 with

 them

.

 So

 if

 you

're



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Ré

pub

lique

"

 and

 "

La

 Riv

iera

,"

 one

 of

 the

 most

 important

 cities

 in

 Europe

 and

 the

 world

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 many

 other

 cultural

 landmarks

,

 as

 well

 as

 being

 the

 center

 of

 French

 politics

,

 culture

,

 and

 society

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 history

 and

 a

 vibrant

 nightlife

,

 making

 it

 an

 exciting

 destination

 for

 tourists

 and

 locals

 alike

.

 Its

 status

 as

 the

 capital

 of

 France

 and

 its

 rich

 culture

 make

 it

 a

 must

-

visit

 destination

 for

 anyone

 interested

 in

 French

 history

,

 culture

,

 and

 city

 living

.

 Paris

 is

 also

 one

 of

 the



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 uncertain

 and

 complex

,

 and

 there

 is

 no

 one

 clear

 trend

 that

 can

 be

 predicted

 with

 certainty

.

 However

,

 based

 on

 current

 developments

 and

 trends

,

 some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increasing

ly

 sophisticated

 AI

 systems

:

 As

 AI

 technology

 continues

 to

 advance

,

 we

 are

 likely

 to

 see

 more

 sophisticated

 and

 complex

 AI

 systems

 that

 can

 perform

 a

 wide

 range

 of

 tasks

,

 including

 decision

-making

,

 translation

,

 and

 speech

 recognition

.



2

.

 Improved

 data

 privacy

 and

 security

:

 As

 AI

 systems

 become

 more

 integrated

 into

 our

 daily

 lives

,

 there

 is

 a

 growing

 concern

 about

 data

 privacy

 and

 security

.

 The

 increasing

 complexity

 and

 sophistication

 of

 AI

 systems

 will

 require

 us

 to

 develop

 new

 ways

 of

 handling




In [6]:
llm.shutdown()