# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.45it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Max. I'm from the Netherlands. I live in Amsterdam. I have a very special title: my dad is a scientist. My dad is a molecular biologist. He's a very good scientist. He has a lot of knowledge in biochemistry, but he's also really into his hobby, which is hunting sea snakes. He's always looking for new specimens in the wild. Max was born in America and my dad has two dogs, which are our pets. They're Golden Retrievers. They're a really good family dog. They're really kind of good at playing with kids. I'm pretty much a constant observer of the world
Prompt: The president of the United States is
Generated text:  a politician who holds the highest office of the United States. What is the job title for the vice president of the United States? The job title for the vice president of the United States is the President Elect.
As of the most recent election data, the previous President Elect was not yet confirmed or elected, and thus the position of Vi

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also a major center for art, culture, and politics. Paris is a popular tourist destination and a major economic hub. The city is home to many international organizations and cultural institutions. It is also known for its cuisine, including French cuisine, which is renowned for its rich flavors and use of fresh ingredients. Paris is a city of contrasts, with its rich history and modernity. Its status as the capital of France is a testament to its importance as a cultural and political center. The city is home to

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely future trends in AI:

1. Increased automation and robotics: As AI continues to improve, we can expect to see more automation and robotics in various industries. This will lead to increased efficiency, productivity, and cost savings for businesses.

2. Enhanced personalization: AI will enable businesses to better understand their customers' needs and preferences, leading to more personalized experiences. This will enable businesses to offer more targeted marketing and personalized product recommendations.

3. Improved healthcare: AI will play a



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [Position] at [Company]. I am a [education level] with [number] years of experience in [field of expertise]. I bring a unique blend of [qualities/attributes] to my work, and I am always looking to learn and grow in my field. I am [interests/occupations/activities] and am passionate about [career goals]. I am eager to expand my knowledge and contribute to the team to the best of my abilities.
As you have seen, I am confident, dedicated, and adaptable, and I am always striving to improve myself and be a better version of

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the largest city and capital of the country. 

A French person can speak French. The French language is one of the official languages of France. 

The weather in Paris can vary from cold to hot. However, the city has a ple

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

'm

 from

 [

City

,

 State

].

 I

 like

 to

 read

 books

 and

 watch

 movies

,

 and

 I

 enjoy

 trying

 new

 foods

.

 I

'm

 passionate

 about

 [

Your

 passion

 or

 interest

],

 and

 I

 love

 to

 [

Describe

 a

 recent

 experience

 or

 activity

 that

 shows

 your

 passion

 or

 interest

].

 I

'm

 looking

 forward

 to

 getting

 to

 know

 more

 about

 you

 and

 learning

 more

 about

 you

.

 How

 are

 you

?

 [

Name

].

 



I

'm

 excited

 to

 have

 the

 chance

 to

 meet

 you

.

 [

Name

],

 this

 is

 my

 first

 time

.

 How

 are

 you

?

 [

Name

].

 It

's

 always

 nice

 to

 meet

 someone

 new

.

 How

 do

 you

 like

 your

 new



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 known

 for

 its

 unique

 architecture

 and

 vibrant

 culture

.



What

 is

 the

 capital

 of

 France

?

 Paris

,

 the

 capital

 city

 of

 France

,

 is

 renowned

 for

 its

 distinctive

 architecture

 and

 thriving

 culture

.

 Its

 historical

 significance

 is

 reflected

 in

 its

 many

 museums

,

 museums

,

 and

 art

 galleries

 that

 showcase

 French

 art

 and

 culture

.

 Paris

 also

 has

 a

 rich

 history

 and

 is

 famous

 for

 its

 festivals

,

 such

 as

 the

 Festival

 de

 la

 Mus

ique

 de

 la

 S

alle

,

 which

 is

 one

 of

 the

 oldest

 music

 festivals

 in

 the

 world

.

 The

 city

 also

 hosts

 several

 major

 international

 events

 and

 attracts

 visitors

 from

 all

 over

 the

 world

.

 In

 addition

 to

 its

 cultural

 attractions

,

 Paris

 is

 also

 known

 for



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 different

 trends

,

 some

 of

 which

 are

 outlined

 below

:



1

.

 Increased

 use

 of

 machine

 learning

:

 As

 AI

 technology

 continues

 to

 improve

,

 it

 is

 likely

 to

 become

 more

 powerful

 and

 capable

 of

 performing

 increasingly

 complex

 tasks

.

 This

 trend

 is

 expected

 to

 lead

 to

 the

 development

 of

 even

 more

 sophisticated

 models

 that

 can

 learn

 from

 large

 amounts

 of

 data

 and

 make

 increasingly

 accurate

 predictions

.



2

.

 More

 intelligent

 interfaces

:

 As

 AI

 technology

 continues

 to

 improve

,

 it

 is

 likely

 to

 become

 even

 more

 integrated

 into

 our

 daily

 lives

.

 This

 trend

 is

 expected

 to

 lead

 to

 the

 development

 of

 more

 intelligent

 and

 human

-like

 interfaces

,

 such

 as

 virtual

 assistants

 and

 voice

 assistants

,

 that

 can

 provide




In [6]:
llm.shutdown()