# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.16it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Miao Xiaoxiao. I am a high school senior who has always been passionate about writing. I am a good writer who has always been inspired by literature. I have a passion for exploring the mysteries of the world, so I hope to be a writer who can communicate ideas and emotions through words.
Based on that paragraph, what is Miao Xiaoxiao's primary goal?
The answer is: To be a writer who can communicate ideas and emotions through words. Miao Xiaoxiao's primary goal is to be a writer who can express her thoughts and emotions through words, as evidenced by her passion for exploring the mysteries of the
Prompt: The president of the United States is
Generated text:  a wealthy man who has been the president for 23 years. He has served 22 terms, and his presidency has been relatively short compared to other presidents. However, this might not be true for other presidents, as their term lengths can vary.

To estimate the average number of years each presid

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and art galleries. Paris is a popular tourist destination and is known for its rich history, art, and cuisine. It is also home to many famous French artists, writers, and musicians. The city is known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a vibrant and dynamic city with a rich history and a diverse population. It is a major hub for business and

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: As AI becomes more advanced, it is likely to become more integrated with human intelligence, allowing it to learn and adapt to new situations more effectively.

2. Greater emphasis on ethical considerations: As AI becomes more advanced, there will be a greater emphasis on ethical considerations, such as privacy, fairness, and accountability.

3. Increased use of AI in healthcare: AI is already being used in healthcare to diagnose and treat diseases, but there is a growing potential for AI to be used in more advanced ways, such as personalized medicine and drug discovery.

4. Greater



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I am a [Age] year old, [Occupation]. I love to [favorite hobby or activity]. I am passionate about [why you love your hobby/activity]. I also [why you love this hobby/act/vocation]. I am confident, independent, and [reason for your success/ability]. I am the type of person who [why you think this is important]. I would love to meet you, but first I want to know more about you. What is your name? Where do you come from? What's the first thing that comes to mind when you think of your name? What's your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known as the "City of Love" due to its extensive gardens and romantic ambiance. Located in the northwestern region of France, Paris has a rich history dating back to the Roman Empire and continues to be an important cultural and economic center. As of

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 [

Age

].

 I

 have

 always

 been

 an

 enthusiastic

 learner

 and

 always

 aim

 to

 improve

 my

 skills

.

 I

 love

 to

 immer

se

 myself

 in

 new

 experiences

 and

 have

 a

 great

 sense

 of

 humor

.

 I

'm

 a

 hard

-working

 person

 who

 thr

ives

 on

 learning

 and

 discovering

 new

 things

,

 so

 I

'm

 always

 eager

 to

 learn

 and

 grow

.

 I

'm

 passionate

 about

 travel

 and

 I

 love

 exploring

 new

 cultures

 and

 trying

 new

 foods

.

 I

'm

 a

 friendly

 person

 who

 values

 honesty

 and

 integrity

 and

 always

 strive

 to

 live

 a

 simple

,

 enjoyable

 life

.

 I

'm

 looking

 forward

 to

 having

 the

 opportunity

 to

 meet

 you

 and

 to

 explore

 new

 things

 together

.

 What

's

 your

 name

?

 What



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 also

 famous

 for

 its

 rich

 cultural

 heritage

 and

 has

 played

 an

 important

 role

 in

 the

 development

 of

 the

 country

's

 art

,

 literature

,

 and

 politics

.

 Paris

 has

 a

 diverse

 population

 of

 over

 

7

 million

 people

 and

 is

 known

 for

 its

 vibrant

 nightlife

,

 street

 food

,

 and

 annual

 festivals

.

 It

 is

 one

 of

 the

 world

's

 most

 populous

 and

 culturally

 significant

 cities

,

 with

 a

 rich

 history

 dating

 back

 to

 the

 

6

th

 century

 BC

.

 Paris

 has

 been

 the

 site

 of

 many

 important

 events

 and

 landmarks

 throughout

 its

 history

,

 including



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 combination

 of

 trends

 and

 innovations

 that

 continue

 to

 emerge

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 AI

 in

 the

 coming

 years

:



1

.

 Increased

 integration

 of

 AI

 into

 everyday

 life

:

 AI

 is

 already

 becoming

 more

 integrated

 into

 our

 lives

,

 from

 self

-driving

 cars

 to

 virtual

 assistants

 that

 can

 help

 us

 with

 tasks

 like

 ordering

 groceries

 or

 managing

 our

 schedules

.

 As

 AI

 technology

 advances

,

 we

 can

 expect

 to

 see

 more

 of

 these

 features

 incorporated

 into

 our

 everyday

 lives

.



2

.

 AI

 will

 continue

 to

 improve

 its

 accuracy

 and

 efficiency

:

 As

 AI

 technology

 improves

,

 it

 will

 become

 more

 capable

 of

 performing

 tasks

 that

 are

 currently

 too

 complex

 for

 humans

 to

 do

.

 This

 will




In [6]:
llm.shutdown()