# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.92it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Eunice. I'm a teacher at a middle school. I have a little dog called Mimi. I take care of her. I feed her, play with her, and take care of her until she's finished with me. I'm happy that she's a little dog and that she has a nice home. I'm really proud of her. I often say I'm proud that she's a good girl. The dog belongs to Mimi's mother, and Mimi is in the care of my husband and two sons. My husband and two sons are very nice to me. They are always there when I need them to
Prompt: The president of the United States is
Generated text:  trying to improve communication between the executive branch and Congress. He is concerned that the executive branch does not communicate effectively with Congress. What could the president do to improve communication between the two branches? I'm sorry, I cannot answer this question. This might be a political issue that requires sensitivity and restraint, and I won't comment on or express opinions about polit

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Job Title] at [Company Name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [Age], [Gender], [Nationality], [Occupation], [Hobbies], and [Favorite Food]. I'm always looking for new experiences and learning new things, and I'm always eager to share my knowledge with others. Thank you for taking the time to meet me. [Name] [Company Name] [Company Address] [Company Phone Number] [Company Email] [Company Website] [Company LinkedIn Profile] [Company Twitter

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the third largest city in the world by population. It is known for its rich history, beautiful architecture, and vibrant culture. Paris is home to many famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also known for its annual festivals and events, including the Eiffel Tower Parade and the World Cup. Paris is a popular tourist destination and a major economic center in France. It is the capital of France and the largest city in the country. It is also known as the "City of Light" and "The City of Light

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars to personalized medicine. Additionally, AI will continue to be used for tasks such as fraud detection, cybersecurity, and environmental monitoring. As AI becomes more integrated into our daily lives, we can expect to see a significant impact on the way we work, communicate, and interact with each other. However, it is important to note that AI is still a rapidly evolving field, and there are many potential risks and challenges that need to be addressed as



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [job title] with a [degree] in [field]. I have always been [positive] about my career and [specific area of interest] has been my passion. I love to [describe a hobby or activity]. In my spare time, I enjoy [something else]. I’m a [positive] person, so I try to [describe a positive trait or behavior]. I’ve always been [what you consider your greatest strength]. I’m always [what you consider your greatest weakness]. If you had to describe me in 30 words or less, what would it be? Hi, my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, often referred to as the “City of Light” due to its iconic landmarks and vibrant cultural scene.
You are to answer this question: Which city is part of the European Union, Paris, or Moscow? To answer this question, I will perform a quick search or look up t

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Title

/

Role

].

 [

Name

]

 is

 a

 [

occupation

]

 who

 has

 always

 been

 fascinated

 by

 the

 idea

 of

 [

objective

 of

 interest

].

 I

’ve

 always

 been

 passionate

 about

 [

why

 you

 are

 passionate

]

 and

 I

 aim

 to

 [

my

 long

-term

 goal

].

 [

Name

]

 is

 an

 [

educ

ational

 background

],

 [

l

ifestyle

],

 and

 [

personal

 interests

].

 I

’m

 a

 [

actor

],

 [

writer

,

 designer

,

 or

 model

],

 [

music

ian

, artist

,

 or

 athlete],

 or

 [

person

 with

 a

 unique

 skill

].

 I

'm

 always

 seeking

 to

 learn

 more

 about

 [

how

 you

 learn

],

 [

what

 motiv

ates

 you

],

 and

 [

what



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Ville

 de

 Paris

,"

 and

 it

 is

 a

 major

 city

 located

 on

 the

 Se

ine

 River

 and

 is

 the

 seat

 of

 the

 French

 government

 and

 the

 largest

 city

 in

 the

 European

 Union

 by

 population

.



Please

 provide

 the

 correct

 sentence

.

 Paris

 is

 the

 capital

 of France

 and

 one

 of

 the

 largest

 cities

 in

 Europe

.

 According

 to

 the

 

2

0

2

0

 European

 Union

 population

 count

,

 Paris

 has

 a

 population

 of

 over

 

2

.

3

 million

 people

.

 Paris

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 and

 vibrant

 culture

,

 making

 it

 a

 popular

 tourist

 destination

 in

 Europe

.

 As

 the

 seat

 of

 government

 and

 the

 largest

 city

 in

 the

 European

 Union

 by

 population



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 wide

 range

 of

 trends

 and

 developments

 that

 could

 greatly

 impact

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 technology

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Improved

 machine

 learning

:

 As

 AI

 algorithms

 become

 more

 complex

 and

 sophisticated

,

 we

 can

 expect

 to

 see

 more

 efficient

 and

 accurate

 machine

 learning

 models

.

 This

 will

 likely

 lead

 to

 faster

 and

 more

 accurate

 predictions

 and

 decision

-making

 processes

,

 as

 well

 as

 improved

 natural

 language

 processing

 and

 image

 recognition

.



2

.

 Personal

ized

 AI

:

 As

 more

 data

 becomes

 available

,

 we

 can

 expect

 to

 see

 more

 personalized

 AI

 experiences

.

 This

 could

 involve

 recommending

 products

 and

 services

 based

 on

 individual

 user

 preferences

,

 or

 providing

 more

 personalized




In [6]:
llm.shutdown()