# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0807 07:37:41.103000 852326 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0807 07:37:41.103000 852326 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0807 07:37:50.922000 853265 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0807 07:37:50.922000 853265 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.90it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.89it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Xia, and I am 12 years old. I am from a rural village. I love reading and watching movies. My favorite book is "The Little Prince". Do you know what's the difference between "The Little Prince" and "The Very Hungry Caterpillar"? 

The Little Prince is about a man named Louis, who goes on a journey, and the Caterpillar is about a caterpillar eating a butterfly.

I want to know how I can find my family name and where I am from.
To answer your question, I am Xia. Please tell me more about the "Little Prince" and "The Very Hungry
Prompt: The president of the United States is
Generated text:  trying to decide whether to use a robot to clean the White House. He has decided that he would need to perform 6000 tasks to get the job done. The robot would take 50 minutes to perform one task. How many hours would it take for the robot to complete all of the tasks? To determine how many hours it would take for the robot to complete all of the tasks, we need

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Occupation]. I'm currently [Current Location] and I'm here to [Purpose of Visit]. I'm excited to meet you and learn more about you. How can I help you today? [Name] [Age] [Occupation] [Current Location] [Purpose of Visit] [Your Name] [Your Age] [Your Occupation] [Your Current Location] [Your Purpose of Visit] [Your Name] [Your Age] [Your Occupation] [Your Current Location] [Your Purpose of Visit] [Your Name] [Your Age] [Your Occupation

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament, the French Academy of Sciences, and the French Parliament building. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the world's largest fashion events. The city is also home to the French Riviera, a popular tourist destination for its beaches and luxury resorts. Overall, Paris is a vibrant and diverse city with a rich history and culture.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in several key areas, including:

1. Increased integration with human intelligence: AI systems are likely to become more integrated with human intelligence, allowing them to learn and adapt to new situations more effectively. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Enhanced machine learning capabilities: AI systems are likely to become even more powerful and capable, with the ability to learn from vast amounts of data and make more accurate predictions and decisions. This could lead to more efficient and effective use of resources, as well as better decision-making in various industries.

3. Increased



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  John Doe, and I am a self-employed freelance writer and editor. I have been in this field for over a decade, and I love my work. I'm excited to write for clients and share my expertise with the world. Let me know if you're interested in learning more about me! 
(End of text) 
[Optional] John Doe's interests, hobbies, and hobbies include:
- Traveling extensively
- Reading
- Music (specific genre)
- Playing guitar 
- Writing songs
- Dancing 
- Watching movies 
- Traveling extensively
- Reading
- Listening to music
- Playing guitar
- Writing

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest and most populous city in France. It is known for its rich history, beautiful architecture, vibrant culture, and famous landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. T

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

Occup

ation

]

 with

 [

Number

 of

 Years

 in

 Industry

]

 years

 of

 experience

 in

 [

Industry

].

 I

 have

 [

Number

 of

 Awards

/

Recognition

]

 industry

 awards

 and

 [

Number

 of

 Publications

/

Contrib

utions

]

 in

 [

Field

].

 My

 background

 is

 rooted

 in

 [

Brief

 History

]

 and

 I

 bring

 a

 diverse

 set

 of

 skills

 to

 the

 table

.


In

 the

 spirit

 of

 unity

 and

 cooperation

,

 I

 am

 excited

 to

 be

 part

 of

 [

Company

 Name

]

 and

 help

 lead

 [

Industry

/

Field

].

 Together

,

 we

 can

 make

 a

 meaningful

 impact

 in

 [

Industry

/

Field

].

 Let

's

 create

 a

 team

 that

 is

 not

 just

 competitive

,

 but

 collaborative

 and

 driven

 to



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 and

 most

 populous

 city

 in

 France

 and

 the

 country

's

 capital

.

 It

 is

 known

 for

 its

 rich

 cultural

 heritage

,

 artistic

 scene

,

 and

 lively

 nightlife

.

 The

 city

 also

 serves

 as

 the

 seat

 of

 government

,

 and

 has

 a

 long

-standing

 tradition

 of

 hosting

 international

 events

 and

 festivals

.

 France

's

 capital

,

 like

 its

 entire

 country

,

 is

 a

 unique

 and

 fascinating

 place

 to

 visit

 and

 explore

.

 


An

 example

 of

 a

 specific

 fact

 about

 Paris

 is

 that

 it

 has

 a

 vast

 array

 of

 museums

,

 art

 galleries

,

 and

 historical

 landmarks

,

 including

 the

 Lou

vre

,

 the

 Mus

ée

 d

'

Or

say

,

 and

 the

 Centre

 Pom

pid

ou

.

 Additionally

,

 Paris

 is

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 continue

 to

 evolve

 and

 divers

ify

,

 driven

 by

 advances

 in

 technology

,

 new

 data

 sources

,

 and

 changing

 societal

 needs

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Enhanced

 Natural

 Language

 Processing

:

 As

 AI

 becomes

 more

 advanced

,

 it

 is

 expected

 to

 further

 improve

 natural

 language

 processing

,

 enabling

 more

 sophisticated

 natural

 language

 understanding

 and

 generation

.



2

.

 Quantum

 Computing

:

 Quantum

 computing

 is

 expected

 to

 revolution

ize

 AI

 by

 making

 it

 possible

 to

 process

 vast

 amounts

 of

 data

 much

 faster

 and

 more

 efficiently

 than

 traditional

 computing

 methods

.



3

.

 Autonomous

 Robots

:

 The

 development

 of

 autonomous

 robots

 is

 expected

 to

 greatly

 reduce

 the

 need

 for

 human

 intervention

,

 allowing

 for

 more

 efficient

 use

 of

 resources

 and

 potentially




In [6]:
llm.shutdown()