# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio
import io
import os

from PIL import Image
import requests
import sglang as sgl

from sglang.srt.conversation import chat_templates
from sglang.test.test_utils import is_in_ci
from sglang.utils import async_stream_and_merge, stream_and_merge

if is_in_ci():
    import patch
else:
    import nest_asyncio

    nest_asyncio.apply()


llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0806 05:19:10.485000 1088358 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0806 05:19:10.485000 1088358 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0806 05:19:19.275000 1089290 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0806 05:19:19.275000 1089290 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]



### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Bill and I'm a software developer based in South Carolina. I've been coding since 2002, with my first project as a Web Developer at Toggl. My interest in technology has always been driven by the possibility of using it to make the world a better place, and I am passionate about working with developers to help build better software. If you have any questions, you can always reach out to me on Twitter or on LinkedIn.
Hello, my name is Bill and I'm a software developer based in South Carolina. I've been coding since 2002, with my first project as a Web Developer at
Prompt: The president of the United States is
Generated text:  a political office with a tight deadline approaching. The only way to win the next election is to become the 41st President.
One of the president’s daily tasks is to attend his own inauguration ceremony. His official chair is located next to the venue. The chair is decorated in red, white and blue. No matter the weather, it

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your job or profession]. I enjoy [insert a short description of your hobbies or interests]. What's your favorite hobby or activity? I'm a [insert a short description of your favorite activity or hobby]. I'm always looking for new experiences and adventures, so I'm always eager to try new things. What's your favorite book or movie? I'm a [insert a short description of your favorite book or movie

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. The city is known for its cuisine, fashion, and art, and is home to many famous museums, theaters, and landmarks. It is a major transportation hub, with many international airports and train stations. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also known

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This includes issues such as bias, transparency, and accountability. AI developers will need to be more mindful of the potential impact of their work on society and the environment.

2. Greater integration with human decision-making: AI is likely to become more integrated with human decision-making in the future. This could lead to more complex and nuanced decision-making processes, as AI is



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Character Name], and I'm a [Name of fictional organization or group] member. I have [number of years] years of experience working in [specific field of work] and [specific role] in [specific organization or company]. I'm passionate about [specific area of interest], and I'm committed to contributing to [specific goal or mission] through my work with [specific organization or group]. My goal is to [specific accomplishment or project]. I enjoy [specific hobby or activity] and am always looking for new challenges and opportunities to grow and learn. Thank you for considering me for a role in [specific organization or group].

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as the City of Love, a historic city with a rich cultural heritage. Its status as the world's fifth-largest city by populat

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 [

character

]

!

 Can

 you

 tell

 me

 a

 little

 bit

 more

 about

 yourself

?

 What

 kind

 of

 work

 do

 you

 do

,

 and

 where

 are

 you

 from

?

 And

 what

 kind

 of

 experience

 do

 you

 have

 that

 would

 make

 you

 a

 great

 fit

 for

 this

 position

?

 [

Name

]

 will

 be

 joining

 [

Company

 Name

]

 as

 a

 [

Position

 Title

]

 and

 is

 excited

 to

 be

 here

.

 And

 let

 me

 know

 what

 you

 think

!

 [

Name

]

 [

Age

]

 [

Occup

ation

]

 [

City

,

 State

]

 [

Email

 or

 Phone

 Number

]

 [

Company

 Name

]



Hi

,

 my

 name

 is

 [

Name

]

 and

 I

'm

 a

 [

character

]

!

 Can



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 and

 the

 second

 most

 populous

 city

 in

 the

 European

 Union

.

 It

 was

 founded

 as

 a

 Norman

 commune

 in

 the

 

1

1

th

 century

 and

 has

 been

 the

 capital

 of

 France

 since

 

1

8

0

4

.

 It

 is

 the

 seat

 of

 the

 French

 government

 and

 home

 to

 many

 of

 France

's

 major

 cultural

 and

 artistic

 institutions

.

 Paris

 is

 also

 one

 of

 the

 world

's

 most

 important

 cities

 for

 luxury

,

 fashion

,

 and

 art

,

 and

 has

 been

 the

 venue

 of

 many

 major

 European

 and

 international

 events

.

 It

 has

 a

 rich

 history

 and

 culture

,

 and

 is

 known

 for

 its

 diverse

 and

 vibrant

 culture

,

 cuisine

,

 and

 fashion

.

 The

 city

 has

 a

 unique

 blend



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 shaped

 by

 several

 key

 trends

 and

 developments

.

 Here

 are

 some

 potential

 future

 trends

 in

 the

 field

 of

 artificial

 intelligence

:



1

.

 Increased

 emphasis

 on

 ethical

 and

 responsible

 AI

:

 As

 concerns

 about

 the

 potential

 impact

 of

 AI

 on

 society

 become

 more

 widely

 acknowledged

,

 there

 will

 likely

 be

 an

 increased

 focus

 on

 ethical

 considerations

 and

 responsible

 design

.

 This

 will

 likely

 involve

 designing

 AI

 systems

 that

 are

 transparent

,

 accountable

,

 and

 contribute

 positively

 to

 society

.



2

.

 Greater

 investment

 in

 AI

 research

 and

 development

:

 With

 AI

 becoming

 a

 key

 driver

 of

 economic

 growth

 and

 innovation

,

 there

 will

 likely

 be

 greater

 investment

 in

 research

 and

 development

,

 as

 well

 as

 more

 collaboration

 between

 academia

 and

 industry

.



3

.

 AI




In [6]:
llm.shutdown()