# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0829 04:55:47.641000 326562 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0829 04:55:47.641000 326562 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0829 04:55:57.329000 327048 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0829 04:55:57.329000 327048 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.86it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.06it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.06it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.06it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.20it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Celine, a French artist based in Paris. My work aims to provoke the viewer to reflect on their own lives and the nature of existence. I am a visual artist who has exhibited in various galleries and galleries online. I am currently based in Paris, France. My recent body of work has been featured in "MoMA" Biennial, "Vivante" Galerie Prada, and "Départure" Galerie Rival.
My inspiration comes from the natural world and my fascination with the human experience. The natural world gives me raw materials and a strange perspective on the world. When I interact with these materials, they
Prompt: The president of the United States is
Generated text:  a person who holds the position of (A) President of the United States (B) Secretary of State (C) Secretary of the Treasury (D) Secretary of Defense. The president of the United States is a person who holds the position of A. President of the United States. 

The United States Constitution establishes the of

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is a popular tourist destination and a major hub for international business and diplomacy. It is also home to the French Parliament and the French National Assembly. The city is known for its rich history, art, and cuisine, and is a major center of politics, culture, and society in France. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the city's vibrant

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI becomes more advanced, it is likely to be used in more complex and personalized ways, leading to even more accurate diagnoses and treatment plans.

2. Increased use of AI in finance: AI is already being used in finance to improve fraud detection and risk management. As AI becomes more advanced, it is likely to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [First Name] [Last Name] and I am a [Age] year old [Gender] person. I have always been [insert unique personality trait here]. But I am currently pursuing a [insert current pursuit here]. I am always [insert unique personality trait here]. And I love [insert one or two favorite activities/ interests here]. I am [insert your favorite quote here]. I am a [insert age group here]. I am from [insert your hometown here]. And I have [insert number of pets here]. If you ever need any advice, I am always here for you. Now, what would you like to be

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

This statement succinctly captures the central location and significance of Paris as the capital city of the country. Paris is renowned for its extensive historical and cultural landmarks, including the Eiffel Tower, L

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

],

 and

 I

'm

 an

 [

insert

 your

 profession

/

field

 of

 work

]

 who

 specialize

 in

 [

insert

 a

 key

 skill

 or

 passion

 of

 your

 profession

].

 I

'm

 excited

 to

 dive

 in

 and

 learn

 more

 about

 your

 organization

's

 current

 projects

,

 and

 how

 we

 can

 help

 achieve

 our

 goals

 together

.

 If

 you

're

 looking

 for

 a

 team

 player

 or

 someone

 who

 thr

ives

 in

 challenging

 situations

,

 I

'm

 ready

 to

 step

 up

 and

 help

 make

 [

insert

 a

 problem

 or

 challenge

 your

 team

 is

 facing

].

 I

'm

 ready

 to

 start

 our

 journey

 together

!

 How

 can

 I

 get

 to

 know

 you

 better

?

 Let

's

 schedule

 a

 call

 to

 discuss

 your

 capabilities

 and

 see

 if

 we

 can

 work

 together



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 in

 the

 north

western

 region

 of

 France

.

 It

 is

 the

 largest

 city

 in

 the

 country

,

 and

 the

 seat

 of

 the

 state

 government

 and

 the

 seat

 of

 the

 French

 government

.

 Paris

 is

 known

 for

 its

 rich

 history

,

 art

,

 and

 culture

,

 and

 it

 is

 also

 a

 major

 transportation

 hub

 for

 Europe

.

 It

 is

 home

 to

 numerous

 iconic

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 known

 for

 its

 fashion

 industry

,

 and

 the

 city

 is

 home

 to

 many

 famous

 fashion

 houses

 and

 fashion

 shows

.

 The

 city

 is

 also

 a

 major

 destination

 for

 tourists

,

 with

 its

 beautiful

 scenery

,

 vibrant

 culture

,

 and

 delicious



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 both

 transformative

 and

 rapidly

 evolving

.

 Here

 are

 some

 possible

 trends

 that

 could

 shape

 the

 AI

 landscape

 in

 the

 coming

 years

:



1

.

 Enhanced

 Predict

ive

 Analytics

:

 AI will

 continue to

 become

 more

 accurate

 and

 predictive

,

 allowing

 businesses

 to

 make

 better

-in

formed

 decisions

 and

 forecast

 future

 trends

.

 Machine

 learning

 algorithms

 will

 be

 better

 at

 understanding

 patterns

 in

 large

 datasets

 and

 making

 predictions

 based

 on

 those

 patterns

.



2

.

 Integration

 with

 Other

 Technologies

:

 AI

 will

 continue

 to

 integrate

 with

 other

 technologies

,

 such

 as

 robotics

,

 drones

,

 and

 autonomous

 vehicles

.

 This

 will

 enable

 more

 complex

 and

 versatile

 AI

 systems

 that

 can

 perform

 tasks

 beyond

 what

 humans

 can

 do

.



3

.

 Personal

ization

 and

 Adapt

ability

:




In [6]:
llm.shutdown()