# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0830 00:40:59.842000 507372 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 00:40:59.842000 507372 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0830 00:41:08.442000 508099 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0830 00:41:08.442000 508099 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.03it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=73.19 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=73.19 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.82it/s]Capturing batches (bs=2 avail_mem=73.13 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.82it/s]Capturing batches (bs=1 avail_mem=73.12 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.82it/s]Capturing batches (bs=1 avail_mem=73.12 GB): 100%|██████████| 3/3 [00:00<00:00,  8.27it/s]Capturing batches (bs=1 avail_mem=73.12 GB): 100%|██████████| 3/3 [00:00<00:00,  7.72it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Dorian and I’m a college student from Australia who is a deep believer in using renewable energy solutions to reduce the carbon footprint of the Australian population. I believe that it is the best way to reduce the carbon emissions that are contributing to climate change. 

I was wondering if I could get some advice on how to actually get my school to start using renewable energy solutions? I would like to have a clean energy system in place by 2022, but I’m not sure where to start with this. I have tried looking at various options such as solar panels, wind turbines, and hydroelectric power, but I’m still confused
Prompt: The president of the United States is
Generated text:  a very important person. He or she is in charge of a country. The president works for the president of the United States. In America, the president works for the president of the United States. When a new president is chosen, the president of the United States becomes p

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I have been working in this role for [number of years] years. I am a [occupation] with [number of years] years of experience in this field. I am a [occupation] with [number of years] years of experience in this field. I am a [occupation] with [number of years] years of experience in this field. I am a [occupation] with [number of years] years of experience in this field. I am a [occupation] with [number of years] years of experience in this field. I am

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic and cultural center with a rich history dating back to the Middle Ages. It is the largest city in France and the second-largest city in the European Union, with a population of over 2. 5 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. It is also home to many world-renowned museums, theaters, and restaurants, making it a popular destination for tourists and locals alike. Paris is a vibrant and diverse city with a rich cultural heritage and a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased automation: As AI continues to become more advanced, it is likely that we will see an increase in automation in various industries. This could lead to the creation of new jobs, but it could also lead to job displacement for some workers.

2. Improved privacy and security: As AI becomes more advanced, there will be an increased need for privacy and security measures to protect the data that is generated and used by AI systems. This could



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [short biography or introduction of the fictional character]. I enjoy [describe your favorite hobby or interest]. If you have any questions about me or anything I'm doing, please let me know! 

You might not notice me at first, but my next step in life is to become a [describe your future career or goal]. I'm currently [describe your current job or status]. What can you tell me about yourself? 
I really appreciate your time with me. I'm looking forward to hearing from you. 
[Name] 
Date
What do you like to eat? 
What do you like to

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the largest city in the country and the seat of the French government and its cultural, political, and economic center. The city is known for its rich history, picturesque architecture, and vibrant cultural 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 I

'm

 a

/an

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a

/an

 __

__.

 My

 name

 is

 __

__.

 I

'm

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 in

 France

 and

 the

 seat

 of

 government

 for

 the

 country

.

 It

 is

 home

 to

 the

 E

iff

el

 Tower

,

 the

 Lou

vre

 Museum

,

 and

 many

 other

 famous

 landmarks

 and

 attractions

.

 It

 is

 also

 known

 as

 "

the

 city

 of

 love

"

 due

 to

 its

 historical

 romantic

 attraction

.

 Paris

 is

 a

 major

 tourist

 destination

 and

 a

 cultural

 hub

,

 hosting

 important

 cultural

 events

 and

 hosting

 the

 World

 Cup

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 poised

 to

 be

 a

 dynamic

 and

 rapidly

 changing

 landscape

.

 Some

 potential

 trends

 that

 are

 expected

 to

 shape

 the

 field

 include

:



1

.

 Adv

ancements

 in

 neural

 networks

:

 Neural

 networks

 are

 becoming

 increasingly

 powerful

 and

 capable

 of

 performing

 complex

 tasks

 such

 as

 image

 and

 speech

 recognition

.

 As

 researchers

 continue

 to

 improve

 neural

 network

 architectures

 and

 algorithms,

 we

 can

 expect

 to

 see

 even

 more

 advanced

 AI

 systems

 emerge

.



2

.

 Integration

 with

 other

 technologies

:

 The

 integration

 of

 AI

 with

 other

 technologies

 such

 as

 blockchain

,

 IoT

,

 and

 quantum

 computing

 is

 expected

 to

 expand

 the

 scope

 of

 AI

's

 applications

 and

 capabilities

.

 This

 could

 result

 in

 new

 opportunities

 for

 innovation

 and

 development

 in

 fields

 such

 as

 supply

 chain

 management

,

 fraud

 detection




In [6]:
llm.shutdown()