# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0829 07:00:31.635000 64900 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0829 07:00:31.635000 64900 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.




W0829 07:00:41.803000 65430 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0829 07:00:41.803000 65430 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.13it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):  33%|â–ˆâ–ˆâ–ˆâ–Ž      | 1/3 [00:00<00:00,  5.06it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|â–ˆâ–ˆâ–ˆâ–Ž      | 1/3 [00:00<00:00,  5.06it/s]

Capturing batches (bs=1 avail_mem=76.96 GB):  33%|â–ˆâ–ˆâ–ˆâ–Ž      | 1/3 [00:00<00:00,  5.06it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3/3 [00:00<00:00, 10.33it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  **Alex Jones** and I'm a freelance IT and cybersecurity consultant based in the UK. As a business owner, my work is to help businesses find and secure cyber security solutions to protect their digital assets.

I offer a range of cybersecurity services, from data backup and recovery, to malware protection, phishing detection and protection, web application security testing and development, and more. I specialize in addressing the unique challenges and requirements of businesses of all sizes, from small startups to large corporations, and from businesses of any industry.

I'm passionate about staying up-to-date with the latest cybersecurity threats and technologies, and I'm always looking for ways to improve
Prompt: The president of the United States is
Generated text:  a person. True or False? To determine whether the statement "The president of the United States is a person" is true or false, let's analyze it step by step.

1. **Definition of 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [job title] and I'm always looking for ways to [job title] and improve my skills. I'm a [job title] who is always looking for ways to [job title] and improve my skills. I'm a [job title] who is always looking for ways to [job title] and improve my skills. I'm a [job title] who is always looking for ways to [job title] and improve my

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, a historic city with a rich cultural heritage and a vibrant nightlife. It is located on the Seine River and is the largest city in France by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral, as well as its diverse cuisine and fashion scene. The city is also home to many world-renowned museums, including the Louvre and the MusÃ©e d'Orsay. Paris is a popular tourist destination and a major economic and cultural center in France. It is also known for its annual Eiffel Tower

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration of AI into everyday life: As AI becomes more integrated into our daily lives, we may see more widespread adoption of AI-powered technologies such as voice assistants, self-driving cars, and virtual assistants. This could lead to a more seamless and efficient use of technology, as well as a reduction in the need for human intervention.

2. Greater emphasis on ethical and responsible AI: As AI becomes more integrated into our daily lives, there



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name]. I'm a [Level] level wizard with a [Weapon] equipped on my hands, and I'm passionate about [Purpose]. Let's chat! ðŸ§ âœ¨âœ¨

---

**[Name]**

Hello, my name is [Name]. I'm a [Level] level wizard with a [Weapon] equipped on my hands, and I'm passionate about [Purpose]. Let's chat! ðŸ§ âœ¨âœ¨

---

**[Name]** is a level 10 wizard with a Staff of Divination on his hands. He is deeply interested in the art of divination and the ways of the magic that

Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text:  Paris, the city where the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral are located. It is the largest and most populous city in the European Union. Despite being the capital, Paris is not the largest city in France, as Lyon is the second-largest city. Paris is also home to numerous museums and

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about Franceâ€™s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Jane

.

 I

'm

 an

 avid

 reader

 and

 an

 avid

 writer

,

 and

 I

 enjoy

 exploring

 new

 cultures

 and

 learning

 new

 things

.

 I

'm

 not

 afraid

 to

 take

 risks

 and

 try

 new

 things,

 and

 I

 thrive

 on

 creativity

 and

 innovation

.

 I

'm

 an

 ext

ro

vert

 who

 enjoys

 meeting

 new

 people

 and

 trying

 new

 things

.

 If

 you

're

 interested

,

 let

 me

 know

!

 (

Note

:

 There

 are

 no

 specific

 details

 about

 my

 personality

,

 interests

,

 or

 achievements

 given

 in

 the

 prompt

.)

 Please

 feel

 free

 to

 include

 any

 other

 details

 about

 myself

 that

 you

 think

 might

 be

 interesting

 or

 helpful

 for

 potential

 clients

.

 (

Optional

)

 Jane

,

 I

'm

 in

 the

 marketing

 department

 at

 a

 top

-tier

 advertising

 agency



Prompt: Provide a concise factual statement about Franceâ€™s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 historic

 and

 cultural

 heart

 of

 the

 country

.

 Paris

 boasts

 over

 

7

0

0

 museums

,

 including

 the

 Lou

vre

 Museum

,

 the

 Mus

Ã©e

 d

'

Or

say

,

 the

 Mus

Ã©e

 d

'

Or

anger

ie

,

 and

 the

 Mus

Ã©e

 national

 de

 la

 DÃ©

coration

.

 It

's

 also

 home

 to

 the

 E

iff

el

 Tower

,

 the

 E

iff

el

 Saint

e

-Ch

ap

elle

,

 and

 the

 Lou

vre

 Gardens

.

 Paris

 is

 a

 bustling

 city

 with

 a

 vibrant

 culinary

 scene

 and

 a

 rich

 history

,

 making

 it

 an

 excellent

 destination

 for

 tourists

 and

 locals

 alike

.

 The

 city

 is

 known

 for

 its

 art

,

 culture

,

 and

 architecture

.

 It

's

 also

 famous

 for

 its

 annual

 Le

 Se



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 rapidly

 evolving

,

 with

 many

 potential

 trends

 and

 areas

 of

 focus

.

 Here

 are

 some

 possibilities

:



1

.

 Increased

 automation

:

 As

 AI

 continues

 to

 improve

,

 it

 is

 likely

 to

 become

 more

 capable

 of

 performing

 tasks

 that

 were

 previously

 done

 by

 humans

,

 such

 as

 manufacturing

,

 transportation

,

 and

 customer

 service

.

 This

 could

 lead

 to

 the

 widespread

 automation

 of

 jobs

,

 freeing

 up

 workers

 for

 more

 creative

 and

 high

-value

 tasks

.



2

.

 Eth

ical

 and

 moral

 AI

:

 As

 AI

 becomes

 more

 advanced

,

 there

 may

 be

 ethical

 and

 moral

 concerns

 that

 need

 to

 be

 addressed

.

 For

 example

,

 AI

 systems

 may

 become

 biased

 or

 perpet

uate

 existing

 inequalities

 if

 not

 designed

 with

 fairness

 in

 mind

.



3

.

 AI

 for




In [6]:
llm.shutdown()