# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0813 07:18:15.377000 1539591 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0813 07:18:15.377000 1539591 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0813 07:18:25.491000 1540034 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0813 07:18:25.491000 1540034 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.40it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=62.39 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=62.39 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.79it/s]Capturing batches (bs=2 avail_mem=62.33 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.79it/s]Capturing batches (bs=1 avail_mem=62.32 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.79it/s]Capturing batches (bs=1 avail_mem=62.32 GB): 100%|██████████| 3/3 [00:00<00:00, 11.16it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  _____. How do you say "Hello, my name is ___. " in Spanish?
Answer:

In Spanish, "Hello, my name is _____. " can be translated as:

Hola, mi nombre es ____.

Therefore, the correct answer is:

A) Hola, mi nombre es ____.

Explanation: 
- "Hola" is the Spanish word for "Hello"
- "mi" means "my" in Spanish
- "nombre" means "name" in Spanish
- "es" means "is" in Spanish

Putting it all together, the sentence structure is the same in both languages, just with some
Prompt: The president of the United States is
Generated text:  married to Nancy and they have two children, Sarah and Matthew. If the president is a bachelor, and the president has two children, how many children does the president have?

To determine how many children the president of the United States has, let's break down the information given:

1. The president is married to Nancy.
2. The president has two children.
3. The president is a bachelor.

Since the president is a bachelor, 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm excited to meet you and learn more about you. What

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is also home to many notable artists, writers, and musicians. It is known for its rich history, diverse culture, and beautiful architecture. Paris is a city of contrasts, with its modern and historic elements blending together to create a unique and fascinating place. The city is also home to many international organizations and institutions, including

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in manufacturing, transportation, and other industries, where it can perform tasks that are currently done by humans. This could lead to the development of more efficient and cost-effective systems.

2. Improved privacy and security: As AI systems become more sophisticated, there will be a need to ensure that they are used responsibly and ethically. This will require improvements in privacy and security measures, as well as ongoing monitoring and regulation of AI systems.

3. Greater integration with human decision-making: AI systems are likely to become more integrated with human decision-making



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [Your Profession/Role] [Your Profession/Role], dedicated to [Your Profession/Role] with [Your Achievements/Experience]. I am passionate about [Your Passion], and my work has made a significant impact in [Your Field of Interest]. If you're curious about what I've achieved in my career, or what excites me, I'm always here to share my journey and insights.

Remember, my goal is to help others achieve their dreams and to help create a better world for all. And whenever you need advice, encouragement, or just a friendly chat, [Your Name]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a historic city with a rich history and diverse culture. The city was founded in 787 by the Vikings and has been the capital of France since 800 AD. It is home to many famous landmarks, such as the Eiffe

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Age

]

 year

 old

,

 [

Occup

ation

]

 [

Job

 Title

]

!

 I

 have

 always

 been

 fascinated

 by

 the

 world

 of

 [

Industry

/

Field

]

 and

 I

 have

 been

 learning

 more

 and

 more

 every

 day

.

 I

 am

 eager

 to

 see

 how

 this

 world

 progresses

 and

 I

 am

 always

 open

 to

 new

 experiences

,

 new

 ideas

,

 and

 new

 ways

 of

 thinking

.

 What

 brings

 you

 to

 this

 field

?

 I

'm

 curious

 to

 hear

 about

 your

 background

 and

 experiences

.

 What

 brings

 you

 to

 this

 field

?

 I

'm

 interested

 to

 know

 more

 about

 the

 world

 and

 its

 challenges

.

 What

 do

 you

 see

 as

 the

 future

 of

 [

Industry

/

Field

]?

 I

 am

 excited



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



To

 expand

 on

 this

 information

,

 can

 you

 provide

 a

 more

 detailed

 sentence

 that

 includes

 the

 correct

 spelling

 of

 the

 city

's

 name

?

 Paris

 is

 the

 capital

 of

 France

.

 



Additionally

,

 can

 you

 provide

 a

 sentence

 that

 uses

 the

 word

 "

France

"

 while

 also

 including

 a

 reference

 to

 a

 famous

 landmark

 in

 the

 city

?

 The

 Lou

vre

 is

 one

 of

 the

 most

 famous

 landmarks

 in

 Paris

.

 



Lastly

,

 can

 you

 suggest

 a

 sentence

 that

 incorporates

 the

 word

 "

France

"

 and

 references

 a

 specific

 cultural

 institution

 or

 art

 piece

 located

 in

 the

 city

?

 The

 Notre

 Dame

 Cathedral

 is

 a

 UNESCO

 World

 Heritage

 site

 in

 Paris

,

 France

.

 



Please

 provide

 the

 sentences

 in

 the

 requested

 format



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 und

eni

ably

 exciting

 and

 rapidly

 evolving

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 Aug

mented

 and

 Virtual

 Reality

:

 With

 the

 increasing

 adoption

 of

 virtual

 and

 augmented

 reality

 technologies

,

 we

 can

 expect

 to

 see

 AI

-driven

 applications

 that

 enhance

 our

 reality

,

 such

 as

 speech

 recognition

,

 image

 processing

,

 and

 autonomous

 driving

.



2

.

 Deep

 Learning

 and

 Neural

 Networks

:

 Deep

 learning

 and

 neural

 networks

 are

 the

 key

 technologies

 driving

 the

 future

 of

 AI

.

 We

 can

 expect

 to

 see

 more

 efficient

 and

 accurate

 algorithms

 that

 can

 learn

 from

 vast

 amounts

 of

 data

,

 enabling

 faster

 and

 more

 accurate

 AI

-driven

 applications

.



3

.

 Explain

able

 AI

:

 With

 the

 increasing

 amount

 of

 data

 and

 its

 complexities

,




In [6]:
llm.shutdown()