# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

W0827 10:43:48.737000 809826 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0827 10:43:48.737000 809826 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


W0827 10:43:57.496000 810242 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0827 10:43:57.496000 810242 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.98it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.98it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.98it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 11.56it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Joseph. I'm a 25-year-old high school student who works in the tech industry. I'm here to ask you questions about your life and what you think about life in general. I hope we have a great time together and can discuss topics we are interested in. How do you feel about life in general?

I feel... (describe how you feel about life in general) Joseph

I feel that life is mostly about working and getting paid to do work. I feel like a financial success. I enjoy being productive and productive work that doesn't require much of my time.

What do you think life is like? Do you
Prompt: The president of the United States is
Generated text:  a young man. He is a man of considerable influence. But he is also a man of many faults. He has not fulfilled his duties as a man of duty and responsibility. He has not lived up to his qualifications as a man. He has not acted with honor and integrity. He has not been a role model for his countrymen.

A President's

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short description of your profession or role]. I enjoy [insert a short description of your hobbies or interests]. What's your favorite hobby or activity? I love [insert a short description of your favorite activity]. What's your favorite book or movie? I love [insert a short description of your favorite book or movie]. What's your favorite place to go? I love [insert a short description of your favorite place]. What's your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is known for its rich history, art, and cuisine, and is a major center of politics and politics. It is also home to many notable French artists, writers, and musicians. Paris is a vibrant and dynamic city that continues to be a major cultural and economic center of France. 

The city is also

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to improve, we can expect to see even more widespread use of AI in healthcare, with the goal of improving patient care and reducing healthcare costs.

2. AI in finance: AI is already being used in finance to improve fraud detection and risk management. As AI technology continues to improve, we can expect to see even more widespread use of AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a [Type of Character] who specializes in [Your Job or Profession]. I have always been passionate about learning and growing, and I believe that with hard work and dedication, anything is possible. If you ever need any assistance or guidance, please don't hesitate to ask. [Your Name] [Your Job or Profession] [Your Address] [Your Contact Information] [Your Twitter Handle] [Your Facebook Page] [Your Instagram Handle] [Your LinkedIn Profile] [Your Website URL] [Your Project], if applicable [Your Other Contributions] I am a [Type of Character] who specializes in

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the city located on the Seine river in the heart of the country. 

This statement is accurate, providing both the country and the city name. It also includes the name of a famous 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 an

 [

Position

/

Position

 Title

]

 with

 [

Company

 Name

],

 and

 I

've

 been

 with

 the

 company

 for

 [

Number

 of

 Years

]

 years

 now

.

 I

'm

 passionate

 about

 [

Company

/

Position

]

 because

 of

 [

Company

/

Position

's

 significance

 or

 impact

 on

 the

 world

].

 I

 enjoy

 [

Company

/

Position

's

 responsibilities

 and

 challenges

].

 I

'm

 a

 [

Industry

 Expert

/

Leaders

hip

 Expert

]

 and

 [

Company

/

Position

 is

 a

 leader

].

 My

 favorite

 thing

 about

 my

 job

 is

 [

Company

/

Position

's

 unique

 aspect

].

 I

'm

 always

 looking

 for

 opportunities

 to

 [

Company

/

Position

's

 personal

 growth

]

 and

 [

Company

/

Position

's

 professional

 development

].

 [



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 "

La

 Rose

"

 due

 to

 the

 large

 number

 of

 roses

 that

 decorate

 the

 city

's

 facade

 and

 street

 decorations

.

 The

 city

's

 official

 language

 is

 French

 and

 is

 the

 third

 largest

 city

 in

 France

.

 The

 French

 Revolution

,

 considered

 the

 first

 modern

 day

 French

 revolution

,

 took

 place

 in

 Paris

.

 Located

 at

 the

 foot

 of

 the

 Se

ine

 River

,

 it

 is

 the

 third

 most

 populous

 city

 in

 the

 world

 and

 the

 most

 visited

 city

 in

 France

.

 Paris

 is

 known

 for

 its

 museums

,

 fashion

,

 art

,

 and

 cuisine

.

 The

 city

 has

 a

 rich

 and

 diverse

 history

 and

 culture

,

 and

 is

 a

 hub

 of

 intellectual

 and

 artistic

 activity

 in

 the

 world

.

 It

 is

 also



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 set

 to

 be

 shaped

 by

 a

 variety

 of

 trends

,

 each

 of

 which

 could

 potentially

 revolution

ize

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 the

 world

 around

 us

.

 Here

 are

 some

 potential

 future

 trends

 that

 could

 shape

 the

 future

 of

 AI

:



1

.

 Improved

 Facial

 Recognition

 and

 Bi

ometric

 Security

:

 As

 AI

 technology

 continues

 to

 improve

,

 we

 may

 see

 an

 increase

 in

 the

 use

 of

 facial

 recognition

 and

 bi

ometric

 security

 technologies

 in

 our

 daily

 lives

.

 This

 could

 include

 things

 like

 facial

 recognition

 at

 airports

,

 security

 checkpoints

,

 and

 even

 in

 everyday

 activities

 like

 unlocking

 a

 door

 or

 entering

 a

 room

.



2

.

 Autonomous

 and

 Rob

otic

 Vehicles

:

 The

 future

 of

 AI

 is

 likely

 to

 see

 the




In [6]:
llm.shutdown()