# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0919 05:48:26.670000 3162537 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 05:48:26.670000 3162537 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0919 05:48:35.092000 3163064 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 05:48:35.092000 3163064 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0919 05:48:35.119000 3163065 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 05:48:35.119000 3163065 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-19 05:48:35] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.52it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.52it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.52it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.71it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Tom and I am 17 years old. I'm from Poland. I'm the president of the English club at school. I'm a little bit shy, but I want to be a leader. I like to talk about what I think. I like to share my ideas. I think I should learn more about what I want to be when I grow up. I want to go to a famous university to study English and go to Paris. I want to be a great language teacher. I have a strong interest in all kinds of things and I like to share my ideas. I will make a presentation to the whole school when I
Prompt: The president of the United States is
Generated text:  50 years younger than his first president. If his first president was 70 years old when he became the president, and the two presidents worked together from 1901 to 1929, what is the age of both presidents together? To determine the total age of both presidents from 1901 to 1929, we need to follow these steps:

1. Determine the age of the president from 1901 to 1929.
2. Add the a

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a brief description of your job or experience]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm always looking for new ways to improve myself and make a positive impact on the world. What do you enjoy doing? I enjoy learning new things, exploring new cultures, and trying new foods. What's your favorite hobby? I love to read and travel. What's your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. 

The statement is concise and accurately describes the capital city of France. It provides the name of the city and its capital, which are both key facts about the city. The statement is clear and easy to understand, making it suitable for a brief introduction to the topic. It also includes the French name for the capital, which is Paris, which is a common and widely known fact about the city. Overall, the statement is a good representation of the factual information about the capital city of France. 

The statement is appropriate for use in a variety of contexts, such as a news article, a travel guide, or a brief introduction

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This will include issues such as bias, transparency, accountability, and the impact of AI on society as a whole.

2. Greater integration with human decision-making: AI is likely to become more integrated with human decision-making in the future, as AI systems are expected to be able to make decisions based



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Age] year old, [Gender] who enjoys [Your field of expertise or hobby]. I'm always looking for ways to [something specific you're passionate about or a challenge you're trying to overcome], and I love to [something you enjoy doing]. I've always been [something that's been a big part of my identity, like your interests, hobbies, or other traits]. What's your name, and how do you go about interacting with others? Hi there! I'm [Name], a [Gender] who enjoys [your field of expertise or hobby]. I'm always looking for ways

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris, the city of love and art, is known as the "City of Love" and is the largest city in France. The city is home to many famous landmarks such as the Eiffel Tower, Notre Dame Cathedral, the Louvre Museum, and the Palace 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

 am

 a

 [

Occup

ation

 or

 Skill

]

 with

 [

Number

 of

 Years

 Experience

].

 I

 am

 a

 hard

working

,

 organized

,

 and

 responsible

 individual

 who

 is

 always

 ready

 to

 take

 on

 new

 challenges

.

 I

 am

 a

 team

 player

,

 always

 willing

 to

 help

 others

 and

 work

 towards

 a

 common

 goal

.

 I

 am

 confident

 in

 my

 abilities

 and

 believe

 that

 with

 hard

 work

 and

 dedication

,

 I

 can

 achieve

 anything

 I

 set

 my

 mind

 to

.

 I

 am

 passionate

 about

 helping

 others

 and

 I

 strive

 to

 be

 a

 good

 example

 for

 others

 to

 follow

.

 I

 am

 a

 positive

 and

 encouraging

 person

,

 and

 I

 believe

 in

 treating

 people

 with

 respect

 and

 kindness

.

 I

 am

 always

 learning

 and

 growing



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 Paris

 is

 known

 as

 the

 city

 of

 love

 and

 is

 the

 largest

 city

 in

 France

,

 with

 an

 estimated

 population

 of

 over

 

1

1

 million

 people

.

 The

 city

 is

 famous

 for

 its

 architecture

,

 museums

,

 and

 food

.

 The

 famous

 E

iff

el

 Tower

 is

 located

 in

 Paris

 and

 is

 one

 of

 the

 most

 recognizable

 landmarks

 in

 the

 world

.

 In

 addition

 to

 its

 iconic

 landmarks

,

 Paris

 is

 home

 to

 numerous

 museums

,

 theaters

,

 and

 shopping

 malls

.

 The

 city

 is

 also

 known

 for

 its

 music

 scene

,

 with

 many

 famous

 musicians

 and

 bands

 performing

 in

 the

 city

.

 Paris

 is

 a

 bustling

 city

 with

 a

 diverse

 population

 and

 a

 rich

 cultural

 heritage

.

 The

 French

 government

 has

 worked

 to

 preserve



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 possibilities

,

 and

 it

 will

 be

 interesting

 to

 see

 how

 these

 trends

 develop

.

 Here

 are

 some

 of

 the

 most

 likely

 future

 trends

 in

 AI

:



1

.

 Increase

 in

 the

 use

 of

 AI

 in

 healthcare

:

 With

 AI

,

 doctors

 will

 be

 able

 to

 better

 diagnose

 diseases

,

 treat

 patients

,

 and

 provide

 personalized

 treatment

 plans

.

 AI

 algorithms

 will

 also

 be

 used

 to

 analyze

 medical

 images

,

 which

 will

 help

 in

 early

 detection

 of

 diseases

 and

 improve

 the

 accuracy

 of

 diagnoses

.



2

.

 Automation

 of

 customer

 service

:

 AI

-powered

 chat

bots

 will

 be

 able

 to

 handle

 customer

 service

 tasks

 such

 as

 answering

 questions

,

 providing

 recommendations

,

 and

 resolving

 issues

.

 This

 will

 make

 customer

 service

 more

 efficient

 and

 faster

,

 and




In [6]:
llm.shutdown()