# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-20 08:53:12] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.26it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.44it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.44it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.44it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00, 10.46it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lisa and I'm a smart girl and I can speak English, Spanish, Chinese and Japanese.
Please, tell me how to contact me. I would like to know more about the organization, its mission and goals, and how to contribute to its success. How can I contact the organization and what kind of information do I need to provide to be included in the organization's newsletter?
Lisa wants to know more about the organization, its mission and goals, and how to contribute to its success. She also wants to know the contact information for the organization and the best way to communicate with them.
Could you please provide me with the necessary information to
Prompt: The president of the United States is
Generated text:  running for a second term. He has won x votes. If he needs 574 more votes to secure a third term, how many votes does he need to receive in total to win the presidency?
To determine the total number of votes the president of the United States needs t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] with [number] years of experience. I'm passionate about [reason for interest] and I'm always looking for ways to [action or goal]. I'm always eager to learn and grow, and I'm always willing to take on new challenges. I'm a [character trait] and I'm always ready to help others. I'm [character trait] and I'm always willing to go above and beyond to help others. I'm [character trait] and I'm always ready to take on new challenges. I'm [character trait] and I'm always ready to help others.

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the third-largest city in the world by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum, as well as its rich history and cultural heritage. It is also a major center for finance, art, and music, and is a popular tourist destination. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into the urban landscape. The city is home to many world-renowned museums, theaters, and other cultural institutions, and is a major hub

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This could lead to the creation of new jobs and the displacement of human workers, but it could also create new opportunities for people who are skilled in AI and robotics.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we are likely to see even



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I am a young, ambitious, and sometimes grumpy person who enjoys reading, cooking, and taking walks in the park. I am also a bit of a go-to person for advice when things get tough. How about you? What are your hobbies, interests, or passions? That would be really helpful for me to get to know you better. Let's talk! [Your Name] [Your Personality] [Your Interests and Experiences] [Your Family Life and Relationships] [Your Goals and Values] [Your Challenges and Conflicts]
As an AI language model, I don't have personal hobbies,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the city of light, of scholars and artists, of kings and queens, of history and modernity. It is a city where the thunderous sound of battle echoes across the Seine and where the sounds of fashionable music, ball

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

,

 and

 I

'm

 a

 

3

0

-year

-old

 writer

 living

 in

 New

 York

 City

.

 I

've

 always

 been

 fascinated

 by

 the

 world

 of

 literature

,

 and

 I

've

 recently

 published

 my

 first

 novel

,

 which

 I

'm

 excited

 to

 share

 with

 you

.

 What

's

 your

 name

,

 and

 where

 do

 you

 work

 or

 live

?

 Emily

's

 response

 is

:

 "

Hello

,

 my

 name

 is

 Emily

,

 and

 I

'm

 a

 

3

0

-year

-old

 writer

 living

 in

 New

 York

 City

.

 I

've

 always

 been

 fascinated

 by

 the

 world

 of

 literature

,

 and

 I

've

 recently

 published

 my

 first

 novel

,

 which

 I

'm

 excited

 to

 share

 with

 you

."

 What

 do

 you

 think

 of

 Emily

's

 personality

?

 Emily



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 as

 the

 City

 of

 Light

.

 The

 city

 is

 located

 on

 the

 Se

ine

 river

,

 and

 is

 home

 to

 numerous

 museums

,

 art

 galleries

,

 and

 historic

 buildings

.

 It

 is

 known

 for

 its

 rich

 history

,

 including

 the

 birth

place

 of

 the

 French

 Revolution

,

 the

 E

iff

el

 Tower

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 also

 has

 a

 vibrant

 culture

,

 with

 jazz

 and

 ballet

 performances

,

 and

 a

 popular

 food

 scene

.

 Paris

 is

 a

 cosm

opolitan

 city

 with

 a

 diverse

 population

,

 and

 is

 known

 for

 its

 artistic

 and

 intellectual

 atmosphere

.

 The

 city

 is

 a

 hub

 for

 education

,

 arts

,

 and

 commerce

,

 with

 numerous

 universities

 and

 cultural

 institutions

.

 The

 city

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 shaped

 by

 several

 trends

 and

 developments

 that

 are

 likely

 to

 occur

 in

 the

 next

 decade

.

 Here

 are

 some

 possible

 future

 trends

 in

 AI

:



1

.

 **

Increased

 AI

 Transparency

**:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 it

 will

 become

 more

 transparent

 and

 accessible

 to

 the

 public

.

 This

 includes

 AI

-powered

 tools

 and

 systems

 that

 display

 their

 algorithms

 and

 decision

-making

 processes

,

 allowing

 users

 to

 understand

 how

 AI

 is

 making

 decisions

 and

 how

 it

 is

 improving

 processes

.



2

.

 **

AI

 Ethics

 and

 Bias

**:

 As

 AI

 systems

 become

 more

 complex

 and

 sophisticated

,

 there

 will

 be

 a

 greater

 emphasis

 on

 ethical

 considerations

 and

 the

 prevention

 of

 bias

.

 This

 includes

 the

 development

 of

 AI

 systems

 that




In [6]:
llm.shutdown()