# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-26 01:58:04] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.88it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.13it/s]Capturing batches (bs=2 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.13it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.13it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  9.92it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Sarah, a 17-year-old college student who is passionate about cooking and sharing her recipes online. I love experimenting with new flavors and ingredients, and I enjoy sharing my knowledge and recipes with others. As a regular reader, I try to give feedback and help others learn cooking. I also love learning new cooking techniques and recipes, and I enjoy experimenting with different styles and cuisines.
My name is Sarah, a college student passionate about cooking. I love experimenting with new flavors and ingredients, sharing my knowledge, and helping others learn. I also enjoy learning new cooking techniques and recipes. I'm always looking for new recipes and ingredients to
Prompt: The president of the United States is
Generated text:  very busy in the summer. He has to go to the White House in Washington D. C. on the weekend. He has to go to important meetings. He has to go to important talks. He has to go to important jobs. He has to go to

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [insert your profession or role here] with [insert your years of experience here]. I'm passionate about [insert something that reflects your personality or interests here]. I enjoy [insert something that reflects your personality or interests here]. I'm always looking for ways to [insert something that reflects your personality or interests here]. I'm excited to [insert something that reflects your personality or interests here]. I'm looking forward to [insert something that reflects your personality or interests here]. I'm a [insert something that reflects your personality or interests here]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville-Marie" or simply "Paris". It is the largest city in France and the third-largest city in the world by population. Paris is known for its rich history, art, and culture, as well as its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major transportation hub and a major economic center in Europe. Paris is a UNESCO World Heritage site and is home to many famous museums, theaters, and art galleries. The city is also known for its fashion industry, with many famous fashion designers and boutiques located in the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some of the most likely trends that are expected to shape the future of AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see an increase in automation and robotics in various industries. This will lead to the creation of more efficient and cost-effective solutions, as well as the creation of new jobs in areas such as robotics and automation.

2. Enhanced privacy and security: As AI technology becomes more advanced, there will be an increased need for privacy and security measures



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  ____. I'm a/an ____. I enjoy ____. I'm an ____. I'm a/an ____. I'm a/an ____. I'm a/an ____. I'm a/an ____. I'm a/an ____. I'm a/an ____. What's your name?
As an AI language model, I don't have personal experiences, emotions, or beliefs, so I don't have a name. However, I'm here to assist you with any questions or information you may need. What can I do for you today? It's always good to have a friendly and welcoming conversation with a helpful AI. How can I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Additional context: Paris is the third most populous city in the world, after Beijing and Tokyo. It is the seat of government for the Fifth French Republic, and is known for its fashion, gastronomy, and romanticism. Paris is also famous for its cafes and opulent opulence, especially during the New Y

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

].

 I

'm

 a

 [

Brief

ly

 describe

 your

 character

's

 job

 or

 role

].

 I

 have

 [

brief

ly

 describe

 your

 character

's

 personality

].

 I

 enjoy

 [

brief

ly

 describe

 your

 character

's

 hobbies

 or

 interests

].

 And

 if

 you

 get

 a

 chance

,

 I

'd

 love

 to

 share

 some

 [

brief

ly

 describe

 your

 character

's

 skill

 or

 expertise

]

 with

 you

.

 Good

night

.

 


[

Your

 Name

].

 


I

'm

 sorry

,

 but

 I

 cannot

 fulfill

 the

 request

 as

 I

 am

 a

 language

 model

 and

 do

 not

 have

 access

 to

 personal

 information

 such

 as

 job

 titles

,

 hobbies

,

 interests

,

 personality

 traits

,

 or

 specific

 skills

.

 Additionally

,

 sharing

 information

 about

 the

 character

's

 skills

 or



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 city

 of

 love

 and

 art

,

 known

 for

 its

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 home

 to

 numerous

 museums

,

 gardens

,

 and

 cultural

 institutions

.

 Paris

 is

 a

 vibrant

 city

 with

 a

 rich

 history

,

 architecture

,

 and

 vibrant

 food

 scene

.

 The

 city

 is

 a

 popular

 tourist

 destination

 and

 a

 favorite

 among

 French

 residents

 and

 visitors

 alike

.

 Its

 bustling

 street

 life

 and

 charming

 ambiance

 make

 it

 a

 popular

 destination

 for

 those

 looking

 for

 a

 taste

 of

 France

.

 The

 French

 government

 has

 made

 Paris

 a

 cultural

 and

 economic

 hub

,

 attracting

 millions

 of

 tourists

 each

 year

 and

 creating

 millions

 of

 jobs

.

 The

 city

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 incredibly

 bright

,

 with

 endless

 possibilities

 and

 exciting

 innovations

 to

 come

.

 Here

 are

 some

 potential

 trends

 in

 AI

 that

 we

 can

 expect

 to

 see

 in

 the

 coming

 years

:



1

.

 Self

-driving

 cars

:

 Self

-driving

 cars

 have

 already

 become

 more

 common

 in

 public

 places

 and

 are

 being

 tested

 in

 the

 real

 world

.

 In

 the

 future

,

 we

 can

 expect

 to

 see

 more

 advanced

 self

-driving

 cars

 that

 are

 capable

 of

 navigating

 the

 streets

 of

 cities

 with

 less

 human

 intervention

.



2

.

 Rob

otic

 agriculture

:

 Robotics

 is

 already

 being

 used

 in

 agriculture

 to

 help

 with

 tasks

 like

 planting

,

 harvesting

,

 and

 monitoring

 crop

 health

.

 In

 the

 future

,

 we

 can

 expect

 to

 see

 even

 more

 advanced

 robotic

 systems

 that

 can

 autonom




In [6]:
llm.shutdown()