# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-13 23:14:54] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.26it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.26it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.59it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.59it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.59it/s]Capturing batches (bs=104 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.59it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01, 14.24it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 14.24it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 14.24it/s]

Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 13.63it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 13.63it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:01, 13.63it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  40%|████      | 8/20 [00:00<00:00, 15.48it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  40%|████      | 8/20 [00:00<00:00, 15.48it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  40%|████      | 8/20 [00:00<00:00, 15.48it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  40%|████      | 8/20 [00:00<00:00, 15.48it/s]

Capturing batches (bs=48 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:00<00:00, 18.39it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:00<00:00, 18.39it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:00<00:00, 18.39it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:00<00:00, 18.39it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  70%|███████   | 14/20 [00:00<00:00, 20.41it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:00<00:00, 20.41it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:00<00:00, 20.41it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  70%|███████   | 14/20 [00:00<00:00, 20.41it/s] Capturing batches (bs=8 avail_mem=76.74 GB):  85%|████████▌ | 17/20 [00:00<00:00, 19.99it/s]Capturing batches (bs=4 avail_mem=76.74 GB):  85%|████████▌ | 17/20 [00:00<00:00, 19.99it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 19.99it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 19.99it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 18.84it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  F. I am a computer science student at the University of Illinois. I'm going to apply for a job at the university's software development department.
As an intern, I would primarily focus on learning about software development and software engineering concepts. Additionally, I would also be able to contribute to the company's success by assisting in the development of new projects or working with existing ones to improve their performance.
Could you please provide me with some guidance on how to prepare for the job interview? Thank you! Also, are there any specific software development tools that I should consider? Certainly! Preparing for a job interview involves several key steps, including
Prompt: The president of the United States is
Generated text:  a role of utmost importance. The president of the United States is the highest elected official in the United States. The president's primary role is to lead the United States, and the other fun

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [insert a short, interesting fact about yourself]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm a [insert a short, interesting fact about your job]. I'm always looking for ways to improve my skills and stay up-to-date with the latest trends in my field. What do you enjoy doing in your free time? I enjoy [insert a short, interesting fact about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light, and is the largest city in the European Union. It is located on the Seine River and is home to the Eiffel Tower, the Louvre Museum, and the Notre-Dame Cathedral. Paris is known for its rich history, art, and culture, and is a popular tourist destination. The city is also home to many famous landmarks and attractions, including the Louvre, the Champs-Élysées, and the Arc de Triomphe. Paris is a vibrant and dynamic city that is known for its lively atmosphere and diverse cultural scene. It is a popular destination for

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration of AI into everyday life: AI is already being integrated into our daily lives, from voice assistants like Siri and Alexa to self-driving cars. As AI becomes more integrated into our daily lives, we can expect to see even more widespread adoption.

2. AI becoming more autonomous: As AI becomes more integrated into our daily lives, we can expect to see more autonomous vehicles on the road. This could lead to a decrease in accidents and a reduction in carbon emissions.

3. AI becoming more



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Sarah, and I am a 25-year-old data analyst with over five years of experience in data management and analysis. I have a strong background in statistical analysis, data visualization, and predictive modeling, and I specialize in utilizing various tools and techniques to help clients improve their decision-making processes. I am also a skilled communicator and a collaborator with a proven track record of working in cross-functional teams and contributing to the success of projects. I am committed to continuous learning and improvement, and I strive to stay up to date with the latest trends and technologies in the field of data analysis and visualization. I am eager to leverage my skills and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the most populous city in Europe and is home to several world

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

]

 and

 I

'm

 a

 software

 engineer

 with

 experience

 in

 [

specific

 area

 of

 software

 engineering

].

 I

'm

 always

 ready

 to

 learn

 and

 always

 looking

 for

 ways

 to

 improve

 my

 skills

.

 What

 do

 you

 do

 for

 a

 living

?

 I

 work

 at

 [

Company

 Name

],

 where

 I

 develop

 [

specific

 software

 product

 or

 service

].

 I

 enjoy

 problem

 solving

,

 and

 trying

 new

 things

 to

 innovate

.

 What

's

 your

 dream

 job

?

 To

 have

 the

 freedom

 to

 explore

 new

 ideas

 and

 work

 on

 projects

 that

 really

 ignite

 my

 passion

.

 Where

 do

 you

 see

 yourself

 in

 five

 years

?

 To

 be

 a

 big

 name

 in

 my

 field

,

 building

 software

 that

 drives

 meaningful

 change

.

 What

's

 your

 favorite

 hobby

?

 H



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 located

 on

 the

 Se

ine

 River

,

 an

 important

 water

way

 that

 provides

 a

 long

-distance

 transport

 route

.

 Paris

 is

 home

 to

 UNESCO

 World

 Heritage

 sites

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 the

 Lou

vre

 Museum

,

 and

 the

 Lou

vre

 Theatre

.

 The

 city

 is

 also

 known

 for

 its

 art

,

 food

,

 and

 music

,

 with

 Paris

 being

 recognized

 as

 the

 best

 city

 in

 the

 world

 for

 food

 and

 wine

.

 In

 recent

 years

,

 Paris

 has

 become

 a

 popular

 tourist

 destination

,

 with

 over

 a

 billion

 tourists

 annually

.

 The

 city

 is

 home

 to

 numerous

 French

-speaking

 communities

,

 with

 many

 large

 French

-language

 restaurants

 and

 cafes

,

 and

 a

 strong

 sense

 of

 national

 pride



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 full

 of

 possibilities

 and

 potential

.

 Here

 are

 some

 possible

 future

 trends

:



1

.

 AI

 will

 become

 more

 integrated

 into

 our

 daily

 lives

:

 AI

 will

 become

 more

 integrated

 into

 our

 daily

 lives

,

 from

 our

 smartphones

 to

 the

 virtual

 assistants

 we

 use

 at

 work

.

 AI

 will

 also

 become

 more

 ubiquitous

,

 with

 more

 people

 interacting

 with

 it

 on

 a

 regular

 basis

.



2

.

 AI

 will

 be

 used

 for

 self

-driving

 cars

:

 As

 autonomous

 vehicles

 become

 more

 common

,

 AI

 will

 be

 used

 for

 self

-driving

 cars

.

 This

 will

 require

 the

 development

 of

 better

 algorithms

 and

 sensors

 to

 recognize

 traffic

 signs

,

 weather

,

 and

 road

 conditions

.



3

.

 AI

 will

 be

 used

 for

 healthcare

:

 AI

 will

 be

 used

 for

 healthcare




In [6]:
llm.shutdown()