# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-01 13:51:09] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.80it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.79it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.37it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:01, 13.78it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 13.78it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.78it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 13.78it/s]

Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.17it/s]Capturing batches (bs=72 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.17it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.17it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 17.17it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]

Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 19.19it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.58it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.58it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.58it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.58it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 16.29it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 16.29it/s] Capturing batches (bs=4 avail_mem=76.72 GB):  80%|████████  | 16/20 [00:01<00:00, 16.29it/s]Capturing batches (bs=4 avail_mem=76.72 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.00it/s]Capturing batches (bs=2 avail_mem=76.71 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.00it/s]Capturing batches (bs=1 avail_mem=76.71 GB):  90%|█████████ | 18/20 [00:01<00:00, 17.00it/s]

Capturing batches (bs=1 avail_mem=76.71 GB): 100%|██████████| 20/20 [00:01<00:00, 17.16it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Rachel and I am a sixth grade student at Granite Middle School. My name is Sara.
I have a strong sense of humor and I really like helping people. I love talking with people and I enjoy spending time with friends. I like to eat well and exercise. I love the outdoors and I love to spend time outside. I love to travel to new places. I love travel and I think it's so cool.
My favorite hobby is reading books. I really like reading books about science and technology. I think reading is really fun. My favorite books are like the classic book and I like to read books that have great stories. I
Prompt: The president of the United States is
Generated text:  trying to decide how many military bases to build in a certain country. He learned that for every base that is built, the population growth rate decreases by 0.5%. If the population of the country is 3.2 million, and the president decides to build 5 bases, how many soldiers will be needed if the grow

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to the Roman Empire and the French Revolution. Paris is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. It is also a popular tourist destination, with millions of visitors annually. The city is known for its cuisine, including French cuisine, and is home to many famous restaurants and bars. Paris is a vibrant and dynamic city,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and increase efficiency. As AI becomes more advanced, we can expect to see even more widespread use of AI in healthcare, with more sophisticated algorithms and machine learning techniques being developed to improve diagnosis, treatment, and patient care.

2. Increased use of AI in finance: AI is already being used in finance to improve risk management, fraud detection, and trading algorithms.



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Character's Name]. I am a [Job Title] with [Company Name] and [Position]. I have been with [Company Name] since [Year] and I have been working hard to [Specific Achievement or Goal]. What inspired you to start your career in this field, and what do you hope to achieve in the next [Number] years? I would love to hear about what's been your greatest accomplishment and what you're looking forward to achieving next in your career.
Your introduction should be professional and concise, highlighting your skills and achievements while showing your interest in the industry. It should also be tailored to the specific job title

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the City of Light and the world’s largest city.
Paris, the City of Light, is the capital of France, located on the Seine River, in the regi

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

 am

 a

 [

occupation

/

role

]

!

 I

 am

 [

Age

]

 years

 old

,

 and

 I

 currently

 live

 in

 [

Location

].

 I

 have

 a

 passion

 for

 [

exc

use

 me

,

 what

 is

 your

 hobby

 or

 activity

 that

 you

 enjoy

 the

 most

?

 -

 please

 make

 sure

 it

's

 something

 you

're

 not

 afraid

 of

 or

 that

 you

 enjoy

 doing

!

 -

 you

 can

 even

 use

 humor

 to

 make

 it

 more

 rel

atable

,

 if

 possible

.

 -

 make

 sure

 to

 be

 honest

 and

 specific

].

 I

'm

 always

 looking

 for

 new

 experiences

 and

 learning

 new

 things

,

 and

 I

 enjoy

 sharing

 my

 knowledge

 with

 others

!

 I

 believe

 that

 my

 interests

 and

 hobbies

 are

 what

 make

 me

 unique

 and



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 city

 renowned

 for

 its

 medieval

 architecture

,

 vibrant

 cultural

 scene

,

 and

 influential

 role

 in

 French

 history

 and

 politics

.

 It

 is

 the

 fifth

-largest

 city

 in

 the

 European

 Union

 and

 the

 second

-largest

 in

 the

 world

 by

 population

.

 Paris

 is

 home

 to

 the

 Lou

vre

 Museum

,

 the

 E

iff

el

 Tower

,

 the

 Arc

 de

 Tri

omp

he

,

 and

 numerous

 other

 iconic

 landmarks

.

 The

 city

 is

 also

 known

 for

 its

 rich

 culture

,

 cuisine

,

 and

 social

 life

,

 making

 it

 an

 important

 city

 for

 many

 French

 residents

 and

 visitors

 alike

.

 



Key

 facts

 about

 Paris

:



1

.

 It

 has

 been

 a

 capital

 since

 

1

2

6

2

.


2

.

 It

 is

 located

 on

 the

 Se



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 complex

 and

 rapidly

 evolving

,

 with

 many

 different

 directions

 in

 which

 it

 may

 be

 heading

.

 Here

 are

 some

 potential

 future

 trends

 in

 AI

:



1

.

 Increased

 integration

 of

 AI

 into

 everyday

 life

:

 As

 AI

 becomes

 more

 integrated

 into

 our

 daily

 lives

,

 we

 may

 see

 more

 of

 it

 being

 used

 for

 tasks

 that

 are

 mundane

 or

 repetitive

,

 such

 as

 voice

 recognition

 and

 machine

 learning

 for

 predicting

 weather

 patterns

.



2

.

 AI

 becoming

 more

 sophisticated

 and

 autonomous

:

 As

 AI

 becomes

 more

 sophisticated

,

 it

 may

 become

 more

 autonomous

,

 able

 to

 operate

 without

 human

 intervention

 in

 some

 cases

.

 This

 could

 be

 seen

 in

 things

 like

 autonomous

 vehicles

,

 robots

,

 and

 even

 artificial

 intelligence

 that

 can

 perform

 tasks

 that

 are

 too

 complex




In [6]:
llm.shutdown()