# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-29 05:07:14] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.33it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=27.97 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=27.97 GB):   5%|▌         | 1/20 [00:00<00:10,  1.89it/s]Capturing batches (bs=120 avail_mem=27.86 GB):   5%|▌         | 1/20 [00:00<00:10,  1.89it/s]

Capturing batches (bs=120 avail_mem=27.86 GB):  10%|█         | 2/20 [00:00<00:07,  2.33it/s]Capturing batches (bs=112 avail_mem=27.86 GB):  10%|█         | 2/20 [00:00<00:07,  2.33it/s]

Capturing batches (bs=112 avail_mem=27.86 GB):  15%|█▌        | 3/20 [00:01<00:10,  1.58it/s]Capturing batches (bs=104 avail_mem=27.85 GB):  15%|█▌        | 3/20 [00:01<00:10,  1.58it/s]Capturing batches (bs=104 avail_mem=27.85 GB):  20%|██        | 4/20 [00:01<00:07,  2.21it/s]Capturing batches (bs=96 avail_mem=27.85 GB):  20%|██        | 4/20 [00:01<00:07,  2.21it/s] 

Capturing batches (bs=88 avail_mem=27.84 GB):  20%|██        | 4/20 [00:02<00:07,  2.21it/s]Capturing batches (bs=88 avail_mem=27.84 GB):  30%|███       | 6/20 [00:02<00:03,  4.00it/s]Capturing batches (bs=80 avail_mem=27.84 GB):  30%|███       | 6/20 [00:02<00:03,  4.00it/s]Capturing batches (bs=72 avail_mem=27.84 GB):  30%|███       | 6/20 [00:02<00:03,  4.00it/s]

Capturing batches (bs=72 avail_mem=27.84 GB):  40%|████      | 8/20 [00:02<00:02,  5.57it/s]Capturing batches (bs=64 avail_mem=27.83 GB):  40%|████      | 8/20 [00:02<00:02,  5.57it/s]Capturing batches (bs=64 avail_mem=27.83 GB):  45%|████▌     | 9/20 [00:02<00:01,  6.04it/s]Capturing batches (bs=56 avail_mem=27.83 GB):  45%|████▌     | 9/20 [00:02<00:01,  6.04it/s]

Capturing batches (bs=48 avail_mem=27.82 GB):  45%|████▌     | 9/20 [00:02<00:01,  6.04it/s]Capturing batches (bs=48 avail_mem=27.82 GB):  55%|█████▌    | 11/20 [00:02<00:01,  7.84it/s]Capturing batches (bs=40 avail_mem=27.82 GB):  55%|█████▌    | 11/20 [00:02<00:01,  7.84it/s]Capturing batches (bs=32 avail_mem=27.81 GB):  55%|█████▌    | 11/20 [00:02<00:01,  7.84it/s]Capturing batches (bs=32 avail_mem=27.81 GB):  65%|██████▌   | 13/20 [00:02<00:00,  9.61it/s]Capturing batches (bs=24 avail_mem=27.81 GB):  65%|██████▌   | 13/20 [00:02<00:00,  9.61it/s]

Capturing batches (bs=16 avail_mem=27.80 GB):  65%|██████▌   | 13/20 [00:02<00:00,  9.61it/s]Capturing batches (bs=16 avail_mem=27.80 GB):  75%|███████▌  | 15/20 [00:02<00:00,  9.87it/s]Capturing batches (bs=12 avail_mem=27.80 GB):  75%|███████▌  | 15/20 [00:02<00:00,  9.87it/s]Capturing batches (bs=8 avail_mem=27.79 GB):  75%|███████▌  | 15/20 [00:02<00:00,  9.87it/s] 

Capturing batches (bs=8 avail_mem=27.79 GB):  85%|████████▌ | 17/20 [00:03<00:00,  8.99it/s]Capturing batches (bs=4 avail_mem=27.78 GB):  85%|████████▌ | 17/20 [00:03<00:00,  8.99it/s]

Capturing batches (bs=2 avail_mem=27.78 GB):  85%|████████▌ | 17/20 [00:03<00:00,  8.99it/s]Capturing batches (bs=2 avail_mem=27.78 GB):  95%|█████████▌| 19/20 [00:03<00:00,  8.83it/s]Capturing batches (bs=1 avail_mem=27.78 GB):  95%|█████████▌| 19/20 [00:03<00:00,  8.83it/s]Capturing batches (bs=1 avail_mem=27.78 GB): 100%|██████████| 20/20 [00:03<00:00,  5.91it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  John. I'm a self-employed person who works from home. I regularly undertake consulting work. My current location is in New Jersey, USA. I'm a software engineer who specializes in machine learning and data science.

I've been planning to start my own company. I've been looking at starting a company in the next 6 months. I'm thinking about offering a consulting service, which I believe will help me financially. What factors do you think I should consider when planning my own company, and how can I improve my business ideas to make it more attractive to potential customers?
Your question is quite insightful, and it's a valid one
Prompt: The president of the United States is
Generated text:  trying to finalize a new executive order that would implement a new health care plan. His plan calls for a tax on tobacco and a tax on alcohol. The tax on tobacco would increase by 20%, while the tax on alcohol would increase by 10%. If the current tax on toba

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum. It is also a major cultural and economic center, hosting numerous museums, theaters, and festivals throughout the year. Paris is a popular tourist destination and is home to many famous landmarks and attractions. It is also a major hub for international trade and diplomacy. The city is known for its rich history, including the influence of the French Revolution and the influence of the French Revolution on modern French culture and politics. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the way we live, work, and interact with technology. Here are some possible future trends in AI:

1. Increased automation and robotics: As AI technology continues to advance, we are likely to see more automation and robotics in various industries, including manufacturing, transportation, and healthcare. This will lead to increased efficiency, cost savings, and job displacement, but it will also create new opportunities for workers and businesses.

2. AI-powered healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As AI technology continues to advance, we are likely to



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Job Title] with [Company Name]. I'm passionate about [Job Title] and love to [Job Title] in [Industry/Location]. I'm always eager to learn new skills and expand my knowledge in my field. I'm always up for a challenge and love to [Job Title] with a smile. Thank you for asking. That sounds like a great start to my self-introduction! Can you tell me a little bit more about what you do and what you're passionate about? That will help me get to know you better and show you more about yourself. Of course, just let

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the world’s third most populous city and the largest metropolitan area in Europe. Paris is known for its beautiful architecture, rich history, and vibrant cultural scene, including the Eiffel Tower and the Louvre Museum. The city is 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 name

],

 and

 I

'm

 a

 [

insert

 your

 occupation

]

 who

 is

 constantly

 amazed

 by

 the

 beauty

 of

 nature

 and

 the

 challenges

 of

 the

 world

.

 I

 travel

 the

 world

 in

 search

 of

 new

 experiences

,

 immer

se

 myself

 in

 the

 culture

 of

 different

 places

,

 and

 gain

 a

 unique

 perspective

 of

 life

.

 I

'm

 always

 on

 the

 lookout

 for

 new

 challenges

 to

 overcome

,

 and

 I

 strive

 to

 be

 an

 advocate

 for

 environmental

 conservation

.

 I

'm

 passionate

 about

 the

 importance

 of

 preserving

 the

 natural

 world

 and

 am

 always

 on

 the

 lookout

 for

 ways

 to

 make

 a

 difference

 in

 my

 community

.

 I

'm

 always

 looking

 to

 learn

 and

 grow

 and

 I

'm

 excited

 to

 learn

 more

 about

 myself

 and

 my

 passions

.



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 known

 for

 its

 stunning

 architecture

,

 rich

 history

,

 and

 vibrant

 cultural

 scene

.

 It

 is

 the

 largest

 city

 in

 the

 country

 and

 its

 population

 is

 over

 

3

.

 

5

 million

.

 The

 city

 is

 also

 home

 to

 many

 famous

 landmarks

 and

 museums

,

 including

 the

 Lou

vre

,

 the

 E

iff

el

 Tower

,

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 known

 for

 its

 romantic

 and

 vibrant

 atmosphere

,

 and

 it

 has

 played

 a

 significant

 role

 in

 French

 culture

 and

 politics

 throughout

 its

 history

.

 



Some

 facts

 about

 Paris

 that

 might

 intrigue

 you

:



1

.

 It

 was

 founded

 in

 the

 

7

th

 century

 by

 Char

lem

agne

 as

 the

 capital

 of

 France

.


2

.

 Paris



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 advancements

 in

 various

 areas

,

 including

 the

 following

:



1

.

 Personal

ized

 medicine

:

 AI

 is

 likely

 to

 play

 a

 crucial

 role

 in

 personal

izing

 medicine

,

 tail

oring

 treatments

 to

 individual

 patients

 based

 on

 their

 genetic

 makeup

,

 medical

 history

,

 and

 lifestyle

.

 AI

 algorithms

 can

 analyze

 large

 amounts

 of

 medical

 data

,

 identify

 patterns

,

 and

 develop

 personalized

 treatment

 plans

.



2

.

 Autonomous

 vehicles

:

 AI

 is

 likely

 to

 revolution

ize

 the

 autonomous

 vehicle

 industry

,

 allowing

 for

 self

-driving

 cars

 that

 can

 navigate

 streets

,

 handle

 emergencies

,

 and

 communicate

 with

 other

 vehicles

.

 Autonomous

 vehicles

 will

 also

 improve

 safety

 and

 reduce

 the

 number

 of

 accidents

.



3

.

 Smart

 cities

:

 AI

 will

 be

 used




In [6]:
llm.shutdown()