# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-08 03:57:01] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.70it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=28.32 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=28.32 GB):   5%|▌         | 1/20 [00:00<00:05,  3.74it/s]Capturing batches (bs=120 avail_mem=28.21 GB):   5%|▌         | 1/20 [00:00<00:05,  3.74it/s]Capturing batches (bs=112 avail_mem=28.21 GB):   5%|▌         | 1/20 [00:00<00:05,  3.74it/s]

Capturing batches (bs=112 avail_mem=28.21 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.71it/s]Capturing batches (bs=104 avail_mem=28.20 GB):  15%|█▌        | 3/20 [00:00<00:03,  4.71it/s]

Capturing batches (bs=104 avail_mem=28.20 GB):  20%|██        | 4/20 [00:00<00:03,  4.16it/s]Capturing batches (bs=96 avail_mem=28.20 GB):  20%|██        | 4/20 [00:00<00:03,  4.16it/s] Capturing batches (bs=96 avail_mem=28.20 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.65it/s]Capturing batches (bs=88 avail_mem=28.19 GB):  25%|██▌       | 5/20 [00:01<00:03,  4.65it/s]

Capturing batches (bs=88 avail_mem=28.19 GB):  30%|███       | 6/20 [00:01<00:03,  4.24it/s]Capturing batches (bs=80 avail_mem=28.18 GB):  30%|███       | 6/20 [00:01<00:03,  4.24it/s]

Capturing batches (bs=80 avail_mem=28.18 GB):  35%|███▌      | 7/20 [00:01<00:02,  4.44it/s]Capturing batches (bs=72 avail_mem=28.18 GB):  35%|███▌      | 7/20 [00:01<00:02,  4.44it/s]Capturing batches (bs=72 avail_mem=28.18 GB):  40%|████      | 8/20 [00:01<00:02,  5.18it/s]Capturing batches (bs=64 avail_mem=28.18 GB):  40%|████      | 8/20 [00:01<00:02,  5.18it/s]Capturing batches (bs=56 avail_mem=28.17 GB):  40%|████      | 8/20 [00:01<00:02,  5.18it/s]

Capturing batches (bs=56 avail_mem=28.17 GB):  50%|█████     | 10/20 [00:02<00:01,  5.60it/s]Capturing batches (bs=48 avail_mem=28.17 GB):  50%|█████     | 10/20 [00:02<00:01,  5.60it/s]Capturing batches (bs=40 avail_mem=28.16 GB):  50%|█████     | 10/20 [00:02<00:01,  5.60it/s]Capturing batches (bs=40 avail_mem=28.16 GB):  60%|██████    | 12/20 [00:02<00:01,  7.64it/s]Capturing batches (bs=32 avail_mem=28.16 GB):  60%|██████    | 12/20 [00:02<00:01,  7.64it/s]Capturing batches (bs=24 avail_mem=28.15 GB):  60%|██████    | 12/20 [00:02<00:01,  7.64it/s]Capturing batches (bs=16 avail_mem=28.15 GB):  60%|██████    | 12/20 [00:02<00:01,  7.64it/s]

Capturing batches (bs=16 avail_mem=28.15 GB):  75%|███████▌  | 15/20 [00:02<00:00, 10.45it/s]Capturing batches (bs=12 avail_mem=28.14 GB):  75%|███████▌  | 15/20 [00:02<00:00, 10.45it/s]Capturing batches (bs=8 avail_mem=28.14 GB):  75%|███████▌  | 15/20 [00:02<00:00, 10.45it/s] Capturing batches (bs=4 avail_mem=28.13 GB):  75%|███████▌  | 15/20 [00:02<00:00, 10.45it/s]Capturing batches (bs=4 avail_mem=28.13 GB):  90%|█████████ | 18/20 [00:02<00:00, 14.00it/s]Capturing batches (bs=2 avail_mem=28.09 GB):  90%|█████████ | 18/20 [00:02<00:00, 14.00it/s]Capturing batches (bs=1 avail_mem=28.05 GB):  90%|█████████ | 18/20 [00:02<00:00, 14.00it/s]Capturing batches (bs=1 avail_mem=28.05 GB): 100%|██████████| 20/20 [00:02<00:00,  8.04it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Karen Harst and I am a professional writer, editor, editor-in-chief and book designer. I am also a science writer and a science fiction writer. I started writing professionally in 2007 and have been writing science fiction, fantasy, and romance novels, poetry, and short stories for over a decade. In my spare time, I enjoy hiking, playing board games, and reading about the future. I am a member of the American Society of Science Writers, Science Fiction Writers, and Authors. I have over 3,000 books to my name. I have been featured in a number of publications and have been
Prompt: The president of the United States is
Generated text:  a senator representing a state. The president of the United States is the leader of the executive branch of the federal government of the United States. The president of the United States is the highest elected official in the United States. The president of the United States is elected annually for a four-year ter

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I am a [job title] at [company name]. I have been working in this field for [number of years] years. I have always been passionate about [job title] and have always wanted to achieve [specific goal or achievement]. I am always looking for new challenges and opportunities to grow and learn. I am a [job title] at [company name] and I am always looking for ways to improve my skills and knowledge. I am excited to work with you and contribute to your success. [Name] [Company Name] [Job Title] [Company Address] [Company Phone Number] [Company

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris. It is the largest city in France and the second-largest city in the European Union. It is known for its rich history, beautiful architecture, and vibrant culture. Paris is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major center for business, finance, and entertainment. Paris is a popular tourist destination and a cultural hub for France and the world. It is the capital of France and the second-largest city in the European Union. Paris is known for its rich history, beautiful architecture, and vibrant culture. It is home to many famous landmarks

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence. This means that AI systems will be able to learn from and adapt to human behavior, and will be able to make decisions based on human preferences and values.

2. Greater use of machine learning: Machine learning is a key component of AI, and it is likely to become more widely used in the future. This means that AI systems will be able to learn from data and make



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I'm a [Your Profession] with over [Your Number of Years] years of experience in [Your Profession]. I have a passion for [Your Professional Interest or Career Goal]. I'm always looking for ways to [Your Goal or Hobby]. I enjoy [Your Passion or Hobby]. I'm confident that I can [Your Goal or Hobby]. I believe in [Your Professional Mission/Goal]. I am a [Your Interests/Values/Charities/Community]. Thank you for considering me for a potential match. Let's chat about it. Have a good day! 

This self-introduction should be neutral

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and Louvre Museum.

Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text:  likely to involv

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 an

 [

age

]

 year

 old

 girl

 who

 was

 born

 in

 [

Birth

place

].

 I

 grew

 up

 in

 [

city

],

 and

 I

 fell

 in

 love

 with

 [

something

 or

 someone

]

 from

 a

 very

 young

 age

.

 I

've

 always

 been

 passionate

 about

 [

what

 you

 can

 do

 in

 your

 spare

 time

],

 and

 I

 believe

 that

 being

 an

 advocate

 for

 [

or

 for

)

 [

a

 cause

 or

 idea

]

 is

 my

 biggest

 passion

.

 I

'm

 a

 strong

-w

illed

 and

 independent

 person

 who

 thr

ives

 on

 challenges

,

 and

 I

'm

 always

 looking

 for

 ways

 to

 make

 the

 world

 a

 better

 place

.

 I

'm

 determined

 to

 use

 my

 skills

 and

 talents

 to

 make

 a

 positive

 impact



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris,

 also

 known

 as

 "

The

 City

 of

 Light

".

 



The

 statement

 should

 be

 clear

,

 concise

,

 and

 accurately

 describe

 the

 capital

's

 importance

 and

 culture

.

 It

 should

 also

 include

 the

 French

 word

 for

 "

light"

 in the

 source material

. 



For example

: "

Paris is

 known for

 its beautiful

 architecture

,

 vibrant

 cultural

 scene

,

 and

 renowned

 museums

 like

 the

 Lou

vre

,

 but

 it

's

 also

 celebrated

 for

 its

 night

-life

 and

 iconic

 E

iff

el

 Tower

."

 



Please

 format

 the

 statement

 as

 an

 answer

 in

 Markdown

 format

.

 



```

markdown




#

 French

 Capital

 City





The

 capital

 of

 France

,

 Paris

,

 is

 renowned

 for

 its

 beautiful

 architecture

,

 vibrant

 cultural

 scene

,

 and

 renowned



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 involve

 a

 number

 of

 key

 trends

 and

 advancements

.

 Here

 are

 some

 potential

 areas

 of

 interest

:



1

.

 Deep

 Learning

:

 As

 neural

 networks

 become

 more

 sophisticated

,

 they

 are

 increasingly

 able

 to

 recognize

 patterns

 and

 make

 decisions

 based

 on

 large

 amounts

 of

 data

.

 Deep

 learning

 is

 particularly

 well

-su

ited

 for

 tasks

 such

 as

 image

 recognition

,

 natural

 language

 processing

,

 and

 computer

 vision

.



2

.

 Explain

able

 AI

:

 AI

 systems

 that

 are

 responsible

 for

 making

 decisions

 can

 be

 difficult

 to

 understand

 and

 explain

,

 which

 can

 lead

 to

 concerns

 about

 fairness

 and

 bias

.

 Researchers

 are

 exploring

 ways

 to

 make

 AI

 systems

 more

 transparent

 and

 interpre

table

,

 such

 as

 through

 methods

 like

 probabil

istic

 programming

 or

 statistical

 machine

 learning




In [6]:
llm.shutdown()