# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-02 07:07:50] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.87it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.86it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.41 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=120 avail_mem=76.31 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]

Capturing batches (bs=112 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=104 avail_mem=76.30 GB):   5%|▌         | 1/20 [00:00<00:03,  5.18it/s]Capturing batches (bs=104 avail_mem=76.30 GB):  20%|██        | 4/20 [00:00<00:01, 13.31it/s]Capturing batches (bs=96 avail_mem=76.29 GB):  20%|██        | 4/20 [00:00<00:01, 13.31it/s] Capturing batches (bs=88 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:01, 13.31it/s]Capturing batches (bs=80 avail_mem=76.28 GB):  20%|██        | 4/20 [00:00<00:01, 13.31it/s]

Capturing batches (bs=80 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.75it/s]Capturing batches (bs=72 avail_mem=76.28 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.75it/s]Capturing batches (bs=64 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.75it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  35%|███▌      | 7/20 [00:00<00:00, 16.75it/s]Capturing batches (bs=56 avail_mem=76.27 GB):  50%|█████     | 10/20 [00:00<00:00, 18.43it/s]Capturing batches (bs=48 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 18.43it/s]Capturing batches (bs=40 avail_mem=76.26 GB):  50%|█████     | 10/20 [00:00<00:00, 18.43it/s]

Capturing batches (bs=32 avail_mem=76.25 GB):  50%|█████     | 10/20 [00:00<00:00, 18.43it/s]Capturing batches (bs=32 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.82it/s]Capturing batches (bs=24 avail_mem=76.25 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.82it/s]Capturing batches (bs=16 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.82it/s]Capturing batches (bs=12 avail_mem=76.24 GB):  65%|██████▌   | 13/20 [00:00<00:00, 19.82it/s]

Capturing batches (bs=12 avail_mem=76.24 GB):  80%|████████  | 16/20 [00:00<00:00, 19.09it/s]Capturing batches (bs=8 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:00<00:00, 19.09it/s] Capturing batches (bs=4 avail_mem=76.23 GB):  80%|████████  | 16/20 [00:00<00:00, 19.09it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  80%|████████  | 16/20 [00:00<00:00, 19.09it/s]Capturing batches (bs=2 avail_mem=76.22 GB):  95%|█████████▌| 19/20 [00:01<00:00, 21.62it/s]Capturing batches (bs=1 avail_mem=76.22 GB):  95%|█████████▌| 19/20 [00:01<00:00, 21.62it/s]Capturing batches (bs=1 avail_mem=76.22 GB): 100%|██████████| 20/20 [00:01<00:00, 19.05it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Lin. I'm a medical student. I'm looking for a job, and the people who interviewed me thought I was smart, so I applied. I'm not a doctor, but I have a great deal of experience with patient care. How would you describe your experience? I have experience with emergency room, surgery and other doctors. I have also worked with a lot of people with different types of problems. How do you think the job market will change in the future? The job market is changing. There are so many jobs out there now, and people are changing their career choices. As a doctor, it's really hard to make
Prompt: The president of the United States is
Generated text:  elected by _____.
A. the citizens of the United States
B. the people
C. the members of the Congress
D. the members of the Executive Branch
Answer: B

Which of the following statements about the allocation of the Office of the President is incorrect?
A. The Office of the President is responsible for the formul

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [Age] year old [Gender] [Occupation]. I'm passionate about [Your Passion], and I'm always looking for new ways to [Your Goal]. I'm a [Your Character Trait] and I'm always [Your Character Trait]. I'm [Your Character Trait] and I'm always [Your Character Trait]. I'm [Your Character Trait] and I'm always [Your Character Trait]. I'm [Your Character Trait] and I'm always [Your Character Trait]. I'm [Your Character Trait] and I'm always [Your Character Trait]. I'm [Your Character Trait]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as "La Ville de Paris". It is the largest city in France and the third-largest city in the world by population. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also home to many world-renowned museums, theaters, and restaurants. Paris is a cultural and historical center that has played a significant role in French history and continues to be a major economic and political center in Europe. It is a popular tourist destination and a major hub for international business and diplomacy. Paris is also known for its

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some possible future trends in AI include:

1. Increased use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes, reduce costs, and improve the quality of care. As AI technology continues to advance, we can expect to see even more innovative uses of AI in healthcare, such as personalized medicine, drug discovery, and patient monitoring.

2. Increased use of AI in transportation: AI is already being used in transportation to improve safety, reduce congestion, and increase efficiency. As AI technology continues to advance,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  Alex. I’m a 35-year-old freelance graphic designer with a passion for creative writing. I’m known for my unique style of using humor as a tool for problem-solving and building relationships. I’m always looking for new challenges and opportunities to grow as a designer and writer. If you’re interested in taking the next step in your creative journey, I would love to hear from you. What's your name? What’s your occupation? What do you do for a living? How did you get started in your field? Can you describe your style of work? What is your writing process like? Are you a natural writer? Is

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a city located in the south of the country and is known for its iconic Eiffel Tower, Champs-Elysées, Louvre Museum, Notre-Dame Cathedral, and other landmarks.
The answer i

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

],

 and

 I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

 have

 [

number

]

 of

 years

 of

 experience

 in

 [

field

 of

 work

],

 and

 my

 expertise

 lies

 in

 [

specific

 skill

 or

 area

].

 I

 love

 [

reason

 why

 I

 enjoy

 my

 job

]

 and

 am

 always

 striving

 to

 grow

 my

 knowledge

 and

 skills

.

 What

's

 your

 profession

 or

 field

 of

 work

?

 Nice

 try

.

 What

 do

 you

 do

?


I

'm

 a

 [

job

 title

]

 at

 [

company

 name

].

 I

 have

 [

number

]

 of

 years

 of

 experience

 in

 [

field

 of

 work

],

 and

 my

 expertise

 lies

 in

 [

specific

 skill

 or

 area

].

 I

 love

 [

reason

 why

 I

 enjoy

 my

 job

]



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 known

 for

 its

 iconic

 E

iff

el

 Tower

 and

 its

 many

 museums

,

 including

 the

 Lou

vre

.

 It

 has

 a

 population

 of

 approximately

 

1

1

 million

 people

 and

 is

 the

 largest

 city

 in

 both

 the

 European

 Union

 and

 the

 world

.

 As

 the

 heart

 of

 France

,

 it

 is

 a

 bustling

 and

 diverse

 city

 with

 a

 rich

 history

 and

 culture

.

 



To

 sum

 up

,

 Paris

 is

 a

 city

 of

 contrasts

,

 and

 the

 E

iff

el

 Tower

 stands

 as

 a

 symbol

 of

 France

's

 identity

 and

 creativity

.

 Visitors

 can

 explore

 its

 beautiful

 art

,

 architecture

,

 and

 museums

,

 as

 well

 as

 its

 iconic

 landmarks

,

 such

 as

 the

 Lou

vre

.

 This

 makes

 Paris

 the

 capital

 of

 the

 French

 Riv



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 dynamic

 and

 uncertain

,

 and

 there

 is

 no

 one

-size

-f

its

-all

 answer

 to

 what

 the

 future

 holds

.

 However

,

 some

 possible

 future

 trends

 that

 could

 shape

 the

 development

 of

 AI

 include

:



1

.

 Integration

 of

 AI

 with

 other

 technologies

:

 AI

 is

 becoming

 increasingly

 integrated

 with

 other

 technologies

,

 including

 machine

 learning

,

 big

 data

,

 and

 robotics

.

 This

 integration

 could

 lead

 to

 a

 more

 versatile

 and

 efficient

 use

 of

 AI

,

 as

 well

 as

 the

 development

 of

 new

 applications

 that

 were

 previously

 unimagin

able

.



2

.

 Adv

ancements

 in

 AI

 ethics

 and

 safety

:

 As

 AI

 becomes

 more

 advanced

,

 it

 will

 be

 important

 to

 address

 ethical

 concerns

 and

 ensure

 that

 AI

 is

 used

 responsibly

.

 This

 could

 involve

 developing




In [6]:
llm.shutdown()