# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-20 04:04:41] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.54it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  5.63it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.63it/s]

Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  5.63it/s]Capturing batches (bs=112 avail_mem=76.81 GB):  15%|█▌        | 3/20 [00:00<00:01, 12.02it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01, 12.02it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  15%|█▌        | 3/20 [00:00<00:01, 12.02it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  15%|█▌        | 3/20 [00:00<00:01, 12.02it/s]Capturing batches (bs=88 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 16.49it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 16.49it/s]

Capturing batches (bs=72 avail_mem=76.79 GB):  30%|███       | 6/20 [00:00<00:00, 16.49it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  30%|███       | 6/20 [00:00<00:00, 16.49it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.53it/s]Capturing batches (bs=56 avail_mem=76.78 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.53it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:00<00:00, 18.53it/s]

Capturing batches (bs=48 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:00<00:00, 12.81it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:00<00:00, 12.81it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:00<00:00, 12.81it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 12.36it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 12.36it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:01<00:00, 12.36it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.51it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.51it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  75%|███████▌  | 15/20 [00:01<00:00, 12.51it/s] 

Capturing batches (bs=8 avail_mem=76.74 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.05it/s]Capturing batches (bs=4 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.05it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.05it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  85%|████████▌ | 17/20 [00:01<00:00, 14.05it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 17.54it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 14.72it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Robyn and I have been working as a dental hygienist for 2 years now. I would like to apply for a dental practice. I have a Bachelor of Science in Public Health from the University of New Brunswick, and a certificate in dental hygiene. I will be entering the public health field next year. What are my next steps? What questions do you have?

---

### Re: Applying for a dental practice
---

Thank you so much for your support with my education. It was great to hear that you are an expert in dental hygiene, which is precisely what I hope to achieve as my next career endeavor.

I understand that
Prompt: The president of the United States is
Generated text:  a powerful man with the ability to make major decisions on an international level. However, he does not have the power to make his own decisions at home. This is because he does not have the authority to make decisions on his home, and his decisions are not reflected in his home. 

In addition, t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [Age] year old [Gender] [Occupation]. I'm a [Skill] with [Number] years of experience in [Field]. I'm passionate about [What you do for a living] and I'm always looking for new opportunities to [What you're looking for]. I'm a [What you're good at] and I enjoy [What you do for a living]. I'm [What you're like] and I'm always ready to learn and grow. I'm [What you're like] and I'm always ready to learn and grow. I'm [What you're like

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the annual Eiffel Tower Festival. It is also the seat of the French government and the country's cultural and political capital. Paris is a bustling metropolis with a rich history and a diverse population of over 2 million people. The city is known for its art, architecture, and cuisine, and is a popular tourist destination. The city is also home to many museums, theaters, and other cultural institutions. Paris is a city of contrasts, with its modern skyscrapers and historical landmarks blending seamlessly into the cityscape. The city is also known for its annual festivals

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation: AI is likely to become more prevalent in various industries, with automation becoming more widespread. This could lead to the creation of new jobs, but also the creation of new opportunities for people to work in areas such as data analysis, machine learning, and software development.

2. AI will become more integrated with other technologies: AI will likely become more integrated with other technologies, such as the Internet of Things (IoT), blockchain, and quantum computing. This integration could lead to new applications and opportunities for AI, such as smart cities, autonomous vehicles, and personalized medicine.

3



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert name] and I am a [insert profession or occupation]. I bring a unique perspective and a strong work ethic to this role. I strive to meet deadlines and deliver high-quality work. I also have a passion for learning and constantly seek out new ideas and techniques to enhance my skills. I am excited to bring my experience and dedication to this project, and I am looking forward to working with you all. [Insert your name] [insert your profession or occupation] [insert your experience and achievements] [insert your role in the project] [insert your personal strengths and weaknesses] [insert your personality and attitude] [insert your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the western part of the country and known as the city of light and the city of music. Paris is the largest city

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

 am

 a

/an

 [

age

]

 year

 old

 [

occupation

]

 from

 [

city

].

 I

 am

 the

 best

 friend

 of

 [

friend

's

 name

]

 and

 have

 been

 through

 a

 lot

 together

.

 I

 love

 [

reason

 why

 I

 like

 [

friend

's

 name

]],

 I

 am

 [

character

's

 personality

]

 and

 I

 am

 always

 there

 for

 [

friend

's

 name

].

 I

 enjoy

 [

activity

/

thing

]

 with

 [

friend

's

 name

]

 and

 I

 am

 passionate

 about

 [

interest

/ex

perience

]

 that

 I

 am

 learning

.

 I

 am

 always

 trying

 to

 help

 [

friend

's

 name

]

 and

 spend

 time

 with

 them

,

 even

 if

 it

's

 just

 [

brief

ly

 describe

 your

 activity

].

 I

 am



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



In

 

2

0

2

0

,

 Paris

 had

 a

 population

 of

 approximately

 

2

.

2

 million

 people

,

 and

 it

 is

 the

 most

 populous

 city

 in

 Europe

.

 



Paris

 is

 known

 for

 its

 iconic

 architecture

,

 rich

 cultural

 heritage

,

 and

 vibrant

 street

 life

,

 making

 it

 a

 popular

 tourist

 destination

 and

 a

 global

 hub

 for

 fashion

,

 art

,

 and

 music

.

 



The

 city

 is

 also

 a

 major

 center

 of

 science

,

 research

,

 and

 innovation

,

 and

 it

 hosts

 numerous

 museums

,

 galleries

,

 and

 scientific

 institutions

.

 



Paris

 is

 an

 important

 center

 for

 education

,

 research

,

 and

 business

,

 with

 numerous

 universities

,

 professional

 training

 institutions

,

 and

 companies

 headquartered

 there

.

 It

 has

 also

 been



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 characterized

 by

 rapid

 advancements

 in

 areas

 such

 as

 machine

 learning

,

 natural

 language

 processing

,

 and

 computer

 vision

.

 These

 technologies

 are

 expected

 to

 continue

 to

 evolve

 and

 improve

,

 with

 new

 applications

 and

 applications

 of

 AI

 being

 developed

 on

 a

 daily

 basis

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Increased

 focus

 on

 ethical

 and

 safety

 concerns

:

 As

 AI

 becomes

 more

 prevalent

 in

 various

 sectors

,

 there

 will

 be

 a

 greater

 emphasis

 on

 ethical

 and

 safety

 concerns

.

 Governments

 and

 organizations

 will

 need

 to

 develop

 policies

 and

 guidelines

 to

 ensure

 the

 responsible

 use

 of

 AI

,

 with

 a

 focus

 on

 minimizing

 risks

 and

 protecting

 people

 and

 machines

.



2

.

 Continued

 development

 of

 machine

 learning

 algorithms

:

 As

 AI

 systems




In [6]:
llm.shutdown()