# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-02 16:35:34] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.77it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.77it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=20.34 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=20.34 GB):   5%|▌         | 1/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=120 avail_mem=20.23 GB):   5%|▌         | 1/20 [00:00<00:03,  5.21it/s]

Capturing batches (bs=112 avail_mem=20.23 GB):   5%|▌         | 1/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=104 avail_mem=20.22 GB):   5%|▌         | 1/20 [00:00<00:03,  5.21it/s]Capturing batches (bs=104 avail_mem=20.22 GB):  20%|██        | 4/20 [00:00<00:01, 13.41it/s]Capturing batches (bs=96 avail_mem=20.22 GB):  20%|██        | 4/20 [00:00<00:01, 13.41it/s] 

Capturing batches (bs=88 avail_mem=20.21 GB):  20%|██        | 4/20 [00:00<00:01, 13.41it/s]Capturing batches (bs=88 avail_mem=20.21 GB):  30%|███       | 6/20 [00:00<00:01, 11.73it/s]Capturing batches (bs=80 avail_mem=20.21 GB):  30%|███       | 6/20 [00:00<00:01, 11.73it/s]

Capturing batches (bs=72 avail_mem=67.96 GB):  30%|███       | 6/20 [00:00<00:01, 11.73it/s]Capturing batches (bs=72 avail_mem=67.96 GB):  40%|████      | 8/20 [00:00<00:01,  9.23it/s]Capturing batches (bs=64 avail_mem=67.87 GB):  40%|████      | 8/20 [00:00<00:01,  9.23it/s]Capturing batches (bs=56 avail_mem=67.22 GB):  40%|████      | 8/20 [00:00<00:01,  9.23it/s]Capturing batches (bs=48 avail_mem=67.16 GB):  40%|████      | 8/20 [00:00<00:01,  9.23it/s]Capturing batches (bs=48 avail_mem=67.16 GB):  55%|█████▌    | 11/20 [00:00<00:00, 12.52it/s]Capturing batches (bs=40 avail_mem=67.15 GB):  55%|█████▌    | 11/20 [00:00<00:00, 12.52it/s]

Capturing batches (bs=32 avail_mem=67.15 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.52it/s]Capturing batches (bs=24 avail_mem=67.14 GB):  55%|█████▌    | 11/20 [00:01<00:00, 12.52it/s]Capturing batches (bs=24 avail_mem=67.14 GB):  70%|███████   | 14/20 [00:01<00:00, 15.15it/s]Capturing batches (bs=16 avail_mem=67.14 GB):  70%|███████   | 14/20 [00:01<00:00, 15.15it/s]Capturing batches (bs=12 avail_mem=67.13 GB):  70%|███████   | 14/20 [00:01<00:00, 15.15it/s]

Capturing batches (bs=12 avail_mem=67.13 GB):  80%|████████  | 16/20 [00:01<00:00, 15.31it/s]Capturing batches (bs=8 avail_mem=67.13 GB):  80%|████████  | 16/20 [00:01<00:00, 15.31it/s] Capturing batches (bs=4 avail_mem=66.72 GB):  80%|████████  | 16/20 [00:01<00:00, 15.31it/s]Capturing batches (bs=2 avail_mem=66.62 GB):  80%|████████  | 16/20 [00:01<00:00, 15.31it/s]Capturing batches (bs=2 avail_mem=66.62 GB):  95%|█████████▌| 19/20 [00:01<00:00, 18.60it/s]Capturing batches (bs=1 avail_mem=66.62 GB):  95%|█████████▌| 19/20 [00:01<00:00, 18.60it/s]Capturing batches (bs=1 avail_mem=66.62 GB): 100%|██████████| 20/20 [00:01<00:00, 14.82it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Laura. I'm an art student at the University of Wisconsin-Madison. I'm a big fan of the Beatles. They had many famous songs that I love. As you know, the Beatles made lots of music. I wanted to learn more about them. I got a student service job. My job was to help other students learn about the Beatles. One day, the Beatles' manager, the man in charge of the band, told me I was the right student to study for them. My boss gave me a nice little job to do. In this job, I would learn the songs and put them together. It was going
Prompt: The president of the United States is
Generated text:  an elected office. In 2020, the president of the United States was Donald Trump. The president of the United States was re-elected in 2024, when Donald Trump had just announced a new campaign campaign called the Trump Organization. A magazine named The Guardian has published a story on the 2020 election. It includes a quote from a senior United States senator e

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a brief description of your job or experience here]. I enjoy [insert a brief description of your hobbies or interests here]. What do you like to do in your free time? I enjoy [insert a brief description of your hobbies or interests here]. What do you like to do in your free time? I enjoy [insert a brief description of your hobbies or interests here]. What do you like to do in your free time? I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, also known as "La Ville Flottante" or "La Ville Blanche" (White City). It is the largest city in France and the third largest in the world by population, with a population of over 2 million people. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and 

Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human needs.

2. Greater emphasis on ethical and social considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical and social considerations. This could lead to more robust AI systems that are designed to be transparent,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  __________. My name is Mary Jane Morgan, and I'm a medical student at __________ (insert university name). I'm currently in my third year of medical school, and I've gained valuable experience in various fields like __________ (list 3 medical professions you've been involved in). I've learned a lot from my time in the classroom, and I'm looking forward to continuing to develop my skills. What's the atmosphere like at the university? I'm excited to be in this environment and learn new things. What's your favorite activity to do when you're not studying? I enjoy going to the gym, practicing yoga,

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, located in the northwestern region of the country. It is the largest city in France and has a population of over 6 million people. The city is known for its rich h

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Your

 Name

]

 and

 I

 am

 an

 experienced

 writer

 of

 fiction

 books

,

 primarily

 focusing

 on

 action

 and

 adventure

.

 I

'm

 a

 quick

 thinker

 and

 can

 come

 up

 with

 fresh

 ideas

,

 whether

 it

's

 writing

 original

 stories

 or

 finding

 inspiration

 for

 a

 book

 proposal

.

 I

'm

 also

 a

 perfection

ist

 and

 have

 a

 natural

 talent

 for

 crafting

 vivid

 descriptions

 and

 writing

 compelling

 characters

.

 I

'm

 a

dept

 at

 using

 descriptive

 language

 to

 bring

 my

 stories

 to

 life

 and

 help

 readers

 visualize

 the

 world

 I

've

 created

.

 I

'm

 passionate

 about

 storytelling

 and

 believe

 that

 the

 ability

 to

 turn

 words

 into

 reality

 is

 one

 of

 the

 greatest

 gifts

 in

 life

.

 I

 have

 a

 knack

 for

 keeping

 my

 audience

 engaged

 and

 will

 be



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 largest

 city

 in

 France

,

 and

 the

 country

's

 capital

.

 It

 is

 a

 major

 cultural

,

 economic

,

 and

 political

 center

.

 The

 city

 is

 home

 to

 many

 world

-ren

owned

 landmarks

,

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 It

 is

 also

 known

 for

 its

 vibrant

 arts

 scene

 and

 diverse

 cuisine

.

 Paris

 has

 a

 rich

 history

,

 dating

 back

 thousands

 of

 years

,

 and

 is

 home

 to

 many

 famous

 landmarks

 and

 attractions

.

 Despite

 the

 challenges

 of

 political

 and

 economic

 instability

,

 Paris

 remains

 an

 important

 cultural

 and

 economic

 center

 in

 France

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

:



1

.

 Increased

 automation

 and

 efficiency

:

 As

 AI

 becomes

 more

 powerful

 and

 accurate

,

 there

 is

 the

 potential

 for

 machines

 to

 take

 over

 some

 of

 the

 more

 mundane

 tasks

 that

 humans

 do

.

 This

 could

 lead

 to

 increased

 efficiency

 and

 productivity

,

 as

 AI

 can

 process

 information

 much

 faster

 than

 humans

.



2

.

 Emer

gence

 of

 new

 forms

 of

 AI

:

 With

 advancements

 in

 machine

 learning

 and

 other

 areas

 of

 AI

,

 we

 may

 see

 the

 emergence

 of

 new

 forms

 of

 AI

 that

 go

 beyond

 traditional

 computer

 programs

.

 These

 could

 include

 self

-learning

 systems

,

 emotional

 intelligence

 agents

,

 and

 even

 hypothetical

 entities

 like

 super

intelligence

.



3

.

 Growing

 importance

 of

 AI

 ethics

:

 As

 AI

 becomes

 more




In [6]:
llm.shutdown()