# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-28 05:10:37] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-28 05:10:37] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-28 05:10:37] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-28 05:10:37] INFO trace.py:48: opentelemetry package is not installed, tracing disabled






[2025-10-28 05:10:45] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-28 05:10:45] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-28 05:10:45] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-28 05:10:47] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.78it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.78it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  6.19it/s]Capturing batches (bs=120 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  6.19it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  6.19it/s]

Capturing batches (bs=104 avail_mem=76.80 GB):   5%|▌         | 1/20 [00:00<00:03,  6.19it/s]Capturing batches (bs=104 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 15.58it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:01, 15.58it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 15.58it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:01, 15.58it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.98it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.98it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.98it/s]

Capturing batches (bs=56 avail_mem=76.77 GB):  35%|███▌      | 7/20 [00:00<00:00, 19.98it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.05it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.05it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.05it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.05it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.31it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.31it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.31it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 23.31it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 22.09it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 22.09it/s] Capturing batches (bs=4 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 22.09it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 22.09it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:00<00:00, 24.21it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:00<00:00, 24.21it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 21.90it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Andrew and I'm a medical student. I am a licensed pharmacists. I have been studying for my MD for 3 years now and I'm ready to be a pharmacist. I was wondering if you could share your own personal experience as a pharmacist and what a typical day of working in a pharmacy might look like?

Certainly! As an AI language model, I don't have personal experience as a pharmacist, but I can tell you about the typical day of working in a pharmacy.

A typical day in a pharmacy could involve the following activities:

  1. Preparing prescriptions: The pharmacist will review a patient's medical history, drug
Prompt: The president of the United States is
Generated text:  a fellow of the______ of the United States.
A. American Academy of Arts and Sciences
B. American College of Physicians
C. American Philosophical Society
D. American Academy of Arts and Sciences
Answer: D

Which of the following statements about the requirements for auditing the income and 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to the French Parliament and the French National Museum of Modern Art. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. The city is known for its diverse cuisine, including French cuisine, and is home to many famous French artists and writers. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. The city is also known for its fashion industry, with many famous fashion designers and boutiques located in the

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This integration will enable AI to perform tasks that are currently only possible with human intelligence, such as image recognition, speech recognition, and decision-making.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with other technologies, there will be a greater emphasis on ethical considerations. This will include issues such as bias, transparency, accountability,



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm a [Job Title] at [Company Name]. I'm passionate about [What you do for a living]. I've always loved to travel and explore new places, so I'm always eager to get my hands dirty and try new things. Whether it's visiting a new city, trying a new cuisine, or even just getting a new haircut, I'm always looking for new experiences. I've also been a lifelong learner, constantly seeking to improve myself and expand my knowledge. I enjoy sharing my experiences and knowledge with others, and I love helping others achieve their goals. And, of course, I love my

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, a bustling metropolis known for its rich history, art, and vibrant culture. It's also the birthplace of the French Revolution and an important center for politics, art, and fashion. Paris has a 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

],

 and

 I

'm

 a

 [

profession

,

 such

 as

 "

doctor

,"

 "

teacher

,"

 or

 "

prof

essor

"].

 I

 specialize

 in

 [

field

 of

 study

,

 such

 as

 "

ped

iatrics

,"

 "

psych

ology

,"

 or

 "

education

"].

 I

'm

 a

 [

gener

ally

 positive

,

 such

 as

 "

human

itarian

,"

 "

professional

,"

 or

 "

hum

ane

."]

 individual

 who

 is

 [

gener

ally

 optimistic

,

 such

 as

 "

optim

istic

,"

 "

ener

getic

,"

 or

 "

amb

itious

."]

 with

 a

 passion

 for

 [

field

 of

 interest

,

 such

 as

 "

learning

,"

 "

coding

,"

 or

 "

coding

,"

]

 and

 a

 desire

 to

 [

goal

,

 such

 as

 "

help

ing

 others

,"

 "

in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 which

 is

 located

 in

 the

 northern

 region

 of

 the

 country

,

 on

 the

 Lo

ire

 River

.

 It

 is

 one

 of

 the

 world

's

 most

 populous

 cities

 and

 has

 a

 rich

 history

 dating

 back

 to

 the

 ancient

 Gaul

s

 and

 Romans

.

 Paris

 is

 known

 for

 its

 stunning

 architecture

,

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 and

 a

 vibrant

 cultural

 scene

.

 The

 city

 is

 also

 home

 to

 many

 world

-ren

owned

 museums

 and

 art

 galleries

,

 including

 the

 Lou

vre

 and

 the

 Mus

ée

 d

'

Or

say

.

 Paris

 is

 a

 popular

 tourist

 destination

 and

 a

 major

 economic

 and

 financial

 center

 in

 the

 French

 Republic

.

 Its

 status

 as

 the

 capital

 makes

 it

 the

 political

,

 cultural

,

 and

 economic



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 highly

 promising

,

 with

 potential

 to

 revolution

ize

 many

 industries

 and

 bring

 about

 a

 wide

 range

 of

 positive

 changes

.

 Here

 are

 some

 of

 the

 possible

 future

 trends

 in

 AI

:



1

.

 Autonomous

 vehicles

:

 One

 of

 the

 most

 significant

 potential

 future

 trends

 is

 the

 development

 of

 autonomous

 vehicles

.

 These

 vehicles

 will

 be

 able

 to

 drive

 themselves

 without

 human

 intervention

,

 reducing

 accidents

 and

 speeding

 up

 traffic

.

 Autonomous

 vehicles

 will

 also

 improve

 the

 efficiency

 of

 the

 transportation

 industry

,

 making

 it

 easier

 for

 people

 to

 access

 and

 use

 services

 like

 delivery

 and

 ride

-sharing

.



2

.

 Language

 translation

:

 AI

 will

 be

 able

 to

 translate

 languages

 faster

 and

 more

 accurately

 than

 ever

 before

,

 making

 it

 easier

 for

 people

 to

 communicate

 with

 people

 who




In [6]:
llm.shutdown()