# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

`torch_dtype` is deprecated! Use `dtype` instead!




`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-18 18:16:06] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.75it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.75it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=71.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=71.93 GB):   5%|▌         | 1/20 [00:00<00:08,  2.26it/s]Capturing batches (bs=120 avail_mem=71.83 GB):   5%|▌         | 1/20 [00:00<00:08,  2.26it/s]Capturing batches (bs=112 avail_mem=71.82 GB):   5%|▌         | 1/20 [00:00<00:08,  2.26it/s]Capturing batches (bs=112 avail_mem=71.82 GB):  15%|█▌        | 3/20 [00:00<00:02,  5.96it/s]Capturing batches (bs=104 avail_mem=71.82 GB):  15%|█▌        | 3/20 [00:00<00:02,  5.96it/s]

Capturing batches (bs=104 avail_mem=71.82 GB):  20%|██        | 4/20 [00:00<00:02,  6.28it/s]Capturing batches (bs=96 avail_mem=71.81 GB):  20%|██        | 4/20 [00:00<00:02,  6.28it/s] Capturing batches (bs=88 avail_mem=71.80 GB):  20%|██        | 4/20 [00:00<00:02,  6.28it/s]Capturing batches (bs=88 avail_mem=71.80 GB):  30%|███       | 6/20 [00:00<00:01,  8.89it/s]Capturing batches (bs=80 avail_mem=71.80 GB):  30%|███       | 6/20 [00:00<00:01,  8.89it/s]Capturing batches (bs=72 avail_mem=71.79 GB):  30%|███       | 6/20 [00:00<00:01,  8.89it/s]

Capturing batches (bs=64 avail_mem=71.79 GB):  30%|███       | 6/20 [00:00<00:01,  8.89it/s]Capturing batches (bs=64 avail_mem=71.79 GB):  45%|████▌     | 9/20 [00:01<00:01, 10.31it/s]Capturing batches (bs=56 avail_mem=71.78 GB):  45%|████▌     | 9/20 [00:01<00:01, 10.31it/s]

Capturing batches (bs=48 avail_mem=76.77 GB):  45%|████▌     | 9/20 [00:01<00:01, 10.31it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:01<00:00, 10.82it/s]Capturing batches (bs=40 avail_mem=76.77 GB):  55%|█████▌    | 11/20 [00:01<00:00, 10.82it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:01<00:00, 10.82it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  55%|█████▌    | 11/20 [00:01<00:00, 10.82it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  70%|███████   | 14/20 [00:01<00:00, 13.80it/s]Capturing batches (bs=16 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 13.80it/s]

Capturing batches (bs=12 avail_mem=76.75 GB):  70%|███████   | 14/20 [00:01<00:00, 13.80it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:01<00:00, 14.39it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 14.39it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:01<00:00, 14.39it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:01<00:00, 14.39it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:01<00:00, 17.89it/s]Capturing batches (bs=1 avail_mem=76.73 GB):  95%|█████████▌| 19/20 [00:01<00:00, 17.89it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:01<00:00, 12.05it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Luisa, I’m an 8th grade student, and I have a question about Italian restaurant chains. What are some Italian restaurants chains in the United States and what are some of their unique selling points?
Certainly! Here are some Italian restaurants chains in the United States along with their unique selling points:

1. **Bratelli** - Known for its authentic Italian cuisine with a focus on grilled meats and salads. Bratelli offers a variety of Italian dishes, including pasta, pizza, and sandwiches. They also have a strong emphasis on fresh ingredients.

2. **Alimentum** - Often associated with the "Art of Cooking
Prompt: The president of the United States is
Generated text:  trying to reduce the amount of plastic waste on the ocean. He decided to implement a policy that requires every single person to bring a reusable bag with them when they go to the store. Initially, 15% of the 30 million people in the United States brought reusable bags. The pol

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? I'm a [job title] at [company name], and I'm passionate about [job title] and [job title] at [company name]. I'm always looking for new challenges and opportunities to grow and learn. What's your favorite hobby or activity? I love [hobby or activity], and I'm always looking for new ways to explore and discover new things. What's your favorite book or movie? I love [book/movie], and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and its rich history dating back to the Middle Ages. It is a bustling metropolis with a diverse population and a rich cultural heritage. The city is home to many famous landmarks such as the Louvre Museum, Notre-Dame Cathedral, and the Palace of Versailles. Paris is also known for its fashion industry, with many famous designers and boutiques. The city is a popular tourist destination and a major economic center in Europe. It is home to many important institutions such as the French Academy of Sciences and the French National Library. Paris is a city of contrasts, with its modern

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased integration with other technologies: AI is likely to become more integrated with other technologies, such as machine learning, natural language processing, and computer vision. This integration will enable AI to perform tasks that are currently performed by humans, such as image and speech recognition, autonomous driving, and personalized medicine.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with other technologies, there will be increased pressure to address ethical concerns, such as bias, transparency, and accountability. This will lead



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name], and I'm a [Your Age, Location, or Profession] who started my own website last year. Currently, I'm [Your Current Job or Current Role] and I'm passionate about [Your Passion]. As a [Your Skill], I'm constantly learning new things and exploring new opportunities. I enjoy taking on challenges and being a leader in my field. What's your most notable achievement or accomplishment to date? Thank you! As an AI language model, I don't have personal experiences or achievements in the same way that a human does. However, I am programmed to understand and respond to various prompts, which allows

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the largest city in France and the capital of France. It is located in the Loire Valley region of southern France and is the economic, cultural, and p

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 Emily

 and

 I

 am

 a

 friendly

 and

 approach

able

 person

.

 I

 love

 to

 laugh

,

 talk

 to

 people

,

 and

 learn

 new

 things

.

 I

'm

 always

 looking

 for

 ways

 to

 make

 people

 smile

 and

 have

 fun

.

 If

 you

 have

 any

 questions

 or

 need

 anything

,

 please

 don

't

 hesitate

 to

 reach out

. My

 favorite

 thing

 about

 me

 is

 that

 I

 have

 a

 kind

 heart

 and

 I

 am

 always

 ready

 to

 help

 others

.

 Thank

 you

 for

 considering

 me

 for

 a

 connection

.

 



Emily

,

 here

 to

 answer

 any

 questions

 you

 might

 have

.

 



(

Emily

 walks

 away

,

 looking

 friendly

 and

 positive

)

 



Emily

's

 friendly

 and

 approach

able

 personality

 is

 evident

 in

 her

 self

-int

roduction

.

 She

 uses

 a



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 a

 historic

 city

 with

 a

 rich

 and

 diverse

 cultural

 history

 and

 architecture

.

 It

 is

 the

 largest

 city

 in

 Europe

 and

 a

 major

 economic

 and

 political

 center

,

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 Paris

 is

 also

 famous

 for

 its

 vibrant

 cultural

 scene

,

 including

 jazz

 music

,

 theater

,

 and

 film

,

 and

 its

 famous

 fashion

 industry.

 It

 has played

 an

 important

 role

 in

 French

 history

 and

 culture

,

 including

 being

 the

 location

 of

 the

 French

 Revolution

 and

 being

 the

 site

 of

 the

 coron

ation

 of

 King

 Louis

 XIV

.

 Today

,

 Paris

 remains

 a

 major

 global

 hub

,

 with

 many

 famous

 museums

,

 landmarks

,



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 a

 rapidly

 evolving

 field

 driven

 by

 advances

 in

 computational

 power

,

 data

 science

,

 and

 machine

 learning

.

 Some

 potential

 future

 trends

 in

 AI

 include

:



1

.

 Enhanced

 Intelligence

:

 One

 of

 the

 most

 promising

 areas

 of

 AI

 is

 enhanced

 intelligence

.

 This

 involves

 increasing

 the

 cognitive

 abilities

 of

 machines

,

 including

 learning

 from

 experiences

,

 reasoning

,

 and

 problem

-solving

.

 This

 could

 lead

 to

 breakthrough

s

 in

 areas

 such

 as

 natural

 language

 processing

,

 robotics

,

 and

 drug

 discovery

.



2

.

 Autonomous

 Systems

:

 With

 the

 development

 of

 advanced

 AI

,

 autonomous

 systems

 will

 become

 more

 common

.

 These

 systems

 will

 be

 able

 to

 make

 decisions

 and

 take

 actions

 without

 human

 intervention

,

 leading

 to

 safer

 and

 more

 efficient

 work

 environments




In [6]:
llm.shutdown()