# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-28 02:24:06] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-28 02:24:06] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-28 02:24:06] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-28 02:24:06] INFO trace.py:48: opentelemetry package is not installed, tracing disabled






[2025-10-28 02:24:14] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-28 02:24:14] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-28 02:24:14] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-28 02:24:16] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.34it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.92 GB):   5%|▌         | 1/20 [00:00<00:03,  6.27it/s]Capturing batches (bs=120 avail_mem=76.82 GB):   5%|▌         | 1/20 [00:00<00:03,  6.27it/s]Capturing batches (bs=112 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  6.27it/s]

Capturing batches (bs=104 avail_mem=76.81 GB):   5%|▌         | 1/20 [00:00<00:03,  6.27it/s]Capturing batches (bs=104 avail_mem=76.81 GB):  20%|██        | 4/20 [00:00<00:00, 16.46it/s]Capturing batches (bs=96 avail_mem=76.80 GB):  20%|██        | 4/20 [00:00<00:00, 16.46it/s] Capturing batches (bs=88 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:00, 16.46it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  20%|██        | 4/20 [00:00<00:00, 16.46it/s]Capturing batches (bs=80 avail_mem=76.79 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.94it/s]Capturing batches (bs=72 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.94it/s]Capturing batches (bs=64 avail_mem=76.78 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.94it/s]

Capturing batches (bs=56 avail_mem=76.77 GB):  35%|███▌      | 7/20 [00:00<00:00, 20.94it/s]Capturing batches (bs=56 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.57it/s]Capturing batches (bs=48 avail_mem=76.77 GB):  50%|█████     | 10/20 [00:00<00:00, 22.57it/s]Capturing batches (bs=40 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.57it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  50%|█████     | 10/20 [00:00<00:00, 22.57it/s]Capturing batches (bs=32 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.88it/s]Capturing batches (bs=24 avail_mem=76.76 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.88it/s]

Capturing batches (bs=16 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.88it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  65%|██████▌   | 13/20 [00:00<00:00, 22.88it/s]Capturing batches (bs=12 avail_mem=76.75 GB):  80%|████████  | 16/20 [00:00<00:00, 21.08it/s]Capturing batches (bs=8 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 21.08it/s] Capturing batches (bs=4 avail_mem=76.74 GB):  80%|████████  | 16/20 [00:00<00:00, 21.08it/s]Capturing batches (bs=2 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 21.08it/s]

Capturing batches (bs=1 avail_mem=76.73 GB):  80%|████████  | 16/20 [00:00<00:00, 21.08it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 24.19it/s]Capturing batches (bs=1 avail_mem=76.73 GB): 100%|██████████| 20/20 [00:00<00:00, 21.69it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Samantha, a 23 year old American citizen. I'm at the front of the line on the 2020 US presidential election, and I'm looking forward to voting. After researching the candidates, I'm a huge fan of Michelle Obama, and I voted for her during her 2008 presidential campaign. 

I don't know how I feel about the upcoming election. It's exciting, and I have good reasons to vote for a candidate. I have a lot of personal and professional relationships with a number of other candidates who I like. I feel that it's important to vote for candidates that are trustworthy and
Prompt: The president of the United States is
Generated text:  a person who is in charge of the country. President Obama is a man in charge of the United States. He was the first African American to be the president of the United States. Obama has been president since 2009. He was born in Chicago, Illinois, in 1961. He was the youngest of five children. He grew up in a large family and t

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about your career and interests. What can you tell me about yourself? [Name] is a [job title] at [company name]. I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history dating back to the Roman Empire and the Middle Ages. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also famous for its fashion industry, art scene, and cuisine. Paris is a vibrant and diverse city with a rich cultural heritage and a strong sense of identity. It is a popular tourist destination and a major economic and financial center in Europe. The city is home to many world-renowned museums, theaters, and art galleries. Paris is a city of

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. Some potential trends include:

1. Increased integration with human intelligence: As AI becomes more sophisticated, it is likely to become more integrated with human intelligence, allowing for more complex and nuanced interactions.

2. Greater emphasis on ethical considerations: As AI becomes more prevalent in various industries, there will be a greater emphasis on ethical considerations and regulations to ensure that AI is used in a responsible and beneficial way.

3. Development of more advanced AI systems: As AI technology continues to advance, there will be an increased focus on developing more advanced AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: ... [insert character's name] and I'm an AI language model created by Anthropic. I'm a computer program designed to assist and provide information to users. I'm always here to help and answer any questions you may have. As an AI, I'm programmed to learn from the data I receive and improve my performance over time. My main goal is to assist and improve the lives of the people who use me, by providing them with accurate and helpful information. So if you have any questions or need any help, feel free to ask, and I'll do my best to provide you with the information you need. At Anthropic

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is the cultural, intellectual, and political center of the country and is renowned for its art, architecture, and cuisine. The city has a rich history and has been a key ce

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [Name

] and

 I'm

 an [

Age]

 year

 old [

Occupation

]. I

'm

 currently [

Current Location

] and

 I'm

 [Current

 Job].

 I

 enjoy [

Reason for

 Job

]

 and I

 spend a

 lot

 of

 time [

Favorite Activity

]. What

 brings

 you

 to

 this

 location

 and what

 do

 you

 do

 there

?


As

 an

 AI language

 model,

 I do

 not have

 a physical

 existence or

 a personal

 life

,

 so I

 don't

 have

 a

 name,

 age

,

 occupation

,

 location,

 or

 personal activities

. However

, I

 can assist

 you

 with any

 questions you

 have

 about

 my

 capabilities,

 such

 as

 my

 abilities to

 generate text

,

 respond

 to queries

,

 or

 perform specific

 tasks



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 

Here

's

 an

 explanation

 for

 anyone without

 formal education

:


Paris

 is

 the

 capital

 of

 France

 and is

 a

 major

 European

 city.

 It

's

 the

 largest

 city

 in

 France

 and

 has

 a

 rich

 history

 and

 culture

.

 The city

 is

 famous for

 its beautiful

 architecture

,

 including the

 E

iffel

 Tower and

 the

 Lou

vre

 Museum.

 It

's home

 to

 many famous

 people

, including

 actors,

 musicians

, and

 writers.

 Paris is

 also

 known for

 its cuisine

, including

 its

 famous French

 fries

 and

 its

 traditional

 French

 breakfast

s

.

 The

 city

 is

 often

 called

 "

the

 city

 of

 love

"

 due

 to

 its

 romantic

 atmosphere

.

 France

's

 capital

 city

 is

 Paris

.

 



The

 process

 of

 arriving

 at



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 a

 highly

 competitive

 and

 rapidly

 evolving

 field,

 driven

 by

 the

 increasing

 complexity

 and

 sophistication

 of

 algorithms

,

 data

,

 and

 systems

 that

 are

 being

 developed.

 Here are

 some possible

 future trends

 in AI

:

1

. Increased

 use of

 AI in

 healthcare

: As

 AI

 becomes more

 prevalent

 in healthcare

,

 it will

 be used

 to improve

 patient

 care

 and

 treatment outcomes

. This

 could involve

 the use

 of AI

-powered

 diagnostic

 tools

,

 predictive

 analytics

,

 and

 personalized

 treatment

 plans

.



2

. Development

 of AI

-powered virtual

 assistants:

 As AI

 continues to

 advance,

 it is

 expected that

 we will

 see more

 AI-powered

 virtual assistants

 and personal

 assistants being

 developed.

 These devices

 will be

 able to

 understand natural




In [6]:
llm.shutdown()