# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-11-17 13:32:29] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-11-17 13:32:29] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-11-17 13:32:29] INFO utils.py:164: NumExpr defaulting to 16 threads.






[2025-11-17 13:32:38] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-11-17 13:32:38] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-11-17 13:32:38] INFO utils.py:164: NumExpr defaulting to 16 threads.


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.91it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=74.75 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=74.75 GB):   5%|▌         | 1/20 [00:00<00:06,  2.99it/s]Capturing batches (bs=120 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:06,  2.99it/s]Capturing batches (bs=112 avail_mem=74.64 GB):   5%|▌         | 1/20 [00:00<00:06,  2.99it/s]Capturing batches (bs=104 avail_mem=74.63 GB):   5%|▌         | 1/20 [00:00<00:06,  2.99it/s]Capturing batches (bs=104 avail_mem=74.63 GB):  20%|██        | 4/20 [00:00<00:01, 10.49it/s]Capturing batches (bs=96 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:01, 10.49it/s] Capturing batches (bs=88 avail_mem=74.62 GB):  20%|██        | 4/20 [00:00<00:01, 10.49it/s]Capturing batches (bs=80 avail_mem=74.61 GB):  20%|██        | 4/20 [00:00<00:01, 10.49it/s]

Capturing batches (bs=80 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.69it/s]Capturing batches (bs=72 avail_mem=74.61 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.69it/s]Capturing batches (bs=64 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.69it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  35%|███▌      | 7/20 [00:00<00:00, 15.69it/s]Capturing batches (bs=56 avail_mem=74.60 GB):  50%|█████     | 10/20 [00:00<00:00, 18.94it/s]Capturing batches (bs=48 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 18.94it/s]Capturing batches (bs=40 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 18.94it/s]Capturing batches (bs=32 avail_mem=74.59 GB):  50%|█████     | 10/20 [00:00<00:00, 18.94it/s]

Capturing batches (bs=32 avail_mem=74.59 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.95it/s]Capturing batches (bs=24 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.95it/s]Capturing batches (bs=16 avail_mem=74.58 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.95it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  65%|██████▌   | 13/20 [00:00<00:00, 20.95it/s]Capturing batches (bs=12 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 20.12it/s]Capturing batches (bs=8 avail_mem=74.57 GB):  80%|████████  | 16/20 [00:00<00:00, 20.12it/s] 

Capturing batches (bs=4 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:00<00:00, 20.12it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  80%|████████  | 16/20 [00:01<00:00, 20.12it/s]Capturing batches (bs=2 avail_mem=74.56 GB):  95%|█████████▌| 19/20 [00:01<00:00, 22.57it/s]Capturing batches (bs=1 avail_mem=74.55 GB):  95%|█████████▌| 19/20 [00:01<00:00, 22.57it/s]Capturing batches (bs=1 avail_mem=74.55 GB): 100%|██████████| 20/20 [00:01<00:00, 18.39it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jessica and I am 16. How would you describe yourself?

I am a creative, driven, and curious individual who enjoys exploring new ideas and learning new things. I am passionate about technology and being a part of the tech industry. I have always been fascinated by the future of the internet and how it will revolutionize the way we live, work, and communicate. I am always looking for new and exciting things to try and have fun while I do it. I believe that everyone has the potential to make a positive impact on the world and that every small action we take can make a big difference.

In terms of skills, I
Prompt: The president of the United States is
Generated text:  a president of the United States, while a senator is a senator of the United States.  Given the paragraphs above, in which department would you find a president of the United States?  A. legislative  B. executive  C. judicial  D. media
The answer is:
B. executive
You are an AI assis

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also a major cultural and economic center, with a rich history dating back to the Roman Empire and the French Revolution. Paris is home to many famous museums, including the Louvre, the Musée d'Orsay, and the Musée d'Art Moderne. The city is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. Paris is a popular tourist destination, with millions of visitors annually. It is also a major hub for international

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased automation and efficiency: AI is expected to continue to automate a wide range of tasks, from manufacturing and transportation to customer service and healthcare. This will lead to increased efficiency and productivity, as machines can perform tasks that would otherwise require human intervention.

2. Enhanced human-machine collaboration: AI will continue to improve its ability to understand and interpret human language, emotions, and intentions. This will enable machines to better understand and respond to human needs and preferences, leading to more effective and empathetic interactions.

3. AI will become more integrated with other technologies: AI will continue to be integrated with



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I am a highly intelligent, analytical person who is always looking for ways to solve problems. I have a natural aptitude for problem-solving and critical thinking, and I enjoy using my brain to come up with creative solutions to complex problems. I also enjoy learning new things and constantly exploring new ideas, which has led me to become a freelance writer, journalist, and editor. In my spare time, I enjoy reading, playing video games, and spending time with my friends and family. I am passionate about learning and growing, and I believe that this passion will help me become a more effective and successful person. Thank you for

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Paris is the capital of France and the largest city in the country, located on the Left Bank of the Seine in the 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name],

 and

 I'm

 a [

profession or

 occupation

]. I

 enjoy [

mention something

 interesting or

 unique about

 your profession

]

 and [

mention

 any

 hobbies or

 interests you

 have

]. Throughout

 the day

, I

 like to

 [

describe something

 enjoyable

 or engaging

 activity you

 enjoy

 doing

].

 I hope

 to

 be

 a

 [

mention

 a new

 skill

 or level

 of

 expertise you

 aim

 to develop

]

 in the

 field

 I

'm

 in

,

 and

 I

 believe

 that will

 allow

 me to

 [

explain

 why

 you

 think

 this

 will

 be

 helpful

 to

 the

 character

].

 I

 hope

 to

 achieve

 this

 by

 [

what

 specific

 action

 or

 goal

 you

 have

 in

 mind

].

 Overall

,

 I

 am

 [

write

 a

 brief

 description

 of

 yourself

,



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

.

 It

 is

 the

 largest

 city

 in

 France

 by

 area

 and

 population

,

 and

 is

 a

 UNESCO

 World

 Heritage

 site

.

 The

 city

 is

 known

 for

 its

 rich

 history

,

 beautiful

 architecture

,

 and

 vibrant

 culture

, and

 is one

 of the

 most visited

 tourist destinations

 in the

 world.

 Paris is

 also a

 major financial

 hub

, home

 to many

 of the

 world’s

 largest and

 most influential

 institutions,

 including the

 European Central

 Bank and

 the French

 National Anti

-Cor

ruption Agency

. The

 city is

 also famous

 for

 its

 fashion industry

, with

 Paris Fashion

 Week

 being one

 of the

 largest and

 most prestigious

 events in

 the world

.



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to continue

 to evolve

 and change

 as

 technology

 advances

.

 Some

 possible

 trends

 include

:



1

.

 Increased

 Automation

:

 As

 AI

 becomes

 more

 sophisticated

 and

 capable

 of

 performing

 tasks

 that

 were

 previously

 done

 by

 humans

,

 we

 can

 expect

 to

 see

 more

 automation

 in

 areas

 like

 healthcare

,

 manufacturing

,

 and

 transportation

.

 AI

 systems

 will

 be

 able

 to

 perform

 tasks

 that

 are

 currently

 done

 by

 humans

,

 and

 will

 be

 able

 to

 perform

 tasks

 more

 efficiently

 than

 humans

.



2

.

 AI

 Will

 Be

 More

 Human

-F

riendly

:

 As

 AI

 becomes

 more sophisticated

,

 we

 can expect

 to

 see more

 AI systems

 that are

 designed to

 be more

 human-like

. For

 example,

 AI systems

 that can

 empathize

 with human

 emotions and




In [6]:
llm.shutdown()