# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-08 15:21:10] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-08 15:21:10] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-08 15:21:10] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-08 15:21:13] INFO server_args.py:1796: Attention backend not specified. Use fa3 backend by default.


[2026-02-08 15:21:13] INFO server_args.py:2783: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.57it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=76.93 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=76.93 GB):   5%|▌         | 1/20 [00:00<00:15,  1.23it/s]Capturing batches (bs=120 avail_mem=76.16 GB):   5%|▌         | 1/20 [00:00<00:15,  1.23it/s]Capturing batches (bs=112 avail_mem=76.16 GB):   5%|▌         | 1/20 [00:00<00:15,  1.23it/s]Capturing batches (bs=104 avail_mem=76.16 GB):   5%|▌         | 1/20 [00:00<00:15,  1.23it/s]Capturing batches (bs=104 avail_mem=76.16 GB):  20%|██        | 4/20 [00:00<00:02,  5.50it/s]Capturing batches (bs=96 avail_mem=76.15 GB):  20%|██        | 4/20 [00:00<00:02,  5.50it/s] Capturing batches (bs=88 avail_mem=76.15 GB):  20%|██        | 4/20 [00:00<00:02,  5.50it/s]Capturing batches (bs=80 avail_mem=76.14 GB):  20%|██        | 4/20 [00:00<00:02,  5.50it/s]

Capturing batches (bs=80 avail_mem=76.14 GB):  35%|███▌      | 7/20 [00:01<00:01,  9.77it/s]Capturing batches (bs=72 avail_mem=76.14 GB):  35%|███▌      | 7/20 [00:01<00:01,  9.77it/s]Capturing batches (bs=64 avail_mem=76.13 GB):  35%|███▌      | 7/20 [00:01<00:01,  9.77it/s]Capturing batches (bs=56 avail_mem=76.12 GB):  35%|███▌      | 7/20 [00:01<00:01,  9.77it/s]Capturing batches (bs=56 avail_mem=76.12 GB):  50%|█████     | 10/20 [00:01<00:00, 13.59it/s]Capturing batches (bs=48 avail_mem=76.12 GB):  50%|█████     | 10/20 [00:01<00:00, 13.59it/s]Capturing batches (bs=40 avail_mem=76.11 GB):  50%|█████     | 10/20 [00:01<00:00, 13.59it/s]Capturing batches (bs=32 avail_mem=76.11 GB):  50%|█████     | 10/20 [00:01<00:00, 13.59it/s]

Capturing batches (bs=32 avail_mem=76.11 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.86it/s]Capturing batches (bs=24 avail_mem=76.11 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.86it/s]Capturing batches (bs=16 avail_mem=76.10 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.86it/s]Capturing batches (bs=12 avail_mem=76.10 GB):  65%|██████▌   | 13/20 [00:01<00:00, 16.86it/s]Capturing batches (bs=12 avail_mem=76.10 GB):  80%|████████  | 16/20 [00:01<00:00, 18.05it/s]Capturing batches (bs=8 avail_mem=76.09 GB):  80%|████████  | 16/20 [00:01<00:00, 18.05it/s] Capturing batches (bs=4 avail_mem=76.09 GB):  80%|████████  | 16/20 [00:01<00:00, 18.05it/s]

Capturing batches (bs=2 avail_mem=76.08 GB):  80%|████████  | 16/20 [00:01<00:00, 18.05it/s]Capturing batches (bs=1 avail_mem=76.08 GB):  80%|████████  | 16/20 [00:01<00:00, 18.05it/s]Capturing batches (bs=1 avail_mem=76.08 GB): 100%|██████████| 20/20 [00:01<00:00, 22.34it/s]Capturing batches (bs=1 avail_mem=76.08 GB): 100%|██████████| 20/20 [00:01<00:00, 13.38it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Karl. I am a foreign student from Mexico. I am now in the United States for the beginning of my third year at a university. I have a lot of homework and tests to do and I also have to take care of myself in case of the flu or cold. I have always been a nice person and I like to help others, but I have some problems. I have an English teacher and a history teacher in school. I have never met my history teacher before. And I have never met my English teacher either. I know she speaks English. I have never spoken to her before. I am a bit worried about the flu
Prompt: The president of the United States is
Generated text:  attempting to establish a new policy that will impact the global economy. The policy requires that all countries agree to a minimum level of carbon emissions per capita. The president has identified two countries, Country A and Country B, and has been tracking their emissions levels. Country A has an average carbon emissions of 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also the birthplace of many famous French artists and writers, including Pablo Picasso and Vincent van Gogh. Paris is a bustling metropolis with a rich cultural heritage and is a popular tourist destination. It is also known for its diverse cuisine, including French cuisine, and its annual Eiffel Tower Festival. Paris is a city of contrasts, with its historic architecture and modern fashion, and is a major hub for business and commerce. It is a city of art, culture, and history,

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical considerations: As AI becomes more integrated into our daily lives, there will be a growing emphasis on ethical considerations. This includes issues such as bias, transparency, accountability, and privacy. As a result, there will be a push for more robust ethical guidelines and standards for AI development and deployment.

2. Greater integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more sophisticated and nuanced decision-making. This could lead to a more personalized and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Your Name] and I am a friendly and outgoing person who enjoys spending time with people. I have a great sense of humor and am always looking for new experiences to try. I enjoy meeting people and learning about their interests and hobbies. I am always eager to make new friends and have a good time. If you need anything or need to chat, just let me know. Happy to meet you! [Your Name] [Your Contact Information] [Your Online Presence] [Your Interests and Hobbies] [Your Favorite Things to Do] [Your Favorite Movie, TV Show, or Book] [Your Favorite Place to Travel]

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

Does this next sentence follow, given the preceding text? Paris is the largest city in France.

OPTIONS: [i] yes. [ii] no.
[ii] no. While Paris is the largest city in France, it is not the largest

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 a

 

2

8

-year

-old

 software

 developer

 with

 over

 five

 years

 of

 experience

 in

 the

 industry

.

 I

 have

 a

 passion

 for

 innovation

 and

 am

 always

 looking

 to

 improve

 my

 skills

.

 My

 work

 is

 focused

 on

 creating

 high

-quality

 software

 that

 meets

 the

 needs

 of

 my

 clients

.

 I

 am

 a

 team

 player

 and

 enjoy

 working

 with

 others

 to

 achieve

 our

 goals

.

 I

 am

 excited

 to

 bring

 my

 experiences

 and

 skills

 to

 work

 with

 you

 and

 help

 you

 achieve

 your

 goals

.

 [

Name

]

 [

Company

 Name

]

 -

 [

Your

 title

]

 -

 [

Company

 name

]



Hey

,

 I

'm

 [

Name

]

!

 I

'm

 a

 software

 developer

 with

 [

Number

]

 years

 of

 experience

 in



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Light

 and

 the

 City

 of

 Love

.

 It

 is

 the

 largest

 city

 in

 Europe

 by

 population

 and

 is

 a

 major

 center

 of

 culture

,

 science

,

 and

 higher

 education

.

 The

 city

 is

 known

 for

 its

 iconic

 architecture

,

 including

 the

 E

iff

el

 Tower

 and

 Notre

-D

ame

 Cathedral

,

 as

 well

 as

 its

 annual

 fashion

 and

 food

 f

airs

.

 Paris

 is

 also

 home

 to

 the

 world

's

 largest

 library

,

 the

 Bibli

oth

è

que

 nation

ale

 de

 France

,

 and

 the

 Lou

vre

 Museum

,

 a

 symbol

 of

 French

 art

 and

 culture

.

 The

 city

 is

 a

 major

 transportation

 hub

,

 with

 a

 well

-develop

ed

 transportation

 network

 that

 includes

 high

-speed

 trains

 and

 a

 network



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 increasing

 integration

 with

 other

 technologies

,

 both

 in

 terms

 of

 hardware

 and

 software

,

 and

 also

 in

 the

 integration

 of

 data

 and

 information

 systems

.

 Some

 of

 the

 possible

 trends

 in

 AI

 include

:



1

.

 AI

 becoming

 more

 general

-purpose

:

 As

 AI

 can

 now

 process

 vast

 amounts

 of

 data

,

 it

 is

 becoming

 more

 general

-purpose

,

 capable

 of

 solving

 complex

 problems

 in

 a

 variety

 of

 applications

.

 This

 means

 that

 AI

 could

 be

 used

 to

 help

 solve

 problems

 that

 were

 previously

 impossible

,

 such

 as

 climate

 change

 or

 predicting

 the

 spread

 of

 disease

.



2

.

 AI

 becoming

 more

 ubiquitous

:

 The

 integration

 of

 AI

 with

 other

 technologies

,

 such

 as

 sensors

 and

 IoT

 devices

,

 could

 make

 AI

 more




In [6]:
llm.shutdown()