# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2025-10-21 11:34:53] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2025-10-21 11:34:53] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2025-10-21 11:34:53] INFO utils.py:164: NumExpr defaulting to 16 threads.


[2025-10-21 11:34:53] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


`torch_dtype` is deprecated! Use `dtype` instead!






[2025-10-21 11:35:15] INFO utils.py:148: Note: detected 112 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2025-10-21 11:35:15] INFO utils.py:151: Note: NumExpr detected 112 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2025-10-21 11:35:15] INFO utils.py:164: NumExpr defaulting to 16 threads.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-10-21 11:35:16] `torch_dtype` is deprecated! Use `dtype` instead!


[2025-10-21 11:35:17] INFO trace.py:48: opentelemetry package is not installed, tracing disabled


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0




Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.74it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=75.29 GB):   0%|          | 0/20 [00:00<?, ?it/s]

Capturing batches (bs=128 avail_mem=75.29 GB):   5%|▌         | 1/20 [00:01<00:29,  1.56s/it]Capturing batches (bs=120 avail_mem=74.84 GB):   5%|▌         | 1/20 [00:01<00:29,  1.56s/it]

Capturing batches (bs=120 avail_mem=74.84 GB):  10%|█         | 2/20 [00:02<00:17,  1.01it/s]Capturing batches (bs=112 avail_mem=74.84 GB):  10%|█         | 2/20 [00:02<00:17,  1.01it/s]

Capturing batches (bs=112 avail_mem=74.84 GB):  15%|█▌        | 3/20 [00:02<00:11,  1.53it/s]Capturing batches (bs=104 avail_mem=74.74 GB):  15%|█▌        | 3/20 [00:02<00:11,  1.53it/s]

Capturing batches (bs=104 avail_mem=74.74 GB):  20%|██        | 4/20 [00:02<00:09,  1.77it/s]Capturing batches (bs=96 avail_mem=74.73 GB):  20%|██        | 4/20 [00:02<00:09,  1.77it/s] 

Capturing batches (bs=96 avail_mem=74.73 GB):  25%|██▌       | 5/20 [00:03<00:07,  2.12it/s]Capturing batches (bs=88 avail_mem=74.66 GB):  25%|██▌       | 5/20 [00:03<00:07,  2.12it/s]Capturing batches (bs=80 avail_mem=74.66 GB):  25%|██▌       | 5/20 [00:03<00:07,  2.12it/s]Capturing batches (bs=80 avail_mem=74.66 GB):  35%|███▌      | 7/20 [00:03<00:03,  3.68it/s]Capturing batches (bs=72 avail_mem=74.65 GB):  35%|███▌      | 7/20 [00:03<00:03,  3.68it/s]

Capturing batches (bs=72 avail_mem=74.65 GB):  40%|████      | 8/20 [00:03<00:02,  4.41it/s]Capturing batches (bs=64 avail_mem=74.64 GB):  40%|████      | 8/20 [00:03<00:02,  4.41it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  40%|████      | 8/20 [00:03<00:02,  4.41it/s]Capturing batches (bs=56 avail_mem=74.64 GB):  50%|█████     | 10/20 [00:03<00:01,  6.31it/s]Capturing batches (bs=48 avail_mem=74.63 GB):  50%|█████     | 10/20 [00:03<00:01,  6.31it/s]

Capturing batches (bs=48 avail_mem=74.63 GB):  55%|█████▌    | 11/20 [00:03<00:01,  5.44it/s]Capturing batches (bs=40 avail_mem=74.63 GB):  55%|█████▌    | 11/20 [00:03<00:01,  5.44it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  55%|█████▌    | 11/20 [00:03<00:01,  5.44it/s]Capturing batches (bs=32 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:03<00:00,  7.06it/s]Capturing batches (bs=24 avail_mem=74.62 GB):  65%|██████▌   | 13/20 [00:03<00:00,  7.06it/s]

Capturing batches (bs=16 avail_mem=74.61 GB):  65%|██████▌   | 13/20 [00:04<00:00,  7.06it/s]Capturing batches (bs=16 avail_mem=74.61 GB):  75%|███████▌  | 15/20 [00:04<00:00,  7.92it/s]Capturing batches (bs=12 avail_mem=74.61 GB):  75%|███████▌  | 15/20 [00:04<00:00,  7.92it/s]Capturing batches (bs=8 avail_mem=74.60 GB):  75%|███████▌  | 15/20 [00:04<00:00,  7.92it/s] 

Capturing batches (bs=8 avail_mem=74.60 GB):  85%|████████▌ | 17/20 [00:04<00:00,  9.18it/s]Capturing batches (bs=4 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:04<00:00,  9.18it/s]Capturing batches (bs=2 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:04<00:00,  9.18it/s]Capturing batches (bs=1 avail_mem=74.59 GB):  85%|████████▌ | 17/20 [00:04<00:00,  9.18it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:04<00:00, 12.41it/s]Capturing batches (bs=1 avail_mem=74.59 GB): 100%|██████████| 20/20 [00:04<00:00,  4.51it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Bruce and I'm a software developer working at Amazon.com. I love to code, I like to help people, and I enjoy working with people and ideas.
I'm a highly motivated and excited developer with a knack for developing software systems that can make life easier for people.
Currently, I'm working at Amazon as a Software Developer. I'm really looking forward to a job where I can use my coding skills to solve problems and make people's lives better.
I love to work on new and exciting things and I'm always looking for ways to improve myself and my skills.
I'm also very active in the community and enjoy participating in coding
Prompt: The president of the United States is
Generated text:  trying to decide how many military bases to build in different countries around the world. He has decided to build 3 military bases in Europe and 3 military bases in Asia. If he can choose from a total of 10 cities available for military bases, how many different ways c

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about you. What can you tell me about yourself? [Name] is a [job title] at [company name]. I'm excited to meet you and learn more about

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, the city known for its iconic Eiffel Tower and the annual Eiffel Tower Festival. It is the largest city in France and the second-largest city in the European Union. Paris is also the birthplace of the French Revolution and the home of the Louvre Museum. The city is known for its rich history, beautiful architecture, and vibrant culture. It is a popular tourist destination and a major economic center in France. Paris is home to many famous landmarks and museums, including the Louvre, Notre-Dame Cathedral, and the Champs-Élysées. The city is also known for its cuisine, including French cuisine

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction. Here are some of the most likely trends:

1. Increased focus on ethical AI: As more people become aware of the potential risks of AI, there is a growing emphasis on developing AI that is more ethical and responsible. This could mean developing AI that is designed to minimize harm to individuals and society as a whole, or that is designed to be transparent and accountable.

2. Greater use of AI in healthcare: AI is already being used in healthcare to improve patient outcomes and reduce costs. As the technology continues to evolve, it is likely to be



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I am a [job title] at [company name]. I have been working at this company for [number of years] years, and I am passionate about [reason why you love your job]. If you have any questions about [company name], feel free to ask me!

[Name]: Hola! Me llamo [Name]. Soy [job title] en [company name]. Desde [number of years] hace tiempo que me siento muy cómodo trabajando en [company name], y realmente me encanta lo que hace. Si tienes alguna pregunta sobre [company name], por favor no dudes en hacer

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, known as the City of Love for its romantic atmosphere and numerous museums, including the Louvre. The city's architecture and cuisine are renowned for their elegant and traditional style, while the city itself is a UNESCO World Heritage site. The city is

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

], and

 I am

 a [

Career

/

Position

]

 at

 [

Company

 Name

].

 In

 my

 spare

 time

,

 I

 enjoy

 [

Your

 Inter

ests

/

Activities

].

 What

 kind

 of

 experiences

 do

 you

 think

 I

 can

 look

 up

 in

 my

 resume

?

 Sure

,

 feel

 free

 to

 share

 a

 bit

 about

 yourself

,

 and

 I

'll

 do

 my

 best

 to

 help

 you

 write

 a

 concise

 and

 positive

 self

-int

roduction

.

 Good

 luck

!

 [

Name

]

 [

Company

 Name

]

 [

LinkedIn

 Profile

]

 [

Resume

]

 [

Email

 Address

]


Note

:

 I

 have

 been

 exploring

 different

 industries

 and

 am

 not

 bound

 to

 a

 specific

 company

 or

 career

 path

.

 Please

 keep

 your

 responses

 professional

 and

 to

 the

 point

.

 I

'm



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 the

 historical

,

 cultural

,

 and

 economic

 center

 of

 the

 country

.

 It

 is

 the

 world

's

 oldest

 capital

 city

 and

 is

 home

 to

 many

 of

 the

 country

's

 most

 important

 landmarks

,

 including

 the

 E

iff

el

 Tower

 and

 the

 Notre

-D

ame

 Cathedral

.

 Paris

 is

 also

 known

 for

 its

 rich

 culinary

 traditions

,

 vibrant

 nightlife

,

 and

 diverse

 museums

 and

 art

 galleries

.

 Paris

 is

 a

 bustling

 met

ropolis

 with

 a

 rich

 history

,

 vibrant

 culture

,

 and

 international

 appeal

.

 The

 French

 people

 have

 a

 strong

 sense

 of

 pride

 in

 their

 city

 and

 celebrate

 its

 culture

,

 art

,

 and

 traditions

.

 Paris

 is

 a

 major

 hub

 for

 commerce

,

 education

,

 entertainment

,

 and

 food

,

 and

 continues

 to

 be

 a



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 characterized

 by

 rapid

 progress

 and

 integration

 of

 new

 technologies

,

 making

 it

 more

 accessible

 and

 useful

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Increased

 efficiency

:

 As

 AI

 technologies

 continue

 to

 advance

,

 the

 efficiency

 of

 AI

 systems

 will

 increase

 exponentially

.

 This

 will

 lead

 to

 the

 development

 of

 new

 AI

 systems

 that

 can

 perform

 complex

 tasks

 with

 greater

 speed

 and

 accuracy

.



2

.

 Autonomous

 and

 intelligent

 machines

:

 The

 next

 generation

 of

 AI

 systems

 will

 likely

 be

 designed

 to

 work

 autonom

ously

 and

 intellig

ently

.

 These

 systems

 will

 be

 able

 to

 learn

 from

 experience

 and

 make

 decisions

 that

 are

 based

 on

 the

 best

 available

 data

.



3

.

 Greater

 focus

 on

 ethical

 considerations

:

 As

 AI




In [6]:
llm.shutdown()