# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

[2026-02-17 04:08:52] INFO utils.py:148: Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.


[2026-02-17 04:08:52] INFO utils.py:151: Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.


[2026-02-17 04:08:52] INFO utils.py:164: NumExpr defaulting to 16 threads.




[2026-02-17 04:08:54] INFO server_args.py:1830: Attention backend not specified. Use fa3 backend by default.


[2026-02-17 04:08:54] INFO server_args.py:2865: Set soft_watchdog_timeout since in CI






[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.22it/s]



  0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=45.57 GB):   0%|          | 0/20 [00:00<?, ?it/s]Capturing batches (bs=128 avail_mem=45.57 GB):   5%|▌         | 1/20 [00:00<00:03,  5.08it/s]Capturing batches (bs=120 avail_mem=45.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.08it/s]

Capturing batches (bs=112 avail_mem=45.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.08it/s]Capturing batches (bs=104 avail_mem=45.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.08it/s]Capturing batches (bs=96 avail_mem=45.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.08it/s] Capturing batches (bs=88 avail_mem=45.47 GB):   5%|▌         | 1/20 [00:00<00:03,  5.08it/s]Capturing batches (bs=88 avail_mem=45.47 GB):  30%|███       | 6/20 [00:00<00:00, 21.99it/s]Capturing batches (bs=80 avail_mem=45.47 GB):  30%|███       | 6/20 [00:00<00:00, 21.99it/s]Capturing batches (bs=72 avail_mem=45.47 GB):  30%|███       | 6/20 [00:00<00:00, 21.99it/s]Capturing batches (bs=64 avail_mem=45.47 GB):  30%|███       | 6/20 [00:00<00:00, 21.99it/s]Capturing batches (bs=56 avail_mem=45.47 GB):  30%|███       | 6/20 [00:00<00:00, 21.99it/s]

Capturing batches (bs=56 avail_mem=45.47 GB):  50%|█████     | 10/20 [00:00<00:00, 28.22it/s]Capturing batches (bs=48 avail_mem=45.46 GB):  50%|█████     | 10/20 [00:00<00:00, 28.22it/s]Capturing batches (bs=40 avail_mem=45.46 GB):  50%|█████     | 10/20 [00:00<00:00, 28.22it/s]Capturing batches (bs=32 avail_mem=45.46 GB):  50%|█████     | 10/20 [00:00<00:00, 28.22it/s]Capturing batches (bs=24 avail_mem=45.46 GB):  50%|█████     | 10/20 [00:00<00:00, 28.22it/s]Capturing batches (bs=24 avail_mem=45.46 GB):  70%|███████   | 14/20 [00:00<00:00, 32.15it/s]Capturing batches (bs=16 avail_mem=45.46 GB):  70%|███████   | 14/20 [00:00<00:00, 32.15it/s]Capturing batches (bs=12 avail_mem=45.46 GB):  70%|███████   | 14/20 [00:00<00:00, 32.15it/s]Capturing batches (bs=8 avail_mem=45.46 GB):  70%|███████   | 14/20 [00:00<00:00, 32.15it/s] 

Capturing batches (bs=4 avail_mem=45.46 GB):  70%|███████   | 14/20 [00:00<00:00, 32.15it/s]Capturing batches (bs=4 avail_mem=45.46 GB):  90%|█████████ | 18/20 [00:00<00:00, 32.14it/s]Capturing batches (bs=2 avail_mem=45.46 GB):  90%|█████████ | 18/20 [00:00<00:00, 32.14it/s]Capturing batches (bs=1 avail_mem=45.45 GB):  90%|█████████ | 18/20 [00:00<00:00, 32.14it/s]Capturing batches (bs=1 avail_mem=45.45 GB): 100%|██████████| 20/20 [00:00<00:00, 29.37it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Jackson, I'm 22 years old, and I'm interested in working with dogs. I'd love to learn more about dogs and would like to become a certified dog trainer. Could you help me find any resources or places where I can learn more about dogs? Sure! There are many resources available to learn more about dogs. Here are a few options:

1. Dog websites: There are many dog-related websites that offer information about dogs, including breeds, health, and training tips. Some popular ones include Dogster, PetMD, and PetMD.

2. Animal shelters: Many animal shelters offer classes and training sessions for dogs
Prompt: The president of the United States is
Generated text:  representing the United States in a foreign policy debate. The president says, "I believe that the United States should be a great power, but I also believe that the United States should be a moral power." 

To demonstrate this, the president gives an example of a recent policy that the United 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your interests and what you're looking for in a job. What can I do for you today? Let's get started! [Name] [Job Title] at [Company Name] is looking for someone like you to join our team. What are your skills and what excites you about the work? [Name] [Job Title] at [Company Name] is looking for someone like you to join our team. What are your skills and what excites you about the work? [Name] [Job

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, which is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and economic center, hosting numerous museums, theaters, and other attractions. Paris is a popular tourist destination and a major hub for international business and diplomacy. The city is home to many famous French artists, wri

Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies are expected to continue to improve and become more integrated into our daily lives, from self-driving cars and robots to personalized medicine and virtual assistants. Additionally, there is a growing trend towards developing AI that is more ethical and transparent, with greater emphasis on privacy and security. AI will also continue to be used for a variety of applications, from healthcare to finance to transportation, and will likely become an increasingly important part of our daily lives. Finally, there is a growing interest in AI research and development, with many organizations and individuals



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [job title] with over [number of years] years of experience in [industry]. I've always been passionate about [reason for interest in the field], and I've always wanted to [desired outcome]. What's your background, and what interests you in the field?

I'm excited to meet you and learn more about you! 

Please provide a personal statement or resume if you have one. If not, describe your education and your most impressive achievement, any relevant skills, and any advice you have for someone interested in pursuing a career in your field. Additionally, please provide information on any notable projects

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

- *Facts about Paris*:
  * Population: 2. 13 million
  * City centre: 7th largest metro station in the world
  * Official language: French


### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

Name

].

 I

 am

 a

 computer

 scientist

 from

 [

Location

]

 with

 a

 degree

 in

 Computer

 Science

 from

 [

University

].

 I

 have

 a

 passion

 for

 technology

 and

 am

 always

 looking

 for

 ways

 to

 improve

 my

 knowledge

.

 My

 interest

 in

 programming

 languages

 and

 algorithms

 led

 me

 to

 pursue

 a

 degree

 in

 Computer

 Science

.

 I

 have

 a

 keen

 eye

 for

 detail

 and

 a

 strong

 work

 ethic

.

 I

 am

 a

 meticulous

 person

 and

 strive

 to

 stay

 up

 to

 date

 with

 the

 latest

 technology

 trends

.

 I

 am

 always

 eager

 to

 learn

 and

 improve

,

 and

 I

 am

 always

 looking

 for

 ways

 to

 contribute

 to

 the

 growth

 of

 technology

.

 How

 can

 I

 get

 to

 know

 you

 better

?

 What

 are

 your

 hobbies

 or

 interests

?

 How



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 most

 populous

 city

 in

 France

 and

 the

 largest

 metropolitan

 area

 in

 Europe

,

 with

 over

 

1

0

 million

 people

 living

 within

 its

 urban

 boundaries

.

 The

 city

 is

 renowned

 for

 its

 rich

 history

,

 vibrant

 culture

,

 and

 stunning

 architecture

.

 Its

 status

 as

 the

 world

's

 most

 populous

 urban

 center

 has

 made

 it

 a

 hub

 of

 global

 commerce

 and

 tourism

,

 attracting

 millions

 of

 visitors

 each

 year

.

 Paris

 is

 also

 known

 for

 its

 fashion

 industry

,

 dance

,

 and

 music

 scenes

,

 and

 its

 status

 as

 the

 city

 of

 love

 and

 romance

 has

 made

 it

 a

 popular

 destination

 for

 couples

 and

 families

 alike

.

 With

 its

 historical

 significance

 and

 modern

 appeal

,

 Paris

 is

 a

 major

 cultural

 and

 economic

 center



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 an

 exciting

 and

 rapidly

 evolving

 field

,

 with

 a

 wide

 range

 of

 possible

 trends

 and

 technologies

 that

 could

 shape

 the

 way

 we

 live

,

 work

,

 and

 interact

 with

 technology

.

 Here

 are

 some

 possible

 trends

 in

 artificial

 intelligence

 that

 could

 emerge

 in

 the

 next

 few

 years

:



1

.

 More

 accurate

 and

 personalized

 personal

ization

:

 As

 AI

 continues

 to

 advance

,

 we

 can

 expect

 to

 see

 more

 accurate

 and

 personalized

 AI

 solutions

 that

 can

 be

 used

 to

 provide

 better

 service

 to

 customers

.

 This

 could

 include

 things

 like

 chat

bots

 that

 can

 provide

 personalized

 recommendations

,

 voice

 assistants

 that

 can

 understand

 and

 respond

 to

 a

 wide

 range

 of

 voice

 commands

,

 and

 predictive

 analytics

 that

 can

 provide

 insights

 into

 customer

 behavior

 and

 preferences

.



2

.




In [6]:
llm.shutdown()