# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-22 17:36:05] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.13it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=77.03 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=77.03 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.16it/s]Capturing batches (bs=2 avail_mem=76.97 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.16it/s]Capturing batches (bs=1 avail_mem=76.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  2.16it/s]Capturing batches (bs=1 avail_mem=76.96 GB): 100%|██████████| 3/3 [00:00<00:00,  5.72it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Diana and I am a writer and a filmmaker. I have a background in theatre, including a degree in creative writing and a master's in theater management. I have been working in the creative field since 1983 and have a particular interest in performance art. I have been a performer, a director, a writer, and a producer.
As a writer, I have written fiction, poetry, and script. As a filmmaker, I have made a number of short films and a documentary on The Royal Shakespeare Company. I have a PhD in Theater, which I completed in 2009. I have been a member of
Prompt: The president of the United States is
Generated text:  trying to decide how many armed guards should be stationed at the country's major cities. He has heard that the city of New York has a population of 8,000,000 people, and that the population of New York City is the largest in the country. The president believes that at least 40% of New York City's population should be guarded by armed gua

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [occupation] with [number] years of experience in [field]. I'm passionate about [reason for interest] and I'm always looking for new challenges and opportunities to grow and learn. I'm a [character trait] and I'm always ready to help others and make a positive impact. I'm confident in my abilities and I'm eager to share my knowledge and experience with anyone who's interested. Thank you for taking the time to meet me. [Name] [Occupation] [Number] [Field] [Reason for interest] [Character trait] [Reason for interest] [Confidence

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French Academy of Sciences, and the French National Library. Paris is a bustling metropolis with a rich cultural heritage and is a major tourist destination. The city is known for its fashion, art, and cuisine, and is a popular destination for tourists and locals alike. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly into one another. It is a city of people, with a diverse population of over 2 million residents. Paris is

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by rapid advancements in areas such as machine learning, natural language processing, and computer vision. These technologies will continue to improve and become more integrated into our daily lives, from self-driving cars to personalized medicine. Additionally, AI will continue to be used for tasks that require human-like intelligence, such as language translation and emotional intelligence. As AI becomes more integrated into our daily lives, we may see a shift towards more ethical and responsible use of AI, with a focus on minimizing harm and maximizing benefits. Overall, the future of AI is likely to be one of continued innovation and progress, with a focus on ethical and responsible use



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [insert character's name]. I'm [insert character's age], [insert character's occupation or profession]. I'm excited to meet you and contribute to your world in some way. What's your name? And what do you do? How have you been keeping up with the latest trends in the field? If you have any thoughts or ideas for the future of [mention a topic or industry], please let me know! I'm always open to learning and exploring new possibilities. Thanks for having me! How can I help you? [Insert character's name] is a [insert character's profession] with a passion for [mention a

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. 

Let me know if you need me to expand or rephrase this statement. You're welcome, I'm here to help! If you'd like, feel free to ask for clarification or expansion on any aspect of Paris. For 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

____________

_,

 and

 I

'm

 a

/an

 __

____________

_.

 I

'm

 currently

 living

 in

 __

____________

_,

 and

 I

 have

 been

 here

 for

 __

____________

_.

 I

'm

 excited

 to

 meet

 you

 here

 and

 look

 forward

 to

 learning

 about

 you

 and

 your

 journey

.


My

 name

 is

 __

____________

_,

 and

 I

'm

 a

/an

 __

____________

_.

 I

'm

 currently

 living

 in

 __

____________

_,

 and

 I

 have

 been

 here

 for

 __

____________

_.

 I

'm

 very

 excited

 to

 meet

 you

 and

 learn

 about

 your

 journey

.

 How

 can

 I

 help

 you

 today

?

 I

 look

 forward

 to

 meeting

 you

 all

 and

 learning

 more

 about

 you

.

 Looking

 forward

 to

 our

 meeting

,

 and

 I

 look

 forward

 to

 learning

 more

 about

 you

 and

 your

 journey



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.


Paris

 is

 the

 most

 populous

 city

 in

 France

 and

 its

 economic

 and

 cultural

 center

.

 It

 is

 known

 for

 its

 iconic

 landmarks

 such

 as

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 Lou

vre

 Museum

,

 and

 the

 Tu

il

eries

 Gardens

.

 The

 city

 is

 also

 famous

 for

 its

 cuisine

,

 architecture

,

 and

 fashion

.

 Paris

 is

 an

 important

 part

 of

 the

 European

 Union

,

 where

 it

 hosts

 numerous

 international

 events

 and

 attracts

 millions

 of

 tourists

 each

 year

.

 The

 city

 is

 also

 home

 to

 the

 French

 Parliament

 building

,

 the

 E

iff

el

 Tower

,

 and

 the

 Notre

-D

ame

 Cathedral

,

 which

 are

 considered

 UNESCO

 World

 Heritage

 sites

.

 Paris

 has

 a

 rich

 history

 dating

 back

 to

 Roman

 times



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 see

 significant

 advancements

 in

 a

 wide

 range

 of

 areas

,

 with

 a

 focus

 on

 developing

 more

 advanced

 algorithms

 and

 improving

 the

 accuracy

 and

 efficiency

 of

 AI

 systems

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 AI

 will

 become

 more

 personalized

 and

 adaptable

,

 with

 the

 ability

 to

 learn

 from

 data

 and

 adjust

 its

 behavior

 accordingly

.

 This

 will

 require

 new

 approaches

 to

 machine

 learning

 and

 deep

 learning

,

 such

 as

 uns

up

ervised

 learning

 and

 reinforcement

 learning

.



2

.

 AI

 will

 become

 more

 ubiquitous

,

 with

 more

 and

 more

 applications

 being

 developed

 to

 leverage

 its

 capabilities

.

 This

 will

 require

 collaboration

 between

 different

 fields

 of

 research

,

 as

 well

 as

 partnerships

 with

 industry

 leaders

 to

 ensure

 that

 AI

 is

 used




In [6]:
llm.shutdown()