# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


W0919 00:38:40.450000 89830 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 00:38:40.450000 89830 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


W0919 00:38:50.119000 90561 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 00:38:50.119000 90561 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0919 00:38:50.154000 90562 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0919 00:38:50.154000 90562 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-19 00:38:50] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.07it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.79 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.79 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=2 avail_mem=74.73 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=1 avail_mem=74.72 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.51it/s]Capturing batches (bs=1 avail_mem=74.72 GB): 100%|██████████| 3/3 [00:00<00:00, 10.65it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Ricky. I'm from Canada and I'm in junior high school. I love to watch soap operas and cartoons because they are so interesting. As a student, I have a lot of homework to do and I don't have time to play sports or other things I enjoy. I have a girlfriend and we have been dating for 3 years. We have a baby and we are doing a lot of family stuff, like going to the doctor, changing diapers and feeding the baby. It's a lot of work, but I'm happy. I have a friend who likes to play computer games, but he doesn't have a girlfriend or
Prompt: The president of the United States is
Generated text:  a ________. A. member of the U. S. Congress B. member of the U. S. Supreme Court C. member of the U. S. House of Representatives D. member of the U. S. Senate
Answer: D

During the term of office, which of the following situations should be suspended? 
A. The term of office has expired
B. The contract has not been fulfilled by the end of the term
C. The princ

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? As an AI language model, I don't have a physical presence, but I'm always ready to assist you with any questions or tasks you may have. How can I help you today? Let's get started! [Name] [Company name] [Job title] [Company name] [Company address] [Company phone number] [Company email] [Company website] [Company logo] [Company mission statement] [Company values] [Company culture

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. It is also home to the French Parliament, the French National Library, and the French National Opera. Paris is a bustling city with a rich cultural heritage and is a popular tourist destination. It is also known for its fashion industry, with Paris Fashion Week being one of the largest in the world. The city is also home to the French Riviera, a popular tourist destination for its beautiful beaches and warm climate. Paris is a city of contrasts, with its historical architecture and modern fashion industry blending together to create a unique

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing machines to learn from and adapt to human behavior and decision-making processes. This could lead to more sophisticated and adaptive AI systems that can better understand and respond to human needs and preferences.

2. Greater emphasis on ethical considerations: As AI becomes more integrated with human intelligence, there will be a greater emphasis on ethical considerations. This could lead to more rigorous testing and evaluation of AI systems, as well as increased regulation and oversight of AI development and deployment.

3. Increased use of AI in healthcare: AI



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I’m a [Field or occupation] with a passion for [Your passion]. I’m here to share my knowledge, skills, and experiences with you. Let’s connect and discover how I can help you achieve your goals and dreams. [Tell them about yourself, including your education, experiences, and skills. Consider personal anecdotes or stories that show your personality and unique perspective. Use a tone that sounds genuine and approachable. ] I’m here to help you grow and thrive, not just to satisfy your ego. I’m [give a specific reason why you’re the best fit for this role]. Thanks for taking the

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris. It is a bustling metropolis with a rich history and cultural heritage, known for its beautiful architecture, iconic landmarks, and annual world-renowned festivals. The cit

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

name

]

 and

 I

 am

 a

 [

character

 type

]

 [

character

].

 I

'm

 currently

 studying

 [

job

 title

]

 [

job

 position

]

 at

 [

company

 name

],

 and

 I

'm

 passionate

 about

 [

specific

 hobby

 or

 interest

 that

 you

 enjoy

].

 What

 brings

 you

 to

 this

 position

?

 As

 an

 [

job

 title

],

 I

 am

 always

 striving

 to

 [

specific

 goal

 or aspiration

],

 and

 my

 [

job

 title

]

 at

 [

company

 name

]

 is

 dedicated

 to

 [

specific

 mission

 or

 objective

].

 I

 enjoy

 [

specific

 activity

 or

 hobby

]

 and

 I

'm

 always

 looking

 for

 opportunities

 to

 [

specific

 skill

 or

 expertise

].

 My

 [

job

 title

]

 at

 [

company

 name

]

 is

 [

description

 of

 job

].

 I



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

.

 



-

 **

What

 is

 the

 capital

 of

 France

?

**


-

 **

Paris

,

 officially

 known

 as

 the

 Capital

 of

 France

,

 is

 the

 largest

 city

 in

 France

 and

 the

 seat

 of

 the

 Government

 of

 France

.

**



-

 **

French

 capital

?

**


-

 **

Paris

**

 is

 the

 official

 name

 for

 the

 capital

 city

 of

 France

.

 Paris

 is

 the

 seat

 of

 the

 government

 of

 France

 and

 is

 the

 administrative

,

 cultural

,

 and

 economic

 center

 of

 the

 country

.

 It

 is

 the

 third

 most

 populous

 city

 in

 the

 world

,

 with

 an

 estimated

 population

 of

 around

 

2

 million

 people

.

 The

 city

 is

 known

 for

 its

 historical

 architecture

,

 cultural

 attractions

,

 and

 annual

 festivals

.

 



-

 **

What

 is



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 expected

 to

 be

 multif

ac

eted

 and

 diverse

,

 with

 several

 potential

 trends

 shaping

 the

 development

 and

 evolution

 of

 AI

 technology

.

 Here

 are

 some

 key

 trends

 that

 are

 expected

 to

 shape

 the

 AI

 landscape

 in

 the

 years

 to

 come

:



1

.

 Deep

 Learning

 and

 Neural

 Networks

:

 Deep

 learning

 and

 neural

 networks

 are

 expected

 to

 become

 the

 dominant

 paradigm

 in

 AI

,

 driving

 improvements

 in

 a

 wide

 range

 of

 applications

,

 from

 image

 and

 speech

 recognition

 to

 natural

 language

 processing

 and

 computer

 vision

.



2

.

 Explain

ability

 and

 Transparency

:

 There

 is

 growing

 interest

 in

 improving

 the

 transparency

 and

 explain

ability

 of

 AI

 systems

,

 with

 the

 aim

 of

 making

 AI

 systems

 more

 trustworthy

 and

 understandable

 for

 humans

.

 This

 will

 likely

 lead

 to

 more




In [6]:
llm.shutdown()