# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-23 05:06:53] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.31it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.96 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.96 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.31it/s]Capturing batches (bs=2 avail_mem=74.90 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.31it/s]Capturing batches (bs=1 avail_mem=74.89 GB):  33%|███▎      | 1/3 [00:00<00:00,  4.31it/s]Capturing batches (bs=1 avail_mem=74.89 GB): 100%|██████████| 3/3 [00:00<00:00,  9.58it/s]Capturing batches (bs=1 avail_mem=74.89 GB): 100%|██████████| 3/3 [00:00<00:00,  8.53it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Annie. I'm 20 years old and I come from America. I live in America. I like to play soccer. I play soccer every day. And I'm the captain of my school team. I'm in the best team. Now I'm a member of the English club. I have two friends and one brother. But I have no parents and no family. But I love my family very much. I love my brothers and sisters. They are my best friends. They love me very much. I like dogs, too. I have a big dog. I love him very much. My dog is a very clever and
Prompt: The president of the United States is
Generated text:  trying to become more environmentally conscious. He has his car repaired at a $5000 cost, but with an additional cost of $2000 for parts, and a $3000 insurance premium. He has to pay a monthly fee of $500 for maintenance. If he uses the car 6 days a week, what is the average cost per day of running the car?

To determine the average cost per day of running the car, we need to follow these steps:

1. Cal

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name], and I'm a [job title] at [company name]. I'm a [job title] with [number of years] years of experience in [industry]. I'm passionate about [reason for interest] and I'm always looking for ways to [action or goal]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number of years] years of experience in [industry]. I'm a [job title] with [number

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is a historic city with a rich history and a vibrant culture, known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. Paris is also known for its fashion industry, art scene, and food culture, making it a popular tourist destination. The city is home to many famous landmarks and attractions, including the Palace of Versailles, the Arc de Triomphe, and the Champs-Élysées. Paris is a city of contrasts, with its modern architecture and historical landmarks blending seamlessly together. It is a city that

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by several key trends:

1. Increased integration with human intelligence: AI is likely to become more integrated with human intelligence, allowing for more complex and nuanced decision-making. This could lead to more sophisticated and personalized AI systems that can better understand and respond to human emotions and behaviors.

2. Enhanced privacy and security: As AI systems become more sophisticated, there will be a greater need for privacy and security measures to protect user data. This could lead to the development of new technologies and protocols that enhance privacy and security, such as blockchain-based AI systems.

3. Greater automation and efficiency: AI is likely to become more automated and



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name], and I'm a [Job Title] who specializes in [Your specialty]. I'm passionate about [Your passion/interest], and I'm always eager to learn new things and expand my knowledge. I'm a [Type of person] who always puts others' needs and desires first. I'm confident, organized, and ready to tackle whatever challenges come my way. 

What are you looking to achieve in life? Share your goals with me in no more than 250 words.

As an AI language model, I do not have personal goals or ambitions, but I am always ready to help and assist users in achieving

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris, the third largest city and the largest metropolitan area in the European Union. It is located on the banks of the Seine River, situated in the south-central region of France and serves as the political, cultural

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 [

insert

 character

's

 name

]

 and

 I

'm

 a

 [

insert

 character

's

 occupation

 or

 profession

].

 I

 enjoy

 [

insert

 one

 or

 two

 hobbies

 and

 interests

 that

 I

 have

 in

 common

 with

 other

 people

].

 I

'm

 a

 [

insert

 character

's

 personality

 type

 or

 trait

]

 and

 I

 have

 a

 [

insert

 character

's

 strongest

 or

 most

 notable

 strength

].

 I

 work

 hard

 to

 [

insert

 one

 or

 two

 things

 that

 you

 can

 do

 to

 improve

 your

 character

 or

 self

-image

].

 I

'm

 [

insert

 character

's

 age

 and

 date

 of

 birth

]

 and

 I

'm

 [

insert

 character

's

 nationality

].

 I

'm

 located

 in

 [

insert

 location

].

 I

 am

 [

insert

 character

's

 favorite

 way

 to

 entertain

 others

 and

 why

 it



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 the

 City

 of

 Love

.

 



This

 statement

 encaps

ulates

 Paris

'

 primary

 importance

 as

 the

 city

 with

 the

 world

's

 most

 expensive

 fashion

 market

,

 its

 status

 as

 a

 center

 of

 art

 and

 culture

,

 and

 its

 role

 as

 the

 capital

 of

 France

.

 However

,

 it

 also

 highlights

 the

 challenges

 and

 cultural

 diversity

 that

 come

 with

 being

 a

 major

 city

 in

 a

 country

 with

 a

 diverse

 population

.

 



While

 Paris

 is

 not

 the

 most

 expensive

 city

 in

 the

 world

,

 it

 is

 known

 for

 its

 high

-end

 fashion

 industry

,

 art

 museums

,

 and

 other

 cultural

 institutions

.

 It

 is

 also

 a

 major

 city

 with

 a

 large

 immigrant

 population

,

 which

 has

 led

 to

 a

 mix

 of

 cultural

 and

 linguistic



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 shaped

 by

 a

 combination

 of

 technological

 advances

,

 regulatory

 changes

,

 and

 evolving

 social

 and

 cultural

 norms

.

 Here

 are

 some

 possible

 future

 trends

 in

 artificial

 intelligence

:



1

.

 Autonomous

 vehicles

:

 Autonomous

 vehicles

 are

 set

 to

 become

 more

 widespread

 as

 they

 become

 more

 advanced

 and

 reliable

.

 This

 could

 lead

 to

 a

 decrease

 in

 the

 number

 of

 human

 drivers

 and

 reduce

 the

 risk

 of

 accidents

.



2

.

 Personal

ized

 medicine

:

 AI

 could

 help

 doctors

 make

 more

 accurate

 diagnoses

 and

 develop

 personalized

 treatment

 plans

 for

 patients

 with

 diseases

 and

 conditions

.



3

.

 Robotics

 and

 automation

:

 The

 use

 of

 robots

 and

 automation

 in

 industries

 like

 manufacturing

 and

 logistics

 could

 lead

 to

 increased

 efficiency

 and

 productivity

.



4

.

 AI

 in

 healthcare




In [6]:
llm.shutdown()