# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [1]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

  import pynvml  # type: ignore[import]


All deep_gemm operations loaded successfully!


`torch_dtype` is deprecated! Use `dtype` instead!




  import pynvml  # type: ignore[import]
  import pynvml  # type: ignore[import]


`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-22 01:27:54] `torch_dtype` is deprecated! Use `dtype` instead!


[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0


All deep_gemm operations loaded successfully!


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.34it/s]



  0%|          | 0/3 [00:00<?, ?it/s]Capturing batches (bs=4 avail_mem=74.96 GB):   0%|          | 0/3 [00:00<?, ?it/s]

Capturing batches (bs=4 avail_mem=74.96 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.75it/s]Capturing batches (bs=2 avail_mem=74.84 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.75it/s]Capturing batches (bs=1 avail_mem=74.83 GB):  33%|███▎      | 1/3 [00:00<00:01,  1.75it/s]Capturing batches (bs=1 avail_mem=74.83 GB): 100%|██████████| 3/3 [00:00<00:00,  4.77it/s]Capturing batches (bs=1 avail_mem=74.83 GB): 100%|██████████| 3/3 [00:00<00:00,  4.06it/s]


### Non-streaming Synchronous Generation

In [2]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Prompt: Hello, my name is
Generated text:  Aiden and I'm a 10th grader. I'm in 10th grade and I'm from Italy. I'm studying English, Biology, and History. I like to travel, read books, watch TV, and play soccer. What's your favorite subject? What are your favorite activities? What kind of movies do you like to watch? What are your hobbies? What kind of books do you like to read? When did you start watching TV? What is your favorite TV show? Do you have a favorite TV show or movie? What's your favorite food? What's your favorite color? What's your
Prompt: The president of the United States is
Generated text:  a person. The president of the United States is the head of state of the United States. It's a very important job, and it's not easy being president of the United States. Some people want to be president, but some other people don't want to be president. The president is also not always popular. 

Does this passage discuss the job of the president of the United States?

Pick from:
 

### Streaming Synchronous Generation

In [3]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
    "temperature": 0.2,
    "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
    print(f"Prompt: {prompt}")
    merged_output = stream_and_merge(llm, prompt, sampling_params)
    print("Generated text:", merged_output)
    print()


=== Testing synchronous streaming generation with overlap removal ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is


Generated text:  [Name] and I'm a [job title] at [company name]. I'm excited to meet you and learn more about your career. What can you tell me about yourself? I'm a [insert a short description of your character, such as "fun-loving, adventurous, and always looking for new experiences."]. I'm always looking for new challenges and opportunities to grow and learn. What do you do for a living? I'm always looking for new ways to improve myself and make the world a better place. What's your favorite hobby or activity? I love to read, travel, and explore new places. What's your

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is


Generated text:  Paris, also known as the City of Light. It is the largest city in France and the third-largest city in the European Union. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. The city is also home to many famous museums, including the Louvre and the Musée d'Orsay. Paris is a cultural and economic hub of France and a major tourist destination. It is also home to many important political and historical figures, including Napoleon Bonaparte and Marie Antoinette. The city is known for its vibrant nightlife

Prompt: Explain possible future trends in artificial intelligence. The future of AI is


Generated text:  likely to be characterized by a number of trends that are expected to shape the technology's direction and impact on society. Here are some of the most likely trends that could be expected in the future:

1. Increased integration of AI into everyday life: AI is already being integrated into many aspects of our lives, from smart home devices to self-driving cars. As the technology continues to advance, we can expect to see even more integration of AI into our daily lives, from the way we work and communicate to the way we interact with technology.

2. AI will become more ethical and responsible: As AI becomes more integrated into our lives, we will



### Non-streaming Asynchronous Generation

In [4]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")


asyncio.run(main())


=== Testing asynchronous batch generation ===



Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text:  [Name] and I'm [Age] years old. I'm a [occupation] who enjoys [pastime or hobby]. I'm a [specific skill or expertise] expert, so you can trust that I'm not just a general assistant, but an authority figure in my field. How can I help you today? What are your questions or concerns? I'm here to assist you, so feel free to share your needs. And if you have any questions, I'll do my best to answer them. Remember, I'm here to help, not to provide advice or solutions. Just let me know what you need, and I

Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text:  Paris.

The capital of France is Paris. It is the largest city in Europe by population, with an estimated 12.9 million inhabitants in 2020. The city is located on the Seine river and is the center of France and one of the world's most important cities. Paris is home 

### Streaming Asynchronous Generation

In [5]:
prompts = [
    "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
    "Provide a concise factual statement about France’s capital city. The capital of France is",
    "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        print("Generated text: ", end="", flush=True)

        # Replace direct calls to async_generate with our custom overlap-aware version
        async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
            print(cleaned_chunk, end="", flush=True)

        print()  # New line after each prompt


asyncio.run(main())


=== Testing asynchronous streaming generation (no repeats) ===

Prompt: Write a short, neutral self-introduction for a fictional character. Hello, my name is
Generated text: 

 __

________

_,

 and

 I

'm

 a

/an

 __

________

__

_.

 Let

's

 get

 started

!

 I

'm

 currently

 __

________

__.

 If

 you

 have

 any

 questions

 or

 need

 assistance

,

 please

 don

't

 hesitate

 to

 ask

.

 What

's

 your

 name

?


You

 have

 introduced

 yourself

 as

 "

my

 name

 is

 __

________

_",

 which

 is

 a

 quite

 common

 name

.

 You

're

 already

 an

 adult

 now

,

 so

 that

 means

 you

've

 reached

 a

 certain

 age

.

 You

're

 currently

 __

________

__,

 which

 is

 a

 place

 or

 event

.

 Let

's

 get

 started

 by

 asking

 you

 a

 question

 or

 asking

 for

 your

 assistance

.

 If

 you

 have

 any

 questions

 or

 need

 help

,

 please

 don

't

 hesitate

 to

 ask

.

 What

's

 your

 name



Prompt: Provide a concise factual statement about France’s capital city. The capital of France is
Generated text: 

 Paris

,

 also

 known

 as

 Lou

v

ain

-la

-

Ne

uve

,

 and

 is

 the

 oldest

 city

 in

 Europe

.

 Paris

 was

 founded

 in

 the

 

6

th

 century

 as

 a

 Roman

 settlement

,

 and

 is

 known

 for

 its

 art

,

 architecture

,

 cuisine

,

 and

 fashion

.

 It

 is

 also

 one

 of

 the

 largest

 cities

 in

 Europe

,

 with

 a

 population

 of

 over

 

2

.

1

 million

 people

.

 Paris

 is

 home

 to

 many

 famous

 landmarks

,

 including

 the

 E

iff

el

 Tower

,

 Notre

-D

ame

 Cathedral

,

 and

 the

 Lou

vre

 Museum

.

 The

 city

 is

 known

 for

 its

 vibrant

 nightlife

,

 and

 has

 been a

 major

 center

 of

 French

 culture

 and

 politics

 for

 centuries

.

 The

 French

 language

 is

 also

 spoken

 in



Prompt: Explain possible future trends in artificial intelligence. The future of AI is
Generated text: 

 likely

 to

 be

 marked

 by

 several

 key

 trends

,

 including

:



 

 

1

.

 Increased

 Use

 of

 AI

 in

 Healthcare

:

 AI

 is

 already

 being

 used

 in

 a

 variety

 of

 healthcare

 applications

,

 including

 disease

 diagnosis

 and

 treatment

,

 personalized

 medicine

,

 and

 predictive

 analytics

.

 As

 AI

 becomes

 more

 prevalent

 in

 healthcare

,

 we

 may

 see

 increased

 use

 in

 these

 areas

 as

 well

.


 

 

2

.

 Adv

ancements

 in

 Natural

 Language

 Processing

:

 As

 AI

 technology

 continues

 to

 advance

,

 we

 may

 see

 more

 sophisticated

 natural

 language

 processing

 techniques

 that

 enable

 computers

 to

 better

 understand

 and

 interpret

 human

 language

.

 This

 could

 lead

 to

 a

 greater

 ability

 to

 automate

 and

 automate

 certain

 aspects

 of

 the

 healthcare

 system

.


 

 

3

.

 Increased

 Integration




In [6]:
llm.shutdown()